Speech and Natural Language Processing《资源教程》
该文本介绍了多个自然语言处理和机器学习相关的工具包及其应用领域。其中包括用于语音识别(如CMU Sphinx、HTK)、机器翻译(如Kaldi、Moses)、词性标注(如Stanford CoreNLP)以及深度学习模型训练(如gensim、Theano)。此外还提到了一些用于信号处理、文本分类和生成式模型的库(如ISSE、HTS)。这些工具包涵盖了从基础库到复杂模型的各种资源,并广泛应用于语音识别、机器翻译和自然语言处理等领域。
Speech and Natural Language Processing
#######################################
… image:: https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg
:alt: Awesome
:target: https://github.com/sindresorhus/awesome
A curated list of speech and natural language processing resources. Other lists can be found in this list <https://github.com/bayandin/awesome-awesomeness>_. If you want to contribute to this list (please do), send me a pull request. All Sub-caterogires are listed in alphabetical order
Finite State Toolkits and Regular Expressions
AT&T FSM Library <http://www2.research.att.com/~fsmtools/fsm/>_ The AT&T FSM libraryTM provides a comprehensive suite of Unix-compatible software utilities designed to build, integrate, optimize, and search weighted finite-state acceptors and transducers.
Carmel <https://github.com/graehl/carmel>_ Finite-state toolkit providing Expectation-Maximization (EM) and Bayesian methods for training FSTs and context-free parsing trees/
The category-based semiring, as outlined in the work of Sproat and colleagues (2014), is available at [specific URL]. Additionally, the same category-based semiring is described in detail by Sproat and colleagues (2014) at [specific URL].
dk.brics.automaton <http://www.brics.dk/automaton/>_ represents a Java package providing efficient implementations of finite automata and regular expressions.
The system under consideration is based on the theory of finite automata (FAs) and their extensions, which have been extensively studied in theoretical computer science.
Fare <https://github.com/moodmosaic/Fare>_ 是一个专为.NET框架设计的有限状态机和正则表达式引擎,在C#语言中实现。
am 是一个专为JavaScript设计的库,支持操作自动机和形式文法(尤其是正则语言和上下文无关语言)。
Foma <https://code.google.com/p/foma/>_ Finite-state compiler and C library
fsa <http:>_ Toolkit used in RWTH ASR engine
The Thomas Hanneforth fsm 2.0 library, implemented in C++, includes several useful operations, such as three-way composition.
fstrain <https://github.com/markusdr/fstrain>_ A suite for developing finite-state machines.
jopenfst <https://github.com/steveash/jopenfst>_ 是一个基于 C++ 的 OpenFst 库的 Java 实现;它最初源自 CMU Sphinx 项目的衍生版本。
The Kleene-based programming framework is a high-level abstraction for implementing finite-state automata, leveraging the foundation provided by the OpenFst library.
The MIT FST Toolkit, accessible at https://www.google.com/, is no longer actively maintained but still offers a number of unique commands that are not commonly available in other similar toolkits.
MoMs-for-StochasticLanguages <https://github.com/ICML14MoMCompare/MoMs-for-StochasticLanguages>_ Spectral-related training algorithms, including various methods, are employed for the optimization of Weighted Finite-State Automata (WFSAs).
The optimal path for the Partially Homomorphic Encryption (PDT) system has been implemented, with the source code accessible at https://github.com/kho/openfst. This implementation ensures efficient computation of the shortest path in encrypted data, leveraging advanced cryptographic techniques.
Noam https://github.com/izuzak/noam
Noam serves as a JavaScript library designed to handle automata and the grammatical structures of both regular and context-free languages. Additionally, it includes interesting visualizations created with viz.js <https://github.com/mdaines/viz.js>.
The tool serves as a robust framework for building, incorporating, fine-tuning, and accessing weighted finite-state transducers (FSTs).
Valuable collection of tools tailored for OpenFst supports the implementation of categorical semirings.openfst-utils.
The openlat toolkit, accessible via https://github.com/benob/openlat, provides a platform for processing word lattices constructed upon OpenFst. This toolkit supports the import and export of HTK-compliant lattices, facilitating their integration into various applications.
PyFst <https://github.com/vchahun/pyfst>_ Python interface to OpenFst
SFST - Stuttgart Finite State Transducer Tools <http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/SFST.html>_
"SFST represents a comprehensive resource providing developers with tools to create morphological analyzers and related applications built upon finite-state transducer technology."
Treba <https://code.google.com/p/treba/>_ “Serves as an elementary command-line interface (CLI) utility for performing operations such as training, decoding, and computing metrics with weighted probabilistic finite automata (PFSA) and hidden Markov models (HMMs).”
A variety of tools within the machine translation domain also offer or utilize interesting graph structures and semiring operations.
Language Modelling Toolkits
Bayesian Recurrent Neural Network of Language Modeling http://chien.cm.nctu.edu.tw/bayesian-recurrent-neural-network-for-language-modeling/ This project provides a C/C++ implementation of the Bayesian recurrent neural network for language modeling (BRNNLM).
Berkeley LM <http://code.google.com/p/berkeleylm/>_
Bigfatlm <https://github.com/jhclark/bigfatlm>_
This system offers Hadoop-based training for Kneser-ney language models, implemented in Java.
CSLM(链接)是一个支持构建连续空间语言模型的开源软件系统。
DALM <https://github.com/jnory/DALM>_ Double array language model.
KenLM[1]
Kenneth Heafield开发的语言模型工具包采用了一种高效且占用内存少的方法。
The LWLM model, accessible at lwlm <http://chasen.org/~daiti-m/dist/lwlm/>, represents a precise and comprehensive Bayesian implementation of the Latent Words Language Model as described by Deschacht and Moens in 2009.
The system architecture of Maximum Entropy Modeling is introduced with the reference to http://homepages.inf.ed.ac.uk/lzhang10/maxent.html. Le Zhang possesses a rich collection of links that are intricately linked to his work on MaxEnt models.
Maximum entropy language models: SRILM extension <http://www.phon.ioc.ee/dokuwiki/doku.php?id=people:tanel:srilm-me.en>_
"This patch enhances the SRILM toolkit by introducing a feature for training and applying maximum entropy (MaxEnt) language models. Currently, only n-gram-based features are supported in this implementation."
个人最喜欢的模型工具包 mitlm 网址:mitlm <https://code.google.com/p/gitmlm/>_ 。它非常快速,并显示出略微更高的准确性。
MSRLM http://research.microsoft.com/en-us/downloads/78e26f9c-fc9a-44bb-80a7-69324c62df8c/default.aspx
This scalable language-model tool is capable of constructing language models from vast quantities of data. It incorporates variations of modified absolute discounting and Kneser-Ney smoothing.
[OpenGrm](http://opengrm.org) is a language modeling suite designed to work seamlessly within the OpenFst framework.
cpyp <https://github.com/redpony/cpyp> is a C++ library designed to model using Pitman-Yor processes.
Random Language Model (RandLM) http://sourceforge.net/projects/randlm/ 基于布隆过滤器的方法用于构建随机语言模型家族。
Recurrent Neural Network Language Model(RNNLM) toolkit is hosted at http://www.fit.vutbr.cz/~imikolov/rnnlm/.
Refactorer tool: Refactorer is a tool originated from http://code.google.com/p/refr. The re-ranking framework was presented at the Johns-Hopkins symposium concerning language model confusion.
The rwthlm suite provides a toolkit for training various types of neural language models, including feedforward, recurrent, and long short-term memory (LSTM) architectures. This software was developed by Martin Sundermeyer.
SRILM, accessed via the URL http://www.speech.sri.com/projects/srilm/, is a widely recognized toolkit in the speech processing domain. It offers freely accessible source code for non-commercial purposes only, necessitating licensing fees for commercial applications.
Speech Recognition
基于GitHub的AaltoASR <https://github.com/aalto-speech>功能丰富,并支持多种语言和方言的语音识别技术。
该GitHub链接代表了一个开源的 concurrent speech processing框架.
An open-source toolkit designed to support both static and dynamic decoder implementations.
The kaldi-nnet-dur-model, developed as part of the Kaldi speech recognition platform, is a neural network-based duration model specifically designed for phone modeling. This system is described in detail in an Interspeech paper and is built upon the Kaldi speech recognition framework. The model's architecture and implementation details can be accessed via a direct link to a 2014 ICASSP paper.
The CMU Sphinx project, an open-source toolkit for speech recognition, is available at the Carnegie Mellon University website http://cmusphinx.sourceforge.net/.
HTK <http://htk.eng.cam.ac.uk/>_ The HTK resource, known as the Hidden Markov Model Toolkit, serves as a portable solution for creating and adjusting hidden Markov models.
可参考GitHub仓库:https://github.com/idiap/juicer Juicer represents a WFST-based underlying decoder of the ASR system.
The software represents a highly efficient, double-step approach to large vocabulary continuous speech recognition (LVCSR). It is specifically designed to cater to the needs of speech-related researchers and developers. Available at http://julius.sourceforge.jp/en_index.php, this tool offers state-of-the-art capabilities for processing and analyzing speech data efficiently.
Kaldi http://kaldi.org/ is a modern open-source toolkit headed by Dan Povey, highlighting a wide array of cutting-edge technologies.
OpenDcd <http://opendcd.org/>_ A high-quality open-source WFST-based speech recognition decoder system built for robust performance.
Phonetisaurus <https://code.google.com/p/phonetisaurus/>_
Additionally, Josef Novak's highly efficientWFST-based Phoneticizer provides a robust solution for phonetic analysis. Moreover, the site offers comprehensive tutorials and slides to aid in understanding its functionality.
Sail Align <https://github.com/nassosoassos/sail_align>_ 是一个用于实现鲁棒长语音-文本对齐的开源软件工具包。该系统通过自适应、迭代的语音识别和对齐方案能够处理极其冗长(甚至可能存在噪声干扰)的音频信号,并能有效抵抗转录错误带来的影响。该工具主要作为 Perl 库开发使用;然而其功能实现也依赖于...
SCARF: A Segmental CRF Toolkit for Speech Recognition <http://research.microsoft.com/en-us/projects/scarf/>_
“SCARF serves as a toolkit designed to perform speech recognition through the application of segmental conditional random fields.”
trainc <https://code.google.com/p/trainc/>_
David Rybach and Michael Riley developed an application for the direct synthesis of context-dependent transducers, cited in the Interspeech Best Paper.
RASR <http://www-i6.informatik.rwth-aachen.de/rwth-asr/>_ RWTH ASR: Speech Recognition System of the RWTH Aachen University, a specialized tool for language processing and speech analysis.
Signal Processing
An Interactive Source Separation Editorhttp://isse.sourceforge.net/
ISSE 是一款开源、免费且支持多平台的音频编辑工具软件,在时频可视化音频信号中通过涂鸦操作实现音频源分离功能。Bobhttps://github.com/idiap/bob
Bob 是一个基于数字信号处理和机器学习的免费工具包,默认由 Idiap 研究stitute生物ometrics组开发。Matlab Audio Processing Exampleshttp://www.ee.columbia.edu/~dpwe/resources/matlab/
提供Matlab示例代码库。SAcC - Subband Autocorrelation Classification Pitch Trackerhttp://labrosa.ee.columbia.edu/projects/SAcC/
SAcC 是一种基于MLP神经网络的噪声鲁棒谱峰检测方法,在子带自相关分类中实现声调估计功能。
Text-to-Speech
HTS <http://hts.sp.nitech.ac.jp/> 提供了基于声学模型的语音合成系统。
RusPhonetizer <https://github.com/wilpert/RusPhonetizer> 是一个功能强大的工具包, 专门用于 Russian 语音转写过程所需的语言规则和词典。
Speech Data
cmudict基于 GitHub 上 CMU Sphinx 项目发布。该工具集是一个免费的发音词典。LibriSpeech ASR corpus包含约 1,000 小时的 16 kHz 读音英语语音片段。该语料库由 Vassil Panayotov 主持开发并维护(协助者包括 Daniel Povey),数据源自 LibriVox 项目的录音书籍,并经过仔细分割与校准。TED-LIUM Corpus是从 Ted Talks 的语音演讲及其转录在 Ted 网站上可获取制作而成。
Machine Translation
Berkeley Aligner <https://code.google.com/p/berkeleyaligner/>_
“…a package for word alignment that incorporates cutting-edge advancements in unsupervised word alignment techniques.”
cdec <https://github.com/redpony/cdec> _
"Engaging in decoder, aligner, and model optimization tasks specifically for statistical machine translation applications as well as other structured prediction models that are primarily grounded in context-free formalisms."
Jane <http://www-i6.informatik.rwth-aachen.de/jane/>_
“Jane represents RWTH’s open-source platform for statistical machine translation.”
“The toolkit is equipped with state-of-the-art methods for both phrase-based and hierarchical phrase translation.”
Joshua http://joshua-decoder.org/_
一种层次结构且基于语法的机器翻译解码器用Java开发。
Moses http://www.statmt.org/moses/> is a well-known, open-source platform for machine translation.
该工具的OpenFST对齐系统的具体名称为 alignment-with-openfst,并可通过以下链接访问其GitHub存储位置:<https://github.com/ldmt-muri/alignment-with-openfst>
The project, titled zmert, is an open-source Java implementation available at zmert <http://cs.jhu.edu/~ozaidan/zmert/>. The impressive work on this project was carried out by Omar F. Zaidan, whose contributions can be viewed at Omar F. Zaidan <http://www.cs.jhu.edu/~ozaidan/>.
Machine Learning
BIDData <https://github.com/BIDData>_ BIDMat is designed for supporting large-scale exploratory data analysis. Its companion library, BIDMach, provides a machine learning interface.
libFM: Factorization Machine Library <http://libfm.org/>_
Sofia-ML is a tool developed by Google researchers that implements efficient incremental methods applicable to classification, regression, and ranking tasks.
Spearmint 是一个用于执行贝叶斯优化的工具包,
它基于 Jasper Snoek、Hugo Larochelle 和 Ryan P. Adams 在 Advances in Neural Information Processing Systems 2012 上发表的论文中的算法实现。
该方法在实践上实现了机器学习算法的高效优化。
Deep Learning
The repository at https://github.com/soumith/convnet-benchmarks offers a benchmarking platform for evaluating various convolutional network implementations.
Cafee [链接] - 一款高度动态的深度学习框架,并具备cuDNN支持以及多样化的后端选项。
cuDNN <https://developer.nvidia.com/cudnn>_ is a deep learning framework developed by Nvidia, widely-used in academic research as demonstrated in their seminal work here <http://arxiv.org/pdf/1410.0759.pdf>. The release of Torch 7 introduced comprehensive support for cuDNN, alongside additional Python wrappers available on GitHub. These wrappers include here <https://github.com/soumith/cudnn.torch>_ and here <https://github.com/hannes-brt/cudnn-python-wrappers>, providing flexible integration options for developers.
The CURE NT toolkit, available at CURE NT <http://sourceforge.net/projects/currennt/>_, is a Munich-based open-source CUDA Recurrent Neural Network toolkit, detailed in this paper <http://www.mmk.ei.tum.de/publ/pdf/14/14wen7.pdf>.
Gensim, developed by Radim Rehurek, is a Python topic modeling toolkit based on the word2vec algorithm. It is easily installed and straightforward to use.
The Glove model, sourced from http://www.socher.org/index.php/Main/GloveGlobalVectorsForWordRepresentation, provides a comprehensive set of global vector representations of words at the word-level.
GroundHog https://github.com/lisa-groundhog/GroundHog is specifically designed as a neural network-driven machine translation toolkit.
The KALDI LSTM, implemented in the Kaldi framework using C++, is an efficient algorithm for processing sequential data. This algorithm is designed to handle tasks such as automatic speech recognition and language modeling among other applications.
OxLM: Oxford University's Neural Language Modeling Toolkit, accessible via https://github.com/pauldb89/OxLM, is a specialized toolkit designed for advanced neural language modeling tasks. The system is thoroughly documented in the paper "Art-Batitespace-C 斯坦尼斯库-Blunsom-Hoang" available at https://ufal.mff.cuni.cz/pbml/102/art-baltescu-blunsom-hoang.pdf.
The Neural Probabilistic Language Model (NPLM) Toolkit http://nlg.isi.edu/software/nplm/ implements efficient methods for developing neural language models based on the work of Bengio (2003). This software package demonstrates remarkable efficiency when handling extensive vocabularies, capable of processing up to 1 million or more words. Such models can be trained on vast amounts of text data within approximately one week, and once developed, they enable rapid inference at 40 microseconds per query. This functionality makes it suitable to integrate into machine translation decoders.
RNNLM2WFST GitHub 是一个工具包用于将 Recurrent Neural Networks Language Models(RNNLM)转换为 Weighted Finite-State Transducers(WFST)。
lib ViennaCL is an open-source library developed for computations on many-core architectures like GPUs, MIC, and multi-core CPUs.
Natural Language Processing
BLLIP reranking parser https://github.com/BLLIP/bllip-parser: This system is a statistical NLP parser that incorporates a generative group chunk analyzer (first-stage) and a discriminative maximum entropy reranker (second-stage).
- Apache OpenNLP http://opennlp.apache.org/: The Apache OpenNLP library serves as a comprehensive machine learning toolkit for natural language processing tasks.
- SEAL https://github.com/TeamCohen/SEAL: The SEAL library offers set expanders specifically designed for languages detailed in this paper http://www.cs.cmu.edu/~wcohen/postscript/icdm-2007.pdf.
- Stanford CoreNLP http://nlp.stanford.edu/software/corenlp.shtml: Stanford CoreNLP provides a suite of Java-based natural language analysis tools.
Applications
The Cloud ASR system based on PyKaldi represents a comprehensive software solution for speech recognition, offering an accessible online platform for implementing ASR technologies. https://github.com/UFAL-DSG/cloud-asr
Other Tools
GraphViz.sty <https://github.com/mprentice/GraphViz-sty>_
A powerful tool that seamlessly integrates the dot language into LaTeX documents. It enables users to create precise customizations of small, color-coded Weighted Finite-State Transducers (WFST) diagrams in academic papers and presentations.
Blogs
- William Hartmann 的《介于1与0之间》_ by William Hartmann
- CMUSphinx 相关博客_ CMU Sphinx
- 语言日志_ 语言日志
- 自然语言处理与文本分析博客_ Natural Language Processing and Text Analytics
- Hal Daumé III 的自然语言处理博客_ Natural Language Processing Blog by Hal Daumé III
- 静态网页上的语音语言处理博客 “Some thoughts on Spoken Language Processing, with tangents on Natural Language Processing, Machine Learning, and Signal Processing thrown in for good measure。”
Books
- Deep Learning: Techniques and Uses http://research.microsoft.com/pubs/209355/DeepLearning-NowPublishing-Vol7-SIG-039.pdf By Li Deng and Dong Yu
- Foundations of Data Science http://www.cs.cornell.edu/jeh/NOSOLUTIONS90413.pdf. Draft by John Hopcroft and Ravindran Kannan
- An introduction to Matrix Methods and Applications http://stanford.edu/class/ee103/mma.pdf. (Working Title) S. Boyd and L. Vandenberghe
