The ICSI Speech Group has released a set of tools for speech processing, particularly automatic speech recognition (ASR). These tools are intended for use by specialists. Even with that in mind, it should be noted that these tools are poorly documented and supported compared to some other toolkits such as HTK. We see our tools as being appropriate mainly for those who have a use for one of the particular strengths of our tools, such as the strong support for acoustic modeling using multi-layer perceptron (MLP) neural networks.
Our tools are released with no warranty and with no guarantee of technical support. Please direct support questions to the Yahoo Groups online forum at http://groups.yahoo.com/group/icsi-speech-tools (this forum can also be subscribed to as a mailing list, by writing to icsi-speech-tools-subscribe(@)yahoogroups.com). If you use our tools, we hope you will share in the support effort for newer users on the forum. Besides the documentation that comes with the tools, useful information can also be found in the ICSI Speech FAQ.
These tools are freely available for non-commercial use; that is, taking them is equivalent to acquiring a license for such use. Commercial exploitation (including use in product development) requires obtaining a commercial license from ICSI. In some cases, the technology is protected by patent in addition to copyright. (Some of the applicable patents, but not all, are mentioned in source code comments.)
The SPRACHcore package contains a number of ICSI tools. Notable contents are RASTA, PLP, and MSG feature extraction tools, the feacat and pfile_utils tools which allow manipulation and transformation of feature files, and our Quicknet MLP toolkit. The noway decoder, which we have often used for hybrid MLP/HMM ASR, is also included, by permission of its principal author Steve Renals (University of Edinburgh). The SPRACHcore web page includes links to both the SPRACHcore package and newer software releases which are not included in SPRACHcore.
Quicknet is a general purpose MLP toolkit and has been used for tasks other than ASR, including handwriting recognition. We have used Quicknet for both "hybrid" MLP/HMM ASR, in which only MLPs are used to model state emission probabilities, and "tandem" ASR, which uses MLPs as a kind of nonlinear discriminant analysis prior to Gaussian mixture modeling. The latest version has many improvements compared to the version that was originally released in SPRACHcore, such as the option to speed up computation through the use of the ATLAS BLAS library.
This section provides a discussion of MLP acoustic modeling, to help potential users of Quicknet understand where it will be most useful.
Flexibility: Multi-layer perceptron acoustic modeling has sometimes given better performance than diagonal-covariance Gaussian mixture model acoustic modeling when working with novel or unusual feature vectors. This is not always the case, but it was seen for MSG features in an investigation which included experiments in which the KLT (for decorrelation) and Gaussianization (using the SPRACHcore pfile gaussian tool) were used to post-process the MSG features before using them with the GMM system. The ICASSP 2000 paper by Sharma et al. also has information on MSG performance differences. An advantage for MLPs when using a novel feature vector was also reported in the INTERSPEECH 2008 submitted paper by Wang, Gelbart, and Hemmert. Gabor filter features were also observed to perform better with MLP-based tandem acoustic modeling in most of the experiments discussed here. The tandem approach, which uses the MLP as a kind of nonlinear discriminant analysis prior to Gaussian mixture modeling, has been a valuable technique for integrating novel features into GMM-based ASR systems. The tandem approach has also been exploited to improve performance using conventional PLP features.
Compact acoustic models: Hybrid ASR using MLP has sometimes required fewer acoustic model parameters (in other words, a smaller acoustic model) than GMM-based systems with comparable performance.
Straightforward multi-stream ASR: Frame-level posterior probabilities produced by MLPs are simple to combine in multi-stream ASR approaches in which the decisions of several classifiers running in parallel are combined. (It is also possible to combine GMM outputs in multi-stream approaches. However, posterior probabilities produced by MLPs might be easier to work with sometimes; their range is always constrained between 0 and 1.)
Frame-level labeling is required: While GMM/HMM systems can often be trained mainly or entirely with word-level labeling, frame-level labels are normally required to train our MLPs. However, we have not found this to be a serious inconvenience, since if frame-level labels are not available we generate them from word-level labels by running a trained speech recognition system in a forced alignment mode.
More limited parallelism in training: Quicknet allows training to be parallelized across CPUs that share memory in a multiprocessor system. This is very useful for speeding up training, but in a networked environment it may not reach the training speed of some GMM/HMM systems, which can distribute training across a number of networked computers.
This list of references has been deliberately kept short. Most of these publications, and newer publications on some topics, can be downloaded from the ICSI publications page.
RASTA: Hynek Hermansky and Nelson Morgan. RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, October 1994
MSG: Brian Kingsbury. Perceptually-inspired signal processing strategies for robust speech recognition in reverberant environments. PhD Dissertation. University of California at Berkeley, December 1998
PLP: Hynek Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87(4):1738-1752, April 1990
Hybrid ASR: Nelson Morgan and Hervé Bourlard. An Introduction to Hybrid HMM/Connectionist Continuous Speech Recognition. IEEE Signal Processing Magazine, pp. 25-42, May 1995
Multi-stream ASR: Dan Ellis. Improved recognition by combining different features and different systems. Proc. AVIOS-2000, San Jose, May 2000
Hynek Hermansky, Dan Ellis and Sangita Sharma. Tandem connectionist feature stream extraction for conventional HMM systems. ICASSP-2000, Istanbul, June 2000
A. Adami, L. Burget, S. Dupont, H. Garudadri, F. Grezl, H. Hermansky, P. Jain, S. Kajarekar, N. Morgan, and S. Sivadas. Qualcomm-ICSI-OGI Features for ASR. ICSLP-2002, Denver, Colorado, USA, September 2002.
Q. Zhu, B. Chen, N. Morgan, and A. Stolcke. On using MLP features in LVCSR. Proc. Intl. Conf. Spoken Language Processing, Jeju, Korea, October 2004