ICSI Speech FAQ:
7.1 What does it mean to train a speech recognizer?

Answer by: dpwe - 2000-08-12


Training generally refers to the adjustment of a system with reference to a body of examples. In the classic pattern recognition problem, where input patterns are to be labelled as members or non-members of some specific class, training is the process by which certain classification algorithms can learn to do this task through exposure to a corpus of positive and negative examples.

Training is in contrast to direct design and implementation, where a human expert directly encodes rules to achieve the classification goal. Most practical classification systems involve a certain amount of design, specifying the space of all possible classifications a system could perform, then a training stage to set various free parameters and determine the actual classification performed. The number of parameters set in this way may vary from a single threshold to the millions of weights in our neural nets.

The algorithm used to set these parameters is of course a very important design problem in itself. Speech recognition makes extensive use of the EM algorithm, a general iterative process to find model parameterizations that maximize the likelihood of some example data, even when all the relevant observations are not available.

When we speak of training a speech recognizer, this normally refers to re-estimating the acoustic model classifier, and perhaps adjusting the relative weights for alternative pronunciations in the dictionary. However, all stages of the model have been trained, explicitly or implicitly, at some point in their development. As explained in the FAQ page on approaches to speech recognition, there are three basic stages to a recognizer:

Feature extraction is often taken as fixed for a given problem, although some parameters in the original Rasta algorithm were optimized for a particular task, and more recent work on LDA feature analysis (at OGI and by Mike Shire) presents feature representations that are trained on labelled examples (e.g. to maximize the discrimination between different classes). The Tandem modeling approach can be viewed as using a neural net for feature extraction, and is most certainly certainly trained.

Speech sound classification is the part of the recognizer most often being referred to when training is mentioned. Typically this is a Gaussian mixture model or a neural network, trained on a very large body of examples to estimate the probability that a given segment of previously-unseen speech corresponds to each of the predefined subword units used within the recognizer. See the FAQ page on training neural networks.

Finally, the HMM decoding relies on probabilistic models of the allowable pronunciations (and which words they represent), and the possible word sequences (i.e. the problem grammar) to find the most likely word sequence. Pronunciation models are often initialized from pre-defined dictionaries and linguistic rules, but are then pruned or weighted in a training stage that sees how well each candidate pronunciation matches real data. Grammars (language models) rely almost entirely on observed counts from very large text corpora to estimate the relative probability of different word sequences.


Previous: 6.10 What is posterior combination? What other kinds of combination are possible? - Next: 7.2 I just got this new data. How can I start training from it?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:17 PDT 2009