ICSI Speech FAQ:
2.2 What are the basic approaches to speech recognition?

Answer by: dpwe - 2000-07-22

Essentially all speech recognition systems use the same basic three-stage architecture:

Feature detection in which the raw acoustic waveform is rerepresented in a more useful space, typically a low-dimensional feature space based on coarse spectral measurements over a 10-50ms time window.
Probabilistic classification of the feature vectors, in which the frames are scored as looking more or less likely as versions of a number of predefined subword linguistic units.
Search for best word-sequence hypothesis in which a word sequence is found that is consistent with the constraints of lexicon and grammar, and which corresponds to subword unit sequence that is highly-ranked in the classifier output.

These stages are illustrated in the following overview block diagram:

Of course, particular systems may blur the line between these stages, for instance by involving the subword likelihood estimation as a part of the search for well-matched word sequences.

Systems can vary at any of these stages.

For feature extraction, there are any number of different algorithms to derive feature vectors from speech, differing in their ability to emphasize linguistically-relevant information relative to irrelevant information (such as speaker identity), their robustness to noise and distortion, and their ability to produce vectors that somehow make the job of classification easier. However, since the result is, in all cases, a feature space in which to make classification, varying the feature extraction has little impact on the overall recognizer architecture; it merely affects the accuracy.
Classification can be done by any of the techniques known to pattern recognition. Early speech recognizers used vector quantization to reduce the speech features to a discrete set, then learned associations between particular vectors and particular subword units. Modern systems learn continuous classification of the feature vector space, most often by parametric modeling of the distribution associated with each speech class, typically with Gaussian mixture models (GMMs). A significant alternative is the use of neural networks to discriminatively classify a speech vector (or short sequence), estimating the probability that it arises from each of the classes.
Hypothesis search is usually performed by the hidden Markov model (HMM), a formulation that expresses the constraints of pronunciation (lexicon or dictionary) and word sequence (grammar) in a single finite-state network, for which efficient search and training algorithms are known. The main alternative is dynamic time warping, the simpler predecessor to HMMs, in which template models are stretched and compressed in time to match the observed features.

At ICSI we have historically used neural nets as our acoustic models - the so-called 'hybrid connectionist' approach pioneered by Morgan and Bourlard - rather than the more common Gaussian mixture models. For a discussion of why, see the next FAQ answer.

Previous: 2.1 What is speech recognition? - Next: 2.3 Why do we use connectionist rather than GMM?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:13 PDT 2009

ICSI Speech FAQ: 2.2 What are the basic approaches to speech recognition?

ICSI Speech FAQ:
2.2 What are the basic approaches to speech recognition?