ICSI Speech FAQ:
9.1 What is decoding?

Answer by: dpwe - 2000-08-14


In the three-stage overview of speech recognition, the first two stages calculate probabilities that a particular frame of the input speech corresponds to each of a specific set of speech sounds, and the final stage, the HMM decoder, finds the most likely word sequence given the per-frame, per-sound probability matrix, in light of the phone models, the word pronunciations in the dictionary, and the word-sequence constraints described by the grammar.

The term "decoder" doesn't really give a good impression of what this process consists of; it is derived from the origins of this kind of stochastic model fitting in coding theory. However, the other term commonly used to describe this part of the chain, search, gives a very good idea of what is involved: The decoder essentially searches over all possible alignments of all possible pronunciations of all possible word sequences to find the most likely one.

The hidden Markov model for speech makes a couple of assumptions to give a very simple expression for the probability of acoustic sequence given a particular state-labeling: it is simply the product of the probabilities of each frame's acoustic features corresponding to that frame's label (i.e. the result of the acoustic classifier, known as the acoustic score), multiplied by the probability of each frame's label given the previous label (which may include rather more state than simply the label, such as how many steps have been spent within that phone, which word is being modeled, and what the preceding words are). This second component depends only on the a-priori structural constraints of the pronunciations and grammar, and is called the language score. Although these two components (acoustic and language score) are directly combined in the pure HMM theory, in practice a log-domain scale factor is usually introduced, known as the acoustic model scale factor or the language model scale factor (depending on how it is applied).

So it is theoretically simple to compare all the possible labelings to find the most likely. In practice, however, with 50-100 frames per second, and a minimum of 30-60 distinct labels, the number of possible alignments rapidly becomes too large to handle by any imaginable hardware. The name of the game, then, is pruning - limiting to search to the vanishingly small proportion of possible alignments that actually show some promise for a high score given the acoustics. All decoders perform pruning of various kinds; most also have some kind of parameter to trade execution time -- essentially how thoroughly the hypothesis space will be searched -- for final overall performance -- how close to the absolute best utterance you get. (This source of error -- the failure to find the most likely utterance due to hypothesis space pruning -- is known as 'search error', and is normally kept very small by empirically increasing the decode search time until the marginal improvement is negligable. However, for certain applications, such as real-time operation, significant search error may be an acceptable compromise.)

Search time very often dominates the recognition processing, and for large-vocabulary tasks, the amount of process space required to hold all the intermediate hypotheses can reach hundreds of megabytes (this is why SWEDE has 768MB of RAM and GIN has 1GB - we needed them for Broadcast News decodes using Noway). Thus, decoding is a big resource bottleneck, and decoders are typically fairly complicated affairs, making all kinds of engineering comprimises and tricks to optimize performance and simplify the representation. What started out as a one-line equation, max(P(W|X)), turns into this huge matrix of heuristics and hierarchy of representations (phones, words, utterances) to find the end result. That's why we have such a messy situation at ICSI, with at least three very different decoder programs in use, as described in the following page.


Previous: 8.3 How do I build an n-gram grammar for noway or chronos? - Next: 9.2 What are the decoders we use at ICSI?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:17 PDT 2009