Hidden Markov models (HMMs) are well known as the central magic in speech recognition. The valid subword (phone) state sequences that can form words, and the valid word sequences that can form utterances, are in effect assembled into a single large network of states, and the HMM decoder (e.g. Y0, noway or chronos) finds the most-likely path through this network, scoring each state according to a posterior probability from the acoustic model, and scoring each transition to other states according to the predefined constraints on pronunciation, lexicon and grammar.
In practice, however, we don't define one huge network, but break it down into various levels of abstraction. Word-sequence constraints are encoded in a language model or grammar file, which abstracts away the different pronunciations. Pronunciations are normally defined per word in a dictionary file in terms of phone units that can be more or less abstracted away from the bottom-level phone states. The final level of abstraction is the phone models themselves, which define the sequence of states, associated with which posterior probabilities, that the actual decoder has to work through.
The phone models definition file does the slightly messy job of defining a set of arbitrary state networks (connecting start end and states via a limited set of transitions, each with an associated probability). It is described several places, including the the Y0 man page. It's not the kind of thing you'd want to write by hand, but if you want to have some kind of exotic state structure, this is how you can define it.
Y0 is often used for experiments with weird state structures because it actually omits the dictionary-pronunciation abstraction, and defines its pronunciations directly in terms of one of these HMM state files. This makes pronunciations very painful to edit. noway, by contrast, defines pronunciations in terms of certain phone symbols, then defines the states corresponding to the phones in a separate HMM model file. The entire HMM for the word is then simply constructed (at least notionally) by concatenating the phone models indicated in the pronunciation. Y0-style word models can (sometimes) be interconverted with noway-style phone models plus pronunciation dictionaries using the tools y0lex2noway, y0lex2nowayphones, and nowaylex2y0.
HMM phone models are usually pretty simple, consisting of a number of repeated states associated with the same phone posterior. The main variable is how many times the state is repeated; this constitutes the minimum duration (in frames) of that phone symbol. The exit probability of the final state (and its complement, the self-loop probability) also affect the distribution of durations that would be generated by the model, and thus can be used for duration modeling.
Sometimes we use context-dependent duration models, where the 'eh' phone in one word might be given a different minimum duration than in a different word. This is done by defining two separate models in the HMM phone models file, e.g. 'eh3' and 'eh4'.
More recently, phonemodel files have been done away with entirely in favor of the phi file, as described in the phi file page of this FAQ.
Previous: 3.6 What are the posterior probabilty data formats? - Next: 3.8 What are phi files? How do I build one?
Back to ICSI Speech FAQ index
Generated by build-faq-index on Tue Mar 24 16:18:15 PDT 2009