This page takes you through the basic steps involved involved in training and testing a speech recognizer using the ICSI speech tools. In essence, we will go through each of the boxes outlined in the block diagram on the overview page.
For your first standard training experience, let's use a simple, but non-trivial speech database, the coresubset of Numbers.
Numbers V1.0 was collected by Oregon Graduate Institute's Center for Spoken Language Understanding. It is composed of 10 hours of speech collected from the public over telephone lines. Callers responded to census prompts and OGI later clipped the sections with numbers out and gather these into a distinct corpus. OGI has phonetically handtranscribed about half of the utterances.
At ICSI, you can find the complete set of Numbers files in /u/drspeech/data/NUMBERS95.
For some experiments at ICSI, we further reduced the Numbers set to about 3.5 hours of speech data by limiting membership to the "coresubset" to those utterances with the following qualities:
Historically, there have been several different tools for calculating features at ICSI. If you are using one of the standard feature sets, it is easiest to use feacalc. The man page includes the command to produce one of the sample p-files mentioned below.
Sample RASTA-PLP files can be found in the example directory /u/sulin/speech/miscellaneous/example/ftrarch. There is one for training (numbers_cs+train+cv+r8+w25+s10+F+M+e+d.pfile), one for cross-validation (numbers_cs+cv+r8+w25+s10+F+M+e+d.pfile), one for developing recognition systems (numbers_cs+dev+r8+w25+s10+F+M+e+d.pfile) and one reserved for final evaluation (numbers_cs+test+r8+w25+s10+F+M+e+d.pfile). Note that the training file includes a copy of the cross-validation file.
The y0 decoder, written successively by Yochai Konig, Chuck Wooters and Mike Hochberg, is a simple Viterbi decoder (dynamic programming). Y0's virtues are that it is simple, robust and it does forced alignment. Y0's disadvantage is that it is not really being supported and we would rather get rid of it. It is also limited to bigram grammars.
The noway decoder, written by Steve Renals, is a much more complex stack decoder (A*). It has many useful features (lattice generation and n-gram grammars, for example) and is being actively supported and improved, and we are moving towards it as a the standard decoder once y0 is taken out of service. The disadvantage is that the stack decoding algorithm can be somewhat sensitive to pruning parameters and strange input.
Since many of the input files are formatted differently for noway versus y0, you have to select which file format you will need for your decoding task.
The language model (also called the "grammar") provides an estimate for the prior probability of the string of words. We typically calculate this using the word strings from a corpus's training set. The term "N-gram grammars" refers to where the probability of an N-word sequence is estimated from the training data. These are the most frequently used forms for grammars in current speech recognition engines.
For Numbers, we typically use a bigram grammar, that is, the language model specifies the probability a certain word follows a certain other word.
Sample y0 and noway bigram grammars can be found in /u/sulin/speech/miscellaneous/example/lm. The file numbers_cs_train.ybigram is formatted for the y0 decoder. The file numbers_cs_train.nbigram is the same grammar, but formatted for the noway decoder. These were created using the procedure outlined in the manpage make_bigram.
A lexicon defines the pronunciation of a word and includes information like how long each phone of the word is. At ICSI we use two forms, a y0-style lexicon and a noway-style lexicon. Equivalent information can be represented in both formats. A Y0-style lexicon includes both the phone sequence and phone durations in one file. Noway-style representation is better suited to having the phone sequence in a seperate file from the phone durations.
The man pages for y0 and noway describe the format in detail. In /u/sulin/speech/project/numbers_cs/400HU/lex, the y0-style lexicon gildea90per-iter0.lex is equivalent to the noway-style lexicon defined by the pair of files gildea90per-iter0.dict and gildea90per-iter0.phone.
Here are a few useful things to do with lexicons (Eric wrote the scripts that do everything.)
In this example, the first field is the sentence number, the second is the starting frame for the word and the third field is the ending frame for the word. Then there is the recognized word and then the pronunciation labels. Notice that you can custom make your own lexicons simply by creating pronunciations with the same .word format.
Babylex can then build a duration-less lexicon in y0-style or noway-style. To add durations, first you need some estimates for phonetic durations. One source of these is calcdurs. Calcdurs uses provided .word files to calculate phone durations. Once you have durations, you can use mpadddur to add these into the duration-less y0-style file produced by babylex. Then you have a complete y0-style lexicon to use.
To summarize, the steps are:
Now suppose you don't have any lexicons or phones to use to create initial pronunciations for your lexicon. There are several sources for pronunciations. /u/drspeech/data/dict/master contains a collection of pronuncations from many different dictionaries. These can be used to bootstrap a new lexicon.
Use qnstrn. This is the underlying C program for training which runs both on a host CPU and/or on a SPERT board.
An example 400HU net trained for Numbers using the example pfiles above can be found in /u/sulin/speech/miscellaneous/example/net.
Use dr_recog. This script takes care of running qnsfwd to do the forward-pass of the network classifier, and will also run a scoring program on the results (see below).
That concludes all the steps involved in training a basic recognizer. At this stage, we might consider changing our input feature processing or other parameters of the process and trying again to see if we can improve performance. An alternative route to improvement is to use the more complex procedure known as embedded training, which is described in the next page.