Standard training of a speech recognizer

This page takes you through the basic steps involved involved in training and testing a speech recognizer using the ICSI speech tools. In essence, we will go through each of the boxes outlined in the block diagram on the overview page.

Choosing a database

There are a number of speech databases available at ICSI, including:

For your first standard training experience, let's use a simple, but non-trivial speech database, the coresubset of Numbers.

Numbers V1.0 was collected by Oregon Graduate Institute's Center for Spoken Language Understanding. It is composed of 10 hours of speech collected from the public over telephone lines. Callers responded to census prompts and OGI later clipped the sections with numbers out and gather these into a distinct corpus. OGI has phonetically handtranscribed about half of the utterances.

At ICSI, you can find the complete set of Numbers files in /u/drspeech/data/NUMBERS95.

For some experiments at ICSI, we further reduced the Numbers set to about 3.5 hours of speech data by limiting membership to the "coresubset" to those utterances with the following qualities:

These files are under directories with "cs" in the directory name, as in /u/drspeech/data/NUMBERS95/phnfile/cs.

Building a feature archive (p-file)

When building a feature file, you have to make some choices:

Historically, there have been several different tools for calculating features at ICSI. If you are using one of the standard feature sets, it is easiest to use feacalc. The man page includes the command to produce one of the sample p-files mentioned below.

Sample RASTA-PLP files can be found in the example directory /u/sulin/speech/miscellaneous/example/ftrarch. There is one for training (numbers_cs+train+cv+r8+w25+s10+F+M+e+d.pfile), one for cross-validation (numbers_cs+cv+r8+w25+s10+F+M+e+d.pfile), one for developing recognition systems (numbers_cs+dev+r8+w25+s10+F+M+e+d.pfile) and one reserved for final evaluation (numbers_cs+test+r8+w25+s10+F+M+e+d.pfile). Note that the training file includes a copy of the cross-validation file.


At ICSI we usually use either the y0 decoder or the noway decoder.

The y0 decoder, written successively by Yochai Konig, Chuck Wooters and Mike Hochberg, is a simple Viterbi decoder (dynamic programming). Y0's virtues are that it is simple, robust and it does forced alignment. Y0's disadvantage is that it is not really being supported and we would rather get rid of it. It is also limited to bigram grammars.

The noway decoder, written by Steve Renals, is a much more complex stack decoder (A*). It has many useful features (lattice generation and n-gram grammars, for example) and is being actively supported and improved, and we are moving towards it as a the standard decoder once y0 is taken out of service. The disadvantage is that the stack decoding algorithm can be somewhat sensitive to pruning parameters and strange input.

Since many of the input files are formatted differently for noway versus y0, you have to select which file format you will need for your decoding task.

Language model

The language model (also called the "grammar") provides an estimate for the prior probability of the string of words. We typically calculate this using the word strings from a corpus's training set. The term "N-gram grammars" refers to where the probability of an N-word sequence is estimated from the training data. These are the most frequently used forms for grammars in current speech recognition engines.

For Numbers, we typically use a bigram grammar, that is, the language model specifies the probability a certain word follows a certain other word.

Sample y0 and noway bigram grammars can be found in /u/sulin/speech/miscellaneous/example/lm. The file numbers_cs_train.ybigram is formatted for the y0 decoder. The file numbers_cs_train.nbigram is the same grammar, but formatted for the noway decoder. These were created using the procedure outlined in the manpage make_bigram.

Word models

A lexicon defines the pronunciation of a word and includes information like how long each phone of the word is. At ICSI we use two forms, a y0-style lexicon and a noway-style lexicon. Equivalent information can be represented in both formats. A Y0-style lexicon includes both the phone sequence and phone durations in one file. Noway-style representation is better suited to having the phone sequence in a seperate file from the phone durations.

The man pages for y0 and noway describe the format in detail. In /u/sulin/speech/project/numbers_cs/400HU/lex, the y0-style lexicon gildea90per-iter0.lex is equivalent to the noway-style lexicon defined by the pair of files gildea90per-iter0.dict and

Here are a few useful things to do with lexicons (Eric wrote the scripts that do everything.)

Now suppose you don't have any lexicons or phones to use to create initial pronunciations for your lexicon. There are several sources for pronunciations. /u/drspeech/data/dict/master contains a collection of pronuncations from many different dictionaries. These can be used to bootstrap a new lexicon.

Training the classifier network

Use qnstrn. This is the underlying C program for training which runs both on a host CPU and/or on a SPERT board.


An example 400HU net trained for Numbers using the example pfiles above can be found in /u/sulin/speech/miscellaneous/example/net.

Running the recognizer

Use dr_recog. This script takes care of running qnsfwd to do the forward-pass of the network classifier, and will also run a scoring program on the results (see below).


Evaluating the performance

Now you've got recognized words. To score them against the correct words, we usually used the wordscore program. For fancier output, sclite can also be used. See Alfred's icsi2sclite for a quick way to convert wordscore-compatible formats to sclite-compatible formats.

That concludes all the steps involved in training a basic recognizer. At this stage, we might consider changing our input feature processing or other parameters of the process and trying again to see if we can improve performance. An alternative route to improvement is to use the more complex procedure known as embedded training, which is described in the next page.

[ index - embedded training ]
Dan Ellis <>