7.3 How do I get target labels to use in training?

Answer by: dpwe - 2000-08-12

As explained in the FAQ page on target file formats, training an acoustic model requires a corpus of features derived from the training set waveforms, plus a corresponding set of target labels, one for each frame, against which the model will be trainied.

Notionally, we are training the acoustic model to reproduce a phonemic (or other subword unit) labelling task that could be perfomed by a human linguistics expert. Thus one source of labels is manual production, where such experts are hired and their data recorded. There are manual (or hand) labels for the venerable TIMIT set, as well as for several corpora produced at OGI (including NUMBERS95 and STORIES). We also did some hand labelling of Switchboard in the Switchboard Transcription Project.

However, it can take up to 100x longer than real time for a transcriber to place the boundaries and identify the phones. Thus, most tasks are content to use humans only for lexical transcription of the training set (just typing in the words, which can be done in 2-5x real time), then using forced alignment to get automatic labeling of the data at the phone level.

For a new task, the initial forced alignment will need to be based on an existing acoustic model (etc.), which may not be a great match to the data, but through several iterations of embedded training, stable targets can usually be obtained.

