ICSI Speech FAQ:
7.6 What is embedded training?

Answer by: dpwe - 2000-07-27

This answer isn't really done, I just needed a placeholder for a forward reference.

Embedded training is the name given to the process of iteratively training a speech recognizer, then using the models to make a new set of training labels via forced alignment, then training a new recognizer from these labels, then re-aligning again, and carrying on until the resulting system stops getting better.

Although embedded training has certainly worked well in many circumstances (typical NUMBERS95 trainings, and our marathon ICSI-Cambridge-Sheffield 1998 Broadcast News effort), it's not uncommon for the recognition to get worse even on the first iteration.

One curious thing about embedded training is that the interpretation of each label becomes slightly vague. If you train a model to hand-labeled phoneme targets, you can state with some confidence what it means when a particular output of your classifier becomes active. But through the process of successive realignments, the nature of each label class can drift. In particular, if your corpus includes a limited range of phone contexts (such as NUMBERS or Aurora digits), it is common for the 'phone' labels to start including some portion of their common surrounding contexts.

Embedded training can be performed manually by successively training and realigning. There are also some scripts that attempt to automate the process. See for example Brian Kingsbury's original dr_embed (which uses the disparaged Y0, since that was all that was available at the time), and the more recent, but task-specific aurora-embed.

A note on evaluating realignments

One weakness that has emerged several times in the past few years concerns the evaluation of realignments. When a cycle of embedded training generates a new set of target labels for a dataset, it would be valuable to be able to measure 'how' and 'by how much' the labels have changed. One overall metric is the proportion of labels that are the same between the two sets (although I don't know of any simple tool to calculate this). But more interesting would be some kind of analysis in terms of substituted, deleted and inserted phone segments, plus statistics on the shifts of transition boundaries. Further breakdown of transition boundary motion according to the surrounding context would also begin to give a better idea of what really happened in the realignment (like the band-to-band asynchrony plots presented in Nikki's thesis). This would be a great project for someone to take on (although the first stage would be to search for existing work and tools in this area).

Previous: 7.5 How can I use the alignment files produced by the SRI trainer? - Next: 8.1 What are grammars for?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:17 PDT 2009