ICSI Speech FAQ:
7.2 I just got this new data. How can I start training from it?

Answer by: dpwe - 2000-07-31


It's always a bit of a problem getting started on a new task. Here are some instructions adapted from a recipe I wrote for Sunil Sivadas at OGI when we were getting started on the SPINE data. Hopefully it will act as an example.


Now we have the waveforms for the training data, and the word-level 
transcripts for each utterance, we need to go through the procedure to 
train a network on it.  

These are the steps, in outline:

 - Define the training and test set utterances.  [I'm not sure if there 
   is a separate test or dev-test set defined].  This normally 
   results in a file containing a list of utterance IDs or filenames.

   It is normally considered good practice for the test set to consist 
   entirely of speakers who are not in the training set, to give a good 
   measure of speaker independence in the modeling.  To a degree, it 
   depends on what the 'real' test set will look like (i.e. same
   speakers or different speakers).

   The larger the test set, the more accurately we can measure the 
   performance improvements of our systems (i.e. the tighter the 
   confidence bounds on the test WER results).  But large test 
   sets also make for slow tests, and sometimes can eat up too 
   much of the training pool - more than 5-10% seems excessive.  
   A test set of 10,000 words is nice; 1,000 words is getting decidedly 
   slim.

 - Build feature archives for the training set.  These are going 
   to be used for the initial forced alignment, so they should be 
   something vanilla, and something for which there is already 
   a net that we can use for bootstrapping.  I was getting reasonable 
   looking results with the plp12N Broadcast News net on the few 
   SPINE utterances I tried, so maybe that's the place to start.  

 - Come up with a pronunciation dictionary for the task, based on 
   the lexicon (list of all words we will be able to recognize).  Typically, 
   we take all the words we have from the training set transcripts plus 
   any other 'representative texts', then take pronunciations from some 
   large baseline dictionary (such as the Broadcast News dictionary).

 - Using just the features, the word transcripts and the dictionary, 
   we can make a forced alignment by using a previous boot-net.  
   This is what dr_align_efsg does.  Basically, once we have the pieces 
   we just run the script.  If the boot net really doesn't match the 
   data, the process may fail i.e. certain utterances don't align at 
   all.  Each of these failures needs to be investigated, in case its 
   indicating a deeper problem (perhaps an incorrect transcript, or 
   something wrong with the feature calculation), but we can if necessary 
   simply drop them from the training set.

 - The initial targets from this forced alignment can be used to train 
   a net specific to this task, via qnstrn.

 - This net can be used to make a new alignment, which should be 
   significantly better.  Someone who worked in pronunciation 
   modeling would want to revisit the original dictionary, to see 
   if it was really matching the data, but I'm not good at that part.

 - This set of 2nd generation labels can be used to train further nets, 
   perhaps based on different sets of features etc.

 - In order to do an actual recognition, we will also need a grammar 
   (language model).  This is hard, both conceptually and practically, 
   certainly for me anyway.  It's very hard to get good estimates 
   of tri-gram probabilities from speech transcripts; normally, 
   you want extra texts with millions and millions of words.  
   Maybe the grammar is limited enough that we can use bigrams, although 
   trigrams always give much better word error rates. There's a lot 
   of art in getting the right amount of smoothing for your 
   language model given the amount of training data, and I've 
   never been involved in this part.   Perhaps someone else has 
   developed a grammar for this task that they can share with us?

 - We can then do recognition tests using our newly-trained nets, the 
   test set, and the grammar and dictionary.

That's the basic plan.

Previous: 7.1 What does it mean to train a speech recognizer? - Next: 7.3 How do I get target labels to use in training?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:17 PDT 2009