It's always a bit of a problem getting started on a new task. Here are some instructions adapted from a recipe I wrote for Sunil Sivadas at OGI when we were getting started on the SPINE data. Hopefully it will act as an example.
Now we have the waveforms for the training data, and the word-level transcripts for each utterance, we need to go through the procedure to train a network on it. These are the steps, in outline: - Define the training and test set utterances. [I'm not sure if there is a separate test or dev-test set defined]. This normally results in a file containing a list of utterance IDs or filenames. It is normally considered good practice for the test set to consist entirely of speakers who are not in the training set, to give a good measure of speaker independence in the modeling. To a degree, it depends on what the 'real' test set will look like (i.e. same speakers or different speakers). The larger the test set, the more accurately we can measure the performance improvements of our systems (i.e. the tighter the confidence bounds on the test WER results). But large test sets also make for slow tests, and sometimes can eat up too much of the training pool - more than 5-10% seems excessive. A test set of 10,000 words is nice; 1,000 words is getting decidedly slim. - Build feature archives for the training set. These are going to be used for the initial forced alignment, so they should be something vanilla, and something for which there is already a net that we can use for bootstrapping. I was getting reasonable looking results with the plp12N Broadcast News net on the few SPINE utterances I tried, so maybe that's the place to start. - Come up with a pronunciation dictionary for the task, based on the lexicon (list of all words we will be able to recognize). Typically, we take all the words we have from the training set transcripts plus any other 'representative texts', then take pronunciations from some large baseline dictionary (such as the Broadcast News dictionary). - Using just the features, the word transcripts and the dictionary, we can make a forced alignment by using a previous boot-net. This is what dr_align_efsg does. Basically, once we have the pieces we just run the script. If the boot net really doesn't match the data, the process may fail i.e. certain utterances don't align at all. Each of these failures needs to be investigated, in case its indicating a deeper problem (perhaps an incorrect transcript, or something wrong with the feature calculation), but we can if necessary simply drop them from the training set. - The initial targets from this forced alignment can be used to train a net specific to this task, via qnstrn. - This net can be used to make a new alignment, which should be significantly better. Someone who worked in pronunciation modeling would want to revisit the original dictionary, to see if it was really matching the data, but I'm not good at that part. - This set of 2nd generation labels can be used to train further nets, perhaps based on different sets of features etc. - In order to do an actual recognition, we will also need a grammar (language model). This is hard, both conceptually and practically, certainly for me anyway. It's very hard to get good estimates of tri-gram probabilities from speech transcripts; normally, you want extra texts with millions and millions of words. Maybe the grammar is limited enough that we can use bigrams, although trigrams always give much better word error rates. There's a lot of art in getting the right amount of smoothing for your language model given the amount of training data, and I've never been involved in this part. Perhaps someone else has developed a grammar for this task that they can share with us? - We can then do recognition tests using our newly-trained nets, the test set, and the grammar and dictionary. That's the basic plan.
Previous: 7.1 What does it mean to train a speech recognizer? - Next: 7.3 How do I get target labels to use in training?
Back to ICSI Speech FAQ index
Generated by build-faq-index on Tue Mar 24 16:18:17 PDT 2009