For the past few months, I've been working with the AURORA noisy digits task, produced by ETSI. The performance of different systems on this task is summarized on a separate page. Here, however, I describe how I got started with connectionist modeling of the task.
The HTK whole-word system that is defined as the AURORA standard is pretty smart - it just takes the word transcripts of the training utterances, and makes its own segmentations, including (I believe) state tying, completely from scratch (it does do something like 16 iterations of EM). To train the neural-net models used in our hybrid connectionist-HMM systems, however, we need to get explicit 'hard-target' phone labels for the entire training set, something not included in the standard database. The nature and quality of these labels has a huge influence on the overall performance on the resulting system, so I figure it is worthwhile documenting how I generated them, and why I did some of the weirder steps.
To get any kind of system going, I needed pronunciations for the digits lexicon, and a net that at least approximately could generate posterior probabilities for those phones. I borrowed these from the NUMBER95 task which has been widely used at ICSI. NUMBERS is a continuous whole-numbers task with a 39-word lexicon of which digits are a subset.
I was anticipating that Rasta would be good for this task, since the data had added noise, and perhaps variable coloration. Looking around for a good Rasta-based NUMBERS net, I found one I liked among Brian Kingsbury's old files. Brian did a lot of work with numbers and improved the baseline. It was actually an (18x9):480:56 net. 480 hidden units is pretty much the point of diminishing returns for NUMBERS, and we use the full ICSI56 output phoneset, even though several of the phones don't occur in numbers (and even more are missing from digits alone). I converted theY0-format pronunciations and phone models into Noway format, both with and without context-dependent phones, and both as priors/models pairs and as "phi" files. Using the phi format, I did a forced alignment based on the Aurora data passing through the NUMBERS net, e.g.
qnsfwd \ ftr1_file=train-rand.pf \ ftr1_norm_file=train-rand.norms \ window_extent=9 \ ftr1_ftr_count=18 \ ftr1_window_len=9 \ init_weight_file=/u/bedk/speech/experiments/rasta-baseline/rasta8+d+iter2.weights \ mlp3_input_size=162 \ mlp3_hidden_size=480 \ mlp3_output_size=56 \ log_file=/tmp/qnsfwd.log \ activation_format=lna8 \ | dr_align_efsg \ activation_format=lna \ utt_words_file=../../wrdfile/train-rand.ref \ utt_words_file_type=ref \ steptime=10 \ phi=../../lex/n95tr.phi.icsi2 \ dictionary=../../lex/digits.dict \ ctm_file=- \ | labels2pfile \ ctmfile=- \ steptime=10 windowtime=10 zeropad=40 \ phoneset=/u/drspeech/data/phonesets/icsi56.phset \ samplerate=8000 force=1 \ opformat=ilab \ pfile=train-rand.ilab
The labels thus generated allowed me to generate priors for each of the phones within this dataset (pfile_print -q -ns -i train-rand.pflab | labels2priors phoneset=/u/drspeech/data/phonesets/icsi56.phset labels=- priors=train-rand.prior), train a new net of any size on the AURORA data, and test it using the noway decoder and the converted NUMBERS models and lexicon. This completed, giving me an initial error rate of 4.8% on the first, clean, 1001-sentence test set (about 3300 words). In order to let the label boundaries 'settle in', I redid the forced alignment using this net, then trained a second net (i.e. one round of 'embedded training'). This new net got 4.6% word error on the clean test set, so more embedding didn't seem necessary.
I did however take a look at the errors in detail to look for patterns of problems. There were 151 individual word errors, so I was able to go through the output of wordscore -v and consider each error in turn. There were some very distinct patterns:
After inspecting errors in the original system showed a disproportionate number of "oh oh"s becoming "oh", I realized we had seen this problem before. In fact, it was almost inevitable since the only pronunciation of "oh" was /aw/, but there's no way to avoid fusing /aw aw/ into /aw/ on recognition. So I had to reintroduce a trick used in our much older work on TIDIGITS, of using a /q/ phone (nominally glottal stop) as an optional prefix to "oh". Thus "oh oh", spoken fluently, is transcribed as /aw q aw/, with the /q/ marking the interruption (or amplitude dip) in the voicing that indicates the boundary between the words. While I was at it, I added an optional /q/ for "eight" too, to stand in for the optional dip between adjacent vowels i.e. "three eight" as /th r iy q ay tcl t/ (although you can say it without the /q/).
Adding the /q/ raised the problem of how to bootstrap the network to generate the /q/, or equivalently how to generate training labels that included the /q/. Making "oh" have a single pronunciation of /q aw/ didn't seem right, since when the preceding word ends in a consonant, there really isn't any glottal stop or amplitude dip of the kind I'm trying to measure (e.g. "one oh" -> /w ah n aw/). But if I do a forced alignment with a net trained without /q/ using a lexicon to which an optional /q/ has been added (i.e. with alternative pronunciations of "oh" as /aw/ and /q aw/), the alternate including /q/ would never be selected.
So I did something pretty weird. First, I manhandled the net in MATLAB to create a new /q/ output whose weights were a copy of the silence (/h#/) weights. The matlab script readmlpwts will read a QN-style weights file into a set of matlab arrays. I made the 56th column (/q/) of the hidden-to-output weights matrix be a copy of the 55th column (/h#/), then wrote out the modified weights file with writemlpwts. This net still estimated /q/ with a much lower probability than /h#/ because of a difference in the bias values for the output units, which I didn't copy.
Then I did a forced alignment with this net, using a dictionary which forced any occurrence of "oh" to start with at least one frame labelled as /q/. I trained a net to these labels, then realigned again, then trained again. These nets had problems decoding (perhaps because I was forcing /q/s in even when there was no need for an intervening phone), but looking at the posterior outputs, it seemed as though the /q/ output was really detecting energy dips at the start of "oh", where needed.
So I relaxed the dictionary to have "oh" be pronouncable either as /aw/ or as /q aw/. I also added an optional /q/ in front of "eight" (e.g. for "three eight" = /th r iy q ey tcl t/) and made the final /t/ burst in "eight" optional, so that "eight two" can be /ey tcl tcl t uw/. I realigned with this dictionary, then trained, then realigned, then trained, to let the /q/ settle down to being used hopefully only where it was appropriate or useful. I also had to re-estimate the priors, of course, to permit /q/s. This net got down to 2.5% word error rate on the clean test set, and the prevalence of "oh oh" contractions was essentially eliminated.
n.b.: Given that my system was based on a NUMBERS95 lexicon and phone set, I guess all our N95 systems must have a lot of trouble with "oh oh" (which also seems to be disproportionately common). Anyone want to check this?
Back to ICSI AURORA homepage - ICSI RESPITE homepage - ICSI Realization group homepage