ICSI Speech FAQ:
9.4 How do I balance insertions and deletions?

Answer by: dpwe - 2000-09-27


Word recognition errors come in three flavors:

Word error rate is defined as the sum of all three types of errors divided by the total number of words in the reference (perfect) transcript. (Thus, through insertions, word error rate can exceed 100%).

So, for example, if the original utterance is:

THE CAT SAT ON THE MAT

but recognition returned:

THE CATS ON THE MAT ARE

we might say that:

The word error rate here would be 3 errors for 6 words in the true transcript, or 50%. (You will note that the precise classification of errors depends on the alignment between true and recognized transcripts; we could have said that CAT was deleted and that SAT had been substituted with CATS, but the word error rate is not affected.)

Now, a given continuous-speech recognition setup can usually be tuned to more or less inclined to generate words, all other things being equal. Thus, we can vary our recognizer to produce lots of output words (implying lots of insertions, as well as perhaps the correct words) or rather few output words (implying lots of deletions, although the words it does produce may be mostly correct). Of course, normally we want to minimize word error rate, which usually corresponds to making about the same number of insertions and deletions.

The ratio of insertions to deletions (i.e. the propensity for the recognizer to generate more or less words) is usually controlled by one or more 'penalty' factors, which weight different hypotheses according the number of transitions they make. There are all kinds of ways to do this, and it interacts with the language model (which defines the score for each particular word transition), but one knob is the phone_deletion_penalty, which is a parameter to noway.

(n.b.: we should probably have a separate FAQ on acoustic model and language model scaling factors, AMSF and LMSF, and what they can do.)

phone_deletion_penalty is defined in such a way that making it larger increases insertions and reduces deletions (i.e. if you are penalized for deleting phones, you prefer instead to generate more of them). As a fairly random example, here is a table of results for some experiments I did varying the pdp for the Aurora task (as mentioned in my status report of 1999jun04). This table actually comes from /u/drspeech/data/aurora/experiments/dpwe/RESULTS :

    Search across PDPs on train-cv-800 was pretty flat.  Best value 
    was actually 0.1, but 0.4 had a much better balance of dels/inserts:
                               S   D   I   E   WER
train-cv-msg3N+plp12Nd-pdp.05  31  42   4  77  2.9%
train-cv-msg3N+plp12Nd-pdp.075 31  39   5  75  2.8%
train-cv-msg3N+plp12Nd-pdp.1   31  35   5  71  2.6%
train-cv-msg3N+plp12Nd-pdp.15  33  33   8  74  2.8%
train-cv-msg3N+plp12Nd-pdp.2   35  31  10  76  2.8%
train-cv-msg3N+plp12Nd-pdp.3   36  27  12  75  2.8%
train-cv-msg3N+plp12Nd-pdp.35  36  26  13  75  2.8%
train-cv-msg3N+plp12Nd-pdp.4   36  23  14  73  2.7%
train-cv-msg3N+plp12Nd-pdp.45  35  27  14  76  2.8%
train-cv-msg3N+plp12Nd-pdp.5   36  26  18  80  3.0%
train-cv-msg3N+plp12Nd-pdp.6   35  25  21  81  3.0%

Each row has the pdp as the last component (varying from 0.05 to 0.6). The columns are counts of substitution, deletion, insertion and total errors, respectively, and the net WER percentage. As can be seen here, reducing deletions increased insertions almost equivalently over this range, so the WER wasn't too sensitive to the pdp value, but values far outside this range would be different.


Previous: 9.2 What are the decoders we use at ICSI? - Next: 9.5 What is the format of Noway acoustic scores?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:17 PDT 2009