We use an internal speech recognition system based on weighted finite-state transducers. [4]. Our IR system is an internally modified version of Cornell's well-known SMART retrieval system. [1,6] For speech retrieval, we believe that parallel text corpora, for example printed news from the same time period, can be successfully exploited to improve retrieval effectiveness of a system. This is especially true for the news material currently being used in the SDR track. We use these ideas in our SDR track participation. Initial results from the use of a parallel corpus are quite encouraging.
The recognizer is based on a standard time-synchronous beam search algorithm. The probabilities defining the transduction from text-dependent phone sequences to word sequences are estimated on word level grapheme-to-phone mappings and are implemented in the general framework of weighted finite-state transducers. [4] Transducer composition is used to generate word lattice output.
We use continuous density, three-state, left-to-right, context-dependent hidden Markov phone models. These models were trained on 39-dimensional feature vectors consisting of the first 13 mel-frequency cepstral coefficients and their first and second time derivatives. Training iterations included eigenvector rotations, k-means clustering, maximum likelihood normalization of means and variances and Viterbi alignment. The output probability distributions consist of a weighted mixture of Gaussians with diagonal covariance, with each mixture containing at most 12 components. The training data were divided into wideband and narrowband partitions, resulting in two acoustic models.
Both the first pass trigram language model and the rescoring 4-gram model are standard Katz backoff models [3], using the same 237 thousand word vocabulary. For choosing the vocabulary, all of the words from the SDR98 training transcript were used. This base vocabulary was supplemented with all words of frequency greater than two appearing in the New York Times and LA Times segments of LDC's North American News corpus (LDC Catalog Number: LDC95T21, see www.ldc.upenn. edu), in the period from June 1997 through January 1998. The vocabulary includes about 5,000 common acronyms (e.g ``N.P.R.''), and the training texts were preprocessed to include these acronyms.
The language model training was based on three transcription sources (the SDR98 training transcripts, HUB4 transcripts, transcripts of NBC nightly news) and one print source (the LDC NA News corpus of newspaper text). The first-pass trigram model was built by first constructing a backoff language model from the 271 million words of training text, yielding 15.8 million 2-grams and 22.4 million 3-grams. This model was reduced in size, using the approach of Seymore and Rosenfeld [7], to 1.4 million 2-grams and 1.1 million 3-grams. When composed with the lexicon, this smaller trigram model yielded a manageable sized network. The second pass model used 6.2 million 2-grams, 7.8 million 3-grams, and 4.0 million 4-grams. For this model, the three transcription sources (SDR, HUB4, NBC) were in effect interpolated with the text source (NA News), with the latter being give a weight of 0.1. The word error rate for our recognizer for the SDR track data was 31%. These transcriptions are labeled ATT-S1.
Like most other participants, we create word-level transcriptions for these stories using our recognizer and use our ad-hoc searching algorithm to do retrieval over these erroneous transcriptions. The effectiveness of a ranking is measured via non-interpolated average precision, a standard metric used in IR to measure retrieval effectiveness. More details on the ad-hoc task and its evaluation can be found in [10].
d | tf factor: 1 + ln(1 + ln(tf)), 0 if tf=0 |
t | idf factor: log ((N + 1) / df) |
b | pivoted byte length normalization factor: |
1 / (0.8 + 0.2x (length of document in bytes) / | |
(average document length in bytes)) | |
where, | tf is the term's frequency in text |
N is the total number of documents | |
df is number of documents containing the term | |
dnb | weighting: d factor x b factor |
dtb | weighting: d factor x t factor x b factor |
dtn | weighting: d factor x t factor |
Base | Query Expansion | |
---|---|---|
Human | 0.4595 | 0.5300 (+15.3%) |
ATT-S1 | 0.4353 | 0.5020 (+15.3%) |
(-5.3%) | (-5.3%) |
Several techniques are plausible for bringing new words into a document. An obvious one from an IR perspective is document expansion using similar documents: find some documents related to a given document, and add new words from the related documents to the document at hand. And from a speech recognition perspective, the obvious choice is to use word lattices which contain multiple recognition hypotheses for any utterance. A word lattice contains words that are acoustically similar to the recognized words could have been said instead of the words recognized in the one-best transcription.
In our official TREC-7 participation we used a constrained document expansion which used only those expansion words that are proposed by similar documents and also appear in a word-lattice. However, after the official conference we did a more rigorous study of document expansion and discovered that constraining document expansion to allow only terms from the word-lattices generated by our recognizer held no additional benefit over not doing so. I.e. we can do document expansion only from NA news and the results were equally good or better. This also allows us to test document expansion for retrieval from the automatic transcriptions provided by other SDR track participants, for which we don't have the word-lattices.
Code | Provided By | WER |
Human | NIST | 0% |
CUHTK-S1 | Cambridge University | 24.8% |
Dragon98-S1 | Dragon Systems | 29.8% |
ATT-S1 | AT&T Labs | 31.0% |
NIST-B1 | Carnegie Melon (CMU) | 34.1% |
SHEF-S1 | Sheffield University | 36.8% |
NIST-B2 | Carnegie Mellon (CMU) | 46.9% |
DERASRU-S2 | DERA | 61.5% |
DERASRU-S1 | DERA | 66.2% |
We test document expansion on different automatic transcriptions provided to NIST by various track participants. Table 3 lists these transcriptions along with their word error rates. Here are the steps involved in document expansion:
where is the initial document
vector,
the the vector for the i-th
related document, and
is the modified
document vector. All documents are dnb weighted (see
Table 1). New words are added to the document. For term
selection, the Rocchio weights for new words are multiplied by their
idf, the terms are selected, and the idf i s
stripped from a selected term's final weight. Furthermore, to ensure
that this document expansion process doesn't change the effective
length of the document vectors, and change the results due to document
length normalization effects, [8]
we force the total weight for all terms in the new vector to be the
same as the total weight of all terms in the initial document
vector. We expand documents by 100% of their original length
(i.e. if the original document has 60 indexed terms, then we
add 60 new terms to the document).
Unexpanded Docs | Expanded Docs | |||
Transcript | Base | Qry Expn | Base | Qry Expn |
Human | 0.4595 | 0.5300 | 0.5108 | 0.5549 |
CUHTK-S1 | 0.4376 | 0.5035 | 0.5220 | 0.5372 |
Dragon98-s1 | 0.4190 | 0.5100 | 0.5061 | 0.5284 |
ATT-S1 | 0.4353 | 0.5020 | 0.5080 | 0.5343 |
NIST-B1 | 0.4104 | 0.4820 | 0.4862 | 0.5259 |
SHEF-S1 | 0.4073 | 0.4890 | 0.5068 | 0.5421 |
NIST-B2 | 0.3352 | 0.3965 | 0.4377 | 0.4743 |
DERASRU-S2 | 0.3633 | 0.3962 | 0.4585 | 0.5065 |
DERASRU-S1 | 0.3236 | 0.3613 | 0.4526 | 0.4849 |
The results for unexpanded as well as the expanded documents are listed in Table 4. The two main highlights of these results are:
These points are highlighted in Figure 1. The left plot shows the average precision on the y-axis, against the WER on the x-axis. All number plotted in Figure 1 are for the unexpanded queries (i.e. we use the columns marked Base in Table 4). This prevents effects of query expansion from affecting these graphs and allows us to study the effects of document expansion in isolation. The horizontal lines are for human transcriptions whereas the other lines are for the different automatic transcriptions. As we can see in the left graph, document expansion (solid lines) yields large improvements across the board for this task over not doing document expansion (dashed lines). This is indicated by the noticeably higher average precision for the solid lines as compared to the corresponding dashed lines.
The right graph in Figure 1 plots the %-loss from human transcriptions on the y-axis for unexpanded and expanded documents. The baseline for the expanded documents is higher; it is the expanded human transcriptions, i.e. the solid horizontal line on the left graph. We observe that for the poorest transcriptions (DERASRU-S1) document expansion yields an improvement of an impressive 40% (over 0.3236) and reduces the performance gap from human transcription to about 12% instead of the original 30% despite the higher baseline used. The results are similar for other transcriptions.
It might be the case that for this test collection document expansion is beneficial in general, and it doesn't hold any special advantage for automatic speech transcripts. However, the right graph in Figure 1 shows that this is not the case, an d document expansion indeed is more useful when the text is erroneous. The dashed line on the right graph shows the loss in average precision when retrieval is done from (unexpanded) automatic transcriptions instead of (unexpanded) human transcriptions. This line has the same shape as the dashed line on the left graph since it is essentially the same curve on a different scale (0 to -100 in % loss, the human transcriptions being the 0% mark). And we notice that the loss for CUHTK-S1 (the leftmost point) is close to 0% whereas it is 30% for DERASRU-S1 (the rightmost point). The solid line on the right plot shows the losses for various transcripts for expanded documents. The baseline for this curve is higher; it corresponds to the solid horizontal line on the left graph. We see that document expansion indeed benefits the poor transcriptions much more that it benefits the human or the better automatic transcriptions. For poor transcriptions, the gap in retrieval effectiveness reduces from 27% to about 15% for NIST-B2, from 22% to about 10% for DERASRU-S2, and from about 30% to about 11% for DERASRU-S1. All these loss reductions are quite significant.
In summary, document expansion is more useful for automatic speech transcripts than it is for human transcriptions. Automatic recognitions that are relatively poor need the most help during retrieval. Document expansion is helping exactly these transcriptions, and quite noticeably. It is encouraging that even with word error rates as high as 65%, the retrieval effectiveness drops just 10-15% post document expansion. This drop would have been 22-30% without expansion.