Abstract submitted to Eurospeech'99, Budapest (1999jan15)

Classifying audio segments based on speech-recognition acoustic models

Gethin Williams (University of Sheffield) & Daniel P. W. Ellis (ICSI, Berkeley) 

We approach the problem of classifying audio segments according to their suitability for speech recognition, using phone probability estimates as features. In tasks such as the transcription of broadcast audio, segment classification is an important intermediary between a segmenter that identifies boundaries between different types of material, and a speech recognition engine which would be expensive and unprofitable to employ on segments consisting of theme music or speech overwhelmed by background noise. Previous work in this area has been based on the same features (e.g. MFCCs) used for speech recognition [1]; we employ novel statistics based on the phone-class posterior probability estimates generated by the neural-network classifier in our hybrid connectionist-HMM speech recognizer.

It may seem paradoxical to use the output of a classifier trained exclusively to discriminate phonetic classes as the basis for classifying as speech or nonspeech material, but we have observed several distinctions at this level between `good' speech (i.e. likely to lead to a recognizer output with few errors) and other segments. One interpretation is that the classifier has been discriminatively trained to focus on the regions of feature space that particularly crucial in marking phonetic distinctions; a speech signal will be primarily distributed in these regions, whereas an arbitrary nonspeech sound will cross them only irregularly. Our recognizers are particularly suitable for this task because the small number of context-independent posterior phone probabilities are relatively quick to calculate, and identification of nonspeech at this stage can avoid the decoder search through the hidden Markov models, which is the most time-consuming part of the recognition process.

Starting from the posterior probability estimates of 54 phone classes generated in our Broadcast News system [2], we defined four per-segment features to capture the distinctions we had observed. First is the mean per-frame entropy across the class posteriors, since well-modeled speech will usually be dominated by a single category, giving a low entropy (we have used this previously as a confidence measure [3]). Secondly, a measure of the `dynamism' of the estimates, namely the mean-squared frame-to-frame difference, penalizes the tendency of nonspeech audio to remain in a particular phone class for disproportionately long period. The third measure calculates the energy in the acoustic signal for frames classified as background/silence as a ratio of the total segment energy: for clean speech, background/silence is usually low-energy, but in nonspeech many other signal episodes may be best matched by this class. Finally, a measure of the match between the variance in each phone class and a template derived from known good speech helps reject segments that happened to match some speech classes, but in a pattern very different from real speech.

A simple Gaussian classifier based on these measures made no errors on segment-level classifications of 160 speech/music examples [4]. In a practical setting, the classification can be used to maximize the efficient use of the decoder by rejecting the least-well-modeled segments. Our full paper will report details of this approach used on Broadcast News material. 

[1] M. Siegler, U. Jain, B. Raj & R. Stern (1997) "Acoustic segmentation, classification and clustering of Broadcast News audio," Proc. DARPA Speech Recognition Workshop.

[2] G. Cook, J. Christie, D. Ellis, E. Fosler-Lussier, Y. Gotoh, B. Kingsbury, N. Morgan, S. Renals, A. Robinson, & G. Williams (1999) "The SPRACH System for the Transcription of Broadcast News," Abstract submitted for the 1999 DARPA Broadcast News Workshop.

[3] J. Barker, G. Williams, S. Renals (1998) "Acoustic Confidence Measures for Segmenting Broadcast News," Proc. ICSLP, Sydney.

[4] E. Scheirer & M. Slaney (1997) "Construction and evaluation of a robust multifeature speech/music discriminator," Proc. ICASSP, Munich. 

Back to DAn's Research page - DAn's home page.