EARS: RT Tasks

Rich Transcription Tasks

The SRI/ICSI/UW Project Team will advance the state-of-the-art in automatic Rich Transcription of speech by creating a collection of novel models, algorithms, and techniques.

Core automatic speech recognition (ASR): Our objectives in this task are to markedly reduce the error rate for the core recognition process. We will use a multifaceted approach to improve all aspects of the recognition system, leveraging all levels of information in the speech signal. Particularly, we will focus on the components that are 'broken' or have received little attention so far.

Front end processing

pitch-dependent analysis for multi-speaker and speaker/noise separation
multiple front ends tuned for extraction of different phonetic features

Acoustic modeling

further improve speaking rate-dependent modeling
rapid adaptation to speakers, dialects, and speaking styles

Pronunciation and duration modeling

data-driven learning of rules for generating and adapting pronunciation models
duration modeling at the phone, syllable, and word levels

Language modeling

parameter-tying techniques for data-efficient discriminative LM training

Post-recognition error correction

word-posterior estimators based on prosodic and other features not used in the recognizer
improved confidence estimates (feeds into metadata task)
system combination at the feature level or through word posteriors
decorrelate systems for more effective system combination

Rapid Development of ASR in New Languages and Domains (Portability) :

New Linguistic Phenomena

trajectory models of lexical tone
cross-boundary pronunciation effects
text normalization

Domain Adaptation

text gathering and LM composition using transformation models
dynamic LM adaptation with uncertainty models
lexicon expansion

Leveraging Limited Acoustic Resources

adapt a multilingual base acoustic model
speaker clustering and dependence models to handle limited speaker diversity in target language
automatically derived sub-word units
automatic selection of data for transcription

Severely Constrained Lexical Resources

Automatic pronunciation acquisition
Rapid development of morphological analysis tools

Develop recognition systems in resource-rich languages, such as Mandarin, both to provide the contrast case of porting to a non-English language where resources are not as constrained, and also to explore new linguistic phenomena such as modeling tone.

Metadata Extraction and Modeling: Currently ASR output is impoverished. Too much information is missing. In this task we seek to introduce structural information, such as from a good human transcriber, augment this with higher-level information, and feed information back to the recognizer to improve ASR. Metadata topics include:

Punctuation and topic segmentation
Disfluency detection and clean-up
Semantic annotation
Dialogue act modeling
Speaker recognition, segmentation, and tracking
Annotation of speaker attributes

Evaluation: In this task, we will develop and maintain a state-of-the-art Rich Transcription evaluation systems, essential both for evaluating new ASR technologies developed under this program and for participating in the annual EARS evaluations.

Develop and maintain a state-of-the-art Rich Transcription system based on SRI's Decipher technology
Assemble an alternate system and/or alternate modules using publicly available components such as HTK or ICSI-based training and recognition modules
Engineer the evaluation system for computational efficiency
Participate in the EARS evaluations, both for English and non-English languages and for telephone and broadcast media

Back to the EARS main page