EARS: RT Tasks
Rich Transcription Tasks
The SRI/ICSI/UW Project Team will advance the state-of-the-art in automatic
Rich Transcription of speech by creating a collection of novel models,
algorithms, and techniques.
Core automatic speech recognition (ASR): Our objectives in this task are
to markedly reduce the error rate for the core recognition process.
We will use a multifaceted approach to improve all aspects of the recognition
system, leveraging all levels of information in the speech signal. Particularly,
we will focus on the components that are 'broken' or have received little
attention so far.
-
Front end processing
-
pitch-dependent analysis for multi-speaker and speaker/noise separation
-
multiple front ends tuned for extraction of different phonetic features
-
Acoustic modeling
-
further improve speaking rate-dependent modeling
-
rapid adaptation to speakers, dialects, and speaking styles
-
Pronunciation and duration modeling
-
data-driven learning of rules for generating and adapting pronunciation
models
-
duration modeling at the phone, syllable, and word levels
-
Language modeling
-
parameter-tying techniques for data-efficient discriminative LM training
-
Post-recognition error correction
-
word-posterior estimators based on prosodic and other features not used
in the recognizer
-
improved confidence estimates (feeds into metadata task)
-
system combination
at the feature level or through word posteriors
-
decorrelate systems for more effective system combination
Rapid Development of ASR in New Languages and Domains (Portability)
:
-
New Linguistic Phenomena
-
trajectory models of lexical tone
-
cross-boundary pronunciation effects
-
text normalization
-
Domain Adaptation
-
text gathering and LM composition using transformation models
-
dynamic LM adaptation with uncertainty models
-
lexicon expansion
-
Leveraging Limited Acoustic Resources
-
adapt a multilingual base acoustic model
-
speaker clustering and dependence models to handle limited speaker diversity
in target language
-
automatically derived sub-word units
-
automatic selection of data for transcription
-
Severely Constrained Lexical Resources
-
Automatic pronunciation acquisition
-
Rapid development of morphological analysis tools
-
Develop recognition systems in resource-rich languages, such as Mandarin,
both to provide the contrast case of porting to a non-English language
where resources are not as constrained, and also to explore new linguistic
phenomena such as modeling tone.
Metadata Extraction and Modeling: Currently ASR output is impoverished.
Too much information is missing. In this task we seek to introduce structural
information, such as from a good human transcriber, augment this with
higher-level information, and feed information back
to the recognizer to improve ASR. Metadata topics include:
-
Punctuation and topic segmentation
-
Disfluency detection and clean-up
-
Semantic annotation
-
Dialogue act modeling
-
Speaker recognition, segmentation, and tracking
-
Annotation of speaker attributes
Evaluation: In this task, we will develop and maintain a state-of-the-art
Rich Transcription evaluation systems, essential both for evaluating new
ASR technologies developed under this program and for participating in
the annual EARS evaluations.
-
Develop and maintain a state-of-the-art Rich Transcription system based
on SRI's Decipher technology
-
Assemble an alternate system and/or alternate modules using publicly available
components such as HTK or ICSI-based training and recognition modules
-
Engineer the evaluation system for computational efficiency
-
Participate in the EARS evaluations, both for English and non-English
languages and for telephone and broadcast media
Back to the EARS main page