EARS: RT Objectives
Rich Transcription Objectives
Problems:
Current speech-to-text technologies still suffer from several key limitations
that make them impractical for many potential applications, in both Government
and commerce:
-
Word-level transcription accuracy for spontaneous, conversational speech
is only about 70%, and significantly lower in adverse conditions, such
as in noisy environments or when the style of speech or topic of discussion
were not well covered in the training data.
-
Recognition output is an unstructured string of words, without proper capitalization,
punctuation, paragraph breaks, speaker labels, and other common markings
that make text readable. Furthermore, important information conveyed by
speaking style, rather than the words themselves, is lost, such as what
the speaker was emphasizing (i.e., considered important), or any emotional
qualities apparent in the manner of speaking.
-
Large amounts of data and expensive hand-transcriptions and annotations
are required to achieve state-of-the-art performance. For languages other
than English, such data may be hard to obtain, and the linguistic expertise
for data annotation and system development may not be available, further
limiting performance of speech-to-text systems in those languages.
Solutions:
The core of the EARS Rich Transcription program is directed at overcoming
these limitations:
-
Reduce the word error rate of automatic speech transcription by leveraging
knowledge sources not currently captured in recognition systems, but which
contain information that can help rule out incorrect recognition hypotheses.
-
Improve the accuracy and efficiency of transcription through better combination
of available knowledge sources.
-
Enrich the recognition output with multiple levels of metadata annotation,
including proper names and sentence boundaries (enabling proper capitalization
and punctuation), types of utterances (e.g. questions versus statements),
changes in topics, and speaker identity. Also, speech will be edited properly,
so that the intended meaning can be inferred.
-
Combine the recognition of word information and metadata into a unified
recognition model, so that information about one can help improve the other.
This will yield not only overall higher accuracy but also mutually
consistent outputs.
-
Develop techniques to quickly adapt the recognition system to new languages,
speaking styles, and domains of discourse, without requiring full retraining
on vast amounts of data as is currently required.
Back to the EARS main page