EARS: RT Objectives

Rich Transcription Objectives

Problems:

Current speech-to-text technologies still suffer from several key limitations that make them impractical for many potential applications, in both Government and commerce:

Word-level transcription accuracy for spontaneous, conversational speech is only about 70%, and significantly lower in adverse conditions, such as in noisy environments or when the style of speech or topic of discussion were not well covered in the training data.
Recognition output is an unstructured string of words, without proper capitalization, punctuation, paragraph breaks, speaker labels, and other common markings that make text readable. Furthermore, important information conveyed by speaking style, rather than the words themselves, is lost, such as what the speaker was emphasizing (i.e., considered important), or any emotional qualities apparent in the manner of speaking.
Large amounts of data and expensive hand-transcriptions and annotations are required to achieve state-of-the-art performance. For languages other than English, such data may be hard to obtain, and the linguistic expertise for data annotation and system development may not be available, further limiting performance of speech-to-text systems in those languages.

Solutions:

The core of the EARS Rich Transcription program is directed at overcoming these limitations:

Reduce the word error rate of automatic speech transcription by leveraging knowledge sources not currently captured in recognition systems, but which contain information that can help rule out incorrect recognition hypotheses.
Improve the accuracy and efficiency of transcription through better combination of available knowledge sources.
Enrich the recognition output with multiple levels of metadata annotation, including proper names and sentence boundaries (enabling proper capitalization and punctuation), types of utterances (e.g. questions versus statements), changes in topics, and speaker identity. Also, speech will be edited properly, so that the intended meaning can be inferred.
Combine the recognition of word information and metadata into a unified recognition model, so that information about one can help improve the other. This will yield not only overall higher accuracy but also mutually consistent outputs.
Develop techniques to quickly adapt the recognition system to new languages, speaking styles, and domains of discourse, without requiring full retraining on vast amounts of data as is currently required.

Back to the EARS main page