EARS: Novel Approaches Objectives
Problems:
As noted in the Rich Transcription pages, current
speech-to-text technologies are still quite poor for some applications,
particularly for those requiring the transcription of conversational speech.
This may at least partially be explained by the following observations:
-
From a distance, all current ASR systems are the same; they (1) compute the local spectral
envelope, (2) determine the likelihoods of speech sounds having generated this envelope, and then
(3) search for the most likely HMMs. It seems likely that substantial improvements will require
breaking out of this mold in at least some respect.
-
The spectral envelope (or a simple function of it, such as the mel cepstrum) is distorted by many things,
e.g., speaking style, head movement, channel, noise, even small amounts of room reverberation, etc.
However, radically different alternatives may be a poor fit to the assumptions of the commonly
used statistical models.
-
Speech recognition systems can be thought of as "half-deaf", given that phonetic recognition is quite poor,
particularly for conversational and/or degraded speech. Their success for some applications is due to
extensive constraints, such as domain, specific speaker, noise-cancelling mics, etc. These constraints
can sometimes mask the underlying fundamental weakness of the technology.
Solutions:
The EARS Novel Approaches program in general (and the ICSI/SRI/UW/Columbia/IDIAP/OGI project in particular)
is directed at overcoming these limitations, using several fundamental principles:
-
Escape the dependence on the spectral envelope - replace (or augment) it with alternative features,
especially probabilistic ones (as opposed to simple transformations of local spectral energy)
-
Use multiple front ends across time/frequency - in particular, representing local spectral information
across longer stretches of time, and other more general functions of the time/frequency plane.
-
Design optimal combination schemes for these alternative feature streams.
-
Modify the statistical models to accommodate the new front ends. This may be particularly important
for the features that are computed over temporal windows that greatly exceed the current practice.
Back to the EARS main page