Research in the Speech Group at ICSI
The major application area that is researched in the
Speech Group at ICSI is speech recognition,
although some of this work has led to basic research in
auditory processing. Overall, there have been three major focii of the
work:
- Auditory-inspired signal processing
-
In this work we focus on
signal processing transformations that provide increased robustness
(particularly for speech recognition) in the presence of acoustic
interference such as noise, reverberation, or modified channel frequency
response. We tend toward an interest in functional
models of human hearing, but maintain an engineering perspective
(that is, are willing to use decidedly non-human approaches
if they work). In this category, we have developed
RASTA approaches,
most notably RASTA-PLP. In this style of processing,
simple models of forward temporal masking are used to accentuate
the components of spectral trajectories that have speech-like
temporal character, and discriminate against sources of variance
that change at a different rate (typically slower). The more recent
form of this approach, J-RASTA, provides robustness to both
channel spectral mismatch (between training and test) and additive noise.
Current work in this area is focusing on using other auditory concepts
to provide greater robustness to the time smearing that occurs with
reverberation, and in general getting better models of temporal
processing as it relates to speech intelligibility under realistic
conditions, both for persons and machines.
- Statistical modeling
- Since 1988 we have worked in the area
of hybrid connectionist/HMM systems. Recently we have been focusing
on transition-based models of speech. Some of this work is centered
around constraining the statistical models to have some of the
temporal properties suggested by work in the previous research
category, particularly for the model we have
called SPAM
(Stochastic Perceptual Auditory-event-based Models). In other work we have
developed a recursive algorithm for training transition-based systems
that in theory maximizes the posterior probability for the correct
model sequence; this approach is called REMAP,
which stand for Recursive Estimation and Maximization of A Posteriori
Probabilities.
- Models of phonology and language for speech understanding
-
We have also been working on basic questions in language modeling
for speech recognition at several levels, including automatic learning
of stochastic pronunciation models, and the incorporation of both
structural grammars and simple statistical N-gram grammars in
speech understanding systems. Towards this end we have built
an interactive query system called BeRP,
which is the Berkeley Restaurant Project.
There are a number of other areas of research that do not fit
neatly in any of these categories. For instance, we have been looking
at the sources of degradation in the machine recognition of rapid speech.
We are interested in the detection and incorporation of
information about accent in our recognition systems.
We are researching the interaction between the statistical learning
of acoustic and language models. In the last year or two, we have also
ramped up a task focus on natural and impromptu meeeetings, with subtasks
of online and offline transcription, far-field microphone recognition
(for impromptu placement as one might have with a PDA or digital
recorder), and modeling of the conversations at many levels.
Nelson Morgan -
May 25, 2001