Listening to speech recognition - the surfsynth home page


I've written a little tool that takes rasta-style subband envelope files and remodulates them onto a noise excitation to reconstruct an audio signal containing effectively the same information as the envelope file. As recently reported by Bob Shannon and others, such speech is certainly intelligible, but it's interesting to listen to because it gives you a feeling about how much is lost from the very start in speech recognition by discarding the signal fine structure (i.e. the voicing etc.) The following example illustrates a number of stages in rasta speech processing both visually as spectrograms, and with the same information in audio form thanks to surfsynth.

This tool, and some of the intermediate data forms used below, were developed for part of my ASR-within-CASA project as described in my 1997 Mohonk paper and my putative ICASSP 98 paper. The images are screenshots from pfview, and recogviz for the spectrogram of the original signal.


Speech analysis example

This is utterance 4758zi from the numbers95_cs corpus. Click on the spectrogram images to hear the corresponding samples.

Original utterance

      

Smoothed RASTA-PLP subband envelopes

This is the signal coming out of the PLP Bark-scaled filterbank, to which the RASTA filter is applied. It has already been smoothed by the 4-point low-pass part of the RASTA filter.

RASTA-PLP filtered features

Each subband has been band-pass filtered in the log domain (RASTA filtering) then approximated by an all-pole model across frequency (PLP modelling). It is then converted back into the spectral domain for display and resynthesis. Because the RASTA filtering occurs in the log domain, there aren't any problems with the energies going negative. The "equal-loudness" weighting applied prior to the PLP all-pole modelling has made the low frequencies much weaker than the rest of the spectrum.

Notice how the onset-enhancing nature of the RASTA filter really emphasizes the first syllable, as well as the "six", which comes after a brief pause.

Reconstructon from recognizer labels

As described in my putatitve ICASSP'98 paper, I trained a single diagonal Gaussian for the 9x15 spectral feature window for each label, then overlap-added the means, weighted by the inverse of the SDs, to reconstruct an "ideal" feature sequence from the phonemic labelling generated by the recognizer. The actual labels, the sole data used to generate the image and sound, are shown below. I only just noticed that the recognizer in fact got it wrong - it's transcribed "five one nine six one" instead of "five four nine six one"!

(This is what it sounds like if you don't undo the pre-PLP "equal loudness" weighting

Adding in the slowly-varying part from the original signal gives something more comparable to the input, but still very blurred:

Just for reference, here is the slowly-varying part of the original signal without the label-derived structure added to it:

I'd have to agree that the label-derived structure doesn't change this a whole lot, but it does add a little bit of definition. I think you can hear the difference better than you can see it in the spectrograms.


Using surfsynth

Read all about it in the surfsynth man page.


Updated: $Date: 1997/11/04 21:54:54 $
DAn Ellis <dpwe@icsi.berkeley.edu>
International Computer Science Institute, Berkeley CA