ICSI Speech FAQ:
5.4 How do you calculate MSG features?

Answer by: dpwe - 2000-08-02


Update

This page describes the historical approach to calculating MSG features. However, I recently made a script, msgcalc that provides an integrated command for MSG calculation, mirroring almost all the relevant command-line options to feacalc, the standard Rasta-PLP feature calculation program. So read this page to learn about MSG, but use msgcalc to calculate them, based on the information in its man page.

2000-08-04 dpwe@icsi.berkley.edu

The Modulation-filtered Spectrogram (MSG) features were developed by Brian Kingsbury, inspired by Rasta processing and auditory science (there's a bit more information on the feature types page). Brian also developed a whole infrastructure for feature calculation called drspeech_featools. The philosophy was to provide execution-time flexibility in feature calculation by making each feature calculation process a chain of relatively simple programs, connected together by Unix pipes. This means that a single delta-calculation or online-normalization program could be added to a pipe with any kind of base feature, rather than having to compile it in to each separate program. This approach went along with a C++ library of signal processing primitives, a version of which lives in /u/drspeech/src/bedk_frontend, which includes the source of the modspec program (which unfortunately has no man page at this time).

Although modspec has become a widely-used program, it still exists as part of this component approach to feature calculation, making it a little more awkward to use than a single, monolithic program like feacalc (within which it should arguably be incorporated at some point). Thus a typical calculation of MSG features is via a shell script whose core might look something like:

  # msg3 (0-8 Hz and 8-16 Hz) for 10ms steptime
  FILT_DIR=/u/drspeech/data/modspec/10ms/msg3
  FILT_A=$FILT_DIR/lo0_hi8_n21_dn5.sos
  FILT_B=$FILT_DIR/lo8_hi16_n21.sos
  TAU1=160
  TAU2_A=320
  TAU2_B=320

  STEPTIME=0.010
  WINTIME=0.025

  ## 8 kHz settings
  SRHZ=8000
  NFTRS=14
  NFFT=256
  NWIN=`calc "int($WINTIME*$SRHZ)"`
  NSTEP=`calc "int($STEPTIME*$SRHZ)"`

  mknod $PIPE p
  wavs2onlaudio sf=$SRHZ infilename=$WAVLIST ipsffmt=PCM/R8FsC1Eb \
      wavdir=$WAVDIR wavext=.raw \
  | tee $PIPE \
  | modspec \
      -sf $SRHZ -nfft $NFFT -nwin $NWIN -nstep $NSTEP \
      -efilt $FILT_A \
      -agctau1 $TAU1 -agctau2 $TAU2_A \
  | feacat -width $NFTRS -ip onl -ox -op pfile -out $BASENAME-msg3a.pf \
  & modspec \
      -sf $SRHZ -nfft $NFFT -nwin $NWIN -nstep $NSTEP \
      -efilt $FILT_B \
      -agctau1 $TAU1 -agctau2 $TAU2_A \
    < $PIPE \
  | feacat -width 14 -ip onl -op pfile -out $BASENAME-msg3b.pf
  rm $PIPE

  pfile_merge -i1 $BASENAME-msg3a.pf -i2 $BASENAME-msg3b.pf -o $BASENAME.pf
  rm $BASENAME-msg3a.pf $BASENAME-msg3b.pf

Here, $WAVLIST is a list of waveform file IDs, made into filenames by prepending $WAVDIR and appending ".raw", to find raw PCM files with 8 kHz sampling rate, samples as 16 bit short integers, single channel and big-endian byte order (ipsffmt=PCM/R8FsC1Eb). The program wavs2onlaudio converts this set of waveform files into the simple online audio stream format used by Brian's tools.

This stream is split in two (by tee), with one copy being fed to each of two invocations of modspec. All the modspec formats we use comprise two bands, differing mainly in their modulation frequency pass-band, acting something like the direct+delta features of common feature representations. The key argument to modspec is -efilt which specifies the "sos" file defining the modulation-domain filter in terms of second-order sections. The modulation filters must of course match the implicit feature-domain sampling rate defined by the $STEPTIME option. SOS files can be read and written by Matlab with read_sos and write_sos (in /u/drspeech/share/lib/matlab/icsi).

MSG processing also includes two stages of automatic gain control, and the time constants affect the precise feature form, and thus are also arguments. In certain forms, the time constants may differ between the two banks (based on Brian's empirical optimization).

modspec writes outputs in the simple online feature format, which is converted into the more convenient pfile format by feacat. Finally, after the modspec processing is done, the separate pfiles for the two banks are glued together side-by-side with pfile_merge to create the file msg3 pfile. (This might often be passed through per-utterance normalization with pfile_normutts to make msg3N).

What's with msg1, msg3 etc?

As mentioned above, there can be any number of variants of MSG processing, varying the modulation filtering and AGC time constants of the banks. There are just two that have made it into mainstream recognition jobs at ICSI: msg3 (as above), in which the modulation bands are 0-8 Hz and 8-16 Hz. This was Brian's final feature set (or close to it) that he found to work best with small-vocabulary, telephone-bandwidth tasks such as NUMBERS95. We have also used it with the Aurora noisy digits task.

The other MSG in common use is known as msg1 (msg2 did exist but wasn't competitive, so has disappeared). msg1 uses modulation pass bands of 0-16 Hz and 2-16 Hz, and was found to be the most successful complement to PLP features for the Broadcast News task (based on either the full-band 16 kHz waveforms, or waveforms downsampled to 8 kHz, to make telephone bandwidth data less different). Broadcast News uses 16 ms window steps, which of course must be factored into the filter design. A fragment to calculate msg1 features, including downsampling the waveforms to 8 kHz, might look like:

FILT_DIR=/u/drspeech/data/modspec/16ms/msg1
wavs2onlaudio sf=16000 rangerate=1 rngstartoffset=-0.072 rngendoffset=0.072 \
    infilename=$LISTFILE \
| sndrsmp -S PCM/R16Abb -T PCM/Abb -r 8000 - - \
| tee $PIPE \
| modspec \
    -efilt $FILT_DIR/lo0_hi16.sos \
    -agctau1 160 -agctau2 320 \
| feacat -width 14 -ip onl -op pfile -out $BASENAME-msg1a.pf \
& modspec \
    -efilt $FILT_DIR/lo2_hi16.sos \
    -agctau1 160 -agctau2 640 \
  < $PIPE \
| feacat -width 14 -ip onl -op pfile -out $BASENAME-msg1b.pf

Note the different agctau2 in each bank.

For the curious, I made some plots of the temporal proerties of MSG features compared to PLP. They're online at my MSG temporal structure page.


Previous: 5.3 How do you calculate rasta and/or plp features? - Next: 5.5 What kinds of normalization are there? How do you calculate them?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:16 PDT 2009