ICSI Speech FAQ:
7.4 What is forced alignment?

Answer by: dpwe - 2000-07-27


This answer isn't really done, I just needed a placeholder for a forward reference.

In order to train an acoustic model (or, generally, a statistical pattern classifier), you need a training set of labelled examples. Mostly in speech recognition we want our acoustic models to classify into a set of subword units such as phones, context-dependent subphones etc. We then have the problem of generating labels for our training data in terms of these classes.

This can be done by hand, at least in some cases. For the TIMIT corpus, for instance, linguistics experts went through each waveform carefully marking boundaries between the different phones as they understood them. At ICSI, we did some similar work on several hours of the conversational Switchboard corpus (the so-called Switchboard Transcription Project or STP). However, this work is incredibly time-consuming, and suffers from consistency problems between the experts.

For the many tens of hours of speech in corpora like Broadcast News, it is inconceivable that humans would manually label all the phones. Rather, court reporters are used to obtain word-level transcripts of the utterances, then the word sequence is used to constrain an optimal alignment between existing speech models and the new speech data. This process is called forced alignment.

Forced alignment is a neat way to obtain the training targets for large speech corpora. It may even be preferable to manual labelling because the alignment is applied absolutely consistently by machine. Typically, the labeling can be refined through iterative alignment and retraining, so-called embedded training.

How to perform forced alignment at ICSI

There are potentially several mechanisms for this. It used to be that the only way we could do it was with our archaic Y0 decoder. These days however the preferred option is with the dr_align class of programs, of which dr_align_efsg is at present the only instance.

Some additional notes (gelbart, 2/27/02)

I got an email from Dan which has valuable information about making the number of frames in the label file match the number of frames in the feature pfile, and the creation of ilab-format label files (which are smaller than CTM files or label pfiles).

Date: Wed, 13 Feb 2002 11:34:04 -0500
From: Dan Ellis 
To: John Eric Fosler-Lussier 
Cc: David Gelbart ,
     J. Eric Fosler-Lussier 
Subject: Re: Question about dr_align_esfg 

>>  b) What's strange is that the pflab file that results from the CTM
>>  file doesn't have the same number of frames at the pfile (each has one
>>  less frame than the pfile).  I played around with the zeropad option
>>  (setting it to 4.5 frames*16 ms framestep=72 ms) and got it to produce
>>  the right number of labels.  However, it's not clear to me why a
>>  zeropad of 64 (4*16) doesn't work, as the math would suggest.  Perhaps
>>  DAn can shed some light on why this is so.

windowtime also factors in.  The exact length of the pfile depends on
how many complete windows could fit in the original signal, as well 
as the window alignment policy.  For the default (first window starts
with first sample, rather than any kind of extrapolation), the number
of frames in the pfile will be:

  1 + (nsamples - nwindow)/nframe

.. and the output of a forward pass will have (contextwin-1)/2 fewer 
frames at each end.  To convert this into the appropriately-matching 
sampled-labels file depends on exactly how the timings in the CTM file
were generated.  If they were true times, setting steptime and
windowtime to match the feature calculation and leaving zeropad at
zero should work.  But if the times in the CTM are the index of the
unpadded output of neural net multiplied by the frame rate, then
probably the easiest thing to do is to set windowtime=steptime (to
hide any cleverness trying to account for the window/step offset), 
set zeropad to 4 x steptime (for cw=9) to account for the lost frames
in the context window, and then add a futher 0.5 x (wintime-steptime) 
to account for the time skew of the first frame due to the 'skirts' of
the first complete window.  With wintime=32 ms, steptime=16 ms, this 
would amount to 4.5x16 ms.  

What we're saying is that the first frame out of the neural net
corresponds to the 5th frame of the feature calculation, which was
calculated from the samples between 64 and 96 ms into the soundfile
(the first frame is 0..32ms, the second frame is 16..48ms etc), or, if
we talk only about the 16 ms in the center of the analysis frame
(i.e. ignoring the low-amplitude skirts of the Hamming window), it's
the samples between 72 and 88 ms into the soundfile, so we have to pad
the offset-ignoring times in the CTM file by 72ms to make them match
correctly.  

>>  BTW, was there a way to directly write ilab files from ctm files?)

  labels2pfile pfile=file.ilab opformat=ilab...

will write ilab rather than pflab, despite the confusing syntax.

  DAn.

Also, I got the following commands, which can be used to check that the number of frames in the label file match the number of frames in the feature pfile, in an email from Eric Lussier. Since the commands are a little involved I thought they were worth saving here. The label file in this case is assumed to be in pfile format rather than ilab.

feacat -sr 0:499 -q -v -ip pf my_features.pfile > /tmp/x1

labcat -q -v -sr 0:499 -ip pfile -op ascii my_labels.pflab > /tmp/x2 

diff /tmp/x1 /tmp/x2

Previous: 7.3 How do I get target labels to use in training? - Next: 7.5 How can I use the alignment files produced by the SRI trainer?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:17 PDT 2009