ICSI Speech FAQ:
4.2 How do I convert a time in seconds into a frame index?

Answer by: dpwe - 2001-04-13


Relating feature frames to times in the soundfile

On Thu, Apr 12 2001, Chuck Wooters wrote:

Can someone tell me how to convert from a time (in seconds) to a frame index?

This is a good FAQ. It's easy enough to get it wrong by one frame, and it messes us up when we try to combine results from different systems. In particular, you can get into trouble with large times represented as floats, losing resolution at the sample level (e.g. in Broadcast News long recordings).

It depends how the feature calculation is run, but mostly these days we run in the slightly wasteful "only complete frames" mode (i.e. rasta without -y, feacalc without -pad), in which case frame n (starting from zero) is based on the samples n*hop+[0:(win-1)] (also starting from zero, although I'm using Matlab syntax to describe the indices).

Thus the center of frame n is at sample win/2 + n*hop (or time (win/2 + n*hop)/samplerate).

Let say we have win=25ms, hop=10ms and samplerate=8000Hz. Then the window is 200 samples, the hop is 80 samples. The first window (n=0) includes samples 0..199, the next one (n=1) is 80..279 etc. Samples 80..199 are used in both frames (the overlap), but 80..139 are 'closer' to frame n=0 and 140..199 are closer to frame n=1.

Thus, to convert a time t to a feature frame n:

  1. Round (don't truncate) the time to the nearest sample frame. It should be quantized to these values anyway, which is why truncating is bad: if the floating point happens to be just below the true value, truncating will put you one sample early. So be sure to round
         float t;
         float samplerate;
         int t_samp;
    
         t_samp = rint(t * samplerate);
    
  2. Offset the first (win-hop)/2 samples which is the extra bit at the beginning that comes from using wholly-enclosed frames (you would skip this step in rasta -y mode, I think).
         int win_samp, hop_samp;
         int t_samp_adj;
    
         t_samp_adj = t_samp - (win_samp - hop_samp)/2;
    
    win_samp - hop_samp is always even, meaning I've never even thought about what the right answer would be if it wasn't.
  3. Divide by the number of samples per hop. Here, truncating will do the correct rounding thing, making values 0..(hop_samp - 1) appear as frame 0, samples hop_samp..(2*hop_samp - 1) as frame 1 etc.
         int frame;
    
         frame = t_samp_adj / hop_samp;
    

So, for example, if I have a pfile which contains frames of PLP data calculated from an utterance beginning at time index 56.4438 through time index 64.0459 (a duration of 7.6021 seconds). Using a window size of 25 msecs and a step of 10 msecs, there are 758 frames in the pfile. (Since the -pad option was not used, then according to the feacalc man page, the number of frames is 1+floor((fileduration - windowtime)/steptime)). Now my question is: how do I figure out which frames in the pfile correspond to the time span from, say, 57.0 through 58.0?

The 758 frames in the pfile are centered on times

  56.4438 + (0.025 - 0.010)/2 + n*0.010

where n is the frame index, from zero. In most cases, the 'support' of the frames would be considered as the central time +/- 5ms (for non-overlapping support), although they include the samples for t_cent +/- 12.5 ms (of course).

Thus, the first frame that is 'mostly' inside your range is the one whose time center is >= 57.0

    56.4438 + (0.025 - 0.010)/2 + n*0.010 >= 57.0

==> n1 >= ( 57.0 - 56.4438 - 0.0075 ) / 0.010
       >= 54.87

==> n1 = 55 (covering times 57.0013 +/- 5 ms)

and the last frame 'mostly inside' your range is the last one where the central time is < 58, hence

    n2 < ( 58.0 - 56.4438 - 0.0075 ) / 0.010
       < 154.87
==> n2 = 154

i.e. you'd use frames 55..154 (counting from zero in the pfile), which is 100 frames, which sounds right.

But if you wanted the calculations to be reproducible, you'd quantize all the times to sample counts and do the divisions on integers.

Making frame indices correspond to predictable times in the utterance

Chuck's follow-up question was:

What if we just did something like what George suggested and considered the center of each frame to be the sample located at the index given by: (hop_size * sample_rate * frame_number). Where hop_size is seconds (e.g 0.01), sample_rate is samples_per_second (e.g. 8000), and frame_number is just the index of the frame. So, the 0th frame would always be centered at sample 0 (.01*8000*0). The first frame would always be centered at sample 80 (.01*8000*1), etc.

Dan's reply was:

I guess this question, what to do with incomplete windows at the edge of utterances, is as old as frame-based calculation itself - if not older! - and the reason that no obvious, well-known answer has arisen is because there is no single, good answer to the question of what values to use for the 'undefined' samples beyond the defined time limits.

But Chuck's point is that, if we're dealing with excerpts from long audio segments, like Broadcast News or Meetings or Switchboard, there is a good answer about what values to use for these samples - take the actual samples from the parent, longer audio waveform. No need to reflect samples using feacalc's -pad flag, nor to replicate feature vectors, nor to reflect entire feature vector frames. Just take the data.

This is the purpose behind feacalc's -rngstartoffset etc. options. It's a clumsy and difficult-to-understand mechanism, but it does the best thing in the circumstances. For instance, a typical BN feature calculation process might be:

bn_stm2list < ../../stm/h4e_97.stm > h4e_97.ranges
feacalc -ras no -plp 12 -dom cep -delta 0 -hpf -dither \
    -rangerate 1.0 -rngstartoffset -0.072 -rngendoffset 0.072 \
    -steptime 16.0 -windowtime 32.0 \
    -opformat pfile -filecmd "bn_file %u" \
    -list ./h4e_97.ranges \
    -out h4e_97-plp12.pf

The critical numbers are the rngoffsets of 0.072 sec = 72 ms. This comprises, at each end:

What feacalc does internally, when -range?rate is nonzero, is to read the two time specifications after each filename in the list, add -rngstartoffset and -rngendoffset to the start and end, respectively, (divide the results by the -range?rate argument to convert to seconds), then extract the segment of the soundfile between those times to pass on to the subsequent feature calculation stage (which may or may not be using -pad, -zeropad etc.).

The result of this was that when we ran our MLP classifiers on the features resulting from this calculation, the first frame of posteriors corresponded exactly to the first 16 ms of the segment-as-defined, which agreed with the results of the RNN models we were combining with.

If you're going to worry about feature windows, you might want to worry about MLP context windows at the same time. Typically, it's at the classifier output that you care about the temporal alignment of different feature calculation schemes. Unless it isn't.


Previous: 4.1 How is the SNR of a speech example defined? - Next: 4.3 How can I simulate different acoustic conditions?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:15 PDT 2009