On Thu, Apr 12 2001, Chuck Wooters wrote:
Can someone tell me how to convert from a time (in seconds) to a frame index?
This is a good FAQ. It's easy enough to get it wrong by one frame, and it messes us up when we try to combine results from different systems. In particular, you can get into trouble with large times represented as floats, losing resolution at the sample level (e.g. in Broadcast News long recordings).
It depends how the feature calculation is run, but mostly these days we run in the slightly wasteful "only complete frames" mode (i.e. rasta without -y, feacalc without -pad), in which case frame n (starting from zero) is based on the samples n*hop+[0:(win-1)] (also starting from zero, although I'm using Matlab syntax to describe the indices).
Thus the center of frame n is at sample win/2 + n*hop (or time (win/2 + n*hop)/samplerate).
Let say we have win=25ms, hop=10ms and samplerate=8000Hz. Then the window is 200 samples, the hop is 80 samples. The first window (n=0) includes samples 0..199, the next one (n=1) is 80..279 etc. Samples 80..199 are used in both frames (the overlap), but 80..139 are 'closer' to frame n=0 and 140..199 are closer to frame n=1.
Thus, to convert a time t to a feature frame n:
float t; float samplerate; int t_samp; t_samp = rint(t * samplerate);
int win_samp, hop_samp; int t_samp_adj; t_samp_adj = t_samp - (win_samp - hop_samp)/2;win_samp - hop_samp is always even, meaning I've never even thought about what the right answer would be if it wasn't.
int frame; frame = t_samp_adj / hop_samp;
So, for example, if I have a pfile which contains frames of PLP data calculated from an utterance beginning at time index 56.4438 through time index 64.0459 (a duration of 7.6021 seconds). Using a window size of 25 msecs and a step of 10 msecs, there are 758 frames in the pfile. (Since the -pad option was not used, then according to the feacalc man page, the number of frames is 1+floor((fileduration - windowtime)/steptime)). Now my question is: how do I figure out which frames in the pfile correspond to the time span from, say, 57.0 through 58.0?
The 758 frames in the pfile are centered on times
56.4438 + (0.025 - 0.010)/2 + n*0.010
where n is the frame index, from zero. In most cases, the 'support' of the frames would be considered as the central time +/- 5ms (for non-overlapping support), although they include the samples for t_cent +/- 12.5 ms (of course).
Thus, the first frame that is 'mostly' inside your range is the one whose time center is >= 57.0
56.4438 + (0.025 - 0.010)/2 + n*0.010 >= 57.0 ==> n1 >= ( 57.0 - 56.4438 - 0.0075 ) / 0.010 >= 54.87 ==> n1 = 55 (covering times 57.0013 +/- 5 ms)
and the last frame 'mostly inside' your range is the last one where the central time is < 58, hence
n2 < ( 58.0 - 56.4438 - 0.0075 ) / 0.010 < 154.87 ==> n2 = 154
i.e. you'd use frames 55..154 (counting from zero in the pfile), which is 100 frames, which sounds right.
But if you wanted the calculations to be reproducible, you'd quantize all the times to sample counts and do the divisions on integers.
Chuck's follow-up question was:
What if we just did something like what George suggested and considered the center of each frame to be the sample located at the index given by: (hop_size * sample_rate * frame_number). Where hop_size is seconds (e.g 0.01), sample_rate is samples_per_second (e.g. 8000), and frame_number is just the index of the frame. So, the 0th frame would always be centered at sample 0 (.01*8000*0). The first frame would always be centered at sample 80 (.01*8000*1), etc.
Dan's reply was:
I guess this question, what to do with incomplete windows at the edge of utterances, is as old as frame-based calculation itself - if not older! - and the reason that no obvious, well-known answer has arisen is because there is no single, good answer to the question of what values to use for the 'undefined' samples beyond the defined time limits.
But Chuck's point is that, if we're dealing with excerpts from long audio segments, like Broadcast News or Meetings or Switchboard, there is a good answer about what values to use for these samples - take the actual samples from the parent, longer audio waveform. No need to reflect samples using feacalc's -pad flag, nor to replicate feature vectors, nor to reflect entire feature vector frames. Just take the data.
This is the purpose behind feacalc's -rngstartoffset etc. options. It's a clumsy and difficult-to-understand mechanism, but it does the best thing in the circumstances. For instance, a typical BN feature calculation process might be:
bn_stm2list < ../../stm/h4e_97.stm > h4e_97.ranges
which converts the stm segment-definition file like:
h4e_97 1 David_Brancaccio 0.117000 11.294563THE RECORDS FOR THE ... h4e_97 1 David_Brancaccio 11.294563 41.943063 FIRST RETAIL SALES ... h4e_97 1 David_Brancaccio 41.943063 47.835500 JUST ONE MONTH SINCE ... ...
into a 'ranged list' file like:
h4e_97 0.117 11.294563 h4e_97 11.294563 41.943063 h4e_97 41.943063 47.8355 ...
Then use this ranged list for the feature calculation:
feacalc -ras no -plp 12 -dom cep -delta 0 -hpf -dither \ -rangerate 1.0 -rngstartoffset -0.072 -rngendoffset 0.072 \ -steptime 16.0 -windowtime 32.0 \ -opformat pfile -filecmd "bn_file %u" \ -list ./h4e_97.ranges \ -out h4e_97-plp12.pf
The critical numbers are the rngoffsets of 0.072 sec = 72 ms. This comprises, at each end:
What feacalc does internally, when -range?rate is nonzero, is to read the two time specifications after each filename in the list, add -rngstartoffset and -rngendoffset to the start and end, respectively, (divide the results by the -range?rate argument to convert to seconds), then extract the segment of the soundfile between those times to pass on to the subsequent feature calculation stage (which may or may not be using -pad, -zeropad etc.).
The result of this was that when we ran our MLP classifiers on the features resulting from this calculation, the first frame of posteriors corresponded exactly to the first 16 ms of the segment-as-defined, which agreed with the results of the RNN models we were combining with.
If you're going to worry about feature windows, you might want to worry about MLP context windows at the same time. Typically, it's at the classifier output that you care about the temporal alignment of different feature calculation schemes. Unless it isn't.
Previous: 4.1 How is the SNR of a speech example defined? - Next: 4.3 How can I simulate different acoustic conditions?
Back to ICSI Speech FAQ index
Generated by build-faq-index on Tue Mar 24 16:18:15 PDT 2009