Speech data used to train the acoustic models needs associated labels to give targets against which to train the models. For the actual step of training, these targets are ennumerated, one per input feature frame, as described under target file formats. However, it is usually more natural to describe these aligned labels in terms of segments with start and end times. This is the kind of format produced by hand-alignment programs such as xwaves. It is also the format generated by forced alignment programs such as dr_align_efsg.
These kinds of label alignment files typically are ASCII format, with one line per segment. Each line then specifies the start time of the segment, the end time (or duration), and the label symbol (phone, word, etc.) for that segment. There may be some other information in a header. Typically there is one label file for each utterance or waveform file, although some formats support multiple utterance alignments in a single file, for a kind of label archive.
A single utterance may have several label files describing the alignments of different units. Thus for the Switchboard Transcription Project, we had phone label alignments, syllables (consisting of up to 4 or 5 phones), and words (some number of syllables). Of course, boundaries of different units don't necessarily line up.
There are several different formats presenting slight variations on the above basis. They are:
|TIMIT||*.phn, *.wrd||Format used for the venerable TIMIT word and phone hand-alignments. No header, and start and end times are in integer sample counts (i.e. in units of 1/16000 sec or whatever the sampling rate is), although sometimes other time bases (such as milliseconds) are used. Used in /u/drspeech/data/timit/phnfile etc. Because each line specifies both start and end, it is possible for a labelling to include gaps or overlaps between labels.|
|xlabel||*.phn, *.wrd, *.syl||Native format for xwave's xlabel tool. We used this to collect hand-alignment data within the Switchboard Transcription Project (i.e. under /u/stp/data). Starts with a few ascii header fields, then each line is <end time in sec> <color index> <label>. Note that start time is implicitly the end of the preceding segment, or the beginning of file for the first segment.|
|CTM||*.ctm||The NIST-defined format used in the DARPA/NIST evaluations,
particularly the scoring program
Lines are like:
1 A 2.560 0.016 COMMISSIONi.e. utterance number, channel, start time in sec, duration in sec, then label. I think there is an optional confidence score tagged on the end. Because the utterance number is part of each line, multiple utterances (in order) can be held in a single file. noway, chronos and efsgd write this format.
|MLF||*.mlf||Label archive file used with HTK. I don't know the format or much about it, but it holds alignment data for lots of utterances in one file. Learn more from the HTKBook.|
Phone label files of these types can be converted into quantized, sampled target files via the slightly misnamed labels2pfile (misnamed because it also writes ilab files, which are actually more desirable). The reverse is performed by pfile2labels. Both programs support timit, xlabel and CTM-style files, so you can convert between label file formats by executing them one after the other, via a temporary target labels file. (This won't work for word label files, since they can't be represented by the numerical index intermediate format of the sampled label files).
Previous: 3.10 What are the grammar data formats? - Next: 3.12 What are the reference transcript data formats?
Back to ICSI Speech FAQ index
Generated by build-faq-index on Tue Mar 24 16:18:15 PDT 2009