ICSI Speech FAQ:
3.12 What are the reference transcript data formats?

Answer by: dpwe - 2000-07-31


When we get a pile of speech training data, the bulk of the raw data is from the waveforms of the individual utterances. But to train with each of these utterances, we also need to know the word transcript of each one (which we can then convert into training labels via forced alignment). Typically, organizations creating a new speech corpus will hire court reporters or other expert transcribers to type in the exact word sequences for each acoustic segment in the training set.

Thus the information in a reference transcript is, at its core, just a sequence of words that is associated with a particular waveform file. There is normally no timing information (beyond the implicit sequence), and the words should be canonicalized, without punction, etc. Sometimes a subset of nonspeech events such as voiced pauses or coughs may also be transcribed, or sometimes they are ignored.

In the very simplest cases, the word transcripts can be implicit in the file names. Thus, TIDIGITS utterance waveforms have names like man/ae/52o82a.wav, and thus we know the word transcription should be "five two oh eight two" without a separate transcript. (However, for the purposes of alignment and scoring, we would normally construct an explicit transcript file none-the-less.)

The simplest transcript file consists of the word sequences in ascii, with the sequence for each separate utterance on a single line. The association between utterance and transcript may be done through a separate list file (listing the wavform file names or IDs in the same order as the transcripts), or the first token on each line may be the utterance ID. Thus, for NUMBERS95, the file /u/drspeech/data/NUMBERS95/list/numbers95-cs-train+cv-rand.wrdtrans contains lines like:

three one three
twenty eight two two seven
five three one eight eight

.. and the sibling file /u/drspeech/data/NUMBERS95/list/numbers95-cs-train+cv-rand.utids contains the corresponding utterance IDs, e.g.:

2486st
2667zi
4787zi

Alternatives to having separate sibling files would be to have the utterance ID on the same line as the transcript, say as the first or last word, perhaps within parens to mark it clearly as different from the word transcription.

More sophisticated tasks may use the DARPA/NIST "STM" file format (for instance, the scoring program sclite). This is particularly well suited to holding transcriptions of small chunks of larger files. Thus, after a header with some machine-readable format-describing comments, each line looks like:

h4e_97 1 Connie_Brod 1095.990875 1097.364813 <O,F1> WHO ARE YOU VOTING FOR

i.e. the source file ID, the channel number, unique speaker ID, start and end times in seconds, optional condition indicators (defined in the header), and the word transcripts. Thus multiple utterances can be defined within a single wav file. Broadcast News utterances can be quite long (minutes, or hundreds of words), so sometimes the individual lines in the STM files can overflow fixed-length buffers e.g. in the non-Gnu versions of tail and grep, as supplied with Solaris. Watch out for that.


Previous: 3.11 What are the label data formats? - Next: 3.13 What about phoneset definitions and files?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:15 PDT 2009