ICSI Speech FAQ:
3.3 What are the feature data formats?

Answer by: dpwe - 2000-07-25

As explained before, the sound waveform files of a collection of speech recordings are converted into feature space as the first stage of the recognition process. This representation typically converts a window of 20-50 ms of speech (enough to straddle the pitch cycle, so features avoid pitch-cycle modulation) into a feature vector of 9 to 50 elements, repeated every 10-20 ms with overlapping windows.

Since 10 ms of 16 bit waveform sampled at 16 kHz constitutes 320 bytes, and 25 32-bit floating point feature values corresponds to 100 bytes, feature vector representations occupy, very roughly, about half the space on disk that the original waveform files consume.

Because speech training corpora typically consist of a large number of relatively short segments, which are all used in sequence in the training process, there are good reasons for storing all the features for a given training corpus in a single file - both for convenience in managing it on disk (one file rather than thousands), and to avoid the overhead of continually closing and opening files during the training process. These files are known as 'feature archives'. Of course, a feature archive file format can contain just a single utterance.

Because the training process typically also requires a class label 'target' to be associated with each frame, certain of the formats also support storing one (or more) labels alongside each frame. However, since the targets may change without the feature changing (due to a realignment in 'embedded training'), it is now deprecated to store these data in the same file. For historical reasons, however, the pfile and pre formats include provision for per-frame labels.

Although this page allegedly concerns feature formats, it turns out that the outputs of the acoustic model classifier, which are just fixed-length vectors of probabilities, one for each feature frame, present essentially the same file format requirements, and thus can be held in any of these file types. A couple of the entries below (e.g. lna and rapbin) are more commonly associated with classifier output probabilities than with features (in fact, lna files are exclusively for posterior probabilities).

The major feature file formats are listed in the table below. These are simply file formats - the actual features contained within them could be anything, as long it is a regularly-sampled, fixed-size numerical vector. Almost all of these formats can be read and written by ICSI tools, the most significant being feacat.

Pfile*.pfile, *.pf Homegrown ICSI format with a long history, this is the 'native' format for qnstrn etc., and the feature archive format most commonly encountered at ICSI. It's a slightly gnarly format, with a 32kB header at the front, and an optional index table at the end, and all current programs read and write it via the libquicknet QN_{In,Out}FtrLabStream_PFile classes. See this man page for the pfile format.
LNA*.lna A compact format from Cambridge, designed specifically to hold posterior probability values with one byte per value. The probabilities are quantized on a log scale, so representation of the very small posteriors often seen is still quite accurate. Of course, values outside the range [0,1] cannot be represented. As is typical of Cambridge file formats, there is no header and no index: the only way to find out how many utterances are present is to scan through the entire file, counting the end-of-segment flags. See this man page for the lna format.
Cambridge PRE*.pre Another headerless Cambridge format, this time for training corpora, so that features are stored generically as floats, and each frame includes a slot for a 7-bit target label. Not currently used at ICSI. See this documentation page from the Cambridge/Softsound Abbot package on Acoustic Vector Formats.
Rapbin*.rapbin The native format for posterior probability outputs from QuickNet, basically a binary stream of 32-bit floats. Used to exist in some non-binary formats, but we don't talk about those any more. See the man page for the rapbin format.
Online feature*.olf A format specially developed for real-time demos, where the feature calculation stage passes its results to the acoustic classifier over a pipe, and thus cannot know ahead of time how long a particular stream will be. You could use a headerless format like Pre in this case, but olftrs have a rudimentary header specifying their feature vector size, which is helpful. See the man page for the online_ftrs format.
HTK*.htk The format used by the popular HTK package, rather less common within ICSI, but often important when we wish to share data with other labs. HTK files have a short header, and typically one file per utterance (rather than the monolithic feature archive files preferred at ICSI). HTK files have a couple of options and contain flags that can describe the feature format (i.e. cepstra, including deltas etc.). See this description of parameter file formats from the local copy of the HTKBook documentation.
UNI_IO*.uni A format defined and used by DaimlerChrysler, our partners in the European RESPITE project. Not of much interest otherwise.
Ascii*.asc Feacat etc. support input and output of ASCII-format files. In order to indicate the breaks between utterances, and as a minimal sanity check, the first two columns of ascii representations are the utterance number and the frame number, which must be in order.

Previous: 3.2 What are the wavfile data formats, and how can I manipulate wavfiles? - Next: 3.4 What are the training target data formats?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:14 PDT 2009