ICSI Speech FAQ:
3.10 What are the grammar data formats?

Answer by: fosler - 2000-08-10


What are grammars for?

Similar to the dictionary formats, there is a division between grammar formats for y0 and noway/chronos. To further complicate matters, there is a community standard format (commonly called ARPA n-grams) that can be used as a common interchange format. Usually, a grammar-producing toolkit (such as the SRILM toolkit) will produce ARPA n-grams, which then must be converted into a usable grammar format for the particular decoder.

In particular, all of our decoders use some form of n-gram grammar. Since it is (usually) impossible to generate a probability for every possible n-gram, a backoff strategy must be applied: to calculate the probability of a missing n-gram, a backoff weight is multiplied with the (n-1)-gram. There are different methods for calculating these backoff weights; see How do I build an n-gram grammar for noway or chronos? for details on how to do this with the SRILM toolkit. Or better yet, ask Andreas. ;-)

To calculate a missing n-gram, the CMU-Cambridge toolkit provides (in its comments) the following recipe, which is copied verbatim:

p(wd3|wd1,wd2)= if(trigram exists)           p_3(wd1,wd2,wd3)
                else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
                else                         p(wd3|w2)

p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
            else              bo_wt_1(wd1)*p_1(wd2)

ARPA n-gram format: This is the standard format used by many decoders for n-grams; this ASCII format was introduced by Doug Paul. All probabilities and back-off weights are given in log10 form. The first part of the file is just header comments, and is ignored by the decoder. The beginning of LM data is demarcated by the 6-byte sequence

\data\
on a blank line. The following lines tell how many types of each n-gram there are, e.g.:

\data\
ngram 1=500
ngram 2=1250
ngram 3=2500

The data that follows indicates 1-grams, 2-grams, etc. Here is an example for a 3-gram.

\1-grams:
log10_prob(word1) word1 log10_backoff(word1)
log10_prob(word2) word2 log10_backoff(word2)
...
\2-grams:
log10_prob(word1|word1) word1 word1 log10_backoff(word1,word1)
log10_prob(word1|word2) word2 word1 log10_backoff(word2,word1)
log10_prob(word2|word1) word1 word2 log10_backoff(word1,word2)
...
\3-grams:
log10_prob(word3|word1,word2) word1 word2 word3
...
\end\

Note that the data are completed by the

\end\
marker. All numbers in the file should be 0 or less, since these are log probabilities (not neg-log probabilities). Also, the backoff weights for the n-grams (as opposed to the (n-1)-grams, etc) in an n-gram grammar are not required, since they are never used.

One issue to be aware of is that the SRILM toolkit (i.e. ngram-count) sometimes leaves off the backoff weights. Noway, however, gets confused by this behavior, so you need to fill in a fake backoff weight (0 is a good choice). The program add-dummy-bows (installed under /u/drspeech/src/srilm) adds dummy backoff weights.

Noway: The noway decoder understands ARPA format (given the caveat mentioned above), but it prefers to use its own binary format (i.e., it runs faster). To compile an ARPA n-gram into the noway binary format, execute the following:

noway -ngram [arpalm] -no_decode -write_lm [nowaylm]
You can tell a noway-style ngram by its "magic number" -- the first four bytes of the file. A noway ngram starts with the magic number "NGx\n", where x is equal to n (i.e., NG2 is a bigram).

Note that there was an older format for bigrams and trigrams in noway, with magic numbers of "BI1\n", "BI2\n", "TR1\n", "TR2\n", "1IB\n", "2RT\n", etc. If you can't get an older bigram grammar to work with

 -ngram [file] 
, then you should try
 -bigram [file]
instead.

Chronos: in theory understands a number of n-gram formats, including the Cambridge-CMU binary format, but it works best (in my opinion) with noway binary trigrams (n-gram format). Note: we, as a group, have had much trouble trying to get bigrams to work with chronos. Much respect and honor goes to the knight who slays this dragon. Also important to note is that chronos only has one combined start and end symbol (typically <s>) so that it can work on multi-sentence utterances. This means that you have to convert all </s> symbols in your grammar from SRILM.

Y0: only supports bigrams and wordpair grammars. I'm not going to describe wordpair grammars here (see the man page for y0), but the bigram format is similar to the ARPA format. The main difference is that the log probs are in base e (not in base 10). From the man page:

          The  format  of the file is:

          >wordx wordx-unigram wordx-backoff-weight
          word-following-wordx bigram-probability
          word-following-wordx bigram-probability
          word-following-wordx bigram-probability
           ...
          >wordy wordy-unigram wordy-backoff-weight
          word-following-wordy bigram-probability
          word-following-wordy bigram-probability
          word-following-wordy bigram-probability
           ...

          Note that there is a >  before  wordx  and  wordy.  The
          spelling of (almost) all of the words in this file must
          match the spelling found in the lexicon file (see the -
          l option above).

          Note that the words <s> and </s> should appear  in  the
          bigram file and not in the lexicon.  The <s> represents
          the (null-) model  for  sentence  start  and  </s>  the
          (null-)  model  for  sentence  end.   These are used to
          specify the language model for a word starting or  end-
          ing a sentence.


Previous: 3.9 What are the dictionary data formats? - Next: 3.11 What are the label data formats?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:15 PDT 2009