Next: Overview Up: An Efficient Probabilistic Context-Free Previous: An Efficient Probabilistic Context-Free

Introduction

Context-free grammars are widely used as models of natural language syntax. In their probabilistic version, which defines a language as a probability distribution over strings, they have been used in a variety of applications: for the selection of parses for ambiguous inputs [Fujisaki et al.1991]; to guide the rule choice efficiently during parsing [Jones and Eisner1992]; to compute island probabilities for non-linear parsing [Corazza et al.1991]. In speech recognition, probabilistic context-free grammars play a central role in integrating low-level word models with higher-level language models [Ney1992], as well as in non-finite state acoustic and phonotactic modeling [Lari and Young1991]. In some work, context-free grammars are combined with scoring functions that are not strictly probabilistic [Nakagawa1987], or they are used with context-sensitive and/or semantic probabilities [Magerman and Marcus1991, Magerman and Weir1992, Jones and Eisner1992, Briscoe and Carroll1993].

Although clearly not a perfect model of natural language, stochastic context-free grammars (SCFGs) are superior to non-probabilistic CFGs, with probability theory providing a sound theoretical basis for ranking and pruning of parses, as well as for integration with models for non-syntactic aspects of language. All of the applications listed above involve (or could potentially make use of) one or more of the following standard tasks, compiled by Jelinek:91.

1.: What is the probability that a given string x is generated by a grammar G?
2.: What is the single most likely parse (or derivation) for x?
3.: What is the probability that x occurs as a prefix of some string generated by G (the prefix probability of x)?
4.: How should the parameters (e.g., rule probabilities) in G be chosen to maximize the probability over a training set of strings?

The algorithm described in this article can compute solutions to all four of these problems in a single framework, with a number of additional advantages over previously presented isolated solutions.

Most probabilistic parsers are based on a generalization of bottom-up chart parsing, such as the CYK algorithm. Partial parses are assembled just as in non-probabilistic parsing (modulo possible pruning based on probabilities), while substring probabilities (also known as ``inside'' probabilities) can be computed in a straightforward way. Thus, the CYK chart parser underlies the standard solutions to problems (1) and (4) [Baker1979], as well as (2) [Jelinek1985]. While the Jelinek:91 solution to problem (3) is not a direct extension of CYK parsing they nevertheless present their algorithm in terms of its similarities to the computation of inside probabilities.

In our algorithm, computations for tasks (1) and (3) proceed incrementally, as the parser scans its input from left to right; in particular, prefix probabilities are available as soon as the prefix has been seen, and are updated incrementally as it is extended. Tasks (2) and (4) require one more (reverse) pass over the chart constructed from the input.

Incremental, left-to-right computation of prefix probabilities is particularly important since that is a necessary condition for using SCFGs as a replacement for finite-state language models in many applications, such a speech decoding. As pointed out by Jelinek:91, knowing probabilities for arbitrary prefixes enables probabilistic prediction of possible follow-words , as . These conditional probabilities can then be used as word transition probabilities in a Viterbi-style decoder or to incrementally compute the cost function for a stack decoder [Bahl et al.1983].

Another application where prefix probabilities play a central role is the extraction of n-gram probabilities from SCFGs [Stolcke and Segal1994]. Here, too, efficient incremental computation saves time since the work for common prefix strings can be shared.

The key to most of the features of our algorithm is that it is based on the top-down parsing method for non-probabilistic CFGs developed by Earley:70. Earley's algorithm is appealing because it runs with best-known complexity on a number of special classes of grammars. In particular, Earley parsing is more efficient than the bottom-up methods in cases where top-down prediction can rule out potential parses of substrings. The worst-case computational expense of the algorithm (either for the complete input, or the incrementally for each new word) is as good as that of the other known specialized algorithms, but can be substantially better on well-known grammar classes.

Earley's parser (and hence ours) also deals with any context-free rule format in a seamless way, without requiring conversions to Chomsky Normal Form (CNF), as is often assumed. Another advantage is that our probabilistic Earley parser has been extended to take advantage of partially bracketed input, and to return partial parses on ungrammatical input. The latter extension removes one of the common objections against top-down, predictive (as opposed to bottom-up) parsing approaches [Magerman and Weir1992].

Next: Overview Up: An Efficient Probabilistic Context-Free Previous: An Efficient Probabilistic Context-Free

Andreas Stolcke
Sat Jun 29 21:49:02 PDT 1996