Rule probability estimation

Next: Computing expected production counts Up: Extensions Previous: Viterbi parses

Rule probability estimation

The rule probabilities in a SCFG can be iteratively estimated using the EM (Expectation-Maximization) algorithm [Dempster et al.1977]. Given a sample corpus D, the estimation procedure finds a set of parameters that represent a local maximum of the grammar likelihood function P(D | G), which is given by the product of the string probabilities

i.e., the samples are assumed to be distributed identically and independently.

The two steps of this algorithm can be briefly characterized as follows.

E-step:: Compute expectations for how often each grammar rule is used, given the corpus D and the current grammar parameters (rule probabilities).
M-step:: Reset the parameters so as to maximize the likelihood relative to the expected rule counts found in the E-step.

This procedure is iterated until the parameter values (as well as the likelihood) converge. It can be shown that each round in the algorithm produces a likelihood that is a least a high as the previous one; the EM algorithm is therefore guaranteed to find at least a local maximum of the likelihood function.

EM is a generalization of the well-known Baum-Welch algorithm for HMM estimation [Baum et al.1970]; the original formulation for the case of SCFGs is due to Baker:79. For SCFGs, the E-step involves computing the expected number of times each production is applied in generating the training corpus. After that, the M-step consists of a simple normalization of these counts to yield the new production probabilities.

In this section we examine the computation of production count expectations required for the E-step. The crucial notion introduced by Baker:79 for this purpose is the ``outer probability'' of a nonterminal, or the joint probability that the nonterminal is generated with a given prefix and suffix of terminals. Essentially the same method can be used in the Earley framework, after extending the definition of outer probabilities to apply to arbitrary Earley states.

defn6311

Outer probabilities complement inner probabilities in that they refer precisely to those parts of complete paths generating x not covered by the corresponding inner probability . Therefore, the choice of the production is not part of the outer probability associated with a state . In fact, the definition makes no reference to the first part of the RHS: all states sharing the same k, X and will have identical outer probabilities.

Intuitively, is the probability that an Earley parser operating as a string generator yields the prefix and the suffix , while passing through state at position i (which is independent of ). As was the case for forward probabilities, is actually an expectation of the number of such states in the path, as unit production cycles can result in multiple occurrences for a single state. Again, we gloss over this technicality in our terminology. The name is motivated by the fact that reduces to the ``outer probability'' of X as defined in Baker:79 if the dot is in final position.

Next: Computing expected production counts Up: Extensions Previous: Viterbi parses

Andreas Stolcke
Sat Jun 29 21:49:02 PDT 1996