next up previous
Next: Computing expected production counts Up: Extensions Previous: Viterbi parses

Rule probability estimation

 

The rule probabilities in a SCFG can be iteratively estimated using the EM (Expectation-Maximization) algorithm [Dempster et al.1977]. Given a sample corpus D, the estimation procedure finds a set of parameters that represent a local maximum of the grammar likelihood function P(D | G), which is given by the product of the string probabilities

displaymath8703

i.e., the samples are assumed to be distributed identically and independently.

The two steps of this algorithm can be briefly characterized as follows.

E-step:
Compute expectations for how often each grammar rule is used, given the corpus D and the current grammar parameters (rule probabilities).
M-step:
Reset the parameters so as to maximize the likelihood relative to the expected rule counts found in the E-step.

This procedure is iterated until the parameter values (as well as the likelihood) converge. It can be shown that each round in the algorithm produces a likelihood that is a least a high as the previous one; the EM algorithm is therefore guaranteed to find at least a local maximum of the likelihood function.

EM is a generalization of the well-known Baum-Welch algorithm for HMM estimation [Baum et al.1970]; the original formulation for the case of SCFGs is due to Baker:79. For SCFGs, the E-step involves computing the expected number of times each production is applied in generating the training corpus. After that, the M-step consists of a simple normalization of these counts to yield the new production probabilities.

In this section we examine the computation of production count expectations required for the E-step. The crucial notion introduced by Baker:79 for this purpose is the ``outer probability'' of a nonterminal, or the joint probability that the nonterminal is generated with a given prefix and suffix of terminals. Essentially the same method can be used in the Earley framework, after extending the definition of outer probabilities to apply to arbitrary Earley states.

defn6311

Outer probabilities complement inner probabilities in that they refer precisely to those parts of complete paths generating x not covered by the corresponding inner probability tex2html_wrap_inline8729 . Therefore, the choice of the production tex2html_wrap_inline7787 is not part of the outer probability associated with a state tex2html_wrap_inline8075 . In fact, the definition makes no reference to the first part tex2html_wrap_inline7757 of the RHS: all states sharing the same k, X and tex2html_wrap_inline7759 will have identical outer probabilities.

Intuitively, tex2html_wrap_inline8743 is the probability that an Earley parser operating as a string generator yields the prefix tex2html_wrap_inline8745 and the suffix tex2html_wrap_inline8747 , while passing through state tex2html_wrap_inline8075 at position i (which is independent of tex2html_wrap_inline7757 ). As was the case for forward probabilities, tex2html_wrap_inline8755 is actually an expectation of the number of such states in the path, as unit production cycles can result in multiple occurrences for a single state. Again, we gloss over this technicality in our terminology. The name is motivated by the fact that tex2html_wrap_inline8755 reduces to the ``outer probability'' of X as defined in Baker:79 if the dot is in final position.




next up previous
Next: Computing expected production counts Up: Extensions Previous: Viterbi parses

Andreas Stolcke
Sat Jun 29 21:49:02 PDT 1996