Assembling partial parses

Next: Discussion Up: Robust parsing Previous: Seeding the chart

Assembling partial parses

Instead of consulting the chart for individual substring/nonterminal pairs it may be useful to obtain a list of all complete partial parses of the input. A complete partial parse is a sequence of nonterminals that together generate the input. For example, using the grammar in Table 1, the input a circle touches above a square has the complete partial parses `NP VT PP' and `Det N VT P NP', among others. The input is grammatical exactly if S is among the complete partial parses.

First note that there may not exist a complete partial parse if the input contains unknown symbols. As a preprocessing step, or on-line during parsing, one may have to create new preterminals to account for such new input symbols.

The Earley algorithm can be minimally extended to also generate the list of all partial parses. What is needed is some device that assembles abutting nonterminals from partial parses left-to-right. This work can be carried out as a by-product of the normal completion process using the concept of a wildcard state. A wildcard state is a special kind of dummy state in which the RHS can have any number of nonterminals to the left of the dot, and a wildcard ? to the right of the dot:

As usual, such a state means that the nonterminals in have generated the substring . The wildcard indicates that any continuation of the nonterminal sequence is allowed.

The wildcard semantics are taken into account during prediction and completion. A wildcard state generates predictions for all nonterminals, thus having the same effect as the nonterminal-specific dummy states in the previous section. During completion, a wildcard state combines with all complete states to yield new wildcard states:

for all Y. (That is, the complete nonterminal Y is inserted before the dot and the wildcard is retained following the dot to allow further completions.) This is implemented by a trivial modification to the standard completion step. Inner probabilities, Viterbi probabilities and Viterbi back-pointers are processed as usual.

The net effect of processing wildcard states is that all complete partial parses can be read off the final state set in the chart as the right-hand sides of wildcard states (after discarding the wildcard itself). The inner probability of a wildcard state reflects the combined likelihood of the partial parse for the given input. The Viterbi probability of a wildcard state is the joint maximum achieved by the most likely parses for each of the substrings. Different derivations from the same complete partial parse may split the input string at different places. The Viterbi-parse procedure when applied to a wildcard state will recover the most likely such split.

Table (b) shows a trace of wildcard state completions used in enumerating the partial parses for the example given earlier.

The total number of complete partial parses can be exponential in the length of the input. It may therefore be desirable to compute only a subset of them, applying some application-specific filter criterion. One such criterion is that one is only interested in ``maximal'' complete partial parses. A complete partial parse is called maximal if it has no subsequence of nonterminals that can be replaced by another nonterminal so as to yield another complete partial parse. For example, in the case of a circle touches above a square, the only maximal parse is `NP VT PP'.

It turns out that a filter for maximal partial parses is easy to implement in the Earley framework. Maximal parses contain only nonterminals that are not themselves part of a larger partial parse. Therefore, during completion, we can mark all states that contributed to a larger constituent, and later identify the unmarked states as the ones corresponding to maximal parses. (The chart in Table (a) has all maximal states labeled with MAX.) When completing wildcard states we simply skip all completions due to non-maximal states. The list of complete partial parses obtained from the chart will then contain precisely the maximal ones. }

Next: Discussion Up: Robust parsing Previous: Seeding the chart

Andreas Stolcke
Sat Jun 29 21:49:02 PDT 1996