next up previous
Next: Seeding the chart Up: Extensions Previous: Complexity

Robust parsing

 

In many applications ungrammatical input has to be dealt with in some way. Traditionally it was seen as a drawback of top-down parsing algorithms such as Earley's that they sacrifice ``robustness,'' i.e., the ability to find partial parses in an ungrammatical input, for the efficiency gained from top-down prediction [Magerman and Weir1992].

One approach to the problem is to build robustness into the grammar itself. In the simplest case one could add top-level productions

eqnarray6686

where X can expand to any nonterminal, including an ``unknown word'' category. This grammar will cause the Earley parser to find all partial parses of substrings, effectively behaving like a bottom-up parser constructing the chart in left-to-right fashion. More refined variations are possible: the top-level productions could be used to model which phrasal categories (sentence fragments) can likely follow each other. This probabilistic information can then be used in a pruning version of the Earley parser (Section 6.1) to arrive at a compromise between robust and expectation-driven parsing.

An alternative method for making Earley parsing more robust is to modify the parser itself so as to accept arbitrary input and find all or a chosen subset of possible substring parses.

In the case of Earley's parser there is a simple extension to accomplish just that, based on the notion of a wildcard state

displaymath8885

where the wildcard ? stands for an arbitrary continuation of the RHS.

During prediction, a wildcard to the left of the dot causes the chart to be seeded with dummy states tex2html_wrap_inline8891 for each phrasal category X of interest. Conversely, a minimal modification to the standard completion step allows the wildcard states to collect all abutting substring parses:

displaymath8886

for all Y. This way each partial parse will be represented by exactly one wildcard state in the final chart position.

A detailed account of this technique is given in Stolcke:earley-tr. One advantage over the grammar modifying approach is that it can be tailored to use various criteria at runtime to decide which partial parses to follow. Below we present such a simple extension to Earley's algorithm (probabilistic or not). In the probabilistic version, it will also produce the likelihoods of those partial parses. The potential advantage over the grammar modifying approach is that it can be modified to make use of various criteria for which partial parses to allow at runtime.

The extension for robust parsing does not require any changes to the way the Earley parser operates on the chart, only that the chart be ``seeded'' with some extra states before starting. The computation performed as a result of this modification will be essentially equivalent to that of a CYK bottom-up parser, but with the advantage that a single parsing engine can be used for both standard and robust parsing.




next up previous
Next: Seeding the chart Up: Extensions Previous: Complexity

Andreas Stolcke
Sat Jun 29 21:49:02 PDT 1996