Matched Pairs Sentence-Segment

Home

Benchmark Tests

History

Significance Tests Home	MAPSSWE Test	Sign Test
Benchmark Tests Home	Wilcoxon Signed-Rank Test	McNemar Test


Matched Pairs Sentence-Segment Word Error (MAPSSWE) Test

Unlike all other NIST significance tests, the MAPSSWE test is a parametric test. It is a test that looks at the numbers of errors occurring in units (segments of utterances) of varying size that are specific to the output of the two systems being compared. These segments of utterances are chosen in a manner that approximately validates the needed independence assumption. Because the number of units is large, the central limit theorem permits the approximate assumption that the numbers of errors are normally distributed. This test is essentially a t-test for estimating the mean difference of normal distributions with unknown variances. The MAPSSWE test, also sometimes simply and misleadingly called the matched pairs test, was suggested for speech recognition evaluations by Gillick [1] and implemented at NIST [2].

Segments Chosen

The MAPSSWE test operates on segments of utterances that may be arbitrarily short (see examples below) or may be as long as an entire utterance. Having, in general, multiple units per utterance helps to assure a large enough sample to justify the normality assumption. The units are chosen in such a way that the needed independence assumption is essentially valid.

The units chosen are specific to the pair of systems under consideration. They are segments of utterances bounded on both sides by words correctly recognized by both systems (or by the beginning or end of utterance). Recognizers often use trigram language models, where the words selected depend specifically upon the two preceding and possibly succeeding words. Therefore, NIST uses segments bounded by at least two words correctly recognized by both systems, or bounded by the beginning or end of utterance (and possibly a single word correctly recognized by both systems). These segments can then be expected to have occurred in the same acoustic and linguistic contexts for both systems, justifying the independence assumption.

The figure below offers an example. Comparing the hypothesis strings of Systems A and B with the reference string, it will be seen that there are four errorful segments that meet the boundary criteria. For segments I and IV, A is incorrect and B correct (a substitution and a deletion in I, an insertion in IV); for segment II, A is correct and B incorrect (a deletion). For segment III both are incorrect (a substitution for A, two for B).

       

         I             II               III               IV
REF:  |it was|the best|of|times it|was the worst|of times|  |it was
      |      |        |  |        |             |        |  |
SYS A:|ITS   |the best|of|times it|IS the worst |of times|OR|it was
      |      |        |  |        |             |        |  |
SYS B:|it was|the best|  |times it|WON the TEST |of times|  |it was

Test Statistic

The MAPSSWE test involves the difference in the numbers of errors of the two systems in each segment. NIST takes the number of errors as the sum of the numbers of substitutions, deletions, and insertions after alignment, but any consistent definition for errors may be used.

The mean of the error differences for all segments is determined. After normalizing by the estimated standard deviation, this value has an approximately standard normal distribution for a sufficiently large number of total segments (n > 50). See [1].

More specifically, let

where N_Aⁱ is the number of errors in the i'th segment for system A, and N_Bⁱ is the number of errors in the i'th segment for system B. Then let

and let

The null hypothesis asserts that the distribution of error differences has mean zero (two-tailed) or has mean no larger than zero (one-tailed, with System B a possible improvement on System A). The null hypothesis is then rejected if the measured value w of W is such that

Power

The MAPSSWE test is generally the most powerful of the tests used by NIST. It is unusual for it not to find significance in situations where other tests do. This is primarily because of its inherently large sample size. Since the MAPSSWE is a parametric test, it is possible to make some theoretical assertions about its power.

Randomization

There is another way to view the MAPSSWE test, recognizing that it is based on a normality approximation and more in the spirit of the nonparametric tests otherwise used by NIST. This is to regard it as an approximation to a related randomization test.

The MAPSSWE test determines n sentence segments on which the performance of the two systems are incorrect and generally different. Here n is assumed large, generally at least in the hundreds. Consider randomly assigning the two outputs for each segment to each of the two systems involved. There are 2ⁿ such assignments. The test is basically examining whether the particular assignment consisting of the actual system outputs is unusual among the set of all assignments with regard to the statistic of the difference in total errors.

The randomization test looks at where the actual total error difference occurs in the distribution of all total error differences for all possible assignment of segments. If this is in the 5% percent tail of the distribution (one-tailed) or one of the 2.5% tails of the distribution (two-tailed), then the difference appears significant and the null hypothesis may be rejected. The MAPSSWE test as currently performed is in fact derived from a normal approximation to this random distribution. Other approximations are also possible. See [3], pp. 212-216.

This randomization test will be unaffected by segments for which the error difference is 0. This will include in particular segments for which the two system outputs are identical, but incorrect. Such segments might well be omitted from the MAPSSWE test as currently implemented; i.e., segment boundaries might only be required to contain two identical (not necessarily correctly recognized) words. This situation should not occur frequently, and might well arise due to errors in reference transcripts.

It is probably infeasible to generate all 2ⁿ error differences to carry out the full randomization test, though a number of approaches to reducing the computation are available. Instead, a randomly chosen sample of the possible assignments may be obtained, using a random number generator. If the sample size is large enough, then with very high confidence, the determined significance will be arbitrarily close to that found in the full test. This approach to significance testing has been used in the MUC tests of message understanding from text. A sample size of 9,999 (assignments) is often used.

The other NIST statistical tests, namely the Sign test, the Wilcoxon test, and the McNemar test (which is a form of Sign test) may also be viewed as randomization tests, where the items being randomized are the signs or the ranks of the performance differences.

References

[1] L. Gillick and S. Cox, "Some Statistical Issues in the Comparison of Speech Recognition Algorithms", ICASSP 89, pp. 532-535.

[2] D. Pallett, et. al., "Tools for the Analysis of Benchmark Speech Recognition Tests", ICASSP 90, vol. 1, pp. 97-100.

[3] J. Pratt and J. Gibbons, Concepts of Nonparametric Theory, Springer-Verlag, 1981.

Created: 04-Oct-2000

Last updated: 20-Jul-2001