|
Home |
Benchmark Tests |
Tools and APIs |
Test Beds |
Publications |
Staff |
History |
Participants |
ITL Website |
IAD Website |
|
|
Contact Webmaster |
The sign test, first suggested for use in speech recognition benchmark tests by Makhoul ([1], p. 12), is a test comparing word error rates on the different speakers, or on the different conversation sides, or on other prespecified subsets of a test set. It looks simply at which system performs better on each such subset. If there is systematic evidence of differences in a consistent direction, this may prove to be significant even if the magnitudes of the differences are small. If the measure used as basis for deciding better performance is continuous, then the probability of exactly equal performance would be zero. In practice, the possibility of equality must allowed for, generally by dropping such subsets (speakers) from the collection considered. This has only slight theoretical difficulties (see [2], p. 855), and is standard practice. If the null hypothesis holds, then the probability is 1/2 that either subset will have better performance. Thus, if there are N subsets, the distribution of the statistic CA is the binomial B(N,1/2). Let c, cA, and cB be the measured values of C, CA, and CB, respectively. The null hypothesis is rejected if Prob(C <= c) = Prob(min(CA, CB) <= c) = Prob(CA <= c) + Prob(CB <= c) = 2 * Prob(B(N,1/2 <= c) <= 0.05 (two-tailed) Prob(CA <= cA) = Prob(B(N,1/2) <= cA) <= 0.05 (one-tailed) These probabilities may be found directly from tables for the binomial distribution, or for large N (> 10), from the normal approximation. Table 1 lists critical values, i.e., upper bounds on C for significance at p=0.05, for a range of values of N. (See [3] for one source of this data.). The Sign test is generally less powerful than the Wilcoxon test, described next, which applies in similar evaluation situations. It is, however, simple and easy to use, and thus regularly used by NIST in evaluation reports.
[1] D. Pallett, J. Fiscus, and J. Garofolo, "Resource Management Corpus: September 1992 Test Set Benchmark Test Results", Proceedings of ARPA Microelectronics Technology Office Continuous Speech Recognition Workshop, Stanford, CA, September 21-22, 1992. [2] R. Winkler and W. Hays, Statistics: Probability, Inference and Decision, second edition, Holt, Rinehart, and Winston, 1975. [3] G. Kanji, 100 Statistical Tests, SAGE Publications, 1994 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Created: 04-Oct-2000
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Last updated: 20-Apr-2001
|