Sign Test

Home

Benchmark Tests

History

Significance Tests Home	MAPSSWE Test	Sign Test
Benchmark Tests Home	Wilcoxon Signed-Rank Test	McNemar Test


Sign Test

The sign test, first suggested for use in speech recognition benchmark tests by Makhoul ([1], p. 12), is a test comparing word error rates on the different speakers, or on the different conversation sides, or on other prespecified subsets of a test set. It looks simply at which system performs better on each such subset. If there is systematic evidence of differences in a consistent direction, this may prove to be significant even if the magnitudes of the differences are small.

If the measure used as basis for deciding better performance is continuous, then the probability of exactly equal performance would be zero. In practice, the possibility of equality must allowed for, generally by dropping such subsets (speakers) from the collection considered. This has only slight theoretical difficulties (see [2], p. 855), and is standard practice.

If the null hypothesis holds, then the probability is 1/2 that either subset will have better performance. Thus, if there are N subsets, the distribution of the statistic C_A is the binomial B(N,1/2).

Let c, c_A, and c_B be the measured values of C, C_A, and C_B, respectively. The null hypothesis is rejected if

Prob(C <= c) = Prob(min(C_A, C_B) <= c) = Prob(C_A <= c) + Prob(C_B <= c)

= 2 * Prob(B(N,1/2 <= c) <= 0.05 (two-tailed)

Prob(C_A <= c_A) = Prob(B(N,1/2) <= c_A) <= 0.05 (one-tailed)

These probabilities may be found directly from tables for the binomial distribution, or for large N (> 10), from the normal approximation. Table 1 lists critical values, i.e., upper bounds on C for significance at p=0.05, for a range of values of N. (See [3] for one source of this data.).

The Sign test is generally less powerful than the Wilcoxon test, described next, which applies in similar evaluation situations. It is, however, simple and easy to use, and thus regularly used by NIST in evaluation reports.

Sign Test Critical Values, p=0.05
Number of Subsets (N)	Two-Tailed	One-Tailed
5	---	0
6	0	0
7	0	0
8	0	1
9	1	1
10	1	1
11	1	2
12	2	2
13	2	3
14	2	3
15	3	3
16	3	4
17	4	4
18	4	5
19	4	5
20	5	5
21	5	6
22	5	6
23	6	7
24	6	7
25	7	7
Table 1: Critical values for sign test for different numbers of subsets at significance p=0.05. For significance, the test statistic must be less than or equal to the critical value.

References

[1] D. Pallett, J. Fiscus, and J. Garofolo, "Resource Management Corpus: September 1992 Test Set Benchmark Test Results", Proceedings of ARPA Microelectronics Technology Office Continuous Speech Recognition Workshop, Stanford, CA, September 21-22, 1992.

[2] R. Winkler and W. Hays, Statistics: Probability, Inference and Decision, second edition, Holt, Rinehart, and Winston, 1975.

[3] G. Kanji, 100 Statistical Tests, SAGE Publications, 1994

Created: 04-Oct-2000

Last updated: 20-Apr-2001