Title: The Significance of Result Differences
1The Significance of Result Differences
2Why Significance Tests?
- everybody knows we have to test the significance
of our results - but do we really?
- evaluation results are valid for
- data from specific corpus
- extracted with specific methods
- for a particular type of collocations
- according to the intuitions of one particular
annotator (or two)
3Why Significance Tests?
- significance tests are about generalisations
- basic question"If we repeated the evaluation
experiment (on similar data), would we get the
same results?" - influence of source corpus, domain, collocation
type and definition, annotation guidelines, ...
4Evaluation of Association Measures
5Evaluation of Association Measures
6A Different Perspective
- pair types are described by tables (O11, O12,
O21, O22)? coordinates in 4-D space - O22 is redundant becauseO11 O12 O21 O22
N - can also describe pair type by joint and marginal
frequencies(f, f1, f2) "coordinates" ?
coordinates in 3-D space
7A Different Perspective
- data set cloud of points in three-dimensional
space - visualisation is "challenging"
- many association measures depend on O11 and E11
only(MI, gmean, t-score, binomial) - projection to (O11, E11) ? coordinates in 2-D
space(ignoring the ratio f1 / f2)
8The Parameter Space of Collocation Candidates
9The Parameter Space of Collocation Candidates
10The Parameter Space of Collocation Candidates
11The Parameter Space of Collocation Candidates
12The Parameter Space of Collocation Candidates
13N-best Lists in Parameter Space
- N-best List for AM ? includes all pair types
where score ? ? c(threshold c obtained from
data) - ? ? c describes a subset of the parameter space
- for a sound association measure isoline ? c
is lower boundary(because scores should increase
with O11 for fixed value of E11)
14N-Best Isolines in the Parameter Space
MI
15N-Best Isolines in theParameter Space
MI
16N-Best Isolines in theParameter Space
t-score
17N-Best Isolines in theParameter Space
t-score
1895 Confidence Interval
1999 Confidence Interval
2095 Confidence Interval
21Comparing Precision Values
- number of TPs and FPs for 1000-best lists
22McNemar's Test
- in 1000-best list not in 1000-best
list - ideally all TPs in 1000-best list (possible!)
- H0 differences between AMs are random
23McNemar's Test
- in 1000-best list not in 1000-best
list - gt mcnemar.test(tbl)
- p-value lt 0.001 ? highly significant
24Significant Differences
25Significant Differences
26Significant Differences
27Lowest-Frequency Data Samples
- Too much data for full manual evaluation ? random
samples - AdjN data
- 965 pairs with f 1 (15 sample)
- manually identified 31 TPs (3.2)
- PNV data
- 983 pairs with f lt 3 (0.35 sample)
- manually identified 6 TPs (0.6)
28Lowest-Frequency Data Samples
- Estimate proportion p of TPs among all
lowest-frequency data - Confidence set from binomial test
- AdjN 31 TPs among 965 items
- p ? 5 with 99 confidence
- at most ? 320 TPs
- PNV 6 TPs among 983-items
- p ? 1.5 with 99 confidence
- there might still be ? 4200 TPs !!
29N-best Lists for Lowest-Frequency Data
- evaluate 10,000-best lists
- to reduce manual annotation work,take 10 sample
from each list(i.e. 1,000 candidates for each
AM) - precision graphs for N-best lists
- up to N 10,000 for the PNV data
- 95 confidence estimates for precision of
best-performing AM (from binomial test)
30Random Sample Evaluation
31Random Sample Evaluation
32Random Sample Evaluation