Title: The Case for Corpus Profiling
1The Case for Corpus Profiling
- Anne De Roeck
- (Udo Kruschwitz, Nick Webb, Abduelbaset Goweder,
Avik Sarkar, Paul Garthwaite, Dawei Song) - Centre for Research in Computing
- The Open University, Walton Hall,
- Milton Keynes, MK7 6AA, UK.
2Fact or Factoid Hyperlinks
- Hyperlinks do not significantly improve recall
and precision in diverse domains, such as the
TREC test data (Savoy and Pickard 1999, Hawking
et al 1999).
3Fact or Factoid Hyperlinks
- Hyperlinks do not significantly improve recall
and precision in diverse domains, such as the
TREC test data (Savoy and Pickard 1999, Hawking
et al 1999). - Hyperlinks do significantly improve recall and
precision in narrow domains and Intranets (Chen
et al 1999, Kruschwitz 2001).
4Fact or Factoid Stemming
- Stemming does not improve effectiveness of
retrieval (Harman 1991)
5Fact or Factoid Stemming
- Stemming does not improve effectiveness of
retrieval (Harman 1991) - Stemming improves performance for morphologically
complex languages (Popovitch and Willett 1992)
6Fact or Factoid Stemming
- Stemming does not improve effectiveness of
retrieval (Harman 1991) - Stemming improves performance for morphologically
complex languages (Popovitch and Willett 1992) - Stemming improves performance on short documents
(Krovetz 1993)
7Fact or Factoid Long or Short.
- Stemming improves performance on short documents
(Krovetz 1993) - Short keyword based queries behave differently
from long structured queries (Fujii and Croft
1999) - Keyword based retrieval works better on long
texts (Jurawsky and Martin 2000)
8Fact
- Performance of IR and NLP techniques depends on
the characteristics of the dataset.
9Fact
- Performance of IR and NLP techniques depends on
the characteristics of the dataset. - Performance will vary with task, technique and
language.
10Fact
- Performance of IR and NLP techniques depends on
the characteristics of the dataset. - Performance will vary with task, technique and
language. - Datasets really are significantly different.
11Fact
- Performance of IR and NLP techniques depends on
the characteristics of the dataset. - Performance will vary with task, technique and
language. - Datasets really are significantly different.
- Vital Statistics
- Sparseness
12Description
13Vital Stats
14Type to Token Ratios
15Type to Token Ratios
16Assumption
- Successful (statistical?) techniques can be
successfully ported to other languages. - Western European languages
- Japanese, Chinese, Malay,
- WordSmith Effective use requires 5M word corpus
(Garside 2000)
17Type to Token ratio
18Cargo Cult Science?
19Cargo Cult Science?
- Richard Feynman (1974)
- It's a kind of scientific integrity, a
principle of scientific thought that corresponds
to a kind of utter honesty--a kind of leaning
over backwards. For example, if you're doing an
experiment, you should report everything that you
think might make it invalid--not only what you
think is right about it other causes that could
possibly explain your results and things you
thought of that you've eliminated by some other
experiment, and how they worked--to make sure the
other fellow can tell they have been eliminated.
20Cargo Cult Science?
- Richard Feynman (1974)
-
- Details that could throw doubt on your
interpretation must be given, if you know them.
You must do the best you can--if you know
anything at all wrong, or possibly wrong--to
explain it. - In summary, the idea is to give all of the
information to help others to judge the value of
your contribution not just the information that
leads to judgement in one particular direction or
another.
21Cargo Cult Science?
- The role of data in the outcome of experiments
should be clarified - Why?
- How?
22Why explore role of data?
- Methodological Replicability
- Barbu and Mitkov (2001) Anaphora resolution
- Donaway et al (2000) Automatic Summarisation
23Why explore role of data?
- Methodological Replicability
- Barbu and Mitkov (2001) Anaphora resolution
- Donaway et al (2000) Automatic Summarisation
- Epistemological Theory induction
- What is the relationship between data properties
and technique performance?
24Why explore role of data?
- Methodological Replicability
- Barbu and Mitkov (2001) Anaphora resolution
- Donaway et al (2000) Automatic Summarisation
- Epistemological Theory induction
- What is the relationship between data properties
and technique performance? - Practical Application
- What is relationship between two sets of data?
- What is this dataset (language?) like?
25How explore role of data?
- One way Profiling for Bias
- Assumption Collection will be biased w.r.t.
technique task - Find measures that reflect bias
- Verify effects experimentally
26How explore role of data?
- Profile standard collections
- Adds to past experiments
- Profile new data
- Gauge distance to known collections
- Estimate effectiveness of techniques
27Why Profile for Bias?
- And by the way, the others think it is vital.
- (Machine Learning, Data Mining, Pattern Matching
etc.)
28Why Profile for Bias?
- And by the way, the others think it is vital.
- (Machine Learning, Data Mining, Pattern Matching
etc.) - And so did we! (or do we?)
29Profiling An Abandoned Agenda?
- Sparck-Jones (1973)
- Collection properties influencing automatic
term classification performance. Information
Starage and Retrieval. Vol 9 - Sparck-Jones (1975)
- A performance Yardstick for test collections.
Journal of Documentation. 314
30What has changed?
- Sparck-Jones (1973)
- Is a collection useably classifiable?
- Number of query terms which can be used for
matching. - Is a collection usefully classifiable?
- Number of useful, linked terms in document or
collection - Is a collection classifiable?
- Size of vocabulary and rate of incidence
31(No Transcript)
32Profiling An Abandoned Agenda
- Term weighting formula tailored to query
- Salton 1972
- Stop word identification relative to
collection/query - Wilbur Sirotkin1992 Yang Wilbur 1996
- Effect of collection homogeneity on language
model quality - Rose Haddock 1997
33What has changed?
- Proliferation of (test) collections
- More data per collection
- Increased application need
34What has changed?
- Proliferation of (test) collections
- More data per collection
- Increased application need
- Sparseness is only one kind of bias
35What has changed?
- Proliferation of (test) collections
- More data per collection
- Increased application need
- Sparseness is only one kind of bias
- Better (ways of computing) measures?
36Profiling Measures
- Requirements measures should be
- relevant to NLP techniques given task
- fine grained
- cheap to implement
37Profiling Measures
- Requirements measures should be
- relevant to NLP techniques given task
- fine grained
- cheap to implement
- Need to agree a framework
- Fixed points
- Collections?
- Properties?
- Measures?
38Profiling Measures
- Simple starting point
- Vital Statistics
- Zipf (sparseness ideosyncracy)
- Type to token ratio (sparseness, specialisation)
- Manual sampling (quality content)
- Refine?
- Homogeneity?
- Burstiness?
- (Words and Genre?)
39Profiling Measures
- Homogeneity (or how strong is evidence defeating
homogeneity assumption) - Term Distribution Models (Words!)
- Frequentist vs non-frequentist
- Very frequent terms (!!)
40Very Frequent Terms
- Lots of them
- Reputedly noise-like (random? homogeneous?)
- Present in most datasets (comparison)
- Stop word identification relative to
collection/query is independently relevant - Wilbur Sirotkin1992 Yang Wilbur 1996
41Homogeneity
- Homogeneity Assumption
- Bag of Words
- Function word distribution
- Content word distribution
- Measure of Heterogeneity as dataset profile
- Kilgariff others 1992 onwards
- Measure distance between corpora
- Identify genre
42Heterogeneity Measures
- ?2 (Kilgariff 1997 Rose Haddock 1997)
- G2 (Rose Haddock 1997 Rayson Garside 2000 )
- Correlation, Mann-Whitney (Kilgariff 1996)
- Log-likelihood (Rayson Garside 2000)
- Spearmans S (Rose Haddock 1997)
- Kullback-Leibler divergence (Cavaglia 2002)
43Measuring Heterogeneity
- Divide corpus using 5000 word chunks in random
halves - Frequency list for each half
- Calculate ?2 for term frequency distribution
differences between halves - Normalise for corpus length
- Iterate over successive random halves
44Measuring Heterogeneity
- Kilgariff registers values of ?2 statistic
- High value indicates high heterogeneity
- Finds high heterogeneity in all texts
45Defeating the Homogeneity Assumption
- Assume word distribution is homogeneous (bag of
words) - Explore chunk sizes
- Chunk size 1 -gt homogeneous (random)
- Chunk size 5000 -gt heterogeneous (Kilgariff 1997)
- ?2 test (statistic p-value)
- Defeat assumption with statistical relevance
- Register differences between datasets
- Focus on frequent terms (!)
46Homogeneity detection at a level of statistical
significance
- p-value evidence for/against the hypothesis
- lt 0.1 -- weak evidence against
- lt 0.01 -- strong evidence against
- lt 0.001 -- very strong evidence against
- lt 0.05 -- significant (moderate evidence
against the hypothesis) - Indication of statistically significant
non-homogeneity
47Dividing a Corpus
- docDiv place documents in random halves
- term distribution across documents
- halfdocDiv place half documents in random halves
- term distribution within the same document
- chunkDiv place chunks (between 1 and 5000 words)
in random halves - term distribution between text chunks (genre?)
48Results DocDiv
49Results HalfDocDiv
50Results ChunkDiv (5)
51Results ChunkDiv (100)
52Results
- docDiv
- heterogeneity across most documents, except
- AP and DOE (20 terms or fewer)
- halfdocDiv
- tests sensitive to certain document types
- DOE very homogeneous
- PAT and OU very heterogeneous
- chunkDiv
- chunk length vs. document boundary?
- similar behaviour of WSJ and SJM
- (Intranet data gives extreme results. How
transferable is corpus based training?)
53Pro
- Heterogeneity test reasonable profiling measure?
- sensitive to document types
- eg. different behaviour for halfdocDiv
- cheap to implement
- relation between measure and p-value
54Drawbacks
- Frequency based
- Coarse grained
- Not homogeneous bursty
- Bursty in what way?
- Useful for applications?
55Profiling by Measuring Burstiness
- Pioneers agenda Clumps!
- Sparck-Jones Needham 1964
- Models
- Two poisson (Church 2000)
- K-mixtures (Katz 1996)
- Exponential mixtures (Sarkar et al 2006)
56Sarkar Burstiness Model
- Model gaps (not term occurrence)
- Mixture of exponential distributions
-
- Between-burst (1/l1, or l1)
- Within-burst (1/l2 or l2)
57Burstiness Model
- First occurrence
- No occurrence censoring
58Burstiness Model
- Baysian estimation
- posterior ? prior x likelihood
- choose uninformative prior
- estimate posterior using Gibbs Sampling (MCMC)
- WinBUGS software
- 1000 iteration burn-in
- further 5000 iterations for estimate
59Burstiness Model
- Word behaviour hypotheses
- Small l1, small l2 frequently occurring
(function?) word - Large l1, small l2 bursty (content?) word
- Small l1, large l2 frequent but well spaced
(function?) word - Large l1, large l2 infrequent scattered
(function?) word - p proportion of times term does not occur in a
burst - 1-p proportion of times term appears in a burst
60(No Transcript)
61(No Transcript)
62Very frequent function words
63Less frequent function words
64Style indicative terms
65Content terms
66What now?
- Experimental verification.
- Other aspects
- Coverage (narrow or broad)
- Lay-out and meta data
- Language
- Links and mark-up
67Conclusions
68Conclusions
- NLP/IR papered over the elephant in the room
- Dataset profiling can be a useful way of
augmenting known results - Profiles have to be relative to task
- Measures have to be be informative
- Finding effective profiling measures is a
substantial, difficult essential research agenda