The Case for Corpus Profiling - PowerPoint PPT Presentation

About This Presentation

Title:

The Case for Corpus Profiling

Description:

Rose & Haddock 1997. What has changed? Proliferation of (test) collections ... Spearman's S (Rose & Haddock 1997) Kullback-Leibler divergence (Cavaglia 2002) ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 69

Provided by: kmiOp

Category:

more less

Transcript and Presenter's Notes

Title: The Case for Corpus Profiling

1
The Case for Corpus Profiling

Anne De Roeck
(Udo Kruschwitz, Nick Webb, Abduelbaset Goweder,
Avik Sarkar, Paul Garthwaite, Dawei Song)
Centre for Research in Computing
The Open University, Walton Hall,
Milton Keynes, MK7 6AA, UK.

2
Fact or Factoid Hyperlinks

Hyperlinks do not significantly improve recall
and precision in diverse domains, such as the
TREC test data (Savoy and Pickard 1999, Hawking
et al 1999).

3
Fact or Factoid Hyperlinks

Hyperlinks do not significantly improve recall
and precision in diverse domains, such as the
TREC test data (Savoy and Pickard 1999, Hawking
et al 1999).
Hyperlinks do significantly improve recall and
precision in narrow domains and Intranets (Chen
et al 1999, Kruschwitz 2001).

4
Fact or Factoid Stemming

Stemming does not improve effectiveness of
retrieval (Harman 1991)

5
Fact or Factoid Stemming

Stemming does not improve effectiveness of
retrieval (Harman 1991)
Stemming improves performance for morphologically
complex languages (Popovitch and Willett 1992)

6
Fact or Factoid Stemming

Stemming does not improve effectiveness of
retrieval (Harman 1991)
Stemming improves performance for morphologically
complex languages (Popovitch and Willett 1992)
Stemming improves performance on short documents
(Krovetz 1993)

7
Fact or Factoid Long or Short.

Stemming improves performance on short documents
(Krovetz 1993)
Short keyword based queries behave differently
from long structured queries (Fujii and Croft
1999)
Keyword based retrieval works better on long
texts (Jurawsky and Martin 2000)

8
Fact

Performance of IR and NLP techniques depends on
the characteristics of the dataset.

9
Fact

Performance of IR and NLP techniques depends on
the characteristics of the dataset.
Performance will vary with task, technique and
language.

10
Fact

Performance of IR and NLP techniques depends on
the characteristics of the dataset.
Performance will vary with task, technique and
language.
Datasets really are significantly different.

11
Fact

Performance of IR and NLP techniques depends on
the characteristics of the dataset.
Performance will vary with task, technique and
language.
Datasets really are significantly different.
Vital Statistics
Sparseness

12
Description
13
Vital Stats
14
Type to Token Ratios
15
Type to Token Ratios
16
Assumption

Successful (statistical?) techniques can be
successfully ported to other languages.
Western European languages
Japanese, Chinese, Malay,
WordSmith Effective use requires 5M word corpus
(Garside 2000)

17
Type to Token ratio
18
Cargo Cult Science?

Richard Feynman (1974)

19
Cargo Cult Science?

Richard Feynman (1974)
It's a kind of scientific integrity, a
principle of scientific thought that corresponds
to a kind of utter honesty--a kind of leaning
over backwards. For example, if you're doing an
experiment, you should report everything that you
think might make it invalid--not only what you
think is right about it other causes that could
possibly explain your results and things you
thought of that you've eliminated by some other
experiment, and how they worked--to make sure the
other fellow can tell they have been eliminated.

20
Cargo Cult Science?

Richard Feynman (1974)
Details that could throw doubt on your
interpretation must be given, if you know them.
You must do the best you can--if you know
anything at all wrong, or possibly wrong--to
explain it.
In summary, the idea is to give all of the
information to help others to judge the value of
your contribution not just the information that
leads to judgement in one particular direction or
another.

21
Cargo Cult Science?

The role of data in the outcome of experiments
should be clarified
Why?
How?

22
Why explore role of data?

Methodological Replicability
Barbu and Mitkov (2001) Anaphora resolution
Donaway et al (2000) Automatic Summarisation

23
Why explore role of data?

Methodological Replicability
Barbu and Mitkov (2001) Anaphora resolution
Donaway et al (2000) Automatic Summarisation
Epistemological Theory induction
What is the relationship between data properties
and technique performance?

24
Why explore role of data?

Methodological Replicability
Barbu and Mitkov (2001) Anaphora resolution
Donaway et al (2000) Automatic Summarisation
Epistemological Theory induction
What is the relationship between data properties
and technique performance?
Practical Application
What is relationship between two sets of data?
What is this dataset (language?) like?

25
How explore role of data?

One way Profiling for Bias
Assumption Collection will be biased w.r.t.
technique task
Find measures that reflect bias
Verify effects experimentally

26
How explore role of data?

Profile standard collections
Adds to past experiments
Profile new data
Gauge distance to known collections
Estimate effectiveness of techniques

27
Why Profile for Bias?

And by the way, the others think it is vital.
(Machine Learning, Data Mining, Pattern Matching
etc.)

28
Why Profile for Bias?

And by the way, the others think it is vital.
(Machine Learning, Data Mining, Pattern Matching
etc.)
And so did we! (or do we?)

29
Profiling An Abandoned Agenda?

Sparck-Jones (1973)
Collection properties influencing automatic
term classification performance. Information
Starage and Retrieval. Vol 9
Sparck-Jones (1975)
A performance Yardstick for test collections.
Journal of Documentation. 314

30
What has changed?

Sparck-Jones (1973)
Is a collection useably classifiable?
Number of query terms which can be used for
matching.
Is a collection usefully classifiable?
Number of useful, linked terms in document or
collection
Is a collection classifiable?
Size of vocabulary and rate of incidence

31
(No Transcript)
32
Profiling An Abandoned Agenda

Term weighting formula tailored to query
Salton 1972
Stop word identification relative to
collection/query
Wilbur Sirotkin1992 Yang Wilbur 1996
Effect of collection homogeneity on language
model quality
Rose Haddock 1997

33
What has changed?

Proliferation of (test) collections
More data per collection
Increased application need

34
What has changed?

Proliferation of (test) collections
More data per collection
Increased application need
Sparseness is only one kind of bias

35
What has changed?

Proliferation of (test) collections
More data per collection
Increased application need
Sparseness is only one kind of bias
Better (ways of computing) measures?

36
Profiling Measures

Requirements measures should be
relevant to NLP techniques given task
fine grained
cheap to implement

37
Profiling Measures

Requirements measures should be
relevant to NLP techniques given task
fine grained
cheap to implement
Need to agree a framework
Fixed points
Collections?
Properties?
Measures?

38
Profiling Measures

Simple starting point
Vital Statistics
Zipf (sparseness ideosyncracy)
Type to token ratio (sparseness, specialisation)
Manual sampling (quality content)
Refine?
Homogeneity?
Burstiness?
(Words and Genre?)

39
Profiling Measures

Homogeneity (or how strong is evidence defeating
homogeneity assumption)
Term Distribution Models (Words!)
Frequentist vs non-frequentist
Very frequent terms (!!)

40
Very Frequent Terms

Lots of them
Reputedly noise-like (random? homogeneous?)
Present in most datasets (comparison)
Stop word identification relative to
collection/query is independently relevant
Wilbur Sirotkin1992 Yang Wilbur 1996

41
Homogeneity

Homogeneity Assumption
Bag of Words
Function word distribution
Content word distribution
Measure of Heterogeneity as dataset profile
Kilgariff others 1992 onwards
Measure distance between corpora
Identify genre

42
Heterogeneity Measures

?2 (Kilgariff 1997 Rose Haddock 1997)
G2 (Rose Haddock 1997 Rayson Garside 2000 )
Correlation, Mann-Whitney (Kilgariff 1996)
Log-likelihood (Rayson Garside 2000)
Spearmans S (Rose Haddock 1997)
Kullback-Leibler divergence (Cavaglia 2002)

43
Measuring Heterogeneity

Divide corpus using 5000 word chunks in random
halves
Frequency list for each half
Calculate ?2 for term frequency distribution
differences between halves
Normalise for corpus length
Iterate over successive random halves

44
Measuring Heterogeneity

Kilgariff registers values of ?2 statistic
High value indicates high heterogeneity
Finds high heterogeneity in all texts

45
Defeating the Homogeneity Assumption

Assume word distribution is homogeneous (bag of
words)
Explore chunk sizes
Chunk size 1 -gt homogeneous (random)
Chunk size 5000 -gt heterogeneous (Kilgariff 1997)
?2 test (statistic p-value)
Defeat assumption with statistical relevance
Register differences between datasets
Focus on frequent terms (!)

46
Homogeneity detection at a level of statistical
significance

p-value evidence for/against the hypothesis
lt 0.1 -- weak evidence against
lt 0.01 -- strong evidence against
lt 0.001 -- very strong evidence against
lt 0.05 -- significant (moderate evidence
against the hypothesis)
Indication of statistically significant
non-homogeneity

47
Dividing a Corpus

docDiv place documents in random halves
term distribution across documents
halfdocDiv place half documents in random halves
term distribution within the same document
chunkDiv place chunks (between 1 and 5000 words)
in random halves
term distribution between text chunks (genre?)

48
Results DocDiv
49
Results HalfDocDiv
50
Results ChunkDiv (5)
51
Results ChunkDiv (100)
52
Results

docDiv
heterogeneity across most documents, except
AP and DOE (20 terms or fewer)
halfdocDiv
tests sensitive to certain document types
DOE very homogeneous
PAT and OU very heterogeneous
chunkDiv
chunk length vs. document boundary?
similar behaviour of WSJ and SJM
(Intranet data gives extreme results. How
transferable is corpus based training?)

53
Pro

Heterogeneity test reasonable profiling measure?
sensitive to document types
eg. different behaviour for halfdocDiv
cheap to implement
relation between measure and p-value

54
Drawbacks

Frequency based
Coarse grained
Not homogeneous bursty
Bursty in what way?
Useful for applications?

55
Profiling by Measuring Burstiness

Pioneers agenda Clumps!
Sparck-Jones Needham 1964
Models
Two poisson (Church 2000)
K-mixtures (Katz 1996)
Exponential mixtures (Sarkar et al 2006)

56
Sarkar Burstiness Model

Model gaps (not term occurrence)
Mixture of exponential distributions
Between-burst (1/l1, or l1)
Within-burst (1/l2 or l2)

57
Burstiness Model

First occurrence
No occurrence censoring

58
Burstiness Model

Baysian estimation
posterior ? prior x likelihood
choose uninformative prior
estimate posterior using Gibbs Sampling (MCMC)
WinBUGS software
1000 iteration burn-in
further 5000 iterations for estimate

59
Burstiness Model

Word behaviour hypotheses
Small l1, small l2 frequently occurring
(function?) word
Large l1, small l2 bursty (content?) word
Small l1, large l2 frequent but well spaced
(function?) word
Large l1, large l2 infrequent scattered
(function?) word
p proportion of times term does not occur in a
burst
1-p proportion of times term appears in a burst

60
(No Transcript)
61
(No Transcript)
62
Very frequent function words
63
Less frequent function words
64
Style indicative terms
65
Content terms
66
What now?

Experimental verification.
Other aspects
Coverage (narrow or broad)
Lay-out and meta data
Language
Links and mark-up

67
Conclusions
68
Conclusions

NLP/IR papered over the elephant in the room
Dataset profiling can be a useful way of
augmenting known results
Profiles have to be relative to task
Measures have to be be informative
Finding effective profiling measures is a
substantial, difficult essential research agenda

Write a Comment

User Comments (0)