The Case for Corpus Profiling - PowerPoint PPT Presentation

About This Presentation
Title:

The Case for Corpus Profiling

Description:

Rose & Haddock 1997. What has changed? Proliferation of (test) collections ... Spearman's S (Rose & Haddock 1997) Kullback-Leibler divergence (Cavaglia 2002) ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 69
Provided by: kmiOp
Category:

less

Transcript and Presenter's Notes

Title: The Case for Corpus Profiling


1
The Case for Corpus Profiling
  • Anne De Roeck
  • (Udo Kruschwitz, Nick Webb, Abduelbaset Goweder,
    Avik Sarkar, Paul Garthwaite, Dawei Song)
  • Centre for Research in Computing
  • The Open University, Walton Hall,
  • Milton Keynes, MK7 6AA, UK.

2
Fact or Factoid Hyperlinks
  • Hyperlinks do not significantly improve recall
    and precision in diverse domains, such as the
    TREC test data (Savoy and Pickard 1999, Hawking
    et al 1999).

3
Fact or Factoid Hyperlinks
  • Hyperlinks do not significantly improve recall
    and precision in diverse domains, such as the
    TREC test data (Savoy and Pickard 1999, Hawking
    et al 1999).
  • Hyperlinks do significantly improve recall and
    precision in narrow domains and Intranets (Chen
    et al 1999, Kruschwitz 2001).

4
Fact or Factoid Stemming
  • Stemming does not improve effectiveness of
    retrieval (Harman 1991)

5
Fact or Factoid Stemming
  • Stemming does not improve effectiveness of
    retrieval (Harman 1991)
  • Stemming improves performance for morphologically
    complex languages (Popovitch and Willett 1992)

6
Fact or Factoid Stemming
  • Stemming does not improve effectiveness of
    retrieval (Harman 1991)
  • Stemming improves performance for morphologically
    complex languages (Popovitch and Willett 1992)
  • Stemming improves performance on short documents
    (Krovetz 1993)

7
Fact or Factoid Long or Short.
  • Stemming improves performance on short documents
    (Krovetz 1993)
  • Short keyword based queries behave differently
    from long structured queries (Fujii and Croft
    1999)
  • Keyword based retrieval works better on long
    texts (Jurawsky and Martin 2000)

8
Fact
  • Performance of IR and NLP techniques depends on
    the characteristics of the dataset.

9
Fact
  • Performance of IR and NLP techniques depends on
    the characteristics of the dataset.
  • Performance will vary with task, technique and
    language.

10
Fact
  • Performance of IR and NLP techniques depends on
    the characteristics of the dataset.
  • Performance will vary with task, technique and
    language.
  • Datasets really are significantly different.

11
Fact
  • Performance of IR and NLP techniques depends on
    the characteristics of the dataset.
  • Performance will vary with task, technique and
    language.
  • Datasets really are significantly different.
  • Vital Statistics
  • Sparseness

12
Description
13
Vital Stats
14
Type to Token Ratios
15
Type to Token Ratios
16
Assumption
  • Successful (statistical?) techniques can be
    successfully ported to other languages.
  • Western European languages
  • Japanese, Chinese, Malay,
  • WordSmith Effective use requires 5M word corpus
    (Garside 2000)

17
Type to Token ratio
18
Cargo Cult Science?
  • Richard Feynman (1974)

19
Cargo Cult Science?
  • Richard Feynman (1974)
  • It's a kind of scientific integrity, a
    principle of scientific thought that corresponds
    to a kind of utter honesty--a kind of leaning
    over backwards. For example, if you're doing an
    experiment, you should report everything that you
    think might make it invalid--not only what you
    think is right about it other causes that could
    possibly explain your results and things you
    thought of that you've eliminated by some other
    experiment, and how they worked--to make sure the
    other fellow can tell they have been eliminated.

20
Cargo Cult Science?
  • Richard Feynman (1974)
  • Details that could throw doubt on your
    interpretation must be given, if you know them.
    You must do the best you can--if you know
    anything at all wrong, or possibly wrong--to
    explain it.
  • In summary, the idea is to give all of the
    information to help others to judge the value of
    your contribution not just the information that
    leads to judgement in one particular direction or
    another.

21
Cargo Cult Science?
  • The role of data in the outcome of experiments
    should be clarified
  • Why?
  • How?

22
Why explore role of data?
  • Methodological Replicability
  • Barbu and Mitkov (2001) Anaphora resolution
  • Donaway et al (2000) Automatic Summarisation

23
Why explore role of data?
  • Methodological Replicability
  • Barbu and Mitkov (2001) Anaphora resolution
  • Donaway et al (2000) Automatic Summarisation
  • Epistemological Theory induction
  • What is the relationship between data properties
    and technique performance?

24
Why explore role of data?
  • Methodological Replicability
  • Barbu and Mitkov (2001) Anaphora resolution
  • Donaway et al (2000) Automatic Summarisation
  • Epistemological Theory induction
  • What is the relationship between data properties
    and technique performance?
  • Practical Application
  • What is relationship between two sets of data?
  • What is this dataset (language?) like?

25
How explore role of data?
  • One way Profiling for Bias
  • Assumption Collection will be biased w.r.t.
    technique task
  • Find measures that reflect bias
  • Verify effects experimentally

26
How explore role of data?
  • Profile standard collections
  • Adds to past experiments
  • Profile new data
  • Gauge distance to known collections
  • Estimate effectiveness of techniques

27
Why Profile for Bias?
  • And by the way, the others think it is vital.
  • (Machine Learning, Data Mining, Pattern Matching
    etc.)

28
Why Profile for Bias?
  • And by the way, the others think it is vital.
  • (Machine Learning, Data Mining, Pattern Matching
    etc.)
  • And so did we! (or do we?)

29
Profiling An Abandoned Agenda?
  • Sparck-Jones (1973)
  • Collection properties influencing automatic
    term classification performance. Information
    Starage and Retrieval. Vol 9
  • Sparck-Jones (1975)
  • A performance Yardstick for test collections.
    Journal of Documentation. 314

30
What has changed?
  • Sparck-Jones (1973)
  • Is a collection useably classifiable?
  • Number of query terms which can be used for
    matching.
  • Is a collection usefully classifiable?
  • Number of useful, linked terms in document or
    collection
  • Is a collection classifiable?
  • Size of vocabulary and rate of incidence

31
(No Transcript)
32
Profiling An Abandoned Agenda
  • Term weighting formula tailored to query
  • Salton 1972
  • Stop word identification relative to
    collection/query
  • Wilbur Sirotkin1992 Yang Wilbur 1996
  • Effect of collection homogeneity on language
    model quality
  • Rose Haddock 1997

33
What has changed?
  • Proliferation of (test) collections
  • More data per collection
  • Increased application need

34
What has changed?
  • Proliferation of (test) collections
  • More data per collection
  • Increased application need
  • Sparseness is only one kind of bias

35
What has changed?
  • Proliferation of (test) collections
  • More data per collection
  • Increased application need
  • Sparseness is only one kind of bias
  • Better (ways of computing) measures?

36
Profiling Measures
  • Requirements measures should be
  • relevant to NLP techniques given task
  • fine grained
  • cheap to implement

37
Profiling Measures
  • Requirements measures should be
  • relevant to NLP techniques given task
  • fine grained
  • cheap to implement
  • Need to agree a framework
  • Fixed points
  • Collections?
  • Properties?
  • Measures?

38
Profiling Measures
  • Simple starting point
  • Vital Statistics
  • Zipf (sparseness ideosyncracy)
  • Type to token ratio (sparseness, specialisation)
  • Manual sampling (quality content)
  • Refine?
  • Homogeneity?
  • Burstiness?
  • (Words and Genre?)

39
Profiling Measures
  • Homogeneity (or how strong is evidence defeating
    homogeneity assumption)
  • Term Distribution Models (Words!)
  • Frequentist vs non-frequentist
  • Very frequent terms (!!)

40
Very Frequent Terms
  • Lots of them
  • Reputedly noise-like (random? homogeneous?)
  • Present in most datasets (comparison)
  • Stop word identification relative to
    collection/query is independently relevant
  • Wilbur Sirotkin1992 Yang Wilbur 1996

41
Homogeneity
  • Homogeneity Assumption
  • Bag of Words
  • Function word distribution
  • Content word distribution
  • Measure of Heterogeneity as dataset profile
  • Kilgariff others 1992 onwards
  • Measure distance between corpora
  • Identify genre

42
Heterogeneity Measures
  • ?2 (Kilgariff 1997 Rose Haddock 1997)
  • G2 (Rose Haddock 1997 Rayson Garside 2000 )
  • Correlation, Mann-Whitney (Kilgariff 1996)
  • Log-likelihood (Rayson Garside 2000)
  • Spearmans S (Rose Haddock 1997)
  • Kullback-Leibler divergence (Cavaglia 2002)

43
Measuring Heterogeneity
  • Divide corpus using 5000 word chunks in random
    halves
  • Frequency list for each half
  • Calculate ?2 for term frequency distribution
    differences between halves
  • Normalise for corpus length
  • Iterate over successive random halves

44
Measuring Heterogeneity
  • Kilgariff registers values of ?2 statistic
  • High value indicates high heterogeneity
  • Finds high heterogeneity in all texts

45
Defeating the Homogeneity Assumption
  • Assume word distribution is homogeneous (bag of
    words)
  • Explore chunk sizes
  • Chunk size 1 -gt homogeneous (random)
  • Chunk size 5000 -gt heterogeneous (Kilgariff 1997)
  • ?2 test (statistic p-value)
  • Defeat assumption with statistical relevance
  • Register differences between datasets
  • Focus on frequent terms (!)

46
Homogeneity detection at a level of statistical
significance
  • p-value evidence for/against the hypothesis
  • lt 0.1 -- weak evidence against
  • lt 0.01 -- strong evidence against
  • lt 0.001 -- very strong evidence against
  • lt 0.05 -- significant (moderate evidence
    against the hypothesis)
  • Indication of statistically significant
    non-homogeneity

47
Dividing a Corpus
  • docDiv place documents in random halves
  • term distribution across documents
  • halfdocDiv place half documents in random halves
  • term distribution within the same document
  • chunkDiv place chunks (between 1 and 5000 words)
    in random halves
  • term distribution between text chunks (genre?)

48
Results DocDiv
49
Results HalfDocDiv
50
Results ChunkDiv (5)
51
Results ChunkDiv (100)
52
Results
  • docDiv
  • heterogeneity across most documents, except
  • AP and DOE (20 terms or fewer)
  • halfdocDiv
  • tests sensitive to certain document types
  • DOE very homogeneous
  • PAT and OU very heterogeneous
  • chunkDiv
  • chunk length vs. document boundary?
  • similar behaviour of WSJ and SJM
  • (Intranet data gives extreme results. How
    transferable is corpus based training?)

53
Pro
  • Heterogeneity test reasonable profiling measure?
  • sensitive to document types
  • eg. different behaviour for halfdocDiv
  • cheap to implement
  • relation between measure and p-value

54
Drawbacks
  • Frequency based
  • Coarse grained
  • Not homogeneous bursty
  • Bursty in what way?
  • Useful for applications?

55
Profiling by Measuring Burstiness
  • Pioneers agenda Clumps!
  • Sparck-Jones Needham 1964
  • Models
  • Two poisson (Church 2000)
  • K-mixtures (Katz 1996)
  • Exponential mixtures (Sarkar et al 2006)

56
Sarkar Burstiness Model
  • Model gaps (not term occurrence)
  • Mixture of exponential distributions
  • Between-burst (1/l1, or l1)
  • Within-burst (1/l2 or l2)

57
Burstiness Model
  • First occurrence
  • No occurrence censoring

58
Burstiness Model
  • Baysian estimation
  • posterior ? prior x likelihood
  • choose uninformative prior
  • estimate posterior using Gibbs Sampling (MCMC)
  • WinBUGS software
  • 1000 iteration burn-in
  • further 5000 iterations for estimate

59
Burstiness Model
  • Word behaviour hypotheses
  • Small l1, small l2 frequently occurring
    (function?) word
  • Large l1, small l2 bursty (content?) word
  • Small l1, large l2 frequent but well spaced
    (function?) word
  • Large l1, large l2 infrequent scattered
    (function?) word
  • p proportion of times term does not occur in a
    burst
  • 1-p proportion of times term appears in a burst

60
(No Transcript)
61
(No Transcript)
62
Very frequent function words
63
Less frequent function words
64
Style indicative terms
65
Content terms
66
What now?
  • Experimental verification.
  • Other aspects
  • Coverage (narrow or broad)
  • Lay-out and meta data
  • Language
  • Links and mark-up

67
Conclusions
68
Conclusions
  • NLP/IR papered over the elephant in the room
  • Dataset profiling can be a useful way of
    augmenting known results
  • Profiles have to be relative to task
  • Measures have to be be informative
  • Finding effective profiling measures is a
    substantial, difficult essential research agenda
Write a Comment
User Comments (0)
About PowerShow.com