Statistical Measures for Corpus Profiling - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Measures for Corpus Profiling

Description:

Overall differences between corpora and contributions of individual ... age, social class (Rayson, Leech and Hodges 1997), diachronic, geographical location. ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 22
Provided by: kmiOp
Category:

less

Transcript and Presenter's Notes

Title: Statistical Measures for Corpus Profiling


1
Statistical Measures for Corpus Profiling
  • Michael P. Oakes
  • University of Sunderland
  • Corpus Profiling Workshop, 2008.

2
Contents
  • Why study differences between corpora?
    (Kilgarriff, 2001)
  • Case Study in parsing (Sekine, 1997).
  • Words and countable linguistic features.
  • Overall differences between corpora and
    contributions of individual features
  • Information theory
  • Chi-squared test
  • Factor Analysis
  • Gold standard comparison of measures
    (Kilgarriff, 2001).

3
Why study differences between corpora?
  • Kilgarriff (2001), Comparing Corpora, Int. J.
    Corpus Linguistics 6(1), pp. 97-133.
  • Taxonomise the field how does a new corpus stand
    in relation to existing ones?
  • If an interesting finding is found for one
    corpus, for what other corpora does it hold?
  • Is a new corpus sufficiently different from ones
    you have already got to be worth acquiring?
  • Difficulty in porting a new corpus to an existing
    NLP system time and cost are measurable.

4
Different Text Types
  • Englishes of the world, e.g. US vs. UK (Hofland
    and Johannson, 1982)
  • Social differentiation e.g. gender, age, social
    class (Rayson, Leech and Hodges 1997),
    diachronic, geographical location.
  • Stylometry, e.g. disputed authorship
  • Genre analysis, e.g. science fiction, e-shop
    (Santini, 2006)
  • Sentiment analysis (Westerveld, 2008).
  • Relevant vs. non-relevant documents?
    Probabilistic IR.
  • Statistical techniques exist to discriminate
    between these text types. Here the interest is in
    the types of language per se, rather than their
    amenability to NLP tools.

5
Words and countable linguistic features
  • Bits of words e.g. 2-grams (Kjell, 1994)
  • Words (many studies)
  • Linguistic features for Factor Analysis (Biber,
    1995) e.g. questions, past participles.
  • Phrase rewrite rules (Sekine 1997, Baayen, van
    Halteren and Tweedie, 1996).
  • Any countable feature characteristic of one
    corpus as opposed to another.
  • Not hapax legomena, Semitisms in the New
    Testament.

6
Domain independence of parsing (Sekine, 1997)
  • Used 8 genres from the Brown Corpus, chosen to
    give equal amount of fiction (KLNP) and
    non-fiction (ABEJ).
  • Characterised domains by production rules which
    fire.
  • From this data produced a matrix of Cross Entropy
    of grammar across domains.
  • Then average linking of the domains based on the
    matrix of cross entropy gave intuitively
    reasonable results.
  • Evaluated (training / test) corpus difference on
    parser performance.
  • Discussed size of the training corpus.

7
Broad Text Category Genre Texts in Brown Texts in LOB
Press A Reportage 44 44
B Editorial 27 27
C Reviews 17 17
General Prose D Religion 17 17
E Skills, Trades, Hobbies 36 38
F Popular Lore 48 44
G Belles Lettres, Biographies, Essays 75 77
H Miscellaneous 30 30
J Academic Prose 80 80
Fiction K General Fiction 29 29
L Mystery and Detective 24 24
M Science Fiction 6 6
N Adventure and Western 29 29
P Romance and Love Story 29 29
R Humour 9 9
8
Sekine characterised domains by production rules
which fire
Domain A Domain B
PP ? IN NP (8.40) NP ? PRP (9.52)
NP ? NN PX (5.42) PP ? IN NP (5.79)
S ? S (5.06) S ? NP VP (5.77)
S ? NP VP (4.28) S ? S (5.37)
NP ? DT NNX (3.81) NP ? DT NNX (3.90)
9
Sekine Cross-Entropy of Grammar Across Domains
T/M A B E J K L N P
A 5.13 5.35 5.41 5.45 5.51 5.52 5.53 5.55
B 5.47 5.19 5.50 5.51 5.55 5.58 5.60 5.60
E 5.50 5.48 5.20 5.48 5.58 5.59 5.58 5.61
J 5.39 5.37 5.35 5.15 5.52 5.57 5.58 5.61
K 5.32 5.25 5.31 5.41 4.95 5.14 5.15 5.17
L 5.32 5.25 5.31 5.41 4.95 5.14 5.15 5.17
N 5.29 5.25 5.28 5.43 5.10 5.06 4.89 5.12
P 5.43 5.36 5.40 5.55 5.23 5.21 5.21 5.00
10
(No Transcript)
11
(No Transcript)
12
Overall differences between corpora and
contributions of individual features.
  • Vocabulary richness (e.g. type/token ratio,
    Yules K Characteristic, V2/N) is a
    characteristic of the entire corpus. Puts all
    corpora on a linear scale.
  • The techniques we will look at (chi-squared,
    information theoretic and factor analysis) can
    both give a value for the overall difference
    between two corpora, and quantify the
    contributions made by individual features.

13
Measures of Vocabulary Richness
  • Yules K characteristic K 10000 (M2 -M1) /
    (M1 M1) M1 tokens M2 (V1 1²) (V2
    2²) (V3 3²)
  • Gerson 35.9, Kempis 59.7, De Imitatione Christi
    84.2
  • Heaps Law Vocabulary size as a function of text
    size, M kTb. Parameters k and b could
    discriminate texts, and allow them to be plotted
    in two dimensions.
  • Entropy is a form of vocabulary richness (but
    high individual contributions from both common
    and rare words).

14
The chi-squared test (Oakes and Farrow, 2006)
(O - E)² / E values for three words in five
balanced corpora (S (O-E)²/E 414916.8)
Australian British US Indian NZ
A 12.68 1.36 2.55 76.65 8.33
Commonwealth 399.63 31.20 32.95 19.84 2.16
zzzzooop - - - - -
15
Measures from Information Theory (Dagan et al.,
1997)
  • Kullback Leibler (KL) divergence (also called
    relative entropy) used as a measure of semantic
    similarity by Dagan et al., 1997.
  • Meaning in coding theory
  • Problems we get a value of infinity if there is
    a word with frequency 0 in corpus B and gt0 in
    corpus A, and not symmetrical
  • Dagan (1997), Information Radius.

16
Information Radius
  • L (Fiction detective) and P (Fiction romance)
    0.180
  • A (Press reportage) and B (Press editorial)
    0.257
  • J (Academic prose) and P (Fiction romance) 0.572

17
Detective versus Romantic Fiction
Detective Romance Detective Romance
The .00821 -.00732 Her .00819 -.00522
Of .00308 -.00277 She .00784 -.00535
A .00280 -.00257 You .00453 -.00345
Was .00180 -.00172 To .00235 -.00229
It .00161 -.00148 Be .00128 -.00110
He .00157 -.00148 They .00126 -.00097
On .00110 -.00099 Would .00121 -.00097
Been .00106 -.00089 Are .00087 -.00056
Man .00089 -.00061 Your .00084 -.00062
Money .00065 -.00034 Love .00081 -.00039
18
Factor Analysis
  • Decathlon analogy running, jumping and throwing.
  • Biber (1988) groups of countable features which
    consistently co-occur in texts are said to define
    a linguistic dimension.
  • Such features are said to have positive loadings
    with respect to that dimension, but dimensions
    can also be defined by features which are in
    complementary distributions, i.e. negatively
    loaded.
  • Example at one pole is many pronouns and
    contractions, near which lie conversational
    texts and panel discussions. At the other pole,
    few dimensions and contractions are scientific
    texts and fiction.

19
(No Transcript)
20
Evaluation of Measures (Kilgarriff 2001)
  • Reference corpus made up of known proportions of
    two corpora 100 A, 0 B 90 A, 10 B 80 A,
    20 B
  • This gives a set of gold standard judgements
    subcorpus 1 is more like subcorpus 2 than
    subcorpus 3, etc.
  • Compare machine ranking of corpora with the gold
    standard ranking using Spearmans rank
    correlation coefficient.

21
Conclusions
  • Some measures allow comparisons of entire
    corpora, others enable the identification of
    typical features.
  • Different measure allow different kinds of maps
    vocabulary richness allows ranking of corpora on
    a linear scale, Heaps Law a 2D map of two
    parameters. Information theoretic measures give
    the (dis)similarity between two corpora best
    viewed using clustering. With Factor Analysis,
    you dont know what the dimensions are until
    youve done it.
  • Maps enable contours of application success.
Write a Comment
User Comments (0)
About PowerShow.com