Title: Statistical Measures for Corpus Profiling
1Statistical Measures for Corpus Profiling
- Michael P. Oakes
- University of Sunderland
- Corpus Profiling Workshop, 2008.
2Contents
- Why study differences between corpora?
(Kilgarriff, 2001) - Case Study in parsing (Sekine, 1997).
- Words and countable linguistic features.
- Overall differences between corpora and
contributions of individual features - Information theory
- Chi-squared test
- Factor Analysis
- Gold standard comparison of measures
(Kilgarriff, 2001).
3Why study differences between corpora?
- Kilgarriff (2001), Comparing Corpora, Int. J.
Corpus Linguistics 6(1), pp. 97-133. - Taxonomise the field how does a new corpus stand
in relation to existing ones? - If an interesting finding is found for one
corpus, for what other corpora does it hold? - Is a new corpus sufficiently different from ones
you have already got to be worth acquiring? - Difficulty in porting a new corpus to an existing
NLP system time and cost are measurable.
4Different Text Types
- Englishes of the world, e.g. US vs. UK (Hofland
and Johannson, 1982) - Social differentiation e.g. gender, age, social
class (Rayson, Leech and Hodges 1997),
diachronic, geographical location. - Stylometry, e.g. disputed authorship
- Genre analysis, e.g. science fiction, e-shop
(Santini, 2006) - Sentiment analysis (Westerveld, 2008).
- Relevant vs. non-relevant documents?
Probabilistic IR. - Statistical techniques exist to discriminate
between these text types. Here the interest is in
the types of language per se, rather than their
amenability to NLP tools.
5Words and countable linguistic features
- Bits of words e.g. 2-grams (Kjell, 1994)
- Words (many studies)
- Linguistic features for Factor Analysis (Biber,
1995) e.g. questions, past participles. - Phrase rewrite rules (Sekine 1997, Baayen, van
Halteren and Tweedie, 1996). - Any countable feature characteristic of one
corpus as opposed to another. - Not hapax legomena, Semitisms in the New
Testament.
6Domain independence of parsing (Sekine, 1997)
- Used 8 genres from the Brown Corpus, chosen to
give equal amount of fiction (KLNP) and
non-fiction (ABEJ). - Characterised domains by production rules which
fire. - From this data produced a matrix of Cross Entropy
of grammar across domains. - Then average linking of the domains based on the
matrix of cross entropy gave intuitively
reasonable results. - Evaluated (training / test) corpus difference on
parser performance. - Discussed size of the training corpus.
7Broad Text Category Genre Texts in Brown Texts in LOB
Press A Reportage 44 44
B Editorial 27 27
C Reviews 17 17
General Prose D Religion 17 17
E Skills, Trades, Hobbies 36 38
F Popular Lore 48 44
G Belles Lettres, Biographies, Essays 75 77
H Miscellaneous 30 30
J Academic Prose 80 80
Fiction K General Fiction 29 29
L Mystery and Detective 24 24
M Science Fiction 6 6
N Adventure and Western 29 29
P Romance and Love Story 29 29
R Humour 9 9
8Sekine characterised domains by production rules
which fire
Domain A Domain B
PP ? IN NP (8.40) NP ? PRP (9.52)
NP ? NN PX (5.42) PP ? IN NP (5.79)
S ? S (5.06) S ? NP VP (5.77)
S ? NP VP (4.28) S ? S (5.37)
NP ? DT NNX (3.81) NP ? DT NNX (3.90)
9Sekine Cross-Entropy of Grammar Across Domains
T/M A B E J K L N P
A 5.13 5.35 5.41 5.45 5.51 5.52 5.53 5.55
B 5.47 5.19 5.50 5.51 5.55 5.58 5.60 5.60
E 5.50 5.48 5.20 5.48 5.58 5.59 5.58 5.61
J 5.39 5.37 5.35 5.15 5.52 5.57 5.58 5.61
K 5.32 5.25 5.31 5.41 4.95 5.14 5.15 5.17
L 5.32 5.25 5.31 5.41 4.95 5.14 5.15 5.17
N 5.29 5.25 5.28 5.43 5.10 5.06 4.89 5.12
P 5.43 5.36 5.40 5.55 5.23 5.21 5.21 5.00
10(No Transcript)
11(No Transcript)
12Overall differences between corpora and
contributions of individual features.
- Vocabulary richness (e.g. type/token ratio,
Yules K Characteristic, V2/N) is a
characteristic of the entire corpus. Puts all
corpora on a linear scale. - The techniques we will look at (chi-squared,
information theoretic and factor analysis) can
both give a value for the overall difference
between two corpora, and quantify the
contributions made by individual features.
13Measures of Vocabulary Richness
- Yules K characteristic K 10000 (M2 -M1) /
(M1 M1) M1 tokens M2 (V1 1²) (V2
2²) (V3 3²) - Gerson 35.9, Kempis 59.7, De Imitatione Christi
84.2 - Heaps Law Vocabulary size as a function of text
size, M kTb. Parameters k and b could
discriminate texts, and allow them to be plotted
in two dimensions. - Entropy is a form of vocabulary richness (but
high individual contributions from both common
and rare words).
14The chi-squared test (Oakes and Farrow, 2006)
(O - E)² / E values for three words in five
balanced corpora (S (O-E)²/E 414916.8)
Australian British US Indian NZ
A 12.68 1.36 2.55 76.65 8.33
Commonwealth 399.63 31.20 32.95 19.84 2.16
zzzzooop - - - - -
15Measures from Information Theory (Dagan et al.,
1997)
- Kullback Leibler (KL) divergence (also called
relative entropy) used as a measure of semantic
similarity by Dagan et al., 1997. - Meaning in coding theory
- Problems we get a value of infinity if there is
a word with frequency 0 in corpus B and gt0 in
corpus A, and not symmetrical - Dagan (1997), Information Radius.
16Information Radius
- L (Fiction detective) and P (Fiction romance)
0.180 - A (Press reportage) and B (Press editorial)
0.257 - J (Academic prose) and P (Fiction romance) 0.572
17Detective versus Romantic Fiction
Detective Romance Detective Romance
The .00821 -.00732 Her .00819 -.00522
Of .00308 -.00277 She .00784 -.00535
A .00280 -.00257 You .00453 -.00345
Was .00180 -.00172 To .00235 -.00229
It .00161 -.00148 Be .00128 -.00110
He .00157 -.00148 They .00126 -.00097
On .00110 -.00099 Would .00121 -.00097
Been .00106 -.00089 Are .00087 -.00056
Man .00089 -.00061 Your .00084 -.00062
Money .00065 -.00034 Love .00081 -.00039
18Factor Analysis
- Decathlon analogy running, jumping and throwing.
- Biber (1988) groups of countable features which
consistently co-occur in texts are said to define
a linguistic dimension. - Such features are said to have positive loadings
with respect to that dimension, but dimensions
can also be defined by features which are in
complementary distributions, i.e. negatively
loaded. - Example at one pole is many pronouns and
contractions, near which lie conversational
texts and panel discussions. At the other pole,
few dimensions and contractions are scientific
texts and fiction.
19(No Transcript)
20Evaluation of Measures (Kilgarriff 2001)
- Reference corpus made up of known proportions of
two corpora 100 A, 0 B 90 A, 10 B 80 A,
20 B - This gives a set of gold standard judgements
subcorpus 1 is more like subcorpus 2 than
subcorpus 3, etc. - Compare machine ranking of corpora with the gold
standard ranking using Spearmans rank
correlation coefficient.
21Conclusions
- Some measures allow comparisons of entire
corpora, others enable the identification of
typical features. - Different measure allow different kinds of maps
vocabulary richness allows ranking of corpora on
a linear scale, Heaps Law a 2D map of two
parameters. Information theoretic measures give
the (dis)similarity between two corpora best
viewed using clustering. With Factor Analysis,
you dont know what the dimensions are until
youve done it. - Maps enable contours of application success.