The - PowerPoint PPT Presentation

About This Presentation
Title:

The

Description:

The London Corpora projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College London – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 33
Provided by: SeanW1
Category:

less

Transcript and Presenter's Notes

Title: The


1
The London Corpora projects
  • - the benefits of hindsight -
  • some lessons for diachronic corpus design

Sean Wallis Survey of English Usage University
College London s.wallis_at_ucl.ac.uk
2
Motivating questions
  • What is meant by the phrase a balanced corpus?
  • How do sampling decisions made by corpus builders
    affect the type of research questions that may be
    asked of the data?

3
Motivating questions
  • What is meant by the phrase a balanced corpus?
  • How do sampling decisions made by corpus builders
    affect the type of research questions that may be
    asked of the data?
  • Reviewing ICE-GB and DCPSE
  • Should the data have been more sociolinguistic-all
    y representative, by social class and region?

4
Motivating questions
  • What is meant by the phrase a balanced corpus?
  • How do sampling decisions made by corpus builders
    affect the type of research questions that may be
    asked of the data?
  • Reviewing ICE-GB and DCPSE
  • Should the data have been more sociolinguistic-all
    y representative, by social class and region?
  • Should texts have been stratified sampled so
    that speakers of all categories of gender and age
    were (equally) represented in each genre?

5
ICE-GB
  • British Component of ICE
  • Corpus of speech and writing (1990-1992)
  • 60 spoken, 40 written 1 million words
    orthographically transcribed speech, marked up,
    tagged and fully parsed
  • Sampling principles
  • International sampling scheme, including broad
    range of spoken and written categories
  • But
  • Adults who had completed secondary education
  • British corpus geographically limited
  • speakers mostly from London / SE UK (or sampled
    there)

6
DCPSE
  • Diachronic Corpus of Present-day Spoken English
    (late 1950s - early 1990s)
  • 800,000 words (nominal)
  • London-Lund component annotated as ICE-GB
  • orthographically transcribed and fully parsed
  • Created from subsamples of LLC and ICE-GB
  • Matching numbers of texts in text categories
  • Not sampled over equal duration
  • LLC (1958-1977) ICE-GB (1990-1992)
  • Text passages in LLC larger than ICE-GB
  • LLC (5,000 words) ICE-GB (2,000 words)
  • But text passages may include subtexts
  • telephone calls and newspaper articles are
    frequently short

7
DCPSE
  • Representative?
  • Text categories of unequal size
  • Broad range of text types sampled
  • Not balanced by speaker demography

8
A balanced corpus?
  • Corpora are reusable experimental datasets
  • Data collection (sampling) should avoid limiting
    future research goals
  • Samples should be representative
  • What are they representative of?
  • Quantity vs. quality
  • Large/lighter annotation vs. small/richer
  • Are larger corpora more (easily) representative?
  • Problems for historical corpora
  • Can we add samples to make the corpus more
    representative?

9
Representativeness
  • Do we mean representative...
  • of the language?
  • A sample in the corpus is a genuine random sample
    of the type of text in the language

10
Representativeness
  • Do we mean representative...
  • of the language?
  • A sample in the corpus is a genuine random sample
    of the type of text in the language
  • of text types?
  • Effort made to include examples of all types of
    language text types (including speech contexts)

11
Representativeness
  • Do we mean representative...
  • of the language?
  • A sample in the corpus is a genuine random sample
    of the type of text in the language
  • of text types?
  • Effort made to include examples of all types of
    language text types (including speech contexts)
  • of speaker types?
  • Sampling decisions made to include equal numbers
    (by gender, age, geography, etc.) of participants
    in each text category
  • Should subdivide data independently
    (stratification)

12
Representativeness
  • Do we mean representative...
  • of the language?
  • A sample in the corpus is a genuine random sample
    of the type of text in the language
  • of text types?
  • Effort made to include examples of all types of
    language text types (including speech contexts)
  • of speaker types?
  • Sampling decisions made to include equal numbers
    (by gender, age, geography, etc.) of participants
    in each text category
  • Should subdivide data independently
    (stratification)

random sample
broad
stratified
13
Stratified sampling
  • Ideal
  • Corpus independently subdivided by each variable

14
Stratified sampling
  • Ideal
  • Corpus independently subdivided by each variable

15
Stratified sampling
  • Ideal
  • Corpus independently subdivided by each variable
  • Equal subdivisions?

16
Stratified sampling
  • Ideal
  • Corpus independently subdivided by each variable
  • Equal subdivisions?
  • Not required
  • Independent variables constant probability in
    each subset
  • e.g. proportion of words spoken by women not
    affected by text genre
  • e.g. same ratio of womenmen in age groups, etc.

17
Stratified sampling
  • Ideal
  • Corpus independently subdivided by each variable
  • Equal subdivisions?
  • Not required
  • Independent variables constant probability in
    each subset
  • e.g. proportion of words spoken by women not
    affected by text genre
  • What is the reality?

18
ICE-GB gender / written-spoken
  • Proportion of words in each category spoken by
    women and men
  • The authors of some texts are unspecified
  • Some written material may be jointly authored
  • female/male ratio varies slightly (?0.02)

female
written
male
spoken
TOTAL
p
0
0.2
0.4
0.6
0.8
1
19
ICE-GB gender / spoken genres
  • Gender variation in spoken subcategories

female
male
p
0
0.2
0.4
0.6
0.8
1
20
ICE-GB gender / written genres
  • Gender variation in written genres

21
ICE-GB
  • Sampling was not stratified across variables
  • Women contribute 1/3 of corpus words
  • Some genres are all male (where specified)
  • speech spontaneous commentary, legal
    presentation
  • academic writing technology, natural sciences
  • non-academic writing technology, social science

22
ICE-GB
  • Sampling was not stratified across variables
  • Women contribute 1/3 of corpus words
  • Some genres are all male (where specified)
  • speech spontaneous commentary, legal
    presentation
  • academic writing technology, natural sciences
  • non-academic writing technology, social science
  • Is this representative?

23
ICE-GB
  • Sampling was not stratified across variables
  • Women contribute 1/3 of corpus words
  • Some genres are all male (where specified)
  • speech spontaneous commentary, legal
    presentation
  • academic writing technology, natural sciences
  • non-academic writing technology, social science
  • Is this representative?
  • When we compare
  • technology writing with creative writing
  • academic writing with student essays
  • are we also finding gender effects?

24
ICE-GB
  • Sampling was not stratified across variables
  • Women contribute 1/3 of corpus words
  • Some genres are all male (where specified)
  • speech spontaneous commentary, legal
    presentation
  • academic writing technology, natural sciences
  • non-academic writing technology, social science
  • Is this representative?
  • When we compare
  • technology writing with creative writing
  • academic writing with student essays
  • are we also finding gender effects?
  • Difficult to compensate for absent data in
    analysis!

25
DCPSE gender / genre
  • DCPSE has a simpler genre categorisation
  • also divided by time

prepared speech
assorted spontaneous
legal cross-examination
parliamentary language
spontaneous commentary
broadcast interviews
broadcast discussions
telephone conversations
informal
formal
face-to-face conversations
TOTAL
p
0
0.2
0.4
0.6
0.8
1
26
DCPSE gender / time
  • DCPSE has a simpler genre categorisation
  • also divided by time
  • note the gap

1
p
0.8
0.6
0.4
0.2
0
1958
1960
1962
1964
1966
1968
1970
1972
1974
1976
1978
1980
1982
1984
1986
1988
1990
1992
time
27
DCPSE genre / time
  • Proportion in each spoken genre, over time
  • sampled by matching LLC and ICE-GB overall
  • this is a stratified sample (but only
    LLCICE-GB)
  • uneven sampling over 5-year periods (within LLC)

p
formal face-to-face
ICE-GBtarget for LLC
0.6
0.4
prepared speech
informal face-to-face
0.2
spontaneous commentary
telephone conversations
0
1960
1965
1970
1975
1980
1985
1990
28
DCPSE
  • LLC sampling not stratified
  • Issue not considered, data collected over
    extended period
  • Some data was surreptitiously recorded

29
DCPSE
  • LLC sampling not stratified
  • Issue not considered, data collected over
    extended period
  • Some data was surreptitiously recorded
  • DCPSE matched samples by genre
  • Same text category sizes in ICE-GB and LLC
  • But problems in LLC (and ICE) percolate

30
DCPSE
  • LLC sampling not stratified
  • Issue not considered, data collected over
    extended period
  • Some data was surreptitiously recorded
  • DCPSE matched samples by genre
  • Same text category sizes in ICE-GB and LLC
  • But problems in LLC (and ICE) percolate
  • No stratification by speaker
  • Result difficult and sometimes impossible to
    separate out speaker-demographic effects from
    text category

31
Conclusions
  • Ideal would be that
  • the corpus was representative in all 3 ways
  • a genuine random sample
  • a broad range of text types
  • a stratified sampling of speakers
  • But these principles are unlikely to be
    compatible
  • e.g. speaker age and utterance context
  • Some compensatory approaches may be employed at
    research (data analysis) stage
  • what about absent or atypical data?
  • what if we have few speakers/writers?
  • So...

32
Conclusions
  • pay attention to stratification in deciding
    which texts to include in subcategories
  • consider replacing texts in outlying categories
  • justify and document non-inclusion of stratum by
    evidence
  • e.g. there are no published articles
    attributable to authors of this age in this time
    period
Write a Comment
User Comments (0)
About PowerShow.com