Corpus design

About This Presentation

Title:

Corpus design

Description:

Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch 2 CF Meyer, English Corpus Linguistics, Ch. 2 Issues in corpus design General purpose vs ... – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 19

Provided by: 664117

Category:

more less

Transcript and Presenter's Notes

Title: Corpus design

1
Corpus design

See
G Kennedy, Introduction to Corpus Linguistics,
Ch.2
CF Meyer, English Corpus Linguistics, Ch. 2

2
What is a corpus?

Corpus (pl. corpora) body
Collection of written text or transcribed speech
Usually but not necessarily purposefully
collected
Usually but not necessarily structured
Usually but not necessarily annotated
(Usually stored on and accessible via computer)
Corpus text archive

3
Issues in corpus design

General purpose vs specialized
Dynamic (monitor) vs static
Representativeness and balance
Size
Storage and access
Permission
Text capture and markup
Organizations

4
General purpose vs specialized

Probably obvious how to assemble specialized
corpus appropriateness of texts for inclusion is
self-defined
General-purpose corpus implies very careful
planning to ensure balance
Implies making some assumptions about the nature
of language, even though (as corpus linguists)
that may go against the grain

5
Dynamic vs static

Static corpus will give a snapshot of language
use at a given time
Easier to control balance of content
May limit usefulness, esp. as time passes (eg
Brown corpus now of historical interest, in some
respects BNC already out of date)
Dynamic corpus ever-changing
Called monitor corpus because allows us to
monitor langauge change over time
But more or less impossible to ensure balance

6
Planned balance example of BNC

Sampling and representativeness very difficult to
ensure
BNCdesigners very explicit about their
assumptions
Acknowledge that many decisions are subjective in
the end
100 m words of contemporary spoken and written
British English
Representative of BrE as a whole
Balanced with regard to genre, subject matter and
style
Also designed to be appropriate for a variety of
uses lexicography, education, research,
commercial applications (computational tools)

7
BNC

4,124 texts 90 written, 10 spoken
Largest collection of spoken English ever
collected (10m words), but reflects typical
imbalance in favour of written text (for
understandable practical reasons)
Written portion 75 informative, 25 imaginative
Amount of fiction is slightly disproportionately
high compared to amount published during the
sampling period, justified because of cultural
importance of fiction and creative writing

8
Subject coverage

Planned to reflect pattern of book publishing in
UK over last 20 years

Subject Number of texts
of total written Imaginative
625
22 World affairs 453
18 Social science
510
15 Leisure 374
11 Applied science
364
8 Commerce 284
8 Arts
259
8 Natural science 144
4 Belief thought
146
3 Unclassified 50
3
9
Sources of written material

60 books
25 periodicals
5 brochures and other ephemera
eg bus tickets, produce containers, junk mail
5 unpublished letters, essays, minutes
5 plays, speeches (written to be spoken)

10
Register levels

30 literary or technical high
45 middle
25 informal low
Obvious difficulty of how to judge levels a
priori

11
Spoken corpus

Context-governed material
Lectures, tutorials, classrooms
News reports
Product demonstrations, consultations, interviews
Sermons, political speeches, public meetings,
parliamentary debates
Sports commentaries, phone-ins, chat shows
Samples from 12 different regions

12
Spoken corpus

Ordinary conversation
2000 hrs from 124 volunteers, 38 different
regions
Four different socio-economic groupings
Equal male and female, age range 15 to 60
All conversations over a 2-day period recorded
No secret recording, and allowed to erase
Systematic details kept of time, location,
details of participants (sex, age, race,
occupation, education, social group, ), topic,
etc.
Transcription issues
include false starts, hesitations, etc.
some paralinguistic features (shouting,
whispering),
use of dialect words/grammar
but no phonetic information

13
Another example ICE

Collection of samples of English as
spoken/written around the world
Common design (as well as common annotation
scheme, and shared tools for exploitation)
500 texts of approximately 2,000 words each
60 spoken, 40 written
Specific domains and genres prescribed
Prescribing common design in this way makes the
corpora comparable

14
ICE text categories Each sample should be 2000
words
Spoken (300) Dialogues (180) Private (100) Conversations (90) Phone calls (10)
Public (80) Class lessons (20) Broadcast discussions (20) Broadcast interviews (10) Parliamentary debates (10) Cross-examinations (10) Business transactions (10)
Monologues (120) Unscripted (70) Commentaries (20) Unscripted speeches (30) Demonstrations (10) Legal presentations (10)
Scripted (50) Broadcast news (20) Broadcast talks (20) Non-broadcast talks (10)
Written (200) Non-printed (50) Student writing (20) Student essays (10) Exam scripts (10)
Letters (30) Social letters (15) Business letters (15)
Printed (150) Academic (40) Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10)
Popular (40) Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10)
Reportage (20) Press reports (20)
Instructional (20) Administrative writing (10) Skills/hobbies (10)
Persuasive (10) Editorials (10)
Creative (20) Novels (20)
15
Length of corpus

Resources available to create and manage corpus
determine how long it can be
Funding, researchers, computing facilities
Speech is easy to capture, but much more
time-consuming to process that written language
Transcription and annotation requires 6
person-hours per 1 minute of speech (Santa
Barbara Corpus of Spoken American English)
4 person-hours per 1,000 words of written sample,
but between 5 and 10 person-hours per 1,000 words
of speech (more for dialogues due to overlapping
speech) (International Corpus of English)
On this basis, American component of ICE would
take one researcher working 40 hrs/week 3 years
to complete
BNC is 100 times bigger than that

16
Length of corpus

Length is also determined on use to which it will
be put
Corpora for lexicographic use need to be (much)
bigger
Early corpora (1m words) seemed huge, mainly due
to limitations of computers to process them
Sinclair (1991) described a 20m word corpus as
small but nevertheless useful
Even in a billion-word corpus, data for some
words/constructions would be sparse
How many tokens of a linguistic item are needed
for descriptive adequacy?
Typically 40-50 of all word types occur only
once in a given text (or corpus)
For polysemous words at least half of the
possible meanings will occur only once (if at
all)

17
Type and token

Token means individual occurrence of a word
Type means instance of a given word
The man saw the girl with the telescope
8 tokens, 6 types
Type may refer to lexeme, or individual word
form
run, runs, ran, running 1 or 4 types?

Some attempts to base corpus size on known
statistics of existing corpora
Biber (1993) reliable information on
frequently occurring linguistic items such as
nouns can be got from 120k-word sample, while an
infrequently occurring construction such as
conditional clause would need 2.4m words
How are such figures arrived at?
Observe point at which measures stabilise
Also, how much data can a lexicographer absorb?

Write a Comment

User Comments (0)

About PowerShow.com

Corpus design - PowerPoint PPT Presentation

Corpus design

Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch 2 CF Meyer, English Corpus Linguistics, Ch. 2 Issues in corpus design General purpose vs ... – PowerPoint PPT presentation