Corpus design - PowerPoint PPT Presentation

About This Presentation
Title:

Corpus design

Description:

Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch 2 CF Meyer, English Corpus Linguistics, Ch. 2 Issues in corpus design General purpose vs ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 19
Provided by: 664117
Category:

less

Transcript and Presenter's Notes

Title: Corpus design


1
Corpus design
  • See
  • G Kennedy, Introduction to Corpus Linguistics,
    Ch.2
  • CF Meyer, English Corpus Linguistics, Ch. 2

2
What is a corpus?
  • Corpus (pl. corpora) body
  • Collection of written text or transcribed speech
  • Usually but not necessarily purposefully
    collected
  • Usually but not necessarily structured
  • Usually but not necessarily annotated
  • (Usually stored on and accessible via computer)
  • Corpus text archive

3
Issues in corpus design
  • General purpose vs specialized
  • Dynamic (monitor) vs static
  • Representativeness and balance
  • Size
  • Storage and access
  • Permission
  • Text capture and markup
  • Organizations

4
General purpose vs specialized
  • Probably obvious how to assemble specialized
    corpus appropriateness of texts for inclusion is
    self-defined
  • General-purpose corpus implies very careful
    planning to ensure balance
  • Implies making some assumptions about the nature
    of language, even though (as corpus linguists)
    that may go against the grain

5
Dynamic vs static
  • Static corpus will give a snapshot of language
    use at a given time
  • Easier to control balance of content
  • May limit usefulness, esp. as time passes (eg
    Brown corpus now of historical interest, in some
    respects BNC already out of date)
  • Dynamic corpus ever-changing
  • Called monitor corpus because allows us to
    monitor langauge change over time
  • But more or less impossible to ensure balance

6
Planned balance example of BNC
  • Sampling and representativeness very difficult to
    ensure
  • BNCdesigners very explicit about their
    assumptions
  • Acknowledge that many decisions are subjective in
    the end
  • 100 m words of contemporary spoken and written
    British English
  • Representative of BrE as a whole
  • Balanced with regard to genre, subject matter and
    style
  • Also designed to be appropriate for a variety of
    uses lexicography, education, research,
    commercial applications (computational tools)

7
BNC
  • 4,124 texts 90 written, 10 spoken
  • Largest collection of spoken English ever
    collected (10m words), but reflects typical
    imbalance in favour of written text (for
    understandable practical reasons)
  • Written portion 75 informative, 25 imaginative
  • Amount of fiction is slightly disproportionately
    high compared to amount published during the
    sampling period, justified because of cultural
    importance of fiction and creative writing

8
Subject coverage
  • Planned to reflect pattern of book publishing in
    UK over last 20 years

Subject Number of texts
of total written Imaginative
625
22 World affairs 453
18 Social science
510
15 Leisure 374
11 Applied science
364
8 Commerce 284
8 Arts
259
8 Natural science 144
4 Belief thought
146
3 Unclassified 50
3
9
Sources of written material
  • 60 books
  • 25 periodicals
  • 5 brochures and other ephemera
  • eg bus tickets, produce containers, junk mail
  • 5 unpublished letters, essays, minutes
  • 5 plays, speeches (written to be spoken)

10
Register levels
  • 30 literary or technical high
  • 45 middle
  • 25 informal low
  • Obvious difficulty of how to judge levels a
    priori

11
Spoken corpus
  • Context-governed material
  • Lectures, tutorials, classrooms
  • News reports
  • Product demonstrations, consultations, interviews
  • Sermons, political speeches, public meetings,
    parliamentary debates
  • Sports commentaries, phone-ins, chat shows
  • Samples from 12 different regions

12
Spoken corpus
  • Ordinary conversation
  • 2000 hrs from 124 volunteers, 38 different
    regions
  • Four different socio-economic groupings
  • Equal male and female, age range 15 to 60
  • All conversations over a 2-day period recorded
  • No secret recording, and allowed to erase
  • Systematic details kept of time, location,
    details of participants (sex, age, race,
    occupation, education, social group, ), topic,
    etc.
  • Transcription issues
  • include false starts, hesitations, etc.
  • some paralinguistic features (shouting,
    whispering),
  • use of dialect words/grammar
  • but no phonetic information

13
Another example ICE
  • Collection of samples of English as
    spoken/written around the world
  • Common design (as well as common annotation
    scheme, and shared tools for exploitation)
  • 500 texts of approximately 2,000 words each
  • 60 spoken, 40 written
  • Specific domains and genres prescribed
  • Prescribing common design in this way makes the
    corpora comparable

14
ICE text categories Each sample should be 2000
words
Spoken (300) Dialogues (180) Private  (100) Conversations (90) Phone calls (10)
    Public (80) Class lessons (20) Broadcast discussions (20) Broadcast interviews (10) Parliamentary debates (10) Cross-examinations (10) Business transactions (10)
  Monologues (120) Unscripted (70) Commentaries (20) Unscripted speeches (30) Demonstrations (10) Legal presentations (10)
Scripted (50) Broadcast news (20) Broadcast talks (20) Non-broadcast talks (10)
Written (200) Non-printed (50) Student writing  (20) Student essays (10) Exam scripts (10)
    Letters (30) Social letters (15) Business letters (15)
  Printed (150) Academic  (40) Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10)
    Popular  (40) Humanities (10) Social Sciences (10) Natural Sciences (10) Technology (10)
    Reportage (20) Press reports (20)
    Instructional (20) Administrative writing (10) Skills/hobbies (10)
    Persuasive (10) Editorials (10)
  Creative (20) Novels (20)
15
Length of corpus
  • Resources available to create and manage corpus
    determine how long it can be
  • Funding, researchers, computing facilities
  • Speech is easy to capture, but much more
    time-consuming to process that written language
  • Transcription and annotation requires 6
    person-hours per 1 minute of speech (Santa
    Barbara Corpus of Spoken American English)
  • 4 person-hours per 1,000 words of written sample,
    but between 5 and 10 person-hours per 1,000 words
    of speech (more for dialogues due to overlapping
    speech) (International Corpus of English)
  • On this basis, American component of ICE would
    take one researcher working 40 hrs/week 3 years
    to complete
  • BNC is 100 times bigger than that

16
Length of corpus
  • Length is also determined on use to which it will
    be put
  • Corpora for lexicographic use need to be (much)
    bigger
  • Early corpora (1m words) seemed huge, mainly due
    to limitations of computers to process them
  • Sinclair (1991) described a 20m word corpus as
    small but nevertheless useful
  • Even in a billion-word corpus, data for some
    words/constructions would be sparse
  • How many tokens of a linguistic item are needed
    for descriptive adequacy?
  • Typically 40-50 of all word types occur only
    once in a given text (or corpus)
  • For polysemous words at least half of the
    possible meanings will occur only once (if at
    all)

17
Type and token
  • Token means individual occurrence of a word
  • Type means instance of a given word
  • The man saw the girl with the telescope
  • 8 tokens, 6 types
  • Type may refer to lexeme, or individual word
    form
  • run, runs, ran, running 1 or 4 types?

18
  • Some attempts to base corpus size on known
    statistics of existing corpora
  • Biber (1993) reliable information on
    frequently occurring linguistic items such as
    nouns can be got from 120k-word sample, while an
    infrequently occurring construction such as
    conditional clause would need 2.4m words
  • How are such figures arrived at?
  • Observe point at which measures stabilise
  • Also, how much data can a lexicographer absorb?
Write a Comment
User Comments (0)
About PowerShow.com