THE BNC The British National Corpus - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

THE BNC The British National Corpus

Description:

Name: Ethel /person person age=5 dialect=XLO educ=X flang=EN-GBR id=PS141 n=W0003 ... Kennedy, Graeme. An Introduction to Corpus Linguistics. London: ... – PowerPoint PPT presentation

Number of Views:383
Avg rating:3.0/5.0
Slides: 16
Provided by: VJJ7
Category:
Tags: bnc | the | british | corpus | national

less

Transcript and Presenter's Notes

Title: THE BNC The British National Corpus


1
THE BNCThe British National Corpus
  • TU Chemnitz
  • English Language and Linguistics
  • HS Practical Corpus Linguistics
  • Lecturer Martin Weisser, Ph.D.
  • Presented by Kristin Flanagan
  • Wintersemester 2005/2006

2
What is the BNC?
  • corpus of more than 100 million words of
    contemporary spoken and written British English
  • representative of British English as a whole
  • covers wide range of genres from written and
    spoken English for educational, academic and
    commercial use
  • was put together by Oxford University Press,
    Longman Group (UK) Ltd., W. R. Chambers, the
    British Library, the Universities of Oxford and
    Lancaster
  • 1991 1995

3
How is the BNC divided?
  • 4,124 texts based on 90 written, 10 spoken
    sources
  • Written
  • texts of written sources come from
  • 75 informative prose, all post 1975
  • 25 imaginative (literary works), all post 1960
  • 60 taken from books
  • 25 from periodicals
  • 5 from public brochures
  • 5 from unpublished letters, minutes, essays
  • 5 written plays or speeches
  • written corpus represents different levels of
    British English
  • 30 more literary or technical high style
  • 45 middle
  • 25 informal low style

4
The spoken part of the BNC
  • all spoken data dates back no earlier than 1991
  • two major sources
  • 1. context-dependent section
  • educational and informative events
  • news reports
  • business events
  • official and public events
  • leisure events
  • 2. demographic section (40)
  • 2,000 hours of transcribed recordings
  • made by 127 volunteers
  • all their conversations from a two to seven day
    period
  • from 38 different parts of the UK
  • four different socio-economic groupings
  • balanced coverage of female and male speakers
  • ages between 15 and 60

5
Modifications made to Claws to deal with spoken
data
  • additions to the lexicon
  • - interjections of various sorts (ah,
    mhm, okey-dokey, etc.)
  • - various taboo words
  • - slang words
  • - small list of words used differently spoken
    vs. written
  • addition to the contractions recognized by Claws
    (dya, gotta, etc.)
  • special treatment of truncated words and some
    interjections
  • recognition of repetition

6
Grammatical tagging of the spoken part of the BNC
  • corpus is being encoded using SGML (Standard
    Generalized Markup Language) in conformance with
    TEI (Text Encoding Initiative)
  • every word is assigned part-of-speech markers
    from tagset of 61 tags, which is called C5 tagset
  • tagging is carried out by Claws4 tagging system
  • after Claws4 manual post-editing phase
  • finally, sent back to Oxford University Computing
    Services for possible errors

7
Tagging
  • utterances are indicated with pair of SGML tags
    ltugt and lt/ugt
  • indication of speaker is given by who attribute
  • overlapped passages indicated with SGML ltptrgt
    tags
  • laughs, pauses, etc. indicated with appropriate
    SGML tag
  • after processing by UCREL, text has been split
    into s-units (essentially functional sentences)
  • each sentence is given identifying code number
  • each word is assigned part-of-speech marker

8
Sound example
  • ltcreation date'?'gt
  • Origination/creation date not known
  • lt/creationgt
  • ltparticsgt
  • ltperson age5 dialectXLO educ2
    flangEN-GBR idPS0SV nW0001
  • roleother sexf socC1gt
  • Age 66
  • BMRB code 101
  • BNC name Sidney
  • Name Irene
  • Occupation retired
  • lt/persongt
  • ltperson ageX dialectXLO educX
    flangEN-GBR idPS140 nW0002
  • respPS0SV roleother sexf socUUgt
  • Name Ethel
  • lt/persongt
  • ltperson age5 dialectXLO educX
    flangEN-GBR idPS141 nW0003
  • respPS0SV roleother sexf socUUgt
  • Age 67

9
Transcription of the previous sound example
(excerpt)
  • ltu whoPS140gt
  • lts n0068gt
  • ltw AV0gtWell ltw CJSgtonce ltw PNPgtyoultw VHBgt've ltw
    VDNgtdone ltw DT0gtthat
  • ltw PNPgtitltw VBZgt's ltw PRPgtlike ltw VVGgttwisting
    ltw DPSgtyour ltw NN1gtankleltc PUNgt.
  • lts n0069gt
  • ltw PNPgtItltw VBZgt's ltw AV0gtalways ltw AJCgtweakerltc
    PUNgt.
  • ltw PNPgtItltw VBZgt's ltw AV0gtalways ltw AJCgtweakerltc
    PUNgt.
  • lt/ugt
  • ltu whoPS141gt
  • lts n0078gt
  • ltptr tKDYLC00Bgt ltw PNPgtYou ltw VVBgtmean ltw
    PNPgtyou ltw VM0gtcaltw XX0gtn't ltw VVIgtsee
  • ltptr tKDYLC00Cgt ltw PNPgtit ltw TO0gtto ltw VVIgtput
    ltw PRPgtround ltw AT0gtthe ltw NN1gtcorner
  • ltw AV0gthere ltw PRPgtlike ltw DT0gtthisltc PUNgt?
  • lt/ugt

10
Appendix to transcription part of speech (POS)
tags
  • AJC comparative adjective
  • AT0 article
  • AV0 adverb (unmarked)
  • CJS subordinating conjunction
  • DPS possessive determiner form
  • DT0 general determiner
  • PNP personal pronoun
  • PRP preposition (except for OF)
  • PUN punctuation general mark
  • TO0 infinitive marker TO
  • VBZ -s form of the verb be (is)
  • VDN done
  • VHB forms of the verb have
  • VM0 modal auxiliary verb
  • VVB base form of lexical verb (except the
    infinitive)
  • VVG -ing participle of lexical verb
  • VV1 infinitive
  • XX0 the negative NOT or NT

11
Compare Original Version with SGML (taken from
BNC Sampler Treebank)
  • Original
  • S0KC8001 v
  • S S Hello_UH there_RL ltpausegt_NULL right_RR
    ,_YCOM and_CC N what_DDQ
  • N do_VD0 N you_PPY NV want_VVI V ,_YCOM N
    some_DD mushrooms_NN2
  • NS ltpausegt_NULL SN which_DDQ mushrooms_NN2
    N shall_VM N
  • we_PPIS2 NV have_VHI VN my_APPGE love_NN1
    NS ?_YQUE S
  • SGML
  • ltS n"001"gt ltS co"first"gt ltw f"Hello" t"UH"gt
    ltw f"there" t"RL"gt ltpausegt ltw f"right" t"RR"gt
    ltc f"," t"YCOM"gt ltw f"and" t"CC"gt ltNgt ltw
    f"what" t"DDQ"gt lt/Ngt ltw f"do" t"VD0"gt ltNgt ltw
    f"you" t"PPY"gt lt/NgtltVgt ltw f"want" t"VVI"gt
    lt/Vgt ltc f"," t"YCOM"gt ltNgt ltw f"some" t"DD"gt
    ltw f"mushrooms" t"NN2"gt lt/Ngtlt/Sgt ltpausegt ltS
    co"subs"gtltNgt ltw f"which" t"DDQ"gt ltw
    f"mushrooms" t"NN2"gt lt/Ngt ltw f"shall" t"VM"gt
    ltNgt ltw f"we" t"PPIS2"gt lt/NgtltVgt ltw f"have"
    t"VHI"gt lt/VgtltNgt ltw f"my" t"APPGE"gt ltw f"love"
    t"NN1"gt lt/Ngtlt/Sgt ltc f"?" t"YQUE"gt lt/Sgt

12
How can the BNC be used to analyse British
English?
  • Examples
  • by comparing various channels and text-types
  • by comparing speech and writing
  • by comparing conversation and task-orientated
    speech
  • by comparing imaginative and informative writing

13
Examples from the frequency list
  • figures are given per million words
  • antonyms good 1,276 vs. bad 264
  • weekdays Sunday 93 vs. Tuesday 36
  • living creatures dog 124 vs. fly 16
  • weather wind 85 vs. sunshine 13
  • sports football 67 vs. sailing 9
  • colours black 226 vs. purple 11
  • most frequent common noun time 1,833
  • most frequent verb be 42,277
  • most frequent determiner the 61,847

14
Errors and disadvantages of the BNC
  • already existing typing mistakes in the texts
    provided
  • typing errors made by transcribers
  • sampling errors
  • encoding errors
  • tagging errors
  • quotations in other language may lead to
    confusion (German die has nothing to do with
    mortality)
  • corpus may be biased due to buzzwords at the
    time

15
Sources
  • http//www.natcorp.ox.ac.uk/
  • BNC Sampler
  • Kennedy, Graeme. An Introduction to Corpus
    Linguistics. London Longman, 1998.
  • Leech, Geoffrey/Greg Myers/Jenny Thomas (ed.).
    Spoken English on Computer Transcription,
    mark-up and application. New York Longman, 1995.
  • Leech, Geoffrey/Paul Rayson/Andrew Wilson. Word
    frequencies in written and spoken English based
    on the British National Corpus. London Longman,
    2000.
  • Aston, Guy/Lou Burnard. The BNC Handbook
    Exploring the British National Corpus with SARA.
    Edinburgh Edinburgh University Press, 1998.
Write a Comment
User Comments (0)
About PowerShow.com