Presentazione di PowerPoint - PowerPoint PPT Presentation

About This Presentation
Title:

Presentazione di PowerPoint

Description:

What is a CORPUS? A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a ... – PowerPoint PPT presentation

Number of Views:229
Avg rating:3.0/5.0
Slides: 103
Provided by: glo995
Category:

less

Transcript and Presenter's Notes

Title: Presentazione di PowerPoint


1
(No Transcript)
2
What is a CORPUS?
  • A corpus is a collection of pieces of language
    that are selected and ordered according to
    explicit linguistic criteria in order to be used
    as a sample of the language
  • (Sinclair 1996)

3
What is a CORPUS?
  • the term corpus as used in modern
    linguistics can best be defined as a collection
    of sampled texts, written or spoken, in
    machine-readable form which may be annotated with
    various forms of linguistic information
  • (McEnery, Xiao and Tono 2006)

4
Key concepts re. Corpora
  • Machine-readable texts
  • Authentic texts
  • Sampled texts
  • Representative of a particularlanguage or
    language variety

5
Is Corpus Linguistics a new approach to the study
of language?
  • The expression Corpus Linguistics first appeared
    in the early 80s.
  • Corpus-based language study,however has a
    substantial history.

6
Corpus-based language study
  • In the pre-Chomskyan era
  • Field linguists (Boas)
  • Structuralists (Sapir, Newman, Bloomfield, Pike,
    etc.)
  • Corpora where few paper slips with data.
    Shoebox Corpora Non-representative.
  • Corpus-based only in that the methodology was
    empirical and based on observable data.

7
The 50s the protests
  • Chomsky (1962) accused the (contemporary) corpus
    methodology, by reason of the skewedness of
    corpora.
  • Non-representative, time consuming, competence
    vs. performance, I-language vs. E-language
  • Corpora were marginalized.

8
The revolutionary 60s
  • With the advances in computer technology the
    exploitation of massive corpora became feasible.
  • Brown Corpus
  • Brown University Standard Corpus of American
    Present-day English

9
The 80s the boom
  • From the 80s onwards the number and size of
    corpora and corpus based studies have increased
    dramatically.
  • Corpora have revolutionized almost all
    branches of linguistics.

10
A few remarks
  • Computers
  • allow us to speed up the processing of data.
  • avoid human bias in data analysis
  • allow the enrichment of data with metadata

11
Intuition vs. Corpus
  • Intuition should be applied with caution
  • Influence of dialect, sociolect, idiolect
  • No universal agreement on (degree of)
    acceptability
  • Informants monitor their use of language
    (non-spontaneous)
  • Introspection is not observable

12
Intuition vs. Corpus
  • Corpus-based approach draws upon authentic or
    real texts
  • Computer-based analysis can retrieve differences
    that intuition alone cannot perceive
  • Reliable quantitative data

13
Should we dismiss intuition then?
  • Not at all!
  • The key to using corpus data is to find the
    balance between the use of corpus data and the
    use of ones own intuition.

14
Should we dismiss intuition then?
  • Not all research questions can be addressed by
    the corpus-based approach.
  • Corpus-based approach and intuition-based
    approach
  • ARE NOT MUTUALLY EXCLUSIVE

15
Leech (199114) writes
  • Neither the corpus linguist of the 1950s,
    who rejected intuition, nor the general linguist
    of the 1960s, who rejected corpus data, was able
    to achieve the interaction of data coverage and
    the insight that characterise the many successful
    corpus analyses of recent years.

16
Is CL a methodology or a theory?
  • No universal agreement.
  • CL is a METHODOLOGY and not an independent branch
    of linguistics such as semantics, pragmatics,
    syntax, etc.
  • CL can be employed to explore almost any area of
    linguistic research.

17
Corpus-based or Corpus-driven approaches?
  • Corpus-based approaches are used to expound,
    test or exemplify theories and descriptions that
    were formulated before large corpora became
    available to inform language study
    (Tognini-Bonelli 200165).
  • Therefore, corpus-based linguists are not
    strictly committed to corpus data and they would
    discard inconvenient evidence by insulation,
    standardisation and instantiation (i.e. via
    corpus annotation).

18
Corpus-based or Corpus-driven approaches?
  • Corpus-driven linguists are strictly committed
    to the integrity of the data as a whole.
  • Theoretical statements are fully consistent with,
    and reflect directly, the evidence provided by
    the corpus.
  • (Tognini-Bonelli 200184-85).

19
Corpus-based or Corpus-driven approaches?
  • The distinction is overstated, they are 2
    idealized extremes.
  • 4 basic differences among the 2 approaches
  • Types of corpora used
  • Attitudes towards theories and intuitions
  • Focuses of research
  • Paradigmatic claims

20
  • C.B. Approaches
  • Corpus must be representative and balanced
  • Size is not all-important
  • Minimum frequency is used to exclude non-relevant
    results
  • In favour of corpus annotation CB approaches
    generally have existing theory as a starting
    point and correct and revise such theory in the
    light of corpus evidence
  • Distinction between the different levels of
    language analysis.
  • C.D. Approaches
  • Corpus will balance itself when it grows to be
    big enough (cumulative representativeness)
  • Corpus must be very large
  • Corpus evidence is exploited fully, but this way
    the number of the combinations is enormous
  • Against corpus annotation (no preconceived
    theories)
  • No distinction betweenlexis, syntax,
    pragmatics,etc. There is only 1 levelof
    language descriptionthe functionally complete
    unit of meaning or languagepatterning

21
  • We will only refer to
  • CORPUS-BASED APPROACHES
  • A few key notions in
  • Corpus Linguistics

22
Representativeness
  • Essential feature of a corpus.
  • Balance (the range of genres included in a
    corpus) and sampling (how the text chunks for
    each genre are selected) ensure
    representativeness.

23
Representativeness
  • A corpus is representative if
  • the findings based on its contents cane be
    generalized to the said language variety (Leech
    1991)
  • its samples include the full range of
    variability in a population (Biber 1993)

24
Representativeness
  • It changes over time (Hunston 2002) if a corpus
    is not regularly updated, it rapidly becomes
    unrepresentative.

25
Representativeness
  • Criteria to select texts for a corpus
  • External criteria (Bibers situational
    perspective) defined situationally, e.g. genres,
    registers, text types, etc.
  • Internal criteria (Bibers linguistic
    perspective) defined linguistically, taking into
    account the distribution of linguistic features.
    CIRCULAR because a corpus is typically design
    to study linguistic distribution, so there is no
    point in analysing a corpus where distribution of
    linguistic features is predetermined.

26
Representativeness
  • 2 main types (for the range of text categories
    represented)
  • General corpora a basis for an overall
    description of a language (variety) their r.
    depends on the sampling from a broad range of
    genres.
  • Specialized corpora domain- or genre specific
    corpora their r. can be measured by the degree
    of closure or saturation (lexical features).

27
Balance
  • The range of text categories included in the
    corpus
  • The acceptable b. is determined by the intended
    uses.
  • A balanced corpus covers a wide range of text
    categories which are supposed to be
    representative of the language (variety) under
    consideration.

28
Balance
  • There is no scientific measure for balance.
  • It is more important for sample corpora than
    for monitor corpora

29
Sampling
  • A corpus is a sample of a given population
  • A sample is representative if what we find for
    the sample holds for the general population
  • Samples are scaled-down versions of a larger
    population

30
Sampling
  • Sampling unit for written text, a s.u. could be
    a book, periodical or newspaper.
  • Population the assembly of all sampling units
    it can be defined in terms of language
    production, reception (demographic, sex, age,
    etc.) or language as a product (category, genre
    of language data).
  • Sampling frame the list of sampling units

31
Sampling
  • Sampling techniques
  • Simple random sampling all sampling units within
    the sampling frame are numbered and the sample is
    chosen by use of a table or random numbers rare
    features could not be accounted for.
  • Stratified random sampling the population is
    divided in relatively homogeneous groups, i.e.
    the strata, and then these latter are sampled at
    random never less representative than the former
    method.

32
Sampling
  • Sample size
  • Full texts no balance peculiarity of
    individual texts may show through.
  • Text chunks are sufficient (e.g. 2000 running
    words) frequent linguistic features are stable
    in their distribution and hence short text chunks
    are sufficient for their study (Biber 1993). Text
    initial, middle and end samples must be balanced.

33
Sampling
  • Proportion and number of samples
  • The number of samples across text categories
    should be proportional to their frequencies
    and/or weights in the target population in order
    for the resulting corpus to be considered as
    representative

34
What matters is the Research Question!
  • Claims of corpus representativeness and balance
    should be interpreted in relative terms as there
    is no objective way to balance a corpus or to
    measure its representativeness.
  • Representativeness is a fluid concept the
    research question that one has in mind when
    building a corpus determines what is an
    acceptable balance for the corpus one should use
    and whether it is suitably representative.

35
Data collection
  • Spoken data must be transcribed from audio
    recordings.
  • Written text must be rendered machine-readable by
    keyboarding or OCR (Optical Character
    Recognition) scanning.
  • Language data so collected form a RAW CORPUS.

36
Corpus Mark-up
  • System of standard codes inserted into a
    document stored in electronic form to provide
    information about the text itself and govern
    formatting, printing and other processes.
  • Most widely used mark-up schemes
  • TEI (Text Encoding Initiative)
  • CES (Corpus Encoding Standard)

37
Corpus Mark-up
  • It is essential in corpus-building because
  • sampled texts are out of context and it allows
    to recover contextual information
  • it provides more information than the file
    names alone (re. text types, sociolinguistic
    variables, textual information structure)
  • it ads value to the corpus because it allows for
    a broader range of questions to be addressed
  • it allows to insert editorial comments during
    the corpus building process.

38
Corpus Mark-up
  • Extra-textual and textual information must be
    kept separate from the corpus data.
  • Examples
  • COCOA mark-up scheme
  • ltA WILLIAM SHAKESPEAREgt
  • A author, attribute name
  • WILLIAM SHAKESPEARE attribute value

39
TEI Mark-up Scheme
  • Each individual text is a document consisting in
    a header and a body, in turn composed of
    different elements.
  • Ex. in the header there are 4 main elements
  • A file description ltfileDescgt
  • An encoding description ltencodingDescgt
  • A text profile ltprofileDescgt
  • A revision history ltrevisionDescgt
  • Tags can be nested, i.e. they can appear inside
    other elements.

40
TEI Mark-up Scheme
  • It can be expressed using a number of different
    formal languages.
  • SGML (Standard GeneralizedMark-up Language
    used bythe BNC)
  • XML (Extensible Mark-up Language)

41
CES Mark-up Scheme
  • Designed specifically for the encoding of
    language corpora.
  • Document-wide mark-up (bibliographical
    descripion, encoding description, etc.)
  • Gross structural mark-up (volume, chapter,
    paragraph, footnotes, etc. specifies recommended
    character sets)
  • Mark-up for subparagraph structures (sentence,
    quotations, words, abbreviations, etc.)

42
CES Mark-up Scheme
  • It specifies a minimal encoding level that
    corpora must achieve to be considered
    standardized in terms of descriptive
    representation as well as general architecture.
  • 3 levels of standardization designedto achieve
    the goal of universal document interchange
  • Metalanguage level
  • Syntactic level
  • Semantic level

43
Corpus Annotation
  • Necessary in order to extract relevant
    information from corpora.
  • The process of adding interpretive,
    linguistic information to an electronic corpus of
    spoken and/or written language data
  • (Leech 1997)

44
Annotation vs. Mark-up
  • Corpus mark-up provides objective, verifiable
    information.
  • Annotation is concerned withinterpretive
    linguistic information.

45
The advantages of annotation
  1. It makes extracting information easier, faster
    and enables human analysts to exploit and
    retrieve analyses of which they are not
    themselves capable.

46
The advantages of annotation
  • 2. Annotated corpora are reusable resources.
  • 3. Annotated corpora are multifunctional they
    can be annotated with a purpose and be reused
    with another.

47
The advantages of annotation
  • 4. Corpus annotation records a linguistic
    analysis explicitly.
  • 5. Corpus annotation provides a standard
    reference resource, a stable base of linguistic
    analyses, so that successive studies can be
    compared and contrasted on a common basis.

48
Criticisms to corpus annotation
  1. Annotation produces cluttered corpora
  2. Annotation imposes an analysis
  3. Annotation overvalues corpora making them less
    accessible
  4. Is annotation accurate and consistent?

49
How are corpora annotated?
  • Automatic annotation
  • Computer-assisted annotation
  • Manual annotation
  • Sinclair (1992) the introduction of the human
    element in corpus annotation reduces consistency.

50
Types of annotation
  • Different types of annotation can be carried out
    with different means.
  • For some types automatic annotation is very
    accurate. Other types require post-editing,
    i.e. human correction.

51
Types of annotation
  • Corpora can be annotated at different levels of
    linguistic analysis.
  • Phonological level
  • Syllable boundaries (phonetic/phonemic
    annotation)
  • Prosodic features (prosodic annotation)

52
Types of annotation
  • Morphological level
  • Prefixes
  • Suffixes
  • Stems
  • (morphological annotation)

53
Types of annotation
  • Lexical level
  • Part of speech (POS Tagging)
  • Lemmas (lemmatization)
  • Semantic fields (semantic annotation)
  • Syntactic level
  • parsing
  • treebanking
  • bracketing

54
Types of annotation
  • Discourse level
  • Anaphoric relations (coreference annotation)
  • Speech acts (pragmatic annotation)
  • Stylistic features such as speech and thought in
    presentation (stylistic annotation).

55
POS Tagging
  • POS is the most common type of annotation.
  • Also known as grammatical tagging or
    morpho-syntactic annotation.
  • It provides the basis of further forms of
    analysis such as parsing and semantic
    annotation.
  • Many linguistic analyses, e.g. the collocates of
    a word depend heavily on POS tagging.

56
POS Tagging
  • It can be performed automatically with taggers
    like CLAWS
  • http//www.comp.lancs.ac.uk/ucrel/claws/
  • You can try it for free online.
  • Examples of tags NN1 (noun), VVZ (verb in the
    third person of the simple present tense), VVD
    (verb in the simple past form), ADJ0 (adjective
    in the basic form), etc.

57
POS Tagging
  • Problems
  • Word segmentation (tokenization)
  • Multiwords (so that, inspite of)
  • Mergers (cant, gonna)
  • Variably spelled compounds (noticeboard,
    notice-board, notice board)

58
Lemmatization
  • Type of annotation that reduces the inflextional
    variants of words to their respective lexemes or
    lemmas as they appear in dictionary entries
  • Do, does, did, done, doing DO
  • Corpus, corpora CORPUS
  • Small capital letters are the convention.

59
Lemmatization
  • It is important in vocabulary studies and
    lexicography, e.g. in studying the distribution
    pattern of lexemes and improving dictionaries
    and computer lexicons.
  • It can be automatically performed.

60
Parsing
  • Once a corpus is POS tagged, it is possible to
    bring these morpho-syntactic categories into
    higher level syntactic relationships with one
    another, that is, to analyse the sentences in a
    corpus into their constituents.
  • Parsing consists in bracketing.
  • It can be automated but with a low precision
    rate.

61
Parsing
  • Example
  • (S (NP Mary)
  • (VP visited)
  • (NP a
  • (ADJP very nice)
  • boy)))

62
Semantic annotation
  • It assigns codes indicating the semantic features
    of the semantic fields of the words in a text. It
    is knowledge-based so it needs to be manual most
    of the time.
  • Two types
  • One marks the semantic relationships between the
    constituents in a sentence
  • One marks the semantic features of words in a
    text

63
Coreference annotation
  • Pronouns
  • Repetition
  • Substitution
  • Ellipsis
  • Computer-assisted at best.

64
Pragmatic annotation
  • Speech/dialogue acts in domain-specific dialogue.
  • The most coherent system is DRI (Discourse
    Representation Initiative).
  • 3 layers of coding
  • Segmentation (dividing dialogue in textual
    units, utterances)
  • Functional annotation (dialogue act annotation)
  • Utterance tags (applying utterance tags that
    characterize the role of the utterance as a
    dialogue act)

65
Pragmatic annotation
  • Utterance tags
  • Communicative status (intelligible, complete,
    etc.)
  • Information level and status (indicating the
    semantic content of the utterance and how it
    relates to the task in question)
  • Forward-looking communicative function
    (utterances that may constrain or affect the
    discourse, e.g. assert, request, question and
    offer)
  • Backwarding-looking communicative function
    (utterances that relate to previous parts of the
    discourse, e.g. accept, backchannelling, answer)

66
Stylistic annotation
  • It is particularly associated with stylistic
    features in literary texts.
  • An example the representation of peoples
    speech and thoughts, known as speech ad thought
    presentation (STP)

67
Other types of tagging
  • Error tagging
  • Problem-oriented annotation

68
Types of corpora
  • Multilingual
  • Monolingual

69
Multilingual Corpora
  • Parallel corpora (source texts plus
    translations) Canadian Hansard
  • Comparable corpora (monolingual subcorpora
    designed using the same sampling techniques)
    Aahrus corpus of contract law
  • Multilingual
  • Bilingual

70
Multilingual Corpora
  • Important resources for translation and
    contrastive studies.
  • Multilingual corpora
  • give new insight into the language compared
  • can be used to study language specific and
    universal features
  • illuminate differences between source texts and
    translations
  • can be used for a number of practical
    applications, in lexicography, language teaching,
    translation, etc.

71
Parallel Corpora
  • Bilingual vs.Multilingual
  • Unidirectional (from La to Lb or from Lb to Lc
    alone) vs. Bidirectional (from La to Lb and from
    Lb to La) vs. Multidirectional (from La to Lb, Lc
    etc.)

72
Comparable corpora
  • A corpus containing components that are collected
    using the same sampling techniques and similar
    balance and representativeness, e.g. the same
    proportions of the texts of the same genres in
    the same domains in a range of different
    languages in the same sampling period.

73
Comparable vs. parallel corpora
  • The sampling frame is essential for comparable
    corpora but not for parallel corpora because the
    texts are exact translations of each other.

74
Corpus Alignment
  • In order for us to be able to fully exploit
    parallel corpora, they need to be aligned.
  • Different types of alignment
  • Word-level alignment
  • Sentence-level alignment
  • Paragraph alignment

75
General Corpora
  • British National Corpus (100,106,008 words)
  • The American National Corpus
  • ICE-CUP

76
Specialized Corpora
  • Guangzhou Petroleum English Corpus (411,612 words
    of written English from the petrochemical domain)
  • HKUST Computer Science Corpus (1,000,000 words of
    written English sampled from undergraduate
    textbooks in computer science.
  • CPSA (Corpus of Professional Spoken American
    English)
  • MICASE (1,700,000 words of English spoken in the
    academic domain)

77
Written Corpora
  • BROWN Corpus (written texts, AE in 1961)
  • LOB Corpus (Comparable to BROWN Corpus, BE, early
    1960s)
  • FROWN Corpus (AE, Early 1990s)
  • FLOB Corpus (BE, Early 1990s)

78
Spoken Corpora
  • London-Lund Corpus (LLC)
  • Lancaster/IBM Spoken English Corpus (SEC)
  • Cambridge and Nottingham Corpus of Discourse in
    English (CANCODE)
  • Santa Barbara Corpus of Spoken American English
    (SBCSAE)
  • Wellington Corpus of Spoken New Zealand English
    (WSC)

79
Synchronic Corpora
  • Useful to compare varieties of English. Texts
    date all to the same period.
  • Brown and Lob
  • Frown and Flob
  • International Corpus of English (ICE) (Texts
    produced after 1989)
  • BNC

80
Diachronic Corpora
  • Texts date to different periods in time. Ideal to
    study language change and history.
  • Brown/Frown
  • Lob/Flob
  • Helsinki Diachronic Corpus of English Texts
    (8th-18th century)
  • Archer Corpus A representative Corpus of
    Historical English Registers (BE and AE,
    1650-1990).

81
Learner/developmental Corpora
  • Lstr or L2 acquisition/L1 acquired by children
  • CHILDES (DC)
  • International Corpus of Learner English ICLE
    (LC)
  • Cambridge Learner Corpus (LC)

82
Monitor Corpora
  • Constantly supplemented with fresh material and
    keep increasing in size, though the proportion of
    text types included in the corpus remains
    constant.
  • Bank of English (BoE)
  • Global English Monitor Corpus
  • AVIATOR

83
The BNC
  • The British National Corpus (BNC) is a 100
    million word collection of samples of written and
    spoken language from a wide range of sources,
    designed to represent a wide cross-section of
    current British English, both spoken and written.

84
The BNC
  • The written part of the BNC (90) includes, for
    example, extracts from regional and national
    newspapers, specialist periodicals and journals
    for all ages and interests, academic books and
    popular fiction, published and unpublished
    letters and memoranda, school and university
    essays, etc. The spoken part (10) includes a
    large amount of unscripted informal conversation,
    recorded by volunteers selected from different
    age, region and social classes in a
    demographically balanced way, together with
    spoken language collected in all kinds of
    different contexts, ranging from formal business
    or government meetings to radio shows and
    phone-ins.

85
The BNC
  • The corpus is encoded according to the
    Guidelines of the Text Encoding Initiative (TEI)
    to represent both the output from CLAWS
    (automatic part-of-speech tagger) and a variety
    of other structural properties of texts (e.g.
    headings, paragraphs, lists etc.). Full
    classification, contextual and bibliographic
    information is also included with each text in
    the form of a TEI-conformant header.

86
What sort of corpus is the BNC?
  • Monolingual It deals with modern British
    English, not other languges used in Britain.
    However non-British English and foreign language
    words do occur in the corpus.
  • Synchronic It covers British English of the late
    twentieth century, rather than the historical
    development which produced it.
  • General It includes many different styles and
    varieties, and is not limited to any particular
    subject field, genre or register. In particular,
    it contains examples of both spoken and written
    language.
  • Sample For written sources, samples of 45,000
    words are taken from various parts of
    single-author texts. Shorter texts up to a
    maximum of 45,000 words, or multi-author texts
    such as magazines and newspapers, are included in
    full. Sampling allows for a wider coverage of
    texts within the 100 million limit, and avoids
    over-representing idiosyncratic texts.

87
BNC and Sketchengine
  • Sketch Engine is an excellent user-interface to
    query the BNC.
  • Here are some screenshots.

88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
(No Transcript)
99
(No Transcript)
100
(No Transcript)
101
An example of a POS-tagged text
  • I've been giving some thought to the whole idea
    of writing a book as of late (I've also been
    giving some thought to winning the lottery, and
    we can all see where that's got me) and it came
    to me while showering the other night that if I
    were to ever write a book (which ain't gonna
    happen, but let's just say for the sake of
    argument) I would bill myself as the anti-Francis
    Mayes.

102
An example of a POS-tagged text
  • I_PNP 've_VHB been_VBN giving_VVG some_DT0
    thought_NN1 to_PRP the_AT0 whole_AJ0 idea_NN1
    of_PRF writing_VVG a_AT0 book_NN1 as_PRP21
    of_PRP22 late_AJ0 (_( I_PNP 've_VHB also_AV0
    been_VBN giving_VVG some_DT0 thought_NN1 to_PRP
    winning_VVG the_AT0 lottery_NN1 ,_, and_CJC
    we_PNP can_VM0 all_DT0 see_VVI where_AVQ that_DT0
    's_VHZ got_VVN me_PNP )_) and_CJC it_PNP came_VVD
    to_PRP me_PNP while_CJS showering_VVG the_AT0
    other_AJ0 night_NN1 that_CJT if_CJS I_PNP
    were_VBD to_TO0 ever_AV0 write_VVI a_AT0 book_NN1
    (_( which_DTQ ai_UNC n't_XX0 gon_VVG na_TO0
    happen_VVI ,_, but_CJC let_VM021 's_VM022
    just_AV0 say_VVI for_PRP the_AT0 sake_NN1 of_PRF
    argument_NN1 )_) I_PNP would_VM0 bill_NN1
    myself_PNX as_PRP the_AT0 anti-Francis_AJ0
    Mayes_NP0 ._.
Write a Comment
User Comments (0)
About PowerShow.com