Introduction to corpus linguistics - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to corpus linguistics

Description:

... the corpus of Anglo-Saxon verse The Oxford Companion to the English Language The modern view A ... careful selection 10 % spoken material ... – PowerPoint PPT presentation

Number of Views:147

Avg rating:3.0/5.0

Slides: 27

Provided by: V109

Category:

Tags: corpus | english | introduction | linguistics | material | spoken

Transcript and Presenter's Notes

Title: Introduction to corpus linguistics

1
Introduction to corpus linguistics
2
Corpus

The old school concept
A collection of texts especially if complete and
self-contained the corpus of Anglo-Saxon verse
The Oxford Companion to the English Language
The modern view
A collection of naturally occurring language text
chosen to characterize a state or variety of a
language
John Sinclair Corpus Concordance Collocation OUP

3
Corpus vs. archive

Text archive
Collection of texts in their original format
(Oxford Text Archive http//ota.ox.ac.uk/)
Corpus
texts collected and processed in a
unified,systematic manner
British National Corpus http//www.natcorp.ox.ac.
uk/

4
(No Transcript)
5
(No Transcript)
6
Short history

Brief mention of just a select few!
Brown Corpus (Brown university)
1 m words
15 genres
500 samples 2000 words each
Area US
Time 1961
LOB Corpus (Lancaster-Bergen-Oslo)
GB replica of Brown

7
Cobuild

Major corpus initiative by Collins and Birmingham
Univ. John Sinclair
1991 20 m
-gt Bank of English currently 450 m words
http//www.cobuild.collins.co.uk

8
British National Corpus

100 m words careful selection
10 spoken material
time span 1960 (fiction) 1975 non-ficion)
40-50 000 word texts
TEI compliant SGML coding
http//www.comp.lancs.ac.uk/ucrel/bncindex/

9
(No Transcript)
10
International Corpus of English

20 corpora of 1 m words devoted to varieties of
English around the world
500 texts (300 written 200 spoken) of 2000 words
each
time span 1990-0996
ICE-GB available in demo version
syntactic annotation, graphical tool ICECUP

11
(No Transcript)
12
Corpus processing tokenization

Preprocessing
tokenization segmenting the text into sentences
sometimes tricky sentence delimiters in
mid-sentence positions
words
multi-word units problem
Normalization
restoring clitics, abbreviations ("can't", "I've")

13
Corpus processing tagging

Tagging
labelling every word with its Part of Speech
category
Problem ambiguity
out of context, words can belong to different
part of speech or have different analysis within
the same POS
set N vs. set V
bánt 'bánik' VBD vagy 'bánt' VBZ

14
Corpus processing disambiguation

Disambiguation
defining the correct analysis in context
Two approaches
both needs manually corrected training corpus
statistical
Hidden Markov model
calculating probability within a span of usually
one or two words
rate of success can be around 98
rule-based

15
Syntactic annotation

Difficult to do on such a scale
shallow parsing
Treebank collection of syntactically analyzed
sentences
Penn treebank
http//www.cis.upenn.edu/treebank/

16
Recent trends

Word sense ambiguation (SENSEVAL)
http//www.itri.brighton.ac.uk/events/senseval/
Message understanding
http//www.itl.nist.gov/iaui/894.02/related_projec
ts/muc/index.html
SEMANTIC WEB
making information on the web understandable for
machines
a vision requiring a huge effort, not clear
whether feasible at all

17
Representative sample?

A corpus any size is inevitably a sample
Of what?
Two approaches
sampling speakers demographic sampling
sampling their output text type sample

18
The notion of representativeness

Sample vs. population
sample should be proportional to the population
for a given feature
example for demographic sampling
if we know from census figures that 48 of people
in living in Budapest are male
we should compile our sample so that 48 of the
informants are male
-gt our sample is representative of Budapest
residents for gender

19
Trouble with representativeness

What should be the units of sampling?
Registers, text types, genres etc.
But no independent evidence about theirratio in
the totality of language output
-gt representativeness is an ideal but impossible
to implement

20
Approaches to Representativeness

Douglas Biber
Rejects notion of proportional sampling
Sample should be as varied as possible
Representativeness measured in terms of wide
variety of text types included in the sample

21
The Web as a corpus?

Pro
immense database
dynamically growing
ideal 'quick and dirty' method

Cons
lots of rubbish, irrelevant data
difficult to extract hits
no language analysis
only string query, which is crude

22
One quick example

Representativity or representativeness
Throw the two words at Google and have a look at
the figures
Think about the conclusions
There are special front-end sites

23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

CS 114 Introduction to Computational Linguistics PowerPoint PPT Presentation

CS 114 Introduction to Computational Linguistics - Thesaurus-based algorithms. Based on whether words are 'nearby' in Wordnet or MeSH ... By 'thesaurus-based' we just mean. Using the is-a/subsumption/hypernym ... | PowerPoint PPT presentation | free to view

CS 114: Introduction to Computational Linguistics PowerPoint PPT Presentation

CS 114: Introduction to Computational Linguistics - automatically segmenting a TV news broadcast or a long news story into sequence of stories ... Emma smiled and chatted as cheerfully as she could. ... | PowerPoint PPT presentation | free to view

Introduction to Lexical Semantics PowerPoint PPT Presentation

Introduction to Lexical Semantics - Dictionaries. Representing meaning via definitions, examples ... Dictionaries, ontologies, databases. Measures for word coincidence, similarity. Disambiguation ... | PowerPoint PPT presentation | free to view

Corpus linguistics an introduction PowerPoint PPT Presentation

Corpus linguistics an introduction - A corpus can give you more objective evidence. Why bother with corpora? ( III) ... P S W POS='PRON' NUM='PL' LEMMA='we' We /W W POS='V' LEMMA='have' have /W W ... | PowerPoint PPT presentation | free to view

Introduction to Computational Linguistics PowerPoint PPT Presentation

Introduction to Computational Linguistics - (3) For years, costume jewelry makers fought a losing battle. Jewelry displays in department stores were often cluttered and uninspired. ... | PowerPoint PPT presentation | free to view

Corpus Linguistics and Language Variation PowerPoint PPT Presentation

Corpus Linguistics and Language Variation - Sports (rugby), the natural world (bay, beach, cliff). Wellington (N Zealand) ... Spelling differences (color, theater), terms for transportation (railroad, ... | PowerPoint PPT presentation | free to view

Given an annotated corpus PowerPoint PPT Presentation

Given an annotated corpus - Introduction to our series of syntactically annotated corpora of earlier stages ... Complementizer omission in complement and relative clauses (Jaeger & Wasow) ... | PowerPoint PPT presentation | free to view

Corpus linguistics: a general introduction PowerPoint PPT Presentation

Corpus linguistics: a general introduction - ( Webster's Online Dictionary) ... Medical. Corpora Economic. corpora Legal. corpora. Types of corpora. Bi-multilingual. Comparable ... | PowerPoint PPT presentation | free to view

Introduction to Language and Linguistics PowerPoint PPT Presentation

Introduction to Language and Linguistics - Acoustic phonetics: sound wave properties. Auditory phonetics: speech perception ... The study of sound system of a language. Distribution of sounds in a language ... | PowerPoint PPT presentation | free to view

The Web as a Parallel Corpus PowerPoint PPT Presentation

The Web as a Parallel Corpus - The Rosetta Stone dates back from around 190 BC. The three texts on the RS are ... Motivation:Bitexts provide indispensable training data for statistical ... | PowerPoint PPT presentation | free to view

LELA 30922 English Corpus Linguistics PowerPoint PPT Presentation

LELA 30922 English Corpus Linguistics - Not a theory of linguistics ... History of Corpus Linguistics ... Corpus is closed (finite, synchronic) All text tagged to high quality ... | PowerPoint PPT presentation | free to view

Corpus Linguistics and Second Language Acquisition PowerPoint PPT Presentation

Corpus Linguistics and Second Language Acquisition - ACORN project aims to provide corpus analyses in output formats suitable for ... Ser and estar: Concordance lines from ACORN for est (GC1) ... | PowerPoint PPT presentation | free to view

Corpus linguistics an introduction PowerPoint PPT Presentation

Corpus linguistics an introduction - Corpus linguistics. an introduction. Czech National Corpus. Hellenic National Corpus ... A collection of naturally occurring language text, chosen to characterise a ... | PowerPoint PPT presentation | free to view

Introduction to Linguistics PowerPoint PPT Presentation

Introduction to Linguistics - And ????? generally means 'I'm not going' ... http://zh.wikipedia.org/wiki/Wikipedia:è¯è¨å¦é¦ é¡µ ... | PowerPoint PPT presentation | free to view

Corpus Linguistics Case study 2 PowerPoint PPT Presentation

Corpus Linguistics Case study 2 - Corpus Linguistics Case study 2 Grammatical studies based on morphemes or words. G Kennedy (1998) An introduction to corpus linguistics, London: Longman, pp. 121-137 | PowerPoint PPT presentation | free to view

Corpus Linguistics and Second Language Acquisition – The use of ACORN in the teaching of Spanish Grammar Guadalupe Ruiz Yepes PowerPoint PPT Presentation

Corpus Linguistics and Second Language Acquisition – The use of ACORN in the teaching of Spanish Grammar Guadalupe Ruiz Yepes - Corpus Linguistics and Second Language Acquisition The use of ACORN in the teaching of Spanish Grammar Guadalupe Ruiz Yepes Introduction Demands for authentic ... | PowerPoint PPT presentation | free to view

An Introduction to Linguistics «?????» by Hu Yining PowerPoint PPT Presentation

An Introduction to Linguistics «?????» by Hu Yining - An Introduction to Linguistics by Hu Yining About the course Linguistics is a university course for English majors in their 3rd or 4th year and ... | PowerPoint PPT presentation | free to view

Introduction to Corpora and Corpus Linguistics COG PowerPoint PPT Presentation

Introduction to Corpora and Corpus Linguistics COG - Introduction to Corpora and Corpus Linguistics COGS 523-Lecture 1 General Introduction * COGS 523 - Bilge Say * COGS 523 - Bilge Say COGS 523 - Bilge Say COGS 523 ... | PowerPoint PPT presentation | free to view

Corpus design PowerPoint PPT Presentation

Corpus design - Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch 2 CF Meyer, English Corpus Linguistics, Ch. 2 Issues in corpus design General purpose vs ... | PowerPoint PPT presentation | free to view

A Corpus-based Genre and Language Feature Analysis of Chinese and English Linguistics and Literature Article Abstracts PowerPoint PPT Presentation

A Corpus-based Genre and Language Feature Analysis of Chinese and English Linguistics and Literature Article Abstracts - A Corpus-based Genre and Language Feature Analysis of Chinese and English Linguistics and Literature Article Abstracts Fan Chunxiang Zhengzhou University | PowerPoint PPT presentation | free to view

LINGUIST 180: Introduction to Computational Linguistics PowerPoint PPT Presentation

LINGUIST 180: Introduction to Computational Linguistics - Dan Jurafsky, Marie-Catherine de Marneffe Lecture 9: Grammar and Parsing (I) Thanks to Jim Martin for many of these s! | PowerPoint PPT presentation | free to view

COMPARATIVE LEGAL LINGUISTICS PowerPoint PPT Presentation

COMPARATIVE LEGAL LINGUISTICS - ENGLISH FOR LAWYERS III ... introduction | PowerPoint PPT presentation | free to view

Introduction to corpus session PowerPoint PPT Presentation

Introduction to corpus session - Introduction to corpus session General corpora Rosamund Moon: lexicography, polysemy data Alice Deignan Specialised corpora Elena Semino Andreas Musolff | PowerPoint PPT presentation | free to view

Corpus-Based Work PowerPoint PPT Presentation

Corpus-Based Work - Corpus-Based Work Chapter 4 Foundations of statistical natural language processing | PowerPoint PPT presentation | free to view

Using the ICE-GB corpus to model the English dative alternation PowerPoint PPT Presentation

Using the ICE-GB corpus to model the English dative alternation - Using the ICE-GB corpus to model the English dative alternation Daphne Theijssen PhD student Department of Linguistics Radboud University Nijmegen | PowerPoint PPT presentation | free to view

Introduction to Information Retrival PowerPoint PPT Presentation

Introduction to Information Retrival - Introduction to Information Retrival Slides are adapted from stanford CS276 | PowerPoint PPT presentation | free to view

Introduction%20to%20Syntax%20and%20Context-Free%20Grammars PowerPoint PPT Presentation

Introduction%20to%20Syntax%20and%20Context-Free%20Grammars - Introduction to Syntax and Context-Free Grammars Slides adapted from Owen Rambow, Dan Jurafsky and James Martin | PowerPoint PPT presentation | free to view