Corpus-Based Work - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Corpus-Based Work

Description:

Corpus-Based Work Chapter 4 Foundations of statistical natural language processing – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 23

Provided by: ahme84

Category:

Tags: based | corpus | slide | work

Transcript and Presenter's Notes

Title: Corpus-Based Work

1
Corpus-Based Work

Chapter 4
Foundations of statistical natural language
processing

2
Introduction

Requirements of NLP work
Computers
Corpora
Application/Software
This section covers some issues concerning the
formats and problems encountered in dealing with
raw data
Low-level processing before actual work
Word/Sentence extraction

3
Getting Set Up

Computers
Memory requirements for large corpora
Statistical NLP methods involve counts required
to be accessed speedily
Corpora
A corpus is a special collection of textual
material collected according to a certain set of
criteria
Licensing
Most of the time free sources are not
linguistically marked-up

4

Corpora
Representative sample
What we find for sample also holds for general
population
Balanced corpus
Each subtype of text matching predetermined
criterion of importance
Importance in statistical NLP
Representative corpus
In results type/domain of corpus should be
included

5

Software
Text editors
TextPad, Emacs, BBedit
Regular expressions
Patterns as regular language
Programming language
C/C widely used (Efficient)
Pearl for text preparation and formatting
Built in database and easy handling of
complicated structures makes Prolog important
Java as pure Object Oriented gives automatic
memory management

6
Looking at Text

Either in raw format or marked-up
Markup is used for putting some codes into data
file, giving some information about text
Issues in automatic processing
Junk formatting/content (Corpus Cleaning)
Case sensitivity (All capitalize)
Proper Nouns?
Stress through capitalization
Loss of contextual information

7

Tokenization
Text is divided into units called tokens
Treatment of punctuation marks?
What is a word?
Graphic word (Kucera and Francis 1967)
A string of contiguous alphanumeric characters
with white space on either side.
This is not practical definition even in case of
Latin
Especially for news corpus some odd entries can
be present e.g. Microoft, C net
Apart from these oddities there are some other
issues

8

Periods
Words are not always bounded by white spaces
(commas, semicolons and periods)
Periods are at the end of sentence and also at
the end of abbreviations
In abbreviation they should be attached to words
(Wash. wash)
When abbreviations occur at the end of sentence
there is only one period present, performing both
functions
Within morphology, this phenomenon is referred as
haplology

9

Single Apostrophes
Difficulties in dealing with constructions such
as Ill or isnt
The count of graphic word is 1 according to basic
definition but should be counted as 2 words
1. S? NP VP
2. if we split then some funny words may occur in
collection
End of quotations marks
Possessive form of words ending with s or z
Charles Law Muaz book

10

Hyphenation
Does sequence of letters with hyphen in-between,
count as one or two?
Line ending hyphens
Remove hyphen at the end of line and join both
parts together
If there is some other type of hyphen at end of
line (haplology) then? (text-based)
Mostly in electronic text line breaking hyphens
are not present, but there are some other
issues.

11

Some things with hyphens are clearly treated as
one word
E-mail, A-l-Plus and co-operate
Other cases are arguable
Non-lawyer, pro-Arabs and so-called
The hyphens here are called lexical hyphens
Inserted before or after small word formatives to
split vowel sequence in some cases
Third class of hyphens is inserted to indicate
correct grouping
A text-based medium
A final take-it-or-leave-it offer

12

Inconsistencies in hyphenation
Cooperate ? Co-operate
So we can have multiple forms treated as either
one word or two
Lexemes
Single dictionary entry with single meaning
Homographs
Two lexemes have overlapping forms/nature
Saw

13

Word segmentation in other languages
Opposite issue
White spaces but not word boundary
the New York-New Heaven railroad
I couldnt work the answer out
In spite of, in order to, because of
Variant coding of information of certain semantic
type
Phone numbers 42-111-128-128
Problem in information extraction

14

Speech Corpora Issues
More contractions
Various phonetic representations
Pronunciation variants
Sentence fragments
Filler words
Morphology
Keep various forms separately or collapse them?
e.g. sit, sits, sat
Grouping them together and working with lexemes
(Initially looks easier)

15

Stemming
Strips off affixes
Lemmatization
To extract the lemma or lexeme from inflected
form
Empirical research within IR shows that stemming
does not help in performance
Information loss (operating ?operate)
Closely related tokens are grouped in chunks,
which are more useful
Not good for morphologically rich languages

16

Sentences
What is a sentence?
In English, something ending with ., ? or !
Abbreviations issues
Other issues
you reminded me, she remarked, of your mother.
Nested things are classified as clauses
Quotation marks after punctuation
. is not sentence boundary in this case

17

Sentence boundary (SB) detection
Place tentative SB after all occurrences of .?!
Move the boundary after quotation mark (if any)
Disqualify a period boundary in case of
Preceded by an abbreviation not at sentence end ,
and capitalized Prof., Dr.
Or not followed by capitalized words like in case
of etc., jr.
Disqualify a boundary with ? Or !
If followed by a lower case letter
Regard all other as correct SBs

18

Riley (1989) used classification trees for SB
detection
Features of trees included case and length of
words preceding or following a period and
probabilities of words to occur before and after
a sentence boundary
It required large quantity of labeled data
Palmer and Hearst used POS of such words and
implemented with Neural Networks (98-99
accurate)
In other languages?

19

Marked-up Data
Some sort of code is used to provide information
(mostly SGML, XML)
It can be done automatically, manually or mixture
of both (Semi-Automatic)
Some texts mark up just sentence and paragraph
boundaries
Other mark up more than this basic information
e.g. Pen Treebank (Full syntactic structure)
Common mark up is POS tagging

20

Grammatical Tagging
Generally done with conventional POS tagging like
Noun, Verbs etc.
Also some information regarding nature of the
words like Plurality of nouns or Superlative
forms of adjectives
Tag set
The most influential tag set have been the one
used to tag American Brown Corpus and
Lancaster-Oslo-Bergen corpus

21

Size of tag sets
Brown 87 179 (Total tags)
Penn 45
Claws1 132
Penn tag set is widely used in computational work
Tags are different in different tag sets
Larger tag sets obviously have fine-grained
distinctions
Detail level is according to domain of corpora

22

The design of tag set
Grammatical class of word
Features to tell the behavior of the word
Part of Speech
Semantic grounds
Syntactic distributional grounds
Morphological grounds
Splitting tags in further categories gives
improved information but makes classification
harder
There is not a simple relationship between tag
set size and performance of taggers

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Corpus Linguistics and ESP - is there a link? PowerPoint PPT Presentation

Corpus Linguistics and ESP - is there a link? - Maria Jos Pereira de Oliveira (Agrarian School in Santar m, Portugal), 'Corpus ... Finds that Swan's word list contains words not common in corpora ... | PowerPoint PPT presentation | free to view

Found when we tried to discover how opium based pain killers work' PowerPoint PPT Presentation

Found when we tried to discover how opium based pain killers work' - Found when we tried to discover how opium based pain killers work. ... Human vs. amphibian adaptiveness. Frog on a lily pad. Frog in a box ... | PowerPoint PPT presentation | free to view

Corpus analysis for indexing: when corpus-based terminology makes a difference PowerPoint PPT Presentation

Corpus analysis for indexing: when corpus-based terminology makes a difference - Linguateca is a distributed language resource centre for Portuguese ... teorema de Bayes, rede de Elman. 1,6. 19. CN PRP PN ... | PowerPoint PPT presentation | free to view

Graph-Based Methods for PowerPoint PPT Presentation

Graph-Based Methods for - Find 'coordinate terms' (eg, girl/boy, dolls/cars) in the graph, or find ... City names. Person names. conj-and, prep-in, nn, appos ... subj, obj, poss, nn ... | PowerPoint PPT presentation | free to view

Graph-Based Methods for PowerPoint PPT Presentation

Graph-Based Methods for - Start with seeds: 'NIPS', 'ICML' Look thru a corpus for certain patterns: ... 'at NIPS, AISTATS, KDD and other learning ... In NIPS, 2005. Wrapper Length ... | PowerPoint PPT presentation | free to view

A Holistic Lexicon-Based Approach to Opinion Mining PowerPoint PPT Presentation

A Holistic Lexicon-Based Approach to Opinion Mining - Dictionary-based approaches. Start from a set of seed opinion words ... Use the seeds to search for synonyms and antonyms in WordNet (Hu and Liu, 2004) ... | PowerPoint PPT presentation | free to view

LIN 3098 Corpus Linguistics PowerPoint PPT Presentation

LIN 3098 Corpus Linguistics - tagging words with their emotional content ... Corpus Linguistics Discourse annotation Most common: text-level things such as paragraphs Less common: ... | PowerPoint PPT presentation | free to view

LIN 3098 Corpus Linguistics Lecture 6 PowerPoint PPT Presentation

LIN 3098 Corpus Linguistics Lecture 6 - LIN 3098 Corpus Linguistics Lecture 6 Albert Gatt Part 4 Bonus Topic: Mutual Information for ranking collocations General idea Suppose we identify several multiword ... | PowerPoint PPT presentation | free to view

Making machine translation work PowerPoint PPT Presentation

Making machine translation work - Making machine translation work By Stefan, Simon, Lisa, Nina and Dennis Making machine translation work Introduction Human versus Machine Translation Methods in ... | PowerPoint PPT presentation | free to view

The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals PowerPoint PPT Presentation

The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals - The Cambridge Cookie-Theft Corpus: A Corpus of Directed and Spontaneous Speech of Brain-Damaged Patients and Healthy Individuals Caroline Williamsa, Andrew Thwaitesb ... | PowerPoint PPT presentation | free to view

HMM-based speech synthesis: the new generation of artificial voices PowerPoint PPT Presentation

HMM-based speech synthesis: the new generation of artificial voices - HMM-based speech synthesis: the new generation of artificial voices Thomas Drugman thomas.drugman@umons.ac.be * ... | PowerPoint PPT presentation | free to view

Chinese WordSketch Online, corpus-based summaries of word usage PowerPoint PPT Presentation

Chinese WordSketch Online, corpus-based summaries of word usage - Online, corpus-based summaries of word usage Participants Adam Kilgarriff, Lexical Computing, UK David Tugwell, Tech University Budapest Pavel Rychly, Brno University ... | PowerPoint PPT presentation | free to view

Web-based Acquisition of Japanese Katakana Variants PowerPoint PPT Presentation

Web-based Acquisition of Japanese Katakana Variants - Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo , Japan) | PowerPoint PPT presentation | free to view

Speech and Gesture Corpus From Designing to Piloting PowerPoint PPT Presentation

Speech and Gesture Corpus From Designing to Piloting - Speech and Gesture Corpus From Designing to Piloting Gheida Shahrour Supervised by Prof. Martin Russell Dr Neil Cooke Electronic, Electrical and Computer Engineering | PowerPoint PPT presentation | free to view

N-Grams and Corpus Linguistics PowerPoint PPT Presentation

N-Grams and Corpus Linguistics - Lecture 6 N-Grams and Corpus Linguistics guest lecture by Dragomir Radev radev@eecs.umich.edu radev@cs.columbia.edu | PowerPoint PPT presentation | free to view

Compiling an oral corpus of child language (G.S.C.C) PowerPoint PPT Presentation

Compiling an oral corpus of child language (G.S.C.C) - Compiling an oral corpus of child language (G.S.C.C) Gavriilidou Zoe (Democritus University of Thrace) Elina Chadjipapa (FLEXSEM, Autonomous University of Barcelona ) | PowerPoint PPT presentation | free to view

Corpus Linguistics for Understanding the Quran PowerPoint PPT Presentation

Corpus Linguistics for Understanding the Quran - Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence and Biological ... | PowerPoint PPT presentation | free to view

Chinese Term Extraction Based on Delimiters PowerPoint PPT Presentation

Chinese Term Extraction Based on Delimiters - Chinese Term Extraction Based on Delimiters Yuhang Yang, Qin Lu, Tiejun Zhao School of Computer Science and Technology, Harbin Institute of Technology | PowerPoint PPT presentation | free to view

How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese PowerPoint PPT Presentation

How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese - Behavioral evaluation of corpus representativeness for Maltese Jerid Francom (Wake Forest University) Adam Ussishkin (University of Arizona) Amy LaCross (University ... | PowerPoint PPT presentation | free to view

MapReduce Algorithm Design Based on Jimmy Lin PowerPoint PPT Presentation

MapReduce Algorithm Design Based on Jimmy Lin - MapReduce Algorithm Design Based on Jimmy Lin s s | PowerPoint PPT presentation | free to view

Corpus%20analysis%20for%20indexing:%20when%20corpus-based%20terminology%20makes%20a%20difference PowerPoint PPT Presentation

Corpus%20analysis%20for%20indexing:%20when%20corpus-based%20terminology%20makes%20a%20difference - Corpus analysis for indexing: when corpus-based terminology makes a difference D bora Oliveira Lu s Sarmento Belinda Maia Diana Santos Linguateca | PowerPoint PPT presentation | free to view

Example-based Machine Translation based on Deeper NLP PowerPoint PPT Presentation

Example-based Machine Translation based on Deeper NLP - Example-based Machine Translation based on Deeper NLP Toshiaki Nakazawa1, Kun Yu1, Sadao Kurohashi2 1. Graduate School of Information Science and Technology, | PowerPoint PPT presentation | free to view

Using the ICE-GB corpus to model the English dative alternation PowerPoint PPT Presentation

Using the ICE-GB corpus to model the English dative alternation - Using the ICE-GB corpus to model the English dative alternation Daphne Theijssen PhD student Department of Linguistics Radboud University Nijmegen | PowerPoint PPT presentation | free to view

Graph-Based Methods for PowerPoint PPT Presentation

Graph-Based Methods for - Graph-Based Methods for Open Domain Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science | PowerPoint PPT presentation | free to view

Towards Evaluation of P2P-based DKMS PowerPoint PPT Presentation

Towards Evaluation of P2P-based DKMS - ... based vs. System based Combination of Techniques from Information Retrieval Ontology ... information loss Evaluation of algorithms ... based evaluation while ... | PowerPoint PPT presentation | free to view

Developments in corpus-based translation studies A bibliometric approach PowerPoint PPT Presentation

Developments in corpus-based translation studies A bibliometric approach - Startingpoint. This paper aims at giving an overview of the development of corpus based methodologies in research on translation and interpreting. | PowerPoint PPT presentation | free to view

Comparing Ontology-based and Corpus-based Domain Annotations in WordNet. PowerPoint PPT Presentation

Comparing Ontology-based and Corpus-based Domain Annotations in WordNet. - Title: PowerPoint Presentation Author: alshemal Last modified by: alshemal Created Date: 2/16/2004 3:36:45 AM Document presentation format: On-screen Show | PowerPoint PPT presentation | free to view