Title: NLP/CL: Review
1 NLP/CL Review
School of Computing FACULTY OF ENGINEERING
- Eric Atwell, Language Research Group
- (with thanks to other contributors)
2Objectives of the module
- On completion of this module, students should be
able to- understand theory and terminology of
empirical modelling of natural language-
understand and use algorithms, resources and
techniques for implementing and evaluating NLP
systems- be familiar with some of the main
language engineering application areas-
appreciate why unrestricted natural language
processing is still a major research task. - In a nutshell
- Why NLP is difficult language is a complex
system - How to solve it? Corpus-based machine-learning
approaches - Motivation applications of The Language
Machine
3The main sub-areas of linguistics
- ? Phonetics and Phonology The study of
linguistic sounds or speech. - ? Morphology The study of the meaningful
components of words. - ? Syntax The study of the structural
relationships between words. - ? Semantics The study of meanings of words,
phrases, sentences. - ? Discourse The study of linguistic units larger
than a single utterance. - ? Pragmatics The study of how language is used
to accomplish goals. -
4Python, NLTK, WEKA
- Python A good programming language for NLP
- Interpreted
- Object-oriented
- Easy to interface to other things (text files,
web, DBMS) - Good stuff from java, lisp, tcl, perl
- Easy to learn
- FUN!
- Python NLTK Natural Language Tool Kit with demos
and tutorials - WEKA Machine Learning toolkit Classifiers, eg
J48 Decision Trees
5Why is NLP difficult?
- Computers are not brains
- There is evidence that much of language
understanding is built into the human brain - Computers do not socialize
- Much of language is about communicating with
people - Key problems
- Representation of meaning and hidden structure
- Language presupposes knowledge about the world
- Language is ambiguous a message can have many
interpretations - Language presupposes communication between people
6AmbiguityGrammar (PoS) and Meaning
- Iraqi Head Seeks Arms
- Juvenile Court to Try Shooting Defendant
- Teacher Strikes Idle Kids
- Kids Make Nutritious Snacks
- British Left Waffles on Falkland Islands
- Red Tape Holds Up New Bridges
- Bush Wins on Budget, but More Lies Ahead
- Hospitals are Sued by 7 Foot Doctors
- (Headlines leave out punctuation and
function-words) - Lynne Truss, 2003. Eats shoots and leaves
- The Zero Tolerance Approach to Punctuation
7The Role of Memorization
- Children learn words quickly
- Around age two they learn about 1 word every 2
hours. - (Or 9 words/day)
- Often only need one exposure to associate meaning
with word - Can make mistakes, e.g., overgeneralization
- I goed to the store.
- Exactly how they do this is still under study
- Adult vocabulary
- Typical adult about 60,000 words
- Literate adults about twice that.
8But there is too much to memorize!
- establish
- establishment
- the church of England as the official state
church. - disestablishment
- antidisestablishment
- antidisestablishmentarian
- antidisestablishmentarianism
- is a political philosophy that is opposed to the
separation of church and state. - MAYBE we dont remember every word separately
- MAYBE we remember MORPHEMES and how to combine
them
9Rationalism v Empiricism
- Rationalism the doctrine that knowledge is
acquired by reason without regard to experience
(Collins English Dictionary) - Noam Chomsky, 1957 Syntactic Structures
- Argued that we should build models through
introspection - A language model is a set of rules thought up by
an expert - Like Expert Systems
- Chomsky thought data was full of errors, better
to rely on linguists intuitions
10Empiricism v Rationalism
- Empiricism the doctrine that all knowledge
derives from experience (Collins English
Dictionary) - The field was stuck for quite some time
rationalist - linguistic models for a specific example did not
generalise. - A new approach started around 1990 Corpus
Linguistics - Well, not really new, but in the 50s to 80s,
they didnt have the text, disk space, or GHz - Main idea machine learning from CORPUS data
- How to do corpus linguistics
- Get large text collection (a corpus plural
several corpora) - Compute statistical models over the
words/PoS/parses/ in the corpus - Surprisingly effective
11Example Problem
- Grammar checking example
- Which word to use?
- ltprincipalgt ltprinciplegt
- Empirical solution look at which words surround
each use - I am in my third year as the principal of Anamosa
High School. - School-principal transfers caused some upset.
- This is a simple formulation of the quantum
mechanical uncertainty principle. - Power without principle is barren, but principle
without power is futile. (Tony Blair)
12Using Very Large Corpora
- Keep track of which words are the neighbors of
each spelling in well-edited text, e.g. - Principal high school
- Principle rule
- At grammar-check time, choose the spelling best
predicted by the probability of co-occurring with
surrounding words. - No need to understand the meaning !?
- Surprising results
- Log-linear improvement even to a billion words!
- Getting more data is better than fine-tuning
algorithms!
13The Effects of LARGE Datasets
- From Banko Brill, 2001. Scaling to Very Very
Large Corpora for Natural Language
Disambiguation, Proc ACL
14Corpus, word tokens and types
- Corpus text selected by language, genre, domain,
- Brown, LOB, BNC, Penn Treebank, MapTask, CCA,
- Corpus Annotation text headers, PoS, parses,
- Corpus size is no. of words depends on
tokenisation - We can count word tokens, word types, type-token
distribution - Lexeme/lemma is root form, v inflections (be v
am/is/was)
15Tokenization and Morphology
- Tokenization - by whitespace, regular expressions
- Problems Its data-base New York
- Jabberwocky shows we can break words into
morphemes - Morpheme types root/stem, affix, clitic
- Derivational vs. Inflectional
- Regular vs. Irregular
- Concatinative vs. Templatic (root-and-pattern)
- Morphological analysers Porter stemmer, Morphy,
PC-Kimmo - Morphology by lookup CatVar, CELEX, OALD
16Corpus word-counts and n-grams
- FreqDist counts of tokens and their distribution
can be useful - Eg find main characters in Gutenberg texts
- Eg compare word-lengths in different languages
- Human can predict the next word
- N-gram models are based on counts in a large
corpus - Auto-generate a story ... (but gets stuck in
local maximum)
17Word-counts follow Zipfs Law
- Zipfs law applies to a word type-token frequency
distribution frequency is proportional to the
inverse of the rank in a ranked list - fr k where f is frequency, r is rank, k is a
constant - ie a few very common words, a small to medium
number of middle-frequency words, and a long tail
of low frequency (1) - Chomsky argued against corpus evidence as it is
finite and limited compared to introspection
Zipfs law shows that many words/structures only
appear 1 or 0 times in a given corpus,
??supporting the argument that corpus evidence is
limited compared to introspection
18Kilgarriffs Sketch Engine
- Sketch Engine shows a Word Sketch or list of
collocates words co-occurring with the target
word more frequently than predicted by
independent probabilities - A lexicographer can colour-code groups of related
collocates indicating different senses or
meanings of the target word - With a large corpus the lexicographer should find
all current senses, better than relying on
intuition/introspection - Large user-base of experience, used in
development of several published dictionaries for
English - For minority languages with few existing corpus
resources, Sketch Engine is combined with
Web-Bootcat to enable lexicographers to collect
their own Web-as-Corpus
19Parts of Speech
- Parts of Speech groups words into grammatical
categories - and separates different functions of a word
- In English, many words are ambiguous 2 or more
PoS-tags - Very simple tagger everything is NN
- Better Pos-Taggers unigram, bigram, trigram,
Brill,
20Training and Testing ofMachine Learning
Algorithms
- Algorithms that learn from data see a set of
examples and try to generalize from them. - Training set
- Examples trained on
- Test set
- Also called held-out data and unseen data
- Use this for testing your algorithm
- Must be separate from the training set
- Otherwise, you cheated!
- Gold Standard
- A test set that a community has agreed on and
uses as a common benchmark use for final
evaluation
21 Grammar and Parsing
- Context-Free Grammars and Constituency
- Some common CFG phenomena for English
- Sentence-level constructions
- NP, PP, VP
- Coordination
- Subcategorization
- Top-down and Bottom-up Parsing
22Problems with context-free grammars and parsers
- Parse-trees show syntactic structure of sentences
- Key constituents S, NP, VP, PP
- You can draw a parse-tree and corresponding CFG
- Problems with Context-Free Grammar
- Coordination X ? X and X is a Meta-Rule, not
strict CFG rule - Agreement needs duplicate CFG rules for
singular/plural etc - Subcategorization needs separate CFG
non-terminals for trans/intrans/ - Movement object/subject of verb may be moved
in questions - Dependency parsing captures deeper semantics but
is harder - Parsing top-down v bottom-up v combined
- Ambiguity causes backtracking, so CHART PARSER
stores partial parses
23Parsing sentences left-to-right
- The horse raced past the barn
- S NP A the N horse NP
- VP V raced PP I past NP A the N
barn NP PP VP S - The horse raced past the barn fell
- S NP NP A the N horse NP
- VP V raced PP I past NP A
the N barn NP PP VP NP - VP V fell VP S
24Chunking or Shallow Parsing
- Break text up into non-overlapping contiguous
subsets of tokens. - Shallow parsing or Chunking is useful for
- Entity recognition
- people, locations, organizations
- Studying linguistic patterns
- gave NP
- gave up NP in NP
- gave NP NP
- gave NP to NP
- Prosodic phrase breaks pauses in speech
- Can ignore complex structure when not relevant
- Chunking can be done via regular expressions over
PoS-tags
25Information Extraction
- Partial parsing gives us NP chunks
- IE Named Entity Recognition
- People, places, companies, dates etc.
- In a cohesive text, some NPs refer to the same
thing/person - Needed an algorithm for NP coreference
resolution eg - Hudda, ten of hearts, Mrs Anthrax, she
all refer to the same Named Entity
26Semantics Word Sense Disambiguation
- e.g. mouse (animal /PC-interface)
- Its a hard task (Very)
- Humans very good at it
- Computers not
- Active field of research for over 50 years
- Mistakes in disambiguation have negative results
- Beginning to be of practical use
- Desirable skill (Google, M)
27Machine learningv cognitive modelling
- NLP has been successful using ML from data,
without linguistic / cognitive models - Supervised ML given labelled data (eg PoS-tagged
text to train PoS-tagger, to tag new text in the
style of training text) - Unsupervised ML no labelled data (eg clustering
words with similar contexts gives PoS-tag
categories) - Unsupervised ML is harder, but increasingly
successful!
28NLP applications
- Machine Translation
- Localization adapting text (e.g. ads) to local
language - Information Retrieval (Google, etc)
- Information Extraction
- Detecting Terrorist Activities
- Understanding the Quran
-
- For more, see The Language Machine
29And Finally
- Any final questions?
- Feedback please (eg email me)
- Good luck in the exam!
- Look at past exam papers
- BUT note changes in topics covered
- And if you do use NLP in your career, please let
me know