Title: SIMS%20290-2:%20Applied%20Natural%20Language%20Processing
1SIMS 290-2 Applied Natural Language Processing
Marti Hearst October 13, 2004
2Today
- Finish hand-built rule systems
- Machine Learning approaches to information
extraction - Sliding Windows
- Rule-learners (older)
- Feature-base ML (more recent)
- IE tools
3Two kinds of NE approaches
- Knowledge Engineering
- rule based
- developed by experienced language engineers
- make use of human intuition
- requires only small amount of training data
- development could be very time consuming
- some changes may be hard to accommodate
- Learning Systems
- use statistics or other machine learning
- developers do not need LE expertise
- requires large amounts of annotated training data
- some changes may require re-annotation of the
entire training corpus - annotators are cheap (but you get what you pay
for!)
4Baseline list lookup approach
- System that recognises only entities stored in
its lists (gazetteers). - Advantages - Simple, fast, language independent,
easy to retarget (just create lists) - Disadvantages impossible to enumerate all
names, collection and maintenance of lists,
cannot deal with name variants, cannot resolve
ambiguity
5Creating Gazetteer Lists
- Online phone directories and yellow pages for
person and organisation names (e.g.
Paskaleva02) - Locations lists
- US GEOnet Names Server (GNS) data 3.9 million
locations with 5.37 million names (e.g.,
Manov03) - UN site http//unstats.un.org/unsd/citydata
- Global Discovery database from Europa
technologies Ltd, UK (e.g., Ignat03) - Automatic collection from annotated training data
6Rule-based Example FACILE
- FACILE - used in MUC-7 Black et al 98
- Uses Inxights LinguistiX tools for tagging and
morphological analysis - Database for external information, role similar
to a gazetteer - Linguistic info per token, encoded as feature
vector - Text offsets
- Orthographic pattern (first/all capitals, mixed,
lowercase) - Token and its normalised form
- Syntax category and features
- Semantics from database or morphological
analysis - Morphological analyses
- Example(1192 1196 10 T C "Mrs." "mrs." (PROP
TITLE) (ˆPER_CIV_F)(("Mrs." "Title" "Abbr"))
NIL)PER_CIV_F female civilian (from database)
7FACILE
- Context-sensitive rules written in special rule
notation, executed by an interpreter - Writing rules in PERL is too error-prone and hard
- Rules of the kind A gt B\C/D, where
- A is a set of attribute-value expressions and
optional score, the attributes refer to elements
of the input token feature vector - B and D are left and right context respectively
and can be empty - B, C, D are sequences of attribute-value pairs
and Kleene regular expression operations
variables are also supported - synNP, semORG (0.9) gt\ norm"university",
token"of",semREGIONCOUNTRYCITY /
8FACILE
- Rule for the mark up of person names when the
first name is not - present or known from the gazetteers e.g 'Mr
J. Cass', - SYNPROP,SEMPER, FIRST_F, INITIALS_I,
MIDDLE_M, LAST_S _F, _I, _M, _S are
variables, transfer info from RHS - gt
- SEMTITLE_MILTITLE_FEMALETITLE_MALE
- \SYNNAME, ORTHIO, TOKEN_I?,
- ORTHCA, SYNPROP, TOKEN_F?,
- SYNNAME, ORTHIO, TOKEN_I?,
- SYNNAME, TOKEN_M?,
- ORTHCAO,SYNPROP,TOKEN_S, SOURCE!RULE
- proper name, not recognised by a rule
- /
9FACILE
- Preference mechanism
- The rule with the highest score is preferred
- Longer matches are preferred to shorter matches
- Results are always one semantic categorisation of
the named entity in the text - Evaluation (MUC-7 scores)
- Organization 86 precision, 66 recall
- Person 90 precision, 88 recall
- Location 81 precision, 80 recall
- Dates 93 precision, 86 recall
10Extraction by Sliding Window
11Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
12Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
13Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
14Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
15A Naïve Bayes Sliding Window Model
Freitag 1997
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun
w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix
Estimate Pr(LOCATIONwindow) using Bayes
rule Try all reasonable windows (vary length,
position) Assume independence for length, prefix
words, suffix words, content words Estimate from
data quantities like Pr(Place in
prefixLOCATION)
If P(Wean Hall Rm 5409 LOCATION) is above
some threshold, extract it.
Other examples of sliding window Baluja et al
2000 (decision tree over individual words
their context)
16Naïve Bayes Sliding Window Results
Domain CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
Field F1 Person Name 30 Location 61 Start
Time 98
17SRV a realistic sliding-window-classifier IE
system
Frietag AAAI 98
- What windows to consider?
- all windows containing as many tokens as the
shortest example, but no more tokens than the
longest example - How to represent a classifier? It might
- Restrict the length of window
- Restrict the vocabulary or formatting used
before/after/inside window - Restrict the relative order of tokens
- Use inductive logic programming techniques to
express all these
lttitlegtCourse Information for CS213lt/titlegt lth1gtCS
213 C Programminglt/h1gt
18SRV a rule-learner for sliding-window
classification
- Primitive predicates used by SRV
- token(X,W), allLowerCase(W), numerical(W),
- nextToken(W,U), previousToken(W,V)
- HTML-specific predicates
- inTitleTag(W), inH1Tag(W), inEmTag(W),
- emphasized(W) inEmTag(W) or inBTag(W) or
- tableNextCol(W,U) U is some token in the
column after the column W is in - tablePreviousCol(W,V), tableRowHeader(W,T),
19Automatic Pattern-Learning Systems
20Automatic Pattern-Learning Systems
- Pros
- Portable across domains
- Tend to have broad coverage
- Robust in the face of degraded input.
- Automatically finds appropriate statistical
patterns - System knowledge not needed by those who supply
the domain knowledge. - Cons
- Annotated training data, and lots of it, is
needed. - Isnt necessarily better or cheaper than
hand-built soln - Examples Riloff et al., AutoSlog (UMass)
Soderland WHISK (UMass) Mooney et al. Rapier
(UTexas) - learn lexico-syntactic patterns from templates
21Rapier Califf Mooney, AAAI-99
- Rapier learns three regex-style patterns for each
slot - ?Pre-filler pattern ? Filler pattern ?
Post-filler pattern
22Features for IE Learning Systems
- Part of speech syntactic role of a specific word
- Semantic Classes Synonyms or other related words
- Price class price, cost, amount,
- Month class January, February, March, ,
December - US State class Alaska, Alabama, ,
Washington, Wyoming - WordNet large on-line thesaurus containing
(among other things) semantic classes
23Rapier rule matching example
- sold to the bank for an undisclosed
amount - POS vb pr det nn pr det jj
nn - SClass
price
Pre-filler Filler Post-Filler 1) tag
nn,nnp 1) word undisclosed 1) sem price 2)
list length 2 tag jj
paid Honeywell an undisclosed price POS
vb nnp det jj
nnSClass
price
24Rapier Rules Details
- Rapier rule
- pre-filler pattern
- filler pattern
- post-filler pattern
- pattern subpattern
- subpattern constraint
- constraint
- Word - exact word that must be present
- Tag - matched word must have given POS tag
- Class - semantic class of matched word
- Can specify disjunction with
- List length N - between 0 and N words satisfying
other constraints
25Rapiers Learning Algorithm
- Input set of training examples (list of
documents annotated with extract this
substring) - Output set of rules
- Init Rules a rule that exactly matches each
training example - Repeat several times
- Seed Select M examples randomly and generate
the Kmost-accurate maximally-general filler-only
rules(prefiller postfiller match anything) - GrowRepeat For N 1, 2, 3, Try to improve
K best rules by adding N context words of
prefiller or postfiller context - KeepRules Rules ? the best of the K rules
subsumed rules
26Learning example (one iteration)
- 2 examples located in Atlanta, Georgia
offices in Kansas City, Missouri
appropriately general rule (high precision, high
recall)
27Rapier resultsPrecision vs. Training Examples
28Rapier resultsRecall vs. Training Examples
29Summary Rule-learning approaches to
sliding-window classification
- SRV, Rapier, and WHISK Soderland KDD 97
- Representations for classifiers allow restriction
of the relationships between tokens, etc - Representations are carefully chosen subsets of
even more powerful representations - Use of these heavyweight representations is
complicated, but seems to pay off in results
30Successors to MUC
- CoNNL Conference on Computational Natural
Language Learning - Different topics each year
- 2002, 2003 Language-independent NER
- 2004 Semantic Role recognition
- 2001 Identify clauses in text
- 2000 Chunking boundaries
- http//cnts.uia.ac.be/conll2003/ (also conll2004,
conll2002) - Sponsored by SIGNLL, the Special Interest Group
on Natural Language Learning of the Association
for Computational Linguistics. - ACE Automated Content Extraction
- Entity Detection and Tracking
- Sponsored by NIST
- http//wave.ldc.upenn.edu/Projects/ACE/
- Several others recently
- See http//cnts.uia.ac.be/conll2003/ner/
31CoNNL-2003
- Goal identify boundaries and types of named
entities - People, Organizations, Locations, Misc.
- Experiment with incorporating external resources
(Gazeteers) and unlabeled data - Data
- Using IOB notation
- 4 pieces of info for each term
- Word POS Chunk EntityType
32Details on Training/Test Sets
Reuters Newswire European Corpus Initiative
Sang and De Meulder, Introduction to the
CoNLL-2003 Shared Task Language-Independent
Named Entity Recognition, Proceedings of
CoNLL-2003
33Summary of Results
- 16 systems participated
- Machine Learning Techniques
- Combinations of Maximum Entropy Models (5)
Hidden Markov Models (4) Winnow/Perceptron (4) - Others used once were Support Vector Machines,
Conditional Random Fields, Transformation-Based
learning, AdaBoost, and memory-based learning - Combining techniques often worked well
- Features
- Choice of features is at least as important as ML
method - Top-scoring systems used many types
- No one feature stands out as essential (other
than words)
Sang and De Meulder, Introduction to the
CoNLL-2003 Shared Task Language-Independent
Named Entity Recognition, Proceedings of
CoNLL-2003
34Sang and De Meulder, Introduction to the
CoNLL-2003 Shared Task Language-Independent
Named Entity Recognition, Proceedings of
CoNLL-2003
35Use of External Information
- Improvement from using Gazeteers vs. unlabeled
data nearly equal - Gazeteers less useful for German than English
(higher quality)
Sang and De Meulder, Introduction to the
CoNLL-2003 Shared Task Language-Independent
Named Entity Recognition, Proceedings of
CoNLL-2003
36Precision, Recall, and F-Scores
Not significantly different
Sang and De Meulder, Introduction to the
CoNLL-2003 Shared Task Language-Independent
Named Entity Recognition, Proceedings of
CoNLL-2003
37Combining Results
- What happens if we combine the results of all of
the systems? - Used a majority-vote of 5 systems for each set
- English
- F 90.30 (14 error reduction of best system)
- German
- F 74.17 (6 error reduction of best system)
- Top four systems in more detail
38Zhang and Johnson
- Experimented with the effects of different
features - Used a learning method they developed called
Robust Risk Minimization - Related to the Winnow method
- Used it to predict the class label ti associated
with each token wi - Estimate P(ti c xi) for every possible class
label c where xi is a feature vector associated
with token i - xi can including information about previous tags
- Found that the relatively simple, language
independent features get you much of the way
39Zhang and Johnson
- Simple features include
- The tokens themselves, in window of /- 2
- The previous 2 predicted tags
- The conjunction of the previous tag and the
current token - Initial capitalization of tokens, in window of
/- 2 - More elaborate features include
- Word shape information initial caps, all caps,
all digits, digits containing punctuation - Token prefix (len 3-4) and suffix (len 1-4)
- POS
- Chunking info (chunk bag-of-words at current
token) - Marked up entities from training data
- Other dictionaries
40Language independent
41Florian, Ittycheria, Jing, Zhang
- Combined four machine learning algorithms
- The best-performing was the Zhang Johnson RRM
- Voting algorithm
- Giving them all equal-weight votes worked well
- So did using the RRM algorithm to choose among
them - English F-measure went from 89.94 to 91.63
- Did well with the supplied features did even
better with some complex additional features - The output of 2 other NER systems
- Trained on 1.7M annotated words in 32 categories
- A list of gazetteers
- Improved English F-measure to 93.9
- (21 error reduction)
42Effects of Unknown Words
- Florian et al. note that German is harder
- Has more unknown words
- All nouns are capitalized
43Klein, Smarr, Nguyen, Manning
- Standard approach for unknown words is to extract
features like suffixes, prefixes, and
capitalization - Idea use all-character n-grams, rather than
words, as the primary representation - Integrates unknown words seamlessly into the
model - Improved results of their classifier by 25
44Balancing n-grams with Other Evidence
- Example morning at Grace Road
- Need the classifiers to determine Grace is part
of a location rather than a Person - Used Conditional Markov Model (aka Maximum
Entropy Model) - Also, added other shape information
- 20-month -gt d-x
- Italy -gt Xx
45(No Transcript)
46(No Transcript)
47Chieu and Ng
- Used a Maximum Entropy approach
- Estimates probabilities based on the principle of
making as few assumptions as possible - But allows specification of constraints between
featurs and outcome (derived from training data) - Used a rich feature set, like those already
discussed - Interesting additional features
- Lists derived from training set
- Global features look at how the words appeared
elsewhere within the document - Doesnt say which of these features do well
48Lists Derived from Training Data
- UNI (useful unigrams)
- Top 20 words that precede instances of that class
- Computed using a correlation metric
- UBI (useful bigrams) pairs of preceding words
- CITY OF, ARRIVES IN
- The bigram have higher probability of preceding
the class than the unigram - CITY OF better evidence than just OF
- NCS Useful Name Class Suffixes
- Tokens that frequenty terminate a class
- INC, COMMITTEE
49Using Other Occurrences within the Document
- Zone
- Where is the token from? (headline, author,
body) - Unigrams
- If UNI holds for an occurrence of w elsewhere
- Bigrams
- If UBI holds for an occurrence of w elsewhere
- Suffix
- If NCS holds of an occurrence of w elsewhere
- InitCaps
- A way to check if a word is capitalized due to
its position in the sentence or not. Also, check
the first work in sequence of capitalized words. - Even News Broadcasting Corp., noted for its
accurate reporting, made the erroneous
announcement.
50MUC Redux
- Task fill slots of templates
- MUC-4 (1992)
- All systems hand-engineered
- One MUC-6 entry used learning failed miserably
51(No Transcript)
52MUC Redux
- Fast forward 12 years now use ML!
- Chieu et. al. show a machine learning approach
that can do as well as most of the
hand-engineered MUC-4 systems - Uses state-of-the-art
- Sentence segmenter
- POS tagger
- NER
- Statistical Parser
- Co-reference resolution
- Features look at syntactic context
- Use subject-verb-object information
- Use head-words of NPs
- Train classifiers for each slot type
Chieu, Hai Leong, Ng, Hwee Tou, Lee, Yoong Keok
(2003). Closing the Gap Learning-Based
Information Extraction Rivaling
Knowledge-Engineering Methods, In (ACL-03).
53Best systems took 10.5 person-months of
hand-coding!
54IE Techniques Summary
- Machine learning approaches are doing well, even
without comprehensive word lists - Can develop a pretty good starting list with a
bit of web page scraping - Features mainly have to do with the preceding and
following tags, as well as syntax and word
shape - The latter is somewhat language dependent
- With enough training data, results are getting
pretty decent on well-defined entities - ML is the way of the future!
55IE Tools
- Research tools
- Gate
- http//gate.ac.uk/
- MinorThird
- http//minorthird.sourceforge.net/
- Alembic (only NE tagging)
- http//www.mitre.org/tech/alembic-workbench/
- Commercial
- ?? I dont know which ones work well
56NE Annotation Tools - GATE
57NE Annotation Tools - Alembic
58NE Annotation Tools Alembic (2)
59GATE
- GATE University of Sheffields open-source
infrastructure for language processing - Automatically deals with document formats, saving
of results, evaluation, and visualisation of
results for debugging - has a finite-state pattern-action rule language
- Has an example rule-based system called ANNIE
- ANNIE modified for MUC guidelines 89.5
f-measure on MUC-7 corpus
60NE Components The ANNIE system a reusable and
easily extendable set of components
61Gates Named Entity Grammars
- Phases run sequentially and constitute a cascade
of FSTs over the pre-processing results - Hand-coded rules applied to annotations to
identify NEs - Annotations from format analysis, tokeniser,
sentence splitter, POS tagger, and gazetteer
modules - Use of contextual information
- Finds person names, locations, organisations,
dates, addresses.
62Named Entities in GATE
63Named Entity Coreference