Title: Machine Learning for Information Extraction: An Overview
1Machine Learning for Information Extraction An
Overview
- Kamal Nigam
- Google Pittsburgh
With input, slides and suggestions from William
Cohen, Andrew McCallum and Ion Muslea
2Example A Problem
Mt. Baker, the school district
Baker Hostetler, the company
Baker, a job opening
Genomics job
3Example A Solution
4Job Openings Category Food Services Keyword
Baker Location Continental U.S.
5Extracting Job Openings from the Web
Title Ice Cream Guru Description If you dream
of cold creamy Contact susan_at_foodscience.com
Category Travel/Hospitality Function Food
Services
6Potential Enabler of Faceted Search
7Lots of Structured Information in Text
8IE from Research Papers
9What is Information Extraction?
- Recovering structured data from formatted text
10What is Information Extraction?
- Recovering structured data from formatted text
- Identifying fields (e.g. named entity
recognition)
11What is Information Extraction?
- Recovering structured data from formatted text
- Identifying fields (e.g. named entity
recognition) - Understanding relations between fields (e.g.
record association)
12What is Information Extraction?
- Recovering structured data from formatted text
- Identifying fields (e.g. named entity
recognition) - Understanding relations between fields (e.g.
record association) - Normalization and deduplication
13What is Information Extraction?
- Recovering structured data from formatted text
- Identifying fields (e.g. named entity
recognition) - Understanding relations between fields (e.g.
record association) - Normalization and deduplication
- Today, focus mostly on field identification a
little on record association
14IE Posed as a Machine Learning Task
- Training data documents marked up with ground
truth - In contrast to text classification, local
features crucial. Features of - Contents
- Text just before item
- Text just after item
- Begin/end boundaries
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun
prefix
contents
suffix
15Good Features for Information Extraction
Creativity and Domain Knowledge Required!
contains-question-mark contains-question-word ends
-with-question-mark first-alpha-is-capitalized ind
ented indented-1-to-4 indented-5-to-10 more-than-o
ne-third-space only-punctuation prev-is-blank prev
-begins-with-ordinal shorter-than-30
- Example word features
- identity of word
- is in all caps
- ends in -ski
- is part of a noun phrase
- is in a list of city names
- is under node X in WordNet or Cyc
- is in bold font
- is in hyperlink anchor
- features of past future
- last person name was female
- next two words are and Associates
begins-with-number begins-with-ordinal begins-with
-punctuation begins-with-question-word begins-with
-subject blank contains-alphanum contains-brackete
d-number contains-http contains-non-space contains
-number contains-pipe
16Good Features for Information Extraction
Creativity and Domain Knowledge Required!
Is Capitalized Is Mixed Caps Is All Caps
Initial Cap Contains Digit All lowercase Is
Initial Punctuation Period Comma Apostrophe Dash P
receded by HTML tag
Character n-gram classifier says string is a
person name (80 accurate) In stopword
list(the, of, their, etc) In honorific list(Mr,
Mrs, Dr, Sen, etc) In person suffix list(Jr, Sr,
PhD, etc) In name particle list (de, la, van,
der, etc) In Census lastname listsegmented by
P(name) In Census firstname listsegmented by
P(name) In locations lists(states, cities,
countries) In company name list(J. C.
Penny) In list of company suffixes(Inc,
Associates, Foundation)
- Word Features
- lists of job titles,
- Lists of prefixes
- Lists of suffixes
- 350 informative phrases
- HTML/Formatting Features
- begin, end, in x ltbgt, ltigt, ltagt, lthNgt
xlengths 1, 2, 3, 4, or longer - begin, end of line
17IE History
- Pre-Web
- Mostly news articles
- De Jongs FRUMP 1982
- Hand-built system to fill Schank-style scripts
from news wire - Message Understanding Conference (MUC) DARPA
87-95, TIPSTER 92-96 - Most early work dominated by hand-built models
- E.g. SRIs FASTUS, hand-built FSMs.
- But by 1990s, some machine learning Lehnert,
Cardie, Grishman and then HMMs Elkan Leek 97,
BBN Bikel et al 98 - Web
- AAAI 94 Spring Symposium on Software Agents
- Much discussion of ML applied to Web. Maes,
Mitchell, Etzioni. - Tom Mitchells WebKB, 96
- Build KBs from the Web.
- Wrapper Induction
- Initially hand-build, then ML Soderland 96,
Kushmeric 97,
18Landscape of ML Techniques for IE
Classify Candidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Any of these models can be used to capture words,
formatting or both.
19Sliding Windows Boundary Detection
20Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
21Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
22Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
23Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
24Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
25Information Extraction with Sliding Windows
Freitag 97, 98 Soderland 97 Califf 98
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun
w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix
- Standard supervised learning setting
- Positive instances Candidates with real label
- Negative instances All other candidates
- Features based on candidate, prefix and suffix
- Special-purpose rule learning systems work well
courseNumber(X) - tokenLength(X,,2),
every(X, inTitle, false), some(X, A,
ltpreviousTokengt, inTitle, true), some(X, B, ltgt.
tripleton, true)
26Rule-learning approaches to sliding-window
classification Summary
- Representations for classifiers allow restriction
of the relationships between tokens, etc - Representations are carefully chosen subsets of
even more powerful representations based on logic
programming (ILP and Prolog) - Use of these heavyweight representations is
complicated, but seems to pay off in results
27IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
28IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
29IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
30IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
31IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
32BWI Learning to detect boundaries
Freitag Kushmerick, AAAI 2000
- Another formulation learn three probabilistic
classifiers - START(i) Prob( position i starts a field)
- END(j) Prob( position j ends a field)
- LEN(k) Prob( an extracted field has length k)
- Then score a possible extraction (i,j) by
- START(i) END(j) LEN(j-i)
- LEN(k) is estimated from a histogram
-
33BWI Learning to detect boundaries
- BWI uses boosting to find detectors for START
and END - Each weak detector has a BEFORE and AFTER pattern
(on tokens before/after position i). - Each pattern is a sequence of tokens and/or
wildcards like anyAlphabeticToken, anyToken,
anyUpperCaseLetter, anyNumber, - Weak learner for patterns uses greedy search (
lookahead) to repeatedly extend a pair of empty
BEFORE,AFTER patterns
34BWI Learning to detect boundaries
Field F1 Person Name 30 Location 61 Start
Time 98
35Problems with Sliding Windows and Boundary
Finders
- Decisions in neighboring parts of the input are
made independently from each other. - Naïve Bayes Sliding Window may predict a seminar
end time before the seminar start time. - It is possible for two overlapping windows to
both be above threshold. - In a Boundary-Finding system, left boundaries are
laid down independently from right boundaries,
and their pairing happens as a separate step.
36Finite State Machines
37Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
38IE with Hidden Markov Models
Given a sequence of observations
Yesterday Lawrence Saul spoke this example
sentence.
and a trained HMM
Find the most likely state sequence (Viterbi)
Yesterday Lawrence Saul spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Lawrence Saul
39Generative Extraction with HMMs
McCallum, Nigam, Seymore Rennie 00
- Parameters P(stst-1), P(otst), for all states
st, words ot - Parameters define generative model
40HMM Example Nymble
Bikel, et al 97
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or
(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 450k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90
Results
Other examples of HMMs in IE Leek 97 Freitag
McCallum 99 Seymore et al. 99
41Regrets from Atomic View of Tokens
Would like richer representation of text
multiple overlapping features, whole chunks of
text.
- line, sentence, or paragraph features
- length
- is centered in page
- percent of non-alphabetics
- white-space aligns with next line
- containing sentence has two verbs
- grammatically contains a question
- contains links to authoritative pages
- emissions that are uncountable
- features at multiple levels of granularity
- Example word features
- identity of word
- is in all caps
- ends in -ski
- is part of a noun phrase
- is in a list of city names
- is under node X in WordNet or Cyc
- is in bold font
- is in hyperlink anchor
- features of past future
- last person name was female
- next two words are and Associates
42Problems with Richer Representationand a
Generative Model
- These arbitrary features are not independent
- Overlapping and long-distance dependences
- Multiple levels of granularity (words,
characters) - Multiple modalities (words, formatting, layout)
- Observations from past and future
- HMMs are generative models of the text
- Generative models do not easily handle these
non-independent features. Two choices - Model the dependencies. Each state would have
its own Bayes Net. But we are already starved
for training data! - Ignore the dependencies. This causes
over-counting of evidence (ala naïve Bayes).
Big problem when combining evidence, as in
Viterbi!
43Conditional Sequence Models
- We would prefer a conditional modelP(so)
instead of P(s,o) - Can examine features, but not responsible for
generating them. - Dont have to explicitly model their
dependencies. - Dont waste modeling effort trying to generate
what we are given at test time anyway. - If successful, this answers the challenge of
integrating the ability to handle many arbitrary
features with the full power of finite state
automata.
44Conditional Markov Models
Maximum Entropy Markov Models McCallum, Freitag
Pereira, 2000 MaxEnt POS Tagger Ratnaparkhi,
1996 SNoW-based Markov Model Punyakanok Roth,
2000
Conditional
Generative (traditional HMM)
S
S
S
S
S
S
transitions
t
-
1
t
t1
transitions
t
-
1
t
t1
...
...
...
...
observations
observations
...
...
O
O
O
O
O
O
t
t
1
-
t
1
t
t
1
-
t
1
Standard belief propagation forward-backward
procedure. Viterbi and Baum-Welch follow
naturally.
45Exponential Form for Next State Function
Capture dependency on st-1 with S independent
functions, Pst-1(stot). Each state contains a
next-state classifier that, given the next
observation, produces a probability of the next
state, Pst-1(stot).
st
st-1
weight
feature
Recipe - Labeled data is assigned to
transitions. - Train each states exponential
model by maximum entropy
46Label Bias Problem
- Consider this MEMM, and enough training data to
perfectly model it
Pr(0123rib)1 Pr(0453rob)1
Pr(0123rob) Pr(10,r)/Z1 Pr(21,o)/Z2
Pr(32,b)/Z3 0.5 1 1
Pr(0453rib) Pr(40,r)/Z1 Pr(54,i)/Z2
Pr(35,b)/Z3 0.5 1 1
47From HMMs to MEMMs to CRFs
Conditional Random Fields (CRFs)
Lafferty, McCallum, Pereira 2001
St-1
St
St1
...
HMM
...
Ot
Ot1
Ot-1
MEMM
St-1
St
St1
...
Ot
Ot1
Ot-1
...
St-1
St
St1
...
CRF
Ot
Ot1
Ot-1
...
(A special case of MEMMs and CRFs.)
48Conditional Random Fields (CRFs)
St
St1
St2
St3
St4
O Ot, Ot1, Ot2, Ot3, Ot4
Markov on s, conditional dependency on o.
Hammersley-Clifford-Besag theorem stipulates that
the CRFhas this forman exponential function of
the cliques in the graph.
Assuming that the dependency structure of the
states is tree-shaped (linear chain is a trivial
tree), inference can be done by dynamic
programming in time O(o S2)just like HMMs.
49Training CRFs
- Methods
- iterative scaling (quite slow)
- conjugate gradient (much faster)
- conjugate gradient with preconditioning (super
fast) - limited-memory quasi-Newton methods (also super
fast) - Complexity comparable to standard Baum-Welch
Sha Pereira 2002 Malouf 2002
50Sample IE Applications of CRFs
- Noun phrase segmentation Sha Pereira, 03
- Named entity recognition McCallum Li 03
- Protein names in bio abstracts Settles 05
- Addresses in web pages Culotta et al. 05
- Semantic roles in text Roth Yih 05
- RNA structural alignment Sato Satakibara 05
51Examples of Recent CRF Research
- Semi-Markov CRFs Sarawagi Cohen 05
- Awkwardness of token level decisions for segments
- Segment sequence model alleviates this
- Two-level model with sequences of segments, which
are sequences of tokens - Stochastic Meta-Descent Vishwanathan 06
- Stochastic gradient optimization for training
- Take gradient step with small batches of examples
- Order of magnitude faster than L-BFGS
- Same resulting accuracies for extraction
52Further Reading about CRFs
- Charles Sutton and Andrew McCallum. An
Introduction to Conditional Random Fields for
Relational Learning. In Introduction to
Statistical Relational Learning. Edited by Lise
Getoor and Ben Taskar. MIT Press. 2006. - http//www.cs.umass.edu/mccallum/papers/crf-tutor
ial.pdf