Machine Learning for Information Extraction: An Overview - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Machine Learning for Information Extraction: An Overview

Description:

... from William Cohen, Andrew McCallum and Ion ... 'J. C. Penny') In list of company suffixes (Inc, & Associates, Foundation) Word Features ... 7500 Wean Hall ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 53

Provided by: Kamal74

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning for Information Extraction: An Overview

1
Machine Learning for Information Extraction An
Overview

Kamal Nigam
Google Pittsburgh

With input, slides and suggestions from William
Cohen, Andrew McCallum and Ion Muslea
2
Example A Problem
Mt. Baker, the school district
Baker Hostetler, the company
Baker, a job opening
Genomics job
3
Example A Solution
4
Job Openings Category Food Services Keyword
Baker Location Continental U.S.
5
Extracting Job Openings from the Web
Title Ice Cream Guru Description If you dream
of cold creamy Contact susan_at_foodscience.com
Category Travel/Hospitality Function Food
Services
6
Potential Enabler of Faceted Search
7
Lots of Structured Information in Text
8
IE from Research Papers
9
What is Information Extraction?

Recovering structured data from formatted text

10
What is Information Extraction?

Recovering structured data from formatted text
Identifying fields (e.g. named entity
recognition)

11
What is Information Extraction?

Recovering structured data from formatted text
Identifying fields (e.g. named entity
recognition)
Understanding relations between fields (e.g.
record association)

12
What is Information Extraction?

Recovering structured data from formatted text
Identifying fields (e.g. named entity
recognition)
Understanding relations between fields (e.g.
record association)
Normalization and deduplication

13
What is Information Extraction?

Recovering structured data from formatted text
Identifying fields (e.g. named entity
recognition)
Understanding relations between fields (e.g.
record association)
Normalization and deduplication
Today, focus mostly on field identification a
little on record association

14
IE Posed as a Machine Learning Task

Training data documents marked up with ground
truth
In contrast to text classification, local
features crucial. Features of
Contents
Text just before item
Text just after item
Begin/end boundaries

00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun

prefix
contents
suffix
15
Good Features for Information Extraction
Creativity and Domain Knowledge Required!
contains-question-mark contains-question-word ends
-with-question-mark first-alpha-is-capitalized ind
ented indented-1-to-4 indented-5-to-10 more-than-o
ne-third-space only-punctuation prev-is-blank prev
-begins-with-ordinal shorter-than-30

Example word features
identity of word
is in all caps
ends in -ski
is part of a noun phrase
is in a list of city names
is under node X in WordNet or Cyc
is in bold font
is in hyperlink anchor
features of past future
last person name was female
next two words are and Associates

begins-with-number begins-with-ordinal begins-with
-punctuation begins-with-question-word begins-with
-subject blank contains-alphanum contains-brackete
d-number contains-http contains-non-space contains
-number contains-pipe
16
Good Features for Information Extraction
Creativity and Domain Knowledge Required!
Is Capitalized Is Mixed Caps Is All Caps
Initial Cap Contains Digit All lowercase Is
Initial Punctuation Period Comma Apostrophe Dash P
receded by HTML tag
Character n-gram classifier says string is a
person name (80 accurate) In stopword
list(the, of, their, etc) In honorific list(Mr,
Mrs, Dr, Sen, etc) In person suffix list(Jr, Sr,
PhD, etc) In name particle list (de, la, van,
der, etc) In Census lastname listsegmented by
P(name) In Census firstname listsegmented by
P(name) In locations lists(states, cities,
countries) In company name list(J. C.
Penny) In list of company suffixes(Inc,
Associates, Foundation)

Word Features
lists of job titles,
Lists of prefixes
Lists of suffixes
350 informative phrases
HTML/Formatting Features
begin, end, in x ltbgt, ltigt, ltagt, lthNgt
xlengths 1, 2, 3, 4, or longer
begin, end of line

17
IE History

Pre-Web
Mostly news articles
De Jongs FRUMP 1982
Hand-built system to fill Schank-style scripts
from news wire
Message Understanding Conference (MUC) DARPA
87-95, TIPSTER 92-96
Most early work dominated by hand-built models
E.g. SRIs FASTUS, hand-built FSMs.
But by 1990s, some machine learning Lehnert,
Cardie, Grishman and then HMMs Elkan Leek 97,
BBN Bikel et al 98
Web
AAAI 94 Spring Symposium on Software Agents
Much discussion of ML applied to Web. Maes,
Mitchell, Etzioni.
Tom Mitchells WebKB, 96
Build KBs from the Web.
Wrapper Induction
Initially hand-build, then ML Soderland 96,
Kushmeric 97,

18
Landscape of ML Techniques for IE
Classify Candidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Any of these models can be used to capture words,
formatting or both.
19
Sliding Windows Boundary Detection
20
Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
21
Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
22
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
23
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
24
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
25
Information Extraction with Sliding Windows
Freitag 97, 98 Soderland 97 Califf 98
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun

w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix

Standard supervised learning setting
Positive instances Candidates with real label
Negative instances All other candidates
Features based on candidate, prefix and suffix
Special-purpose rule learning systems work well

courseNumber(X) - tokenLength(X,,2),
every(X, inTitle, false), some(X, A,
ltpreviousTokengt, inTitle, true), some(X, B, ltgt.
tripleton, true)
26
Rule-learning approaches to sliding-window
classification Summary

Representations for classifiers allow restriction
of the relationships between tokens, etc
Representations are carefully chosen subsets of
even more powerful representations based on logic
programming (ILP and Prolog)
Use of these heavyweight representations is
complicated, but seems to pay off in results

27
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
28
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
29
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
30
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
31
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
32
BWI Learning to detect boundaries
Freitag Kushmerick, AAAI 2000

Another formulation learn three probabilistic
classifiers
START(i) Prob( position i starts a field)
END(j) Prob( position j ends a field)
LEN(k) Prob( an extracted field has length k)
Then score a possible extraction (i,j) by
START(i) END(j) LEN(j-i)
LEN(k) is estimated from a histogram

33
BWI Learning to detect boundaries

BWI uses boosting to find detectors for START
and END
Each weak detector has a BEFORE and AFTER pattern
(on tokens before/after position i).
Each pattern is a sequence of tokens and/or
wildcards like anyAlphabeticToken, anyToken,
anyUpperCaseLetter, anyNumber,
Weak learner for patterns uses greedy search (
lookahead) to repeatedly extend a pair of empty
BEFORE,AFTER patterns

34
BWI Learning to detect boundaries
Field F1 Person Name 30 Location 61 Start
Time 98
35
Problems with Sliding Windows and Boundary
Finders

Decisions in neighboring parts of the input are
made independently from each other.
Naïve Bayes Sliding Window may predict a seminar
end time before the seminar start time.
It is possible for two overlapping windows to
both be above threshold.
In a Boundary-Finding system, left boundaries are
laid down independently from right boundaries,
and their pairing happens as a separate step.

36
Finite State Machines
37
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
38
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Lawrence Saul spoke this example
sentence.
and a trained HMM
Find the most likely state sequence (Viterbi)
Yesterday Lawrence Saul spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Lawrence Saul
39
Generative Extraction with HMMs
McCallum, Nigam, Seymore Rennie 00

Parameters P(stst-1), P(otst), for all states
st, words ot
Parameters define generative model

40
HMM Example Nymble
Bikel, et al 97
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or

(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 450k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90

Results
Other examples of HMMs in IE Leek 97 Freitag
McCallum 99 Seymore et al. 99
41
Regrets from Atomic View of Tokens
Would like richer representation of text
multiple overlapping features, whole chunks of
text.

line, sentence, or paragraph features
length
is centered in page
percent of non-alphabetics
white-space aligns with next line
containing sentence has two verbs
grammatically contains a question
contains links to authoritative pages
emissions that are uncountable
features at multiple levels of granularity

Example word features
identity of word
is in all caps
ends in -ski
is part of a noun phrase
is in a list of city names
is under node X in WordNet or Cyc
is in bold font
is in hyperlink anchor
features of past future
last person name was female
next two words are and Associates

42
Problems with Richer Representationand a
Generative Model

These arbitrary features are not independent
Overlapping and long-distance dependences
Multiple levels of granularity (words,
characters)
Multiple modalities (words, formatting, layout)
Observations from past and future
HMMs are generative models of the text
Generative models do not easily handle these
non-independent features. Two choices
Model the dependencies. Each state would have
its own Bayes Net. But we are already starved
for training data!
Ignore the dependencies. This causes
over-counting of evidence (ala naïve Bayes).
Big problem when combining evidence, as in
Viterbi!

43
Conditional Sequence Models

We would prefer a conditional modelP(so)
instead of P(s,o)
Can examine features, but not responsible for
generating them.
Dont have to explicitly model their
dependencies.
Dont waste modeling effort trying to generate
what we are given at test time anyway.
If successful, this answers the challenge of
integrating the ability to handle many arbitrary
features with the full power of finite state
automata.

44
Conditional Markov Models
Maximum Entropy Markov Models McCallum, Freitag
Pereira, 2000 MaxEnt POS Tagger Ratnaparkhi,
1996 SNoW-based Markov Model Punyakanok Roth,
2000
Conditional
Generative (traditional HMM)
S
S
S
S
S
S
transitions
t
-
1
t
t1
transitions
t
-
1
t
t1
...
...
...
...
observations
observations
...
...
O
O
O
O
O
O
t
t
1
-
t
1
t
t
1
-
t
1
Standard belief propagation forward-backward
procedure. Viterbi and Baum-Welch follow
naturally.
45
Exponential Form for Next State Function
Capture dependency on st-1 with S independent
functions, Pst-1(stot). Each state contains a
next-state classifier that, given the next
observation, produces a probability of the next
state, Pst-1(stot).
st
st-1
weight
feature
Recipe - Labeled data is assigned to
transitions. - Train each states exponential
model by maximum entropy
46
Label Bias Problem

Consider this MEMM, and enough training data to
perfectly model it

Pr(0123rib)1 Pr(0453rob)1
Pr(0123rob) Pr(10,r)/Z1 Pr(21,o)/Z2
Pr(32,b)/Z3 0.5 1 1
Pr(0453rib) Pr(40,r)/Z1 Pr(54,i)/Z2
Pr(35,b)/Z3 0.5 1 1
47
From HMMs to MEMMs to CRFs
Conditional Random Fields (CRFs)
Lafferty, McCallum, Pereira 2001
St-1
St
St1
...
HMM
...
Ot
Ot1
Ot-1
MEMM
St-1
St
St1
...
Ot
Ot1
Ot-1
...
St-1
St
St1
...
CRF
Ot
Ot1
Ot-1
...
(A special case of MEMMs and CRFs.)
48
Conditional Random Fields (CRFs)
St
St1
St2
St3
St4
O Ot, Ot1, Ot2, Ot3, Ot4
Markov on s, conditional dependency on o.
Hammersley-Clifford-Besag theorem stipulates that
the CRFhas this forman exponential function of
the cliques in the graph.
Assuming that the dependency structure of the
states is tree-shaped (linear chain is a trivial
tree), inference can be done by dynamic
programming in time O(o S2)just like HMMs.
49
Training CRFs

Methods
iterative scaling (quite slow)
conjugate gradient (much faster)
conjugate gradient with preconditioning (super
fast)
limited-memory quasi-Newton methods (also super
fast)
Complexity comparable to standard Baum-Welch

Sha Pereira 2002 Malouf 2002
50
Sample IE Applications of CRFs

Noun phrase segmentation Sha Pereira, 03
Named entity recognition McCallum Li 03
Protein names in bio abstracts Settles 05
Addresses in web pages Culotta et al. 05
Semantic roles in text Roth Yih 05
RNA structural alignment Sato Satakibara 05

51
Examples of Recent CRF Research

Semi-Markov CRFs Sarawagi Cohen 05
Awkwardness of token level decisions for segments
Segment sequence model alleviates this
Two-level model with sequences of segments, which
are sequences of tokens
Stochastic Meta-Descent Vishwanathan 06
Stochastic gradient optimization for training
Take gradient step with small batches of examples
Order of magnitude faster than L-BFGS
Same resulting accuracies for extraction

52
Further Reading about CRFs

Charles Sutton and Andrew McCallum. An
Introduction to Conditional Random Fields for
Relational Learning. In Introduction to
Statistical Relational Learning. Edited by Lise
Getoor and Ben Taskar. MIT Press. 2006.
http//www.cs.umass.edu/mccallum/papers/crf-tutor
ial.pdf

Write a Comment

User Comments (0)