Machine Learning for Information Extraction: An Overview - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Machine Learning for Information Extraction: An Overview

Description:

... from William Cohen, Andrew McCallum and Ion ... 'J. C. Penny') In list of company suffixes (Inc, & Associates, Foundation) Word Features ... 7500 Wean Hall ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 53
Provided by: Kamal74
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning for Information Extraction: An Overview


1
Machine Learning for Information Extraction An
Overview
  • Kamal Nigam
  • Google Pittsburgh

With input, slides and suggestions from William
Cohen, Andrew McCallum and Ion Muslea
2
Example A Problem
Mt. Baker, the school district
Baker Hostetler, the company
Baker, a job opening
Genomics job
3
Example A Solution
4
Job Openings Category Food Services Keyword
Baker Location Continental U.S.
5
Extracting Job Openings from the Web
Title Ice Cream Guru Description If you dream
of cold creamy Contact susan_at_foodscience.com
Category Travel/Hospitality Function Food
Services
6
Potential Enabler of Faceted Search
7
Lots of Structured Information in Text
8
IE from Research Papers
9
What is Information Extraction?
  • Recovering structured data from formatted text

10
What is Information Extraction?
  • Recovering structured data from formatted text
  • Identifying fields (e.g. named entity
    recognition)

11
What is Information Extraction?
  • Recovering structured data from formatted text
  • Identifying fields (e.g. named entity
    recognition)
  • Understanding relations between fields (e.g.
    record association)

12
What is Information Extraction?
  • Recovering structured data from formatted text
  • Identifying fields (e.g. named entity
    recognition)
  • Understanding relations between fields (e.g.
    record association)
  • Normalization and deduplication

13
What is Information Extraction?
  • Recovering structured data from formatted text
  • Identifying fields (e.g. named entity
    recognition)
  • Understanding relations between fields (e.g.
    record association)
  • Normalization and deduplication
  • Today, focus mostly on field identification a
    little on record association

14
IE Posed as a Machine Learning Task
  • Training data documents marked up with ground
    truth
  • In contrast to text classification, local
    features crucial. Features of
  • Contents
  • Text just before item
  • Text just after item
  • Begin/end boundaries

00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun


prefix
contents
suffix
15
Good Features for Information Extraction
Creativity and Domain Knowledge Required!
contains-question-mark contains-question-word ends
-with-question-mark first-alpha-is-capitalized ind
ented indented-1-to-4 indented-5-to-10 more-than-o
ne-third-space only-punctuation prev-is-blank prev
-begins-with-ordinal shorter-than-30
  • Example word features
  • identity of word
  • is in all caps
  • ends in -ski
  • is part of a noun phrase
  • is in a list of city names
  • is under node X in WordNet or Cyc
  • is in bold font
  • is in hyperlink anchor
  • features of past future
  • last person name was female
  • next two words are and Associates

begins-with-number begins-with-ordinal begins-with
-punctuation begins-with-question-word begins-with
-subject blank contains-alphanum contains-brackete
d-number contains-http contains-non-space contains
-number contains-pipe
16
Good Features for Information Extraction
Creativity and Domain Knowledge Required!
Is Capitalized Is Mixed Caps Is All Caps
Initial Cap Contains Digit All lowercase Is
Initial Punctuation Period Comma Apostrophe Dash P
receded by HTML tag
Character n-gram classifier says string is a
person name (80 accurate) In stopword
list(the, of, their, etc) In honorific list(Mr,
Mrs, Dr, Sen, etc) In person suffix list(Jr, Sr,
PhD, etc) In name particle list (de, la, van,
der, etc) In Census lastname listsegmented by
P(name) In Census firstname listsegmented by
P(name) In locations lists(states, cities,
countries) In company name list(J. C.
Penny) In list of company suffixes(Inc,
Associates, Foundation)
  • Word Features
  • lists of job titles,
  • Lists of prefixes
  • Lists of suffixes
  • 350 informative phrases
  • HTML/Formatting Features
  • begin, end, in x ltbgt, ltigt, ltagt, lthNgt
    xlengths 1, 2, 3, 4, or longer
  • begin, end of line

17
IE History
  • Pre-Web
  • Mostly news articles
  • De Jongs FRUMP 1982
  • Hand-built system to fill Schank-style scripts
    from news wire
  • Message Understanding Conference (MUC) DARPA
    87-95, TIPSTER 92-96
  • Most early work dominated by hand-built models
  • E.g. SRIs FASTUS, hand-built FSMs.
  • But by 1990s, some machine learning Lehnert,
    Cardie, Grishman and then HMMs Elkan Leek 97,
    BBN Bikel et al 98
  • Web
  • AAAI 94 Spring Symposium on Software Agents
  • Much discussion of ML applied to Web. Maes,
    Mitchell, Etzioni.
  • Tom Mitchells WebKB, 96
  • Build KBs from the Web.
  • Wrapper Induction
  • Initially hand-build, then ML Soderland 96,
    Kushmeric 97,

18
Landscape of ML Techniques for IE
Classify Candidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Any of these models can be used to capture words,
formatting or both.
19
Sliding Windows Boundary Detection
20
Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
21
Information Extraction by Sliding Windows
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
22
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
23
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
24
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
25
Information Extraction with Sliding Windows
Freitag 97, 98 Soderland 97 Califf 98
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun


w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix
  • Standard supervised learning setting
  • Positive instances Candidates with real label
  • Negative instances All other candidates
  • Features based on candidate, prefix and suffix
  • Special-purpose rule learning systems work well

courseNumber(X) - tokenLength(X,,2),
every(X, inTitle, false), some(X, A,
ltpreviousTokengt, inTitle, true), some(X, B, ltgt.
tripleton, true)
26
Rule-learning approaches to sliding-window
classification Summary
  • Representations for classifiers allow restriction
    of the relationships between tokens, etc
  • Representations are carefully chosen subsets of
    even more powerful representations based on logic
    programming (ILP and Prolog)
  • Use of these heavyweight representations is
    complicated, but seems to pay off in results

27
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
28
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
29
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
30
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
31
IE by Boundary Detection
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
32
BWI Learning to detect boundaries
Freitag Kushmerick, AAAI 2000
  • Another formulation learn three probabilistic
    classifiers
  • START(i) Prob( position i starts a field)
  • END(j) Prob( position j ends a field)
  • LEN(k) Prob( an extracted field has length k)
  • Then score a possible extraction (i,j) by
  • START(i) END(j) LEN(j-i)
  • LEN(k) is estimated from a histogram

33
BWI Learning to detect boundaries
  • BWI uses boosting to find detectors for START
    and END
  • Each weak detector has a BEFORE and AFTER pattern
    (on tokens before/after position i).
  • Each pattern is a sequence of tokens and/or
    wildcards like anyAlphabeticToken, anyToken,
    anyUpperCaseLetter, anyNumber,
  • Weak learner for patterns uses greedy search (
    lookahead) to repeatedly extend a pair of empty
    BEFORE,AFTER patterns

34
BWI Learning to detect boundaries
Field F1 Person Name 30 Location 61 Start
Time 98
35
Problems with Sliding Windows and Boundary
Finders
  • Decisions in neighboring parts of the input are
    made independently from each other.
  • Naïve Bayes Sliding Window may predict a seminar
    end time before the seminar start time.
  • It is possible for two overlapping windows to
    both be above threshold.
  • In a Boundary-Finding system, left boundaries are
    laid down independently from right boundaries,
    and their pairing happens as a separate step.

36
Finite State Machines
37
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
38
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Lawrence Saul spoke this example
sentence.
and a trained HMM
Find the most likely state sequence (Viterbi)
Yesterday Lawrence Saul spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Lawrence Saul
39
Generative Extraction with HMMs
McCallum, Nigam, Seymore Rennie 00
  • Parameters P(stst-1), P(otst), for all states
    st, words ot
  • Parameters define generative model

40
HMM Example Nymble
Bikel, et al 97
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or

(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 450k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90

Results
Other examples of HMMs in IE Leek 97 Freitag
McCallum 99 Seymore et al. 99
41
Regrets from Atomic View of Tokens
Would like richer representation of text
multiple overlapping features, whole chunks of
text.
  • line, sentence, or paragraph features
  • length
  • is centered in page
  • percent of non-alphabetics
  • white-space aligns with next line
  • containing sentence has two verbs
  • grammatically contains a question
  • contains links to authoritative pages
  • emissions that are uncountable
  • features at multiple levels of granularity
  • Example word features
  • identity of word
  • is in all caps
  • ends in -ski
  • is part of a noun phrase
  • is in a list of city names
  • is under node X in WordNet or Cyc
  • is in bold font
  • is in hyperlink anchor
  • features of past future
  • last person name was female
  • next two words are and Associates

42
Problems with Richer Representationand a
Generative Model
  • These arbitrary features are not independent
  • Overlapping and long-distance dependences
  • Multiple levels of granularity (words,
    characters)
  • Multiple modalities (words, formatting, layout)
  • Observations from past and future
  • HMMs are generative models of the text
  • Generative models do not easily handle these
    non-independent features. Two choices
  • Model the dependencies. Each state would have
    its own Bayes Net. But we are already starved
    for training data!
  • Ignore the dependencies. This causes
    over-counting of evidence (ala naïve Bayes).
    Big problem when combining evidence, as in
    Viterbi!

43
Conditional Sequence Models
  • We would prefer a conditional modelP(so)
    instead of P(s,o)
  • Can examine features, but not responsible for
    generating them.
  • Dont have to explicitly model their
    dependencies.
  • Dont waste modeling effort trying to generate
    what we are given at test time anyway.
  • If successful, this answers the challenge of
    integrating the ability to handle many arbitrary
    features with the full power of finite state
    automata.

44
Conditional Markov Models
Maximum Entropy Markov Models McCallum, Freitag
Pereira, 2000 MaxEnt POS Tagger Ratnaparkhi,
1996 SNoW-based Markov Model Punyakanok Roth,
2000
Conditional
Generative (traditional HMM)
S
S
S
S
S
S
transitions
t
-
1
t
t1
transitions
t
-
1
t
t1
...
...
...
...
observations
observations
...
...
O
O
O
O
O
O
t
t
1
-
t
1
t
t
1
-
t
1
Standard belief propagation forward-backward
procedure. Viterbi and Baum-Welch follow
naturally.
45
Exponential Form for Next State Function
Capture dependency on st-1 with S independent
functions, Pst-1(stot). Each state contains a
next-state classifier that, given the next
observation, produces a probability of the next
state, Pst-1(stot).
st
st-1
weight
feature
Recipe - Labeled data is assigned to
transitions. - Train each states exponential
model by maximum entropy
46
Label Bias Problem
  • Consider this MEMM, and enough training data to
    perfectly model it

Pr(0123rib)1 Pr(0453rob)1
Pr(0123rob) Pr(10,r)/Z1 Pr(21,o)/Z2
Pr(32,b)/Z3 0.5 1 1
Pr(0453rib) Pr(40,r)/Z1 Pr(54,i)/Z2
Pr(35,b)/Z3 0.5 1 1
47
From HMMs to MEMMs to CRFs
Conditional Random Fields (CRFs)
Lafferty, McCallum, Pereira 2001
St-1
St
St1
...
HMM
...
Ot
Ot1
Ot-1
MEMM
St-1
St
St1
...
Ot
Ot1
Ot-1
...
St-1
St
St1
...
CRF
Ot
Ot1
Ot-1
...
(A special case of MEMMs and CRFs.)
48
Conditional Random Fields (CRFs)
St
St1
St2
St3
St4
O Ot, Ot1, Ot2, Ot3, Ot4
Markov on s, conditional dependency on o.
Hammersley-Clifford-Besag theorem stipulates that
the CRFhas this forman exponential function of
the cliques in the graph.
Assuming that the dependency structure of the
states is tree-shaped (linear chain is a trivial
tree), inference can be done by dynamic
programming in time O(o S2)just like HMMs.
49
Training CRFs
  • Methods
  • iterative scaling (quite slow)
  • conjugate gradient (much faster)
  • conjugate gradient with preconditioning (super
    fast)
  • limited-memory quasi-Newton methods (also super
    fast)
  • Complexity comparable to standard Baum-Welch

Sha Pereira 2002 Malouf 2002
50
Sample IE Applications of CRFs
  • Noun phrase segmentation Sha Pereira, 03
  • Named entity recognition McCallum Li 03
  • Protein names in bio abstracts Settles 05
  • Addresses in web pages Culotta et al. 05
  • Semantic roles in text Roth Yih 05
  • RNA structural alignment Sato Satakibara 05

51
Examples of Recent CRF Research
  • Semi-Markov CRFs Sarawagi Cohen 05
  • Awkwardness of token level decisions for segments
  • Segment sequence model alleviates this
  • Two-level model with sequences of segments, which
    are sequences of tokens
  • Stochastic Meta-Descent Vishwanathan 06
  • Stochastic gradient optimization for training
  • Take gradient step with small batches of examples
  • Order of magnitude faster than L-BFGS
  • Same resulting accuracies for extraction

52
Further Reading about CRFs
  • Charles Sutton and Andrew McCallum. An
    Introduction to Conditional Random Fields for
    Relational Learning. In Introduction to
    Statistical Relational Learning. Edited by Lise
    Getoor and Ben Taskar. MIT Press. 2006.
  • http//www.cs.umass.edu/mccallum/papers/crf-tutor
    ial.pdf
Write a Comment
User Comments (0)
About PowerShow.com