Title: Information Extraction
1Information Extraction
- Yunyao Li
- EECS /SI 767
- 03/29/2006
2The Problem
Date
Time Start - End
Location
Speaker
Person
3What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
Courtesy of William W. Cohen
4What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
Courtesy of William W. Cohen
5What is Information Extraction
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
aka named entity extraction
Courtesy of William W. Cohen
6What is Information Extraction
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
Courtesy of William W. Cohen
7What is Information Extraction
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
Courtesy of William W. Cohen
8What is Information Extraction
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
Courtesy of William W. Cohen
9 10Landscape of IE Techniques
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
Our Focus today!
Courtesy of William W. Cohen
11Markov Property
S1 rain S2 cloud S3 sun
The state of a system at time t1, qt1, is
conditionally independent of qt-1, qt-2, , q1,
q0 given qt In another word, current state
determines the probability distribution for the
next state.
1/2
S2
1/3
1/2
S1
S2
2/3
1
12Markov Property
S1 rain S2 cloud S3 sun
1/2
State-transition probabilities, A
S2
1/3
1/2
S1
S3
2/3
1
Q given today is sunny (i.e., q13), what is the
probability of sun-cloud with the model?
13Hidden Markov Model
S1 rain S2 cloud S3 sun
1/2
1/10
S2
9/10
1/2
1/3
S1
S3
2/3
observations
1
4/5
7/10
3/10
1/5
14IE with Hidden Markov Model
Given a sequence of observations
SI/EECS 767 is held weekly at SIN2 .
and a trained HMM
course name
location name
background
Find the most likely state sequence (Viterbi)
SI/EECS 767 is held weekly at SIN2
Any words said to be generated by the designated
course name state extract as a course name
Course name SI/EECS 767
15Name Entity Extraction
Bikel, et al 1998
Hidden states
16Name Entity Extraction
Transitionprobabilities
Observationprobabilities
P(ot st , st-1 )
P(st st-1, ot-1 )
P(ot st , ot-1 )
or
(1) Generating first word of a name-class
(2) Generating the rest of words in the name-class
(3) Generating end in a name-class
17Training Estimating Probabilities
18Back-Off
unknown words and insufficient training data
Transitionprobabilities
Observationprobabilities
P(st st-1 )
P(ot st )
P(st )
P(ot )
19HMM-Experimental Results
Train on 500k words of news wire text.
Results
20Learning HMM for IE
Seymore, 1999
Consider labeled, unlabeled, and
distantly-labeled data
21Some Issues with HMM
- Need to enumerate all possible observation
sequences - Not practical to represent multiple interacting
features or long-range dependencies of the
observations - Very strict independence assumptions on the
observations
22Maximum Entropy Markov Models
Lafferty, 2001
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1
is Wisniewski
part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
Courtesy of William W. Cohen
23MEMM
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
S
t
-
1
t1
t
is Wisniewski
part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
and previous state history
Courtesy of William W. Cohen
24HMM vs. MEMM
St-1
St
St1
...
Ot
Ot1
Ot-1
St-1
St
St1
...
Ot
Ot1
Ot-1
25Label Bias Problem with MEMM
Consider this MEMM
Pr(12ro) Pr(21,ro)Pr(1,ro) Pr(2 1,o)Pr(1,r)
Pr(12ri) Pr(21,ri)Pr(1,ri) Pr(2
1,i)Pr(1,r)
Pr(21,o) Pr(21,r) 1
Pr(12ro) Pr(12ri)
But it should be Pr(12ro)
26Solve the Label Bias Problem
- Change the state-transition structure of the
model - Not always practical to change the set of states
- Start with a fully-connected model and let the
training procedure figure out a good structure - Prelude the use of prior, which is very valuable
(e.g. in information extraction)
27Random Field
Courtesy of Rongkun Shen
28Conditional Random Field
Courtesy of Rongkun Shen
29Conditional Distribution
30Conditional Distribution
- CRFs use the observation-dependent
normalization Z(x) for the conditional
distributions
Z(x) is a normalization over the data sequence x
31HMM like CRF
- Single feature for each state-state pair
(y,y) and state-observation pair in the data to
train CRF
if yu y and yv y
Yt-1
Yt
Yt1
...
otherwise
Xt
Xt1
Xt-1
if yv y and xv x
otherwise
?y,y and µy,x are equivalent to the logarithm of
the HMM transition probability Pr(yy) and
observation probability Pr(xy)
32HMM like CRF
- For a chain structure, the conditional
probability of a label sequence can be expressed
in matrix form. - For each position i in the observed sequence
x, define matrix
Where ei is the edge with label (yi-1, yi) and vi
is the vertex with label yi
33HMM like CRF
The normalization function is the (start, stop)
entry of the product of these matrices
The conditional probability of label sequence y
is
where, y0 start and yn1 stop
34Parameter Estimation
The problem determine the parameters From
training data with
empirical distribution
The goal maximize the log-likelihood objective
function
35Parameter Estimation Iterative Scaling
Algorithms
Update the weights as and
for Appropriately
chosen
for edge feature fk is the solution of
T(x, y) is a global property of (x,y) and
efficiently computing the Right-hand sides of
the above equation is a problem
36Algorithm S
Define slack feature
For each index i 0, , n1 we define forward
vectors
And backward vectors
37Algorithm S
38Algorithm S
39Algorithm T
Keeps track of partial T total. It accumulates
feature expectations into counters indexed by
T(x)
Use forward-back ward recurrences to compute the
expectation ak,t of feature fk and bk,t of
feature gk given that T(x) t
40Experiments
- Modeling label bias problem
- 2000 training and 500 test samples generated by
HMM - CRF error is 4.6
- MEMM error is 42
CRF solves label bias problem
41Experiments
- Modeling mixed order sources
- CRF converge in 500 iterations
- MEMM converge in 100 iterations
42MEMM vs. HMM
The HMM outperforms the MEMM
43CRF vs. MEMM
CRF usually outperforms the MEMM
44CRF vs. HMM
Each open square represents a data set with ? ½, and a sold square indicates a data set with a
? ? ½. When the data is mostly second order ? ?
½, the discriminatively trained CRF usually
outperforms the MEMM
45POS Tagging Experiments
- First-order HMM, MEMM and CRF model
- Data set Penn Tree bank
- 50-50 test-train split
- Uses MEMM parameter vector as a starting point
for training the corresponding CRF to accelerate
convergence speed.
46Interactive IE using CRF
Interactive parser updates IE results according
to users changes. Color coding used to alert
the ambiguity of IE.
47Some IE tools Available
- MALLET (UMass)
- statistical natural language processing,
- document classification,
- clustering,
- information extraction
- other machine learning applications to text.
- Sample Application
- GeneTaggerCRF a gene-entity tagger based on
MALLET (MAchine Learning for LanguagE Toolkit).
It uses conditional random fields to find genes
in a text file.
48MinorThird
- http//minorthird.sourceforge.net/
- a collection of Java classes for storing text,
annotating text, and learning to extract entities
and categorize text - Stored documents can be annotated in independent
files using TextLabels (denoting, say,
part-of-speech and semantic information)
49GATE
- http//gate.ac.uk/ie/annie.html
- leading toolkit for Text Mining
- distributed with an Information Extraction
component set called ANNIE (demo) - Used in many research projects
- Long list can be found on its website
- Under integration of IBM UIMA
50Sunita Sarawagi's CRF package
- http//crf.sourceforge.net/
- A Java implementation of conditional random
fields for sequential labeling.
51UIMA (IBM)
- Unstructured Information Management Architecture.
- A platform for unstructured information
management solutions from combinations of
semantic analysis (IE) and search components.
52Some Interesting Website based on IE
- ZoomInfo
- CiteSeer.org (some of us using it everyday!)
- Google Local, Google Scholar
- and many more