Information Extraction - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Information Extraction

Description:

... years, Microsoft Corporation CEO Bill Gates railed against the economic ... Bill Gates CEO Microsoft. Bill Veghte VP Microsoft. Richard Stallman founder ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 53

Provided by: Yun4

Category:

more less

Transcript and Presenter's Notes

Title: Information Extraction

1
Information Extraction

Yunyao Li
EECS /SI 767
03/29/2006

2
The Problem
Date
Time Start - End
Location
Speaker
Person
3
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
Courtesy of William W. Cohen
4
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
Courtesy of William W. Cohen
5
What is Information Extraction
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
aka named entity extraction
Courtesy of William W. Cohen
6
What is Information Extraction
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
Courtesy of William W. Cohen
7
What is Information Extraction
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
Courtesy of William W. Cohen
8
What is Information Extraction
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation

Courtesy of William W. Cohen
9

Live Example Seminar

10
Landscape of IE Techniques
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
Our Focus today!
Courtesy of William W. Cohen
11
Markov Property
S1 rain S2 cloud S3 sun
The state of a system at time t1, qt1, is
conditionally independent of qt-1, qt-2, , q1,
q0 given qt In another word, current state
determines the probability distribution for the
next state.
1/2
S2
1/3
1/2
S1
S2
2/3
1
12
Markov Property
S1 rain S2 cloud S3 sun
1/2
State-transition probabilities, A
S2
1/3
1/2
S1
S3
2/3
1
Q given today is sunny (i.e., q13), what is the
probability of sun-cloud with the model?
13
Hidden Markov Model
S1 rain S2 cloud S3 sun
1/2
1/10
S2
9/10
1/2
1/3
S1
S3
2/3
observations
1
4/5
7/10
3/10
1/5
14
IE with Hidden Markov Model
Given a sequence of observations
SI/EECS 767 is held weekly at SIN2 .
and a trained HMM
course name
location name
background
Find the most likely state sequence (Viterbi)
SI/EECS 767 is held weekly at SIN2
Any words said to be generated by the designated
course name state extract as a course name
Course name SI/EECS 767
15
Name Entity Extraction
Bikel, et al 1998
Hidden states
16
Name Entity Extraction
Transitionprobabilities
Observationprobabilities
P(ot st , st-1 )
P(st st-1, ot-1 )
P(ot st , ot-1 )
or
(1) Generating first word of a name-class
(2) Generating the rest of words in the name-class
(3) Generating end in a name-class
17
Training Estimating Probabilities
18
Back-Off
unknown words and insufficient training data
Transitionprobabilities
Observationprobabilities
P(st st-1 )
P(ot st )
P(st )
P(ot )
19
HMM-Experimental Results
Train on 500k words of news wire text.
Results
20
Learning HMM for IE
Seymore, 1999
Consider labeled, unlabeled, and
distantly-labeled data
21
Some Issues with HMM

Need to enumerate all possible observation
sequences
Not practical to represent multiple interacting
features or long-range dependencies of the
observations
Very strict independence assumptions on the
observations

22
Maximum Entropy Markov Models
Lafferty, 2001
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
Courtesy of William W. Cohen
23
MEMM
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
S
t
-
1
t1

t
is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
and previous state history
Courtesy of William W. Cohen
24
HMM vs. MEMM
St-1
St
St1
...
Ot
Ot1
Ot-1
St-1
St
St1
...
Ot
Ot1
Ot-1
25
Label Bias Problem with MEMM
Consider this MEMM
Pr(12ro) Pr(21,ro)Pr(1,ro) Pr(2 1,o)Pr(1,r)
Pr(12ri) Pr(21,ri)Pr(1,ri) Pr(2
1,i)Pr(1,r)
Pr(21,o) Pr(21,r) 1
Pr(12ro) Pr(12ri)
But it should be Pr(12ro) 26
Solve the Label Bias Problem

Change the state-transition structure of the
model
Not always practical to change the set of states
Start with a fully-connected model and let the
training procedure figure out a good structure
Prelude the use of prior, which is very valuable
(e.g. in information extraction)

27
Random Field
Courtesy of Rongkun Shen
28
Conditional Random Field
Courtesy of Rongkun Shen
29
Conditional Distribution
30
Conditional Distribution

CRFs use the observation-dependent
normalization Z(x) for the conditional
distributions

Z(x) is a normalization over the data sequence x
31
HMM like CRF

Single feature for each state-state pair
(y,y) and state-observation pair in the data to
train CRF

if yu y and yv y

Yt-1
Yt
Yt1
...
otherwise
Xt
Xt1
Xt-1
if yv y and xv x

otherwise
?y,y and µy,x are equivalent to the logarithm of
the HMM transition probability Pr(yy) and
observation probability Pr(xy)
32
HMM like CRF

For a chain structure, the conditional
probability of a label sequence can be expressed
in matrix form.
For each position i in the observed sequence
x, define matrix

Where ei is the edge with label (yi-1, yi) and vi
is the vertex with label yi
33
HMM like CRF
The normalization function is the (start, stop)
entry of the product of these matrices
The conditional probability of label sequence y
is
where, y0 start and yn1 stop
34
Parameter Estimation
The problem determine the parameters From
training data with
empirical distribution
The goal maximize the log-likelihood objective
function
35
Parameter Estimation Iterative Scaling
Algorithms
Update the weights as and
for Appropriately
chosen
for edge feature fk is the solution of
T(x, y) is a global property of (x,y) and
efficiently computing the Right-hand sides of
the above equation is a problem
36
Algorithm S
Define slack feature
For each index i 0, , n1 we define forward
vectors
And backward vectors
37
Algorithm S

38
Algorithm S
39
Algorithm T
Keeps track of partial T total. It accumulates
feature expectations into counters indexed by
T(x)
Use forward-back ward recurrences to compute the
expectation ak,t of feature fk and bk,t of
feature gk given that T(x) t
40
Experiments

Modeling label bias problem
2000 training and 500 test samples generated by
HMM
CRF error is 4.6
MEMM error is 42

CRF solves label bias problem
41
Experiments

Modeling mixed order sources
CRF converge in 500 iterations
MEMM converge in 100 iterations

42
MEMM vs. HMM
The HMM outperforms the MEMM
43
CRF vs. MEMM
CRF usually outperforms the MEMM
44
CRF vs. HMM
Each open square represents a data set with ? ½, and a sold square indicates a data set with a
? ? ½. When the data is mostly second order ? ?
½, the discriminatively trained CRF usually
outperforms the MEMM
45
POS Tagging Experiments

First-order HMM, MEMM and CRF model
Data set Penn Tree bank
50-50 test-train split
Uses MEMM parameter vector as a starting point
for training the corresponding CRF to accelerate
convergence speed.

46
Interactive IE using CRF
Interactive parser updates IE results according
to users changes. Color coding used to alert
the ambiguity of IE.
47
Some IE tools Available

MALLET (UMass)
statistical natural language processing,
document classification,
clustering,
information extraction
other machine learning applications to text.
Sample Application
GeneTaggerCRF a gene-entity tagger based on
MALLET (MAchine Learning for LanguagE Toolkit).
It uses conditional random fields to find genes
in a text file.

48
MinorThird

http//minorthird.sourceforge.net/
a collection of Java classes for storing text,
annotating text, and learning to extract entities
and categorize text
Stored documents can be annotated in independent
files using TextLabels (denoting, say,
part-of-speech and semantic information)

49
GATE

http//gate.ac.uk/ie/annie.html
leading toolkit for Text Mining
distributed with an Information Extraction
component set called ANNIE (demo)
Used in many research projects
Long list can be found on its website
Under integration of IBM UIMA

50
Sunita Sarawagi's CRF package

http//crf.sourceforge.net/
A Java implementation of conditional random
fields for sequential labeling.

51
UIMA (IBM)

Unstructured Information Management Architecture.
A platform for unstructured information
management solutions from combinations of
semantic analysis (IE) and search components.

52
Some Interesting Website based on IE