Information Extraction - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Information Extraction

Description:

... years, Microsoft Corporation CEO Bill Gates railed against the economic ... Bill Gates CEO Microsoft. Bill Veghte VP Microsoft. Richard Stallman founder ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 53
Provided by: Yun4
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction


1
Information Extraction
  • Yunyao Li
  • EECS /SI 767
  • 03/29/2006

2
The Problem
Date
Time Start - End
Location
Speaker
Person
3
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
Courtesy of William W. Cohen
4
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
Courtesy of William W. Cohen
5
What is Information Extraction
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
aka named entity extraction
Courtesy of William W. Cohen
6
What is Information Extraction
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
Courtesy of William W. Cohen
7
What is Information Extraction
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
Courtesy of William W. Cohen
8
What is Information Extraction
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation




Courtesy of William W. Cohen
9
  • Live Example Seminar

10
Landscape of IE Techniques
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
Our Focus today!
Courtesy of William W. Cohen
11
Markov Property
S1 rain S2 cloud S3 sun
The state of a system at time t1, qt1, is
conditionally independent of qt-1, qt-2, , q1,
q0 given qt In another word, current state
determines the probability distribution for the
next state.
1/2
S2
1/3
1/2
S1
S2
2/3
1
12
Markov Property
S1 rain S2 cloud S3 sun
1/2
State-transition probabilities, A
S2
1/3
1/2
S1
S3
2/3
1
Q given today is sunny (i.e., q13), what is the
probability of sun-cloud with the model?
13
Hidden Markov Model
S1 rain S2 cloud S3 sun
1/2
1/10
S2
9/10
1/2
1/3
S1
S3
2/3
observations
1
4/5
7/10
3/10
1/5
14
IE with Hidden Markov Model
Given a sequence of observations
SI/EECS 767 is held weekly at SIN2 .
and a trained HMM
course name
location name
background
Find the most likely state sequence (Viterbi)
SI/EECS 767 is held weekly at SIN2
Any words said to be generated by the designated
course name state extract as a course name
Course name SI/EECS 767
15
Name Entity Extraction
Bikel, et al 1998
Hidden states
16
Name Entity Extraction
Transitionprobabilities
Observationprobabilities
P(ot st , st-1 )
P(st st-1, ot-1 )
P(ot st , ot-1 )
or
(1) Generating first word of a name-class
(2) Generating the rest of words in the name-class
(3) Generating end in a name-class
17
Training Estimating Probabilities
18
Back-Off
unknown words and insufficient training data
Transitionprobabilities
Observationprobabilities
P(st st-1 )
P(ot st )
P(st )
P(ot )
19
HMM-Experimental Results
Train on 500k words of news wire text.
Results
20
Learning HMM for IE
Seymore, 1999
Consider labeled, unlabeled, and
distantly-labeled data
21
Some Issues with HMM
  • Need to enumerate all possible observation
    sequences
  • Not practical to represent multiple interacting
    features or long-range dependencies of the
    observations
  • Very strict independence assumptions on the
    observations

22
Maximum Entropy Markov Models
Lafferty, 2001
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
Courtesy of William W. Cohen
23
MEMM
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
S
t
-
1
t1

t
is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
and previous state history
Courtesy of William W. Cohen
24
HMM vs. MEMM
St-1
St
St1
...
Ot
Ot1
Ot-1
St-1
St
St1
...
Ot
Ot1
Ot-1
25
Label Bias Problem with MEMM
Consider this MEMM
Pr(12ro) Pr(21,ro)Pr(1,ro) Pr(2 1,o)Pr(1,r)
Pr(12ri) Pr(21,ri)Pr(1,ri) Pr(2
1,i)Pr(1,r)
Pr(21,o) Pr(21,r) 1
Pr(12ro) Pr(12ri)
But it should be Pr(12ro) 26
Solve the Label Bias Problem
  • Change the state-transition structure of the
    model
  • Not always practical to change the set of states
  • Start with a fully-connected model and let the
    training procedure figure out a good structure
  • Prelude the use of prior, which is very valuable
    (e.g. in information extraction)

27
Random Field
Courtesy of Rongkun Shen
28
Conditional Random Field
Courtesy of Rongkun Shen
29
Conditional Distribution
30
Conditional Distribution
  • CRFs use the observation-dependent
    normalization Z(x) for the conditional
    distributions

Z(x) is a normalization over the data sequence x
31
HMM like CRF
  • Single feature for each state-state pair
    (y,y) and state-observation pair in the data to
    train CRF

if yu y and yv y

Yt-1
Yt
Yt1
...
otherwise
Xt
Xt1
Xt-1
if yv y and xv x

otherwise
?y,y and µy,x are equivalent to the logarithm of
the HMM transition probability Pr(yy) and
observation probability Pr(xy)
32
HMM like CRF
  • For a chain structure, the conditional
    probability of a label sequence can be expressed
    in matrix form.
  • For each position i in the observed sequence
    x, define matrix

Where ei is the edge with label (yi-1, yi) and vi
is the vertex with label yi
33
HMM like CRF
The normalization function is the (start, stop)
entry of the product of these matrices
The conditional probability of label sequence y
is
where, y0 start and yn1 stop
34
Parameter Estimation
The problem determine the parameters From
training data with
empirical distribution
The goal maximize the log-likelihood objective
function
35
Parameter Estimation Iterative Scaling
Algorithms
Update the weights as and
for Appropriately
chosen
for edge feature fk is the solution of
T(x, y) is a global property of (x,y) and
efficiently computing the Right-hand sides of
the above equation is a problem
36
Algorithm S
Define slack feature
For each index i 0, , n1 we define forward
vectors
And backward vectors
37
Algorithm S



38
Algorithm S
39
Algorithm T
Keeps track of partial T total. It accumulates
feature expectations into counters indexed by
T(x)
Use forward-back ward recurrences to compute the
expectation ak,t of feature fk and bk,t of
feature gk given that T(x) t
40
Experiments
  • Modeling label bias problem
  • 2000 training and 500 test samples generated by
    HMM
  • CRF error is 4.6
  • MEMM error is 42

CRF solves label bias problem
41
Experiments
  • Modeling mixed order sources
  • CRF converge in 500 iterations
  • MEMM converge in 100 iterations

42
MEMM vs. HMM
The HMM outperforms the MEMM
43
CRF vs. MEMM
CRF usually outperforms the MEMM
44
CRF vs. HMM
Each open square represents a data set with ? ½, and a sold square indicates a data set with a
? ? ½. When the data is mostly second order ? ?
½, the discriminatively trained CRF usually
outperforms the MEMM
45
POS Tagging Experiments
  • First-order HMM, MEMM and CRF model
  • Data set Penn Tree bank
  • 50-50 test-train split
  • Uses MEMM parameter vector as a starting point
    for training the corresponding CRF to accelerate
    convergence speed.

46
Interactive IE using CRF
Interactive parser updates IE results according
to users changes. Color coding used to alert
the ambiguity of IE.
47
Some IE tools Available
  • MALLET (UMass)
  • statistical natural language processing,
  • document classification,
  • clustering,
  • information extraction
  • other machine learning applications to text.
  • Sample Application
  • GeneTaggerCRF a gene-entity tagger based on
    MALLET (MAchine Learning for LanguagE Toolkit).
    It uses conditional random fields to find genes
    in a text file.

48
MinorThird
  • http//minorthird.sourceforge.net/
  • a collection of Java classes for storing text,
    annotating text, and learning to extract entities
    and categorize text
  • Stored documents can be annotated in independent
    files using TextLabels (denoting, say,
    part-of-speech and semantic information)

49
GATE
  • http//gate.ac.uk/ie/annie.html
  • leading toolkit for Text Mining
  • distributed with an Information Extraction
    component set called ANNIE (demo)
  • Used in many research projects
  • Long list can be found on its website
  • Under integration of IBM UIMA

50
Sunita Sarawagi's CRF package
  • http//crf.sourceforge.net/
  • A Java implementation of conditional random
    fields for sequential labeling.

51
UIMA (IBM)
  • Unstructured Information Management Architecture.
  • A platform for unstructured information
    management solutions from combinations of
    semantic analysis (IE) and search components.

52
Some Interesting Website based on IE
  • ZoomInfo
  • CiteSeer.org (some of us using it everyday!)
  • Google Local, Google Scholar
  • and many more
Write a Comment
User Comments (0)
About PowerShow.com