Information Extraction using HMMs

About This Presentation

Title:

Information Extraction using HMMs

Description:

Information Extraction using HMMs Sunita Sarawagi – PowerPoint PPT presentation

Number of Views:4

Avg rating:3.0/5.0

Slides: 32

Provided by: KReS1

Learn more at: http://www.cs.cmu.edu

more less

Transcript and Presenter's Notes

Title: Information Extraction using HMMs

1
Information Extraction using HMMs

Sunita Sarawagi

2
IE by text segmentation

Source concatenation of structured elements with
limited reordering and some missing fields
Example Addresses, bib records

House number
Zip
State
Building
Road
City
4089 Whispering Pines Nobel Drive San Diego CA
92122
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S.
Clark, J.S. Dordick (1993) Protein and Solvent
Engineering of Subtilising BPN' in Nearly
Anhydrous Organic Media J.Amer. Chem. Soc. 115,
12231-12237.
3
Hidden Markov Models
A C
0.6 0.4

Doubly stochastic models
Efficient dynamic programming algorithms exist
for
Finding Pr(S)
The highest probability path P that maximizes
Pr(S,P) (Viterbi)
Training the model
(Baum-Welch algorithm)

A C
0.9 0.1
S2
S1
S4
S3
A C
0.3 0.7
A C
0.5 0.5
4
Input features

Content of the element
Specific keywords like street, zip, vol, pp,
Properties of words like capitalization, parts
of speech, number?
Inter-element sequencing
Intra-element sequencing
Element length
External database
Dictionary words
Semantic relationship between words
Frequency constraints

5
IE with Hidden Markov Models

Probabilistic models for IE

Title
Author
Journal
Year
6
HMM Structure

Naïve Model One state per element

Nested model
Each element another HMM

7
Comparing nested models

Naïve Single state per tag
Element length distribution a, a2, a3,
Intra-tag sequencing not captured
Chain
Element length distribution
Each length gets its own parameter
Intra-tag sequencing captured
Arbitrary mixing of dictionary,
Eg. California York
Pr(WL) not modeled well.
Parallel path
Element length distribution each length gets a
parameter
Separates vocabulary of different length
elements, (limited bigram model)

8
Embedding a HMM in a state
9
Bigram model of Bikel et al.

Each inner model a detailed bigram model
First word conditioned on state and previous
state
Subsequent words conditioned on previous word and
state
Special start and end symbols that can be
thought
Large number of parameters
(Training data order60,000 words in the smallest
experiment)
Backing off mechanism to previous simpler
parent models (lambda parameters to control
mixing)

10
Separate HMM per tag

Special prefix and suffix states to capture the
start and end of a tag

Road name
S2
S1
Prefix
Suffix
S4
11
HMM Dictionary

For each word (feature), associate the
probability of emitting that word
Multinomial model
Features of a word,
example,
part of speech,
capitalized or not
type number, letter, word etc
Maximum entropy models (McCallum 2000), other
exponential models
Bikel ltword,featuregt pairs

12
Feature Hierarchy
13
Learning model parameters

When training data defines unique path through
HMM
Transition probabilities
Probability of transitioning from state i to
state j
number of transitions from i to j
total transitions from state i
Emission probabilities
Probability of emitting symbol k from state i
number of times k generated from i
number of transition from I
When training data defines multiple path
A more general EM like algorithm (Baum-Welch)

14
Smoothing

Two kinds of missing symbols
Case-1 Unknown over the entire dictionary
Case-2 Zero count in some state
Approaches
Laplace smoothing ki 1
m T
Absolute discounting
P(unknown) proportional to number of distinct
tokens
P(unknown) (k) x (number of distinct symbols)
P(known) (actual probability)-(k),
k is a small fixed constant, case 2 smaller
than case 1

15
Smoothing (Cont.)

Smoothing parameters derived from data
Partition training data into two parts
Train on part-1
Use part-2 to map all new tokens to UNK and treat
it as new word in vocabulary
OK for case-1, not good for case-2. Bikel et al
use this method for case-1. For case-2 zero
counts are backed off to 1/(Vocab-size)

16
Using the HMM to segment

Find highest probability path through the HMM.
Viterbi quadratic dynamic programming algorithm

115 Grant street Mumbai 400070
115 Grant ..
400070
House
House
House
House
Road
Road
Road
Road
City
City
City
ot
ot
Pin
Pin
Pin
Pin
17
Most Likely Path for a Given Sequence

The probability that the path is
taken and the sequence is
generated

transition probabilities
emission probabilities
18
Example
0.4
0.2
A 0.4 C 0.1 G 0.2 T 0.3
A 0.2 C 0.3 G 0.3 T 0.2
0.8
0.6
0.5
3
1
0
5
A 0.1 C 0.4 G 0.4 T 0.1
A 0.4 C 0.1 G 0.1 T 0.4
0.5
0.9
0.2
2
4
0.1
0.8
19
Finding the most probable path the Viterbi
algorithm

define to be the probability of the
most probable path accounting for the first i
characters of x and ending in state k

we want to compute , the
probability of the most probable path accounting
for all of the sequence and ending in the end
state
can define recursively
can use dynamic programming to find
efficiently

20
Finding the most probable path the Viterbi
algorithm

initialization

21
The Viterbi algorithm

recursion for emitting states (i 1L)

keep track of most probable path
22
The Viterbi algorithm

termination

to recover the most probable path, follow
pointers back starting at

23
Database Integration

Augment dictionary
Example list of Cities
Assigning probabilities is a problem
Exploit functional dependencies
Example
Santa Barbara -gt USA
Piskinov -gt Georgia

24
2001 University Avenue, Kendall Sq. Piskinov,
Georgia
25
Frequency constraints

Including constraints of the form the same tag
cannot appear in two disconnected segments
Eg Title in a citation cannot appear twice
Street name cannot appear twice
Not relevant for named-entity tagging kinds of
problems

26
Constrained Viterbi
Original Viterbi
Modified Viterbi
.
27
Comparative Evaluation

Naïve model One state per element in the HMM
Independent HMM One HMM per element
Rule Learning Method Rapier
Nested Model Each state in the Naïve model
replaced by a HMM

28
Results Comparative Evaluation
Dataset instances Elements
IITB student Addresses 2388 17
Company Addresses 769 6
US Addresses 740 6
The Nested model does best in all three
cases (from Borkar 2001)
29
Results Effect of Feature Hierarchy
Feature Selection showed at least a 3 increase
in accuracy
30
Results Effect of training data size
HMMs are fast Learners. We reach very close to
the maximum accuracy with just 50 to 100 addresses
31
HMM approach summary