Learning Hidden Markov Model Structure for Information Extraction

1 / 35

About This Presentation

Title:

Learning Hidden Markov Model Structure for Information Extraction

Description:

Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld Hidden Markov Model Structures Machine learning ... –

Number of Views:572

Avg rating:3.0/5.0

Slides: 36

Provided by: charlaw

Category:

more less

Transcript and Presenter's Notes

Title: Learning Hidden Markov Model Structure for Information Extraction

1
Learning Hidden Markov Model Structure for
Information Extraction

Kristie Seymour,
Andrew McCullum,
Ronald Rosenfeld

2
Hidden Markov Model Structures

Machine learning tool applied to Information
Extraction
Part of speech tagging (Kupiec 1992)
Topic detection tracking (Yamron et al 1998)
Dialog act modeling (Stolcke, Shriberg, others
1998)

3
HMM in Information Extraction

Gene names and locations (Luek 1997)
Named-entity extraction (Nymble system Friberg
McCallum 1999)
Information Extraction Strategy
1 HMM 1 Field
1 state / class
Hand-built models using human data inspection

4
HMM Advantages

Strong statistical foundations
Used well in Natural Language programming
Handles new data robustly
Uses established training algorithms which are
computationally efficient to develop and evaluate

5
HMM Disadvantages

Require a priori notion of model topology
Need large amounts of training data to use

6
Authors Contribution

Automatically determined model structure from
data
One HMM to extract all information
Introduced DISTANTLY-LABELED DATA

7
OUTLINE

Information Extraction basics with HMM
Learning model structure from data
Training data
Experiment results
Model selection
Error breakdown
Conclusions
Future work

8
Information Extraction basics with HMM

OBJECT to code every word of CS research paper
headers
Title
Author
Date
Keyword
Etc.
1 HMM / 1 Header
Initial state to Final state

9
Discrete output, First-order HMM

Q set of states
qI initial state
qF final state in transition
? s1, s2, . . . , sm - discrete output
vocabulary
X x1 x2 . . . xi - output string
PROCESS
Initital state -gt new state -gt emit output symbol
-gt
another state -gt new state -gt emit another output
symbol -gt
. . . FINAL STATE
PARAMETERS
P(q -gt q) transition probabilities
P(q ? s) emission probabilities

10
The probability of string x being emitted by an
HMM M is computed as a sum over all possible
paths where q0 and ql1 are restricted to be qI
and qF respectively, and xl1 is an end-of-string
token (uses Forward algorithm)
11
The output is observable, but the underlying
state sequence is HIDDEN
12
To recover the state sequence V(xM)that has the
highest probability of having produced an
observation sequence (uses Viterbi algorithm)
13
HMM application

Each state has a class (i.e. title, author)
Each word in the header is an observation
Each state emits words from header with
associated CLASS TAG
This is learned from TRAINING DATA

14
Learning model structure from data

Decide on states and associated transition states
Set up labeled training data
Use MERGE techniques
Neighbor merge (link all adjacent words in title)
V-merging - 2 states with same label and
transitions (one transition to title and out)
Apply Bayesian model merging to maximize result
accuracy

15
Example Hidden Markov Model
16
Bayesian model merging seeks to find the model
structure that maximizes the probability of the
model (M) given some training data (D), by
iteratively merging states until an optimal
tradeoff between fit to the data and model size
has been reached
17
Three types of training data

Labeled data
Unlabeled data
Distantly-labeled data

18
Labeled data

manual and expensive
Provides COUNTS function c() estimates model
parameters

19
Formulas for deriving parametersusing counts
c()(4) Transition Probabilities(5) Emission
Probabilities
20
Unlabeled Data

Needs estimated parameters from labeled data
Use Baum-Welch training algorithm
Iterative expectation-maximization algorithm
which adjusts model parameters to locally
maximize results from unlabeled data
Sensitive to initial parameters

21
Distantly-labeled data

Data labeled for another purpose
Partially applied to this domain for training
EXAMPLE - CS research headers BibTeX
bibliographic labeled citations

22
Experiment results

Prepare text using computer program
Header- beginning to INTRODUCTION or end of 1st
page
Remove punctuation, case, newlines
Label
ABSTRACT Abstract
INTRO Introduction
PAGE End of 1st page
Manually label 1000 headers
Minus 65 discarded due to poor format
Derive fixed word vocabularies from training

23
Sources Amounts of Training Data
24
Model selection

MODELS 1-4 - 1 state / class
MODEL 1 fully connected HMM model with uniform
transition estimates between states
MODEL 2 maximum likelihood transition estimate
with others uniform
MODEL 3 all likelihood transitions estimates
BASELINE used for HMM model
MODEL 4 adds smoothing no zero results

25
ACCURACY OF MODELS (by word classification
accuracy)L Labeled dataLD Labeled and
Distantly-labeled
26
Multiple states / class- hand distantly-labeled
automatic distantly-labeled
27
Compared BASELINE to best MULTI-STATE to V-MERGED
models
28
UNLABELED DATA TRAININGINITIAL L D U?
0.5 0.5 each emission distribution ?
varies optimum distributionPP
includes smoothing
29
Error breakdown