Title: Learning Hidden Markov Model Structure for Information Extraction
1Learning Hidden Markov Model Structure for
Information Extraction
- Kristie Seymour,
- Andrew McCullum,
- Ronald Rosenfeld
2Hidden Markov Model Structures
- Machine learning tool applied to Information
Extraction - Part of speech tagging (Kupiec 1992)
- Topic detection tracking (Yamron et al 1998)
- Dialog act modeling (Stolcke, Shriberg, others
1998)
3HMM in Information Extraction
- Gene names and locations (Luek 1997)
- Named-entity extraction (Nymble system Friberg
McCallum 1999) - Information Extraction Strategy
- 1 HMM 1 Field
- 1 state / class
- Hand-built models using human data inspection
4HMM Advantages
- Strong statistical foundations
- Used well in Natural Language programming
- Handles new data robustly
- Uses established training algorithms which are
computationally efficient to develop and evaluate
5HMM Disadvantages
- Require a priori notion of model topology
- Need large amounts of training data to use
6Authors Contribution
- Automatically determined model structure from
data - One HMM to extract all information
- Introduced DISTANTLY-LABELED DATA
7OUTLINE
- Information Extraction basics with HMM
- Learning model structure from data
- Training data
- Experiment results
- Model selection
- Error breakdown
- Conclusions
- Future work
8Information Extraction basics with HMM
- OBJECT to code every word of CS research paper
headers - Title
- Author
- Date
- Keyword
- Etc.
- 1 HMM / 1 Header
- Initial state to Final state
9Discrete output, First-order HMM
- Q set of states
- qI initial state
- qF final state in transition
- ? s1, s2, . . . , sm - discrete output
vocabulary - X x1 x2 . . . xi - output string
- PROCESS
- Initital state -gt new state -gt emit output symbol
-gt - another state -gt new state -gt emit another output
symbol -gt - . . . FINAL STATE
- PARAMETERS
- P(q -gt q) transition probabilities
- P(q ? s) emission probabilities
10The probability of string x being emitted by an
HMM M is computed as a sum over all possible
paths where q0 and ql1 are restricted to be qI
and qF respectively, and xl1 is an end-of-string
token (uses Forward algorithm)
11The output is observable, but the underlying
state sequence is HIDDEN
12To recover the state sequence V(xM)that has the
highest probability of having produced an
observation sequence (uses Viterbi algorithm)
13HMM application
- Each state has a class (i.e. title, author)
- Each word in the header is an observation
- Each state emits words from header with
associated CLASS TAG - This is learned from TRAINING DATA
14Learning model structure from data
- Decide on states and associated transition states
- Set up labeled training data
- Use MERGE techniques
- Neighbor merge (link all adjacent words in title)
- V-merging - 2 states with same label and
transitions (one transition to title and out) - Apply Bayesian model merging to maximize result
accuracy -
15Example Hidden Markov Model
16Bayesian model merging seeks to find the model
structure that maximizes the probability of the
model (M) given some training data (D), by
iteratively merging states until an optimal
tradeoff between fit to the data and model size
has been reached
17Three types of training data
- Labeled data
- Unlabeled data
- Distantly-labeled data
18Labeled data
- manual and expensive
- Provides COUNTS function c() estimates model
parameters
19Formulas for deriving parametersusing counts
c()(4) Transition Probabilities(5) Emission
Probabilities
20Unlabeled Data
- Needs estimated parameters from labeled data
- Use Baum-Welch training algorithm
- Iterative expectation-maximization algorithm
which adjusts model parameters to locally
maximize results from unlabeled data - Sensitive to initial parameters
21Distantly-labeled data
- Data labeled for another purpose
- Partially applied to this domain for training
- EXAMPLE - CS research headers BibTeX
bibliographic labeled citations
22Experiment results
- Prepare text using computer program
- Header- beginning to INTRODUCTION or end of 1st
page - Remove punctuation, case, newlines
- Label
- ABSTRACT Abstract
- INTRO Introduction
- PAGE End of 1st page
- Manually label 1000 headers
- Minus 65 discarded due to poor format
- Derive fixed word vocabularies from training
23Sources Amounts of Training Data
24Model selection
- MODELS 1-4 - 1 state / class
- MODEL 1 fully connected HMM model with uniform
transition estimates between states - MODEL 2 maximum likelihood transition estimate
with others uniform - MODEL 3 all likelihood transitions estimates
BASELINE used for HMM model - MODEL 4 adds smoothing no zero results
25ACCURACY OF MODELS (by word classification
accuracy)L Labeled dataLD Labeled and
Distantly-labeled
26Multiple states / class- hand distantly-labeled
automatic distantly-labeled
27Compared BASELINE to best MULTI-STATE to V-MERGED
models
28UNLABELED DATA TRAININGINITIAL L D U?
0.5 0.5 each emission distribution ?
varies optimum distributionPP
includes smoothing
29Error breakdown
- Errors by CLASS TAG
- BOLD distantly-labeled data tags
30(No Transcript)
31Conclusions
- Research paper headers work
- Improvement factors
- Multi-state classes
- Distantly-labeled data (10)
- Distantly-labeled data can reduce labeled data
32Future work
- Use Bayesian model merging to completely automate
model learning - Also describe layout by position on page
- Model internal state structure
33Model of Internal State StructureFirst 2 words
explicitMultiple affiliations possibleLast 2
words - explicit
34My Assessment
- Highly mathematical and complex
- Even unlabeled data is in a preset order
- Model requires work setting up training data
- Change in target data will completely change
model - Valuable experiments with heuristics and
smoothing impacting results - Wish they had included a sample 1st page
35QUESTIONS