Learning Hidden Markov Model Structure for Information Extraction

1 / 35
About This Presentation
Title:

Learning Hidden Markov Model Structure for Information Extraction

Description:

Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld Hidden Markov Model Structures Machine learning ... –

Number of Views:572
Avg rating:3.0/5.0
Slides: 36
Provided by: charlaw
Category:

less

Transcript and Presenter's Notes

Title: Learning Hidden Markov Model Structure for Information Extraction


1
Learning Hidden Markov Model Structure for
Information Extraction
  • Kristie Seymour,
  • Andrew McCullum,
  • Ronald Rosenfeld

2
Hidden Markov Model Structures
  • Machine learning tool applied to Information
    Extraction
  • Part of speech tagging (Kupiec 1992)
  • Topic detection tracking (Yamron et al 1998)
  • Dialog act modeling (Stolcke, Shriberg, others
    1998)

3
HMM in Information Extraction
  • Gene names and locations (Luek 1997)
  • Named-entity extraction (Nymble system Friberg
    McCallum 1999)
  • Information Extraction Strategy
  • 1 HMM 1 Field
  • 1 state / class
  • Hand-built models using human data inspection

4
HMM Advantages
  • Strong statistical foundations
  • Used well in Natural Language programming
  • Handles new data robustly
  • Uses established training algorithms which are
    computationally efficient to develop and evaluate

5
HMM Disadvantages
  • Require a priori notion of model topology
  • Need large amounts of training data to use

6
Authors Contribution
  • Automatically determined model structure from
    data
  • One HMM to extract all information
  • Introduced DISTANTLY-LABELED DATA

7
OUTLINE
  • Information Extraction basics with HMM
  • Learning model structure from data
  • Training data
  • Experiment results
  • Model selection
  • Error breakdown
  • Conclusions
  • Future work

8
Information Extraction basics with HMM
  • OBJECT to code every word of CS research paper
    headers
  • Title
  • Author
  • Date
  • Keyword
  • Etc.
  • 1 HMM / 1 Header
  • Initial state to Final state

9
Discrete output, First-order HMM
  • Q set of states
  • qI initial state
  • qF final state in transition
  • ? s1, s2, . . . , sm - discrete output
    vocabulary
  • X x1 x2 . . . xi - output string
  • PROCESS
  • Initital state -gt new state -gt emit output symbol
    -gt
  • another state -gt new state -gt emit another output
    symbol -gt
  • . . . FINAL STATE
  • PARAMETERS
  • P(q -gt q) transition probabilities
  • P(q ? s) emission probabilities

10
The probability of string x being emitted by an
HMM M is computed as a sum over all possible
paths where q0 and ql1 are restricted to be qI
and qF respectively, and xl1 is an end-of-string
token (uses Forward algorithm)
11
The output is observable, but the underlying
state sequence is HIDDEN
12
To recover the state sequence V(xM)that has the
highest probability of having produced an
observation sequence (uses Viterbi algorithm)
13
HMM application
  • Each state has a class (i.e. title, author)
  • Each word in the header is an observation
  • Each state emits words from header with
    associated CLASS TAG
  • This is learned from TRAINING DATA

14
Learning model structure from data
  • Decide on states and associated transition states
  • Set up labeled training data
  • Use MERGE techniques
  • Neighbor merge (link all adjacent words in title)
  • V-merging - 2 states with same label and
    transitions (one transition to title and out)
  • Apply Bayesian model merging to maximize result
    accuracy

15
Example Hidden Markov Model
16
Bayesian model merging seeks to find the model
structure that maximizes the probability of the
model (M) given some training data (D), by
iteratively merging states until an optimal
tradeoff between fit to the data and model size
has been reached
17
Three types of training data
  • Labeled data
  • Unlabeled data
  • Distantly-labeled data

18
Labeled data
  • manual and expensive
  • Provides COUNTS function c() estimates model
    parameters

19
Formulas for deriving parametersusing counts
c()(4) Transition Probabilities(5) Emission
Probabilities
20
Unlabeled Data
  • Needs estimated parameters from labeled data
  • Use Baum-Welch training algorithm
  • Iterative expectation-maximization algorithm
    which adjusts model parameters to locally
    maximize results from unlabeled data
  • Sensitive to initial parameters

21
Distantly-labeled data
  • Data labeled for another purpose
  • Partially applied to this domain for training
  • EXAMPLE - CS research headers BibTeX
    bibliographic labeled citations

22
Experiment results
  • Prepare text using computer program
  • Header- beginning to INTRODUCTION or end of 1st
    page
  • Remove punctuation, case, newlines
  • Label
  • ABSTRACT Abstract
  • INTRO Introduction
  • PAGE End of 1st page
  • Manually label 1000 headers
  • Minus 65 discarded due to poor format
  • Derive fixed word vocabularies from training

23
Sources Amounts of Training Data
24
Model selection
  • MODELS 1-4 - 1 state / class
  • MODEL 1 fully connected HMM model with uniform
    transition estimates between states
  • MODEL 2 maximum likelihood transition estimate
    with others uniform
  • MODEL 3 all likelihood transitions estimates
    BASELINE used for HMM model
  • MODEL 4 adds smoothing no zero results

25
ACCURACY OF MODELS (by word classification
accuracy)L Labeled dataLD Labeled and
Distantly-labeled
26
Multiple states / class- hand distantly-labeled
automatic distantly-labeled
27
Compared BASELINE to best MULTI-STATE to V-MERGED
models
28
UNLABELED DATA TRAININGINITIAL L D U?
0.5 0.5 each emission distribution ?
varies optimum distributionPP
includes smoothing
29
Error breakdown
  • Errors by CLASS TAG
  • BOLD distantly-labeled data tags

30
(No Transcript)
31
Conclusions
  • Research paper headers work
  • Improvement factors
  • Multi-state classes
  • Distantly-labeled data (10)
  • Distantly-labeled data can reduce labeled data

32
Future work
  • Use Bayesian model merging to completely automate
    model learning
  • Also describe layout by position on page
  • Model internal state structure

33
Model of Internal State StructureFirst 2 words
explicitMultiple affiliations possibleLast 2
words - explicit
34
My Assessment
  • Highly mathematical and complex
  • Even unlabeled data is in a preset order
  • Model requires work setting up training data
  • Change in target data will completely change
    model
  • Valuable experiments with heuristics and
    smoothing impacting results
  • Wish they had included a sample 1st page

35
QUESTIONS
Write a Comment
User Comments (0)
About PowerShow.com