Title: Sequence labeling and beam search
1Sequence labeling andbeam search
2Outline
- Classification problem (Recap)
- Sequence labeling problem
- HMM and Viterbi algorithm
- Beam search
- MaxEnt case study
3Classification Problem
4Classification problem
- Setting
- C a finite set of labels
- Input x
- Output y, where y 2 C.
- Training data an instance list (xi, yi)
- Supervised learning yi is known
- Unsupervised learning yi is unknown
- Semi-supervised learning yi is unknown for most
instances.
5The 1st step data conversion
- Represent x as something else.
- Why?
- The number of possible x is infinite.
- The new representation makes the learning
possible. - How?
- Represent x as a feature vector
- Define feature templates what part of x is
useful for determining its y? - Calculate feature values
6The 2nd step modeling
- kNN and Rocchio find the closest neighbors /
prototypes - DT and DL find the matched group.
7Modeling NB and MaxEnt
- Given x, choose y, s.t.
- y arg maxy P(yx) arg maxy P(x,y)
- How to calculate P(x, y) ?
- How many (x,y) unique pairs are there?
- How can we make the task simpler?
- Decomposition
- Number of parameters 2k C ? O(k C)
8The 3rd step training
- kNN no training
- Rocchio calculate prototypes
- DT and DL learn the trees/rules by selecting
important features and splitting the data - NB calculate the parameter values by simply
counting - MaxEnt estimate parameters with iterations
9The 4th step testing
- kNN calculate distance between x and its
neighbors - Rocchio calculate distance between x and the
prototypes - DT and DL traverse the tree/list
- NB and MaxEnt calculate P(x,y)
10Attribute-value table
- Each row corresponds to an instance
- Each column except the last one corresponds to a
feature. - No features refer to the class label.
- ? At test time
- the classification of xi does not affect the
classification of xj. - all the feature values are available before
testing starts.
11Sequence labeling problem
12Sequence labeling problem
- Task to find the most probable labeling of a
sequence. - Examples
- POS tagging
- NP chunking
- NE detection
- Word segmentation
- IGT detection
- Parsing
-
13Questions
- Training data (xi, yi)
- What is xi? What is yi?
- What are the features?
- How to convert xi to a feature vector for
training data? How to do that for test data?
14How to solve a sequence labeling problem?
- Using a sequence labeling algorithm e.g., HMM
- Using a classification algorithm
- Dont use features that refer to class labels
- Use those features and get their values by
running other processes - Use those features and find a good (global)
solution.
15Major steps
- Data conversion
- What is the label set?
- Modeling
- Training
- Testing
- How to combine individual labels to get a label
sequence? - How to find a good label sequence?
16HMM and Viterbi algorithm
17Two types of HMMs
- State-emission HMM (Moore machine)
- The emission probability depends only on the
state (from-state or to-state). - Arc-emission HMM (Mealy machine)
- The probability depends on (from-state, to-state)
pair.
18State-emission HMM
s1
s2
sN
w1
w4
w1
w5
w3
w1
- Two kinds of parameters
- Transition probability P(sj si)
- Output (Emission) probability P(wk si)
- ? of Parameters O(NMN2)
19Arc-emission HMM
w1
w2
w1
w1
w5
sN
s1
s2
w4
w3
Same kinds of parameters but the emission
probabilities depend on both states P(wk, sj
si) ? of Parameters O(N2MN2).
20Constraints
For any integer n and any HMM
21Properties of HMM
- Limited horizon
- Time invariance the probabilities do not change
over time - The states are hidden because we know the
structure of the machine (i.e., S and S), but we
dont know which state sequences generate a
particular output.
22Three fundamental questions for HMMs
- Finding the probability of an observation
- Finding the best state sequence
- Training estimating parameters
23(2) Finding the best state sequence
- Given the observation O1,To1oT, find the state
sequence X1,T1X1 XT1 that maximizes P(X1,T1
O1,T). - ? Viterbi algorithm
24Viterbi algorithm
- The probability of the best path that produces
O1,t-1 while ending up in state si
Initialization
Induction
25Important concepts
- State vs. class label
- Assumption P(ti t1i-1) P(ti ti-1)
- Multiple sequences of states (paths) can lead to
a given state, but one is the most likely path to
that state, called the "survivor path".
26Viterbi search
27Beam Search
28Beam search (basic)
29More options
- Expanding options TopN, minhyps
- If hyps_num lt minhyps
- then use max (topN, minhyps) tags for w_i
- else use topN tags
- Pruning options maxhyps, beam, minhyps
- Keep a hyp iff
- prob(hyp) beam gt max_prob
- hyp is among top maxhyps, or
- hyp is among the top minhyps
30Beam search
- Generate m tags for w1, set s1j accordingly
- For i2 to n (n is the sentence length)
- Expanding For each surviving sequence s(i-1),j
- Generate m tags for wi, given s(i-1)j as previous
tag context - Append each tag to s(i-1)j to make a new
sequence. - Pruning
- Return highest prob sequence sn1.
31Viterbi vs. Beam search
- DP vs. heuristic search
- Global optimal vs. inexact
- Small window vs. big window for features
32Additional slides
33(1) Finding the probability of the observation
- Forward probability the probability of producing
- O1,t-1 while ending up in state si
34Calculating forward probability
Initialization
Induction