Title: CPSC 503 Computational Linguistics
1CPSC 503Computational Linguistics
- Finish HMMs
- Part-of-Speech Tagging
- Lecture 10
- Giuseppe Carenini
2Today 13/2
- Finish HMM the three key problems
- Part-of-speech tagging
- What it is
- Why we need it
- How to do it
3Hidden Markov Model(Arc Emission)
a
.1
.6
.5
1
b
b
.4
a
b
a
i
.5
.5
s4
s3
.5
.5
a
.1
.4
.5
i
b
b
.7
.1
.4
a
.5
.3
.5
.6
.9
a
i
s1
s2
1
.4
.6
.4
b
Start
Start
4Hidden Markov Model
Formal Specification as five-tuple
Set of States
Output Alphabet
Initial State Probabilities
State Transition Probabilities
Symbol Emission Probabilities
5Three fundamental questions for HMMs
- Decoding Finding the probability of an
observation - brute force or Forward/Backward-Algorithm
- Finding the best state sequence
- Viterbi-Algorithm
Training find model parameters which best
explain the observations
Manning/Schütze, 2000 325
6Computing the probability of an observation
sequence
7Decoding Example
s1, s1, s1 0 ?
s1, s2, s1 0 ?
.
.
s1, s4, s4 .6 .7 .6 .4 .5
s2, s4, s3 0?
s2, s1, s4 .4 .4 .7 1 .5
.
Manning/Schütze, 2000 327
8The forward procedure
9The backward procedure
10Combining backward and forward
11Finding the Best State Sequence
?j(t) probability of the most probable path
that leads to that node
- The Viterbi Algorithm
- Initialization ?j(1) ?j, 1? j? N
- Induction ?j(t1) max1? i?N ?i(t)aijbijot, 1?
j? N - Store backtrace
- ?j(t1) argmax1? i?N ?j(t)aij bijot, 1? j? N
- Termination and path readout
XT1 argmax1? i?N ?j(T1)
Xt ?Xt1(t1)
P(X)
max1? i?N ?j(T1)
12Parameter Estimation
- Find the values of the model parameters ?(A, B,
?) which best explain O - Using Maximum Likelihood Estimation, we can want
find the values that maximize
- There is no known analytic method
- Iterative hill-climbing algorithm known as
Baum-Welch or Forward-Backward algorithm.
(special case of the EM Algorithm)
13Baum-Welch Algorithm Key ideas
- Start using some (perhaps randomly chosen) model.
- 2) Now you can compute
- Expected of transitions from i to j
- Expected of transitions from state i
- Expected of transitions from i to j with k
observed
- 3) Now you can compute re-estimates of model
parameters
14Parts of Speech Tagging
- What is it?
- Why do we need it?
- Word classes (Tags)
- Distribution
- Tagsets
- How to do it
- Rule-based
- Stochastic
- Transformation-based
15Parts of Speech Tagging What
- Brainpower_NNP ,_, not_RB physical_JJ plant_NN
,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ
asset_NN ._.
- Tag meanings
- NNP (Proper N sing), RB (Adv), JJ (Adj), NN (N
sing. or mass), VBZ (V 3sg pres), DT
(Determiner), POS (Possessive ending), .
(sentence-final punct)
16Parts of Speech Tagging Why?
- Part-of-speech (word class, morph. class,
syntactic category) gives a significant amount of
info about the word and its neighbors
Useful in the following NLP tasks
- As a basis for (Partial) Parsing
- IR
- Word-sense disambiguation
- Speech synthesis
- Improve language models (Spelling/Speech)
17Parts of Speech
- Eight basic categories
- Noun, verb, pronoun, preposition, adjective,
adverb, article, conjunction - These categories are based on
- morphological properties (affixes they take)
- distributional properties (what other words can
occur nearby) - e.g, green It is so , both, The is
- Not semantics!
18Parts of Speech
- Two kinds of category
- Closed class (generally are function words)
- Prepositions, articles, conjunctions, pronouns,
determiners, aux, numerals - Open class
- Nouns (proper/common mass/count), verbs,
adjectives, adverbs
Very short, frequent and important
Objects, actions, events, properties
- If you run across an unknown word.??
19PoS Distribution
- Parts of speech follow a usual behavior in
Language
(unfortunately very frequent)
many PoS
Words
1 PoS
2 PoS
but luckily different tags associated with a
word are not equally likely
20Sets of Parts of SpeechTagsets
- Most commonly used
- 45-tag Penn Treebank,
- 61-tag C5,
- 146-tag C7
- The choice of tagset is based on the application
(do you care about distinguishing between to as
a prep and to as a infinitive marker?) - Accurate tagging can be done with even large
tagsets
21PoS Tagging
Input text
- Brainpower, not physical plant, is now a firm's
chief asset.
Tagset
Dictionary wordi -gt set of tags
Output
- Brainpower_NNP ,_, not_RB physical_JJ plant_NN
,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ
asset_NN ._. .
22Tagger Types
- Rule-based
- Stochastic
- HMM tagger gt 92
- Transformation-based tagger (Brill) gt 95
- Maximum Entropy Models gt 97
23Rule-Based (ENGTWOL 95)
- A lexicon transducer returns for each word all
possible morphological parses - A set of 1,000 constraints is applied to rule
out inappropriate PoS
24HMM Stochastic Tagging
- Tags corresponds to an HMM states
- Words correspond to the HMM alphabet symbols
Tagging given a sequence of words
(observations), find the most likely sequence of
tags (states)
But this is..!
We need State transition and symbol emission
probabilities
2) No corpus parameter estimation (Baum-Welch)
25Transformation Based Learning(the Brill Tagger
95-97)
Combines rule-based and stochastic approaches
- Rules specify tags for words based on context
- Rules are automatically induced from a
pre-tagged training corpus
26TBL How TBL rules are applied
Step 1 Assign each word the tag that is most
likely given no contextual information. Race
example P(NNrace) .98 P(VBrace) .02
Step 2 Apply transformation rules that use the
context that was just established. Race example
Change NN to VB when the previous tag is TO.
Johanna is expected to race tomorrow. The race
is already over.
.
.
.
27How TBL Rules are learned
- Major stages (supervised!)
- 0. Save hand-tagged corpus
- Label every word with its most-likely tag.
- Examine every possible transformation and select
the one with the most improved tagging. - Retag the data according to this rule.
- Repeat 2-3 until some stopping point is reached.
Output an ordered list of transformations
28The Universe of Possible Transformations?
Change tag a to b if
Huge search space!
29Evaluating Taggers
- Accuracy percent correct (most current taggers
96-7) test on unseen data! - Human Celing agreement rate of humans on
classification (96-7) - Unigram baseline assign each token to the class
it occurred in most frequently in the training
set (race -gt NN). - What is causing the errors? Build a confusion
matrix
30Knowledge-Formalisms Map(including probabilistic
formalisms)
State Machines (and prob. versions) (Finite State
Automata,Finite State Transducers, Markov Models)
Morphology
Syntax
Rule systems (and prob. versions) (e.g., (Prob.)
Context-Free Grammars)
Semantics
- Logical formalisms
- (First-Order Logics)
Pragmatics Discourse and Dialogue
AI planners
31Next Time
- Read about tagging unknown words
- Read Chapter 9