CPSC 503 Computational Linguistics - PowerPoint PPT Presentation

About This Presentation

Title:

CPSC 503 Computational Linguistics

Description:

Decoding: Finding the probability of an observation. brute force or Forward/Backward-Algorithm ... Training: find model parameters which best explain the ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 32

Provided by: giuseppec

Category:

more less

Transcript and Presenter's Notes

Title: CPSC 503 Computational Linguistics

1
CPSC 503Computational Linguistics

Finish HMMs
Part-of-Speech Tagging
Lecture 10
Giuseppe Carenini

2
Today 13/2

Finish HMM the three key problems
Part-of-speech tagging
What it is
Why we need it
How to do it

3
Hidden Markov Model(Arc Emission)
a
.1
.6
.5
1
b
b
.4
a
b
a
i
.5
.5
s4
s3
.5
.5
a
.1
.4
.5
i
b
b
.7
.1
.4
a
.5
.3
.5
.6
.9
a
i
s1
s2
1
.4
.6
.4
b
Start
Start
4
Hidden Markov Model
Formal Specification as five-tuple
Set of States
Output Alphabet
Initial State Probabilities
State Transition Probabilities
Symbol Emission Probabilities
5
Three fundamental questions for HMMs

Decoding Finding the probability of an
observation
brute force or Forward/Backward-Algorithm

Finding the best state sequence
Viterbi-Algorithm

Training find model parameters which best
explain the observations
Manning/Schütze, 2000 325
6
Computing the probability of an observation
sequence
7
Decoding Example
s1, s1, s1 0 ?
s1, s2, s1 0 ?
.
.
s1, s4, s4 .6 .7 .6 .4 .5
s2, s4, s3 0?
s2, s1, s4 .4 .4 .7 1 .5
.
Manning/Schütze, 2000 327
8
The forward procedure
9
The backward procedure
10
Combining backward and forward
11
Finding the Best State Sequence
?j(t) probability of the most probable path
that leads to that node

The Viterbi Algorithm
Initialization ?j(1) ?j, 1? j? N
Induction ?j(t1) max1? i?N ?i(t)aijbijot, 1?
j? N
Store backtrace
?j(t1) argmax1? i?N ?j(t)aij bijot, 1? j? N

Termination and path readout
XT1 argmax1? i?N ?j(T1)
Xt ?Xt1(t1)
P(X)
max1? i?N ?j(T1)

12
Parameter Estimation

Find the values of the model parameters ?(A, B,
?) which best explain O
Using Maximum Likelihood Estimation, we can want
find the values that maximize

There is no known analytic method
Iterative hill-climbing algorithm known as
Baum-Welch or Forward-Backward algorithm.
(special case of the EM Algorithm)

13
Baum-Welch Algorithm Key ideas

Start using some (perhaps randomly chosen) model.

2) Now you can compute
Expected of transitions from i to j
Expected of transitions from state i
Expected of transitions from i to j with k
observed

3) Now you can compute re-estimates of model
parameters

4) Back to 1

14
Parts of Speech Tagging

What is it?
Why do we need it?
Word classes (Tags)
Distribution
Tagsets
How to do it
Rule-based
Stochastic
Transformation-based

15
Parts of Speech Tagging What

Brainpower_NNP ,_, not_RB physical_JJ plant_NN
,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ
asset_NN ._.

Tag meanings
NNP (Proper N sing), RB (Adv), JJ (Adj), NN (N
sing. or mass), VBZ (V 3sg pres), DT
(Determiner), POS (Possessive ending), .
(sentence-final punct)

16
Parts of Speech Tagging Why?

Part-of-speech (word class, morph. class,
syntactic category) gives a significant amount of
info about the word and its neighbors

Useful in the following NLP tasks

As a basis for (Partial) Parsing
IR
Word-sense disambiguation
Speech synthesis
Improve language models (Spelling/Speech)

17
Parts of Speech

Eight basic categories
Noun, verb, pronoun, preposition, adjective,
adverb, article, conjunction
These categories are based on
morphological properties (affixes they take)
distributional properties (what other words can
occur nearby)
e.g, green It is so , both, The is
Not semantics!

18
Parts of Speech

Two kinds of category
Closed class (generally are function words)
Prepositions, articles, conjunctions, pronouns,
determiners, aux, numerals
Open class
Nouns (proper/common mass/count), verbs,
adjectives, adverbs

Very short, frequent and important
Objects, actions, events, properties

If you run across an unknown word.??

19
PoS Distribution

Parts of speech follow a usual behavior in
Language

(unfortunately very frequent)
many PoS
Words
1 PoS
2 PoS
but luckily different tags associated with a
word are not equally likely
20
Sets of Parts of SpeechTagsets

Most commonly used
45-tag Penn Treebank,
61-tag C5,
146-tag C7
The choice of tagset is based on the application
(do you care about distinguishing between to as
a prep and to as a infinitive marker?)
Accurate tagging can be done with even large
tagsets

21
PoS Tagging
Input text

Brainpower, not physical plant, is now a firm's
chief asset.

Tagset
Dictionary wordi -gt set of tags
Output

Brainpower_NNP ,_, not_RB physical_JJ plant_NN
,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ
asset_NN ._. .

22
Tagger Types

Rule-based
Stochastic
HMM tagger gt 92
Transformation-based tagger (Brill) gt 95
Maximum Entropy Models gt 97

23
Rule-Based (ENGTWOL 95)

A lexicon transducer returns for each word all
possible morphological parses
A set of 1,000 constraints is applied to rule
out inappropriate PoS

24
HMM Stochastic Tagging

Tags corresponds to an HMM states
Words correspond to the HMM alphabet symbols

Tagging given a sequence of words
(observations), find the most likely sequence of
tags (states)
But this is..!
We need State transition and symbol emission
probabilities
2) No corpus parameter estimation (Baum-Welch)
25
Transformation Based Learning(the Brill Tagger
95-97)
Combines rule-based and stochastic approaches

Rules specify tags for words based on context

Rules are automatically induced from a
pre-tagged training corpus

26
TBL How TBL rules are applied
Step 1 Assign each word the tag that is most
likely given no contextual information. Race
example P(NNrace) .98 P(VBrace) .02
Step 2 Apply transformation rules that use the
context that was just established. Race example
Change NN to VB when the previous tag is TO.
Johanna is expected to race tomorrow. The race
is already over.
.
.
.
27
How TBL Rules are learned

Major stages (supervised!)
0. Save hand-tagged corpus
Label every word with its most-likely tag.
Examine every possible transformation and select
the one with the most improved tagging.
Retag the data according to this rule.
Repeat 2-3 until some stopping point is reached.

Output an ordered list of transformations
28
The Universe of Possible Transformations?
Change tag a to b if
Huge search space!
29
Evaluating Taggers

Accuracy percent correct (most current taggers
96-7) test on unseen data!
Human Celing agreement rate of humans on
classification (96-7)
Unigram baseline assign each token to the class
it occurred in most frequently in the training
set (race -gt NN).
What is causing the errors? Build a confusion
matrix

30
Knowledge-Formalisms Map(including probabilistic
formalisms)
State Machines (and prob. versions) (Finite State
Automata,Finite State Transducers, Markov Models)
Morphology
Syntax
Rule systems (and prob. versions) (e.g., (Prob.)
Context-Free Grammars)
Semantics