Title: CS60057 Speech
1CS60057Speech Natural Language Processing
Lecture 8 9 August 2007
2Why Do We Care about Parts of Speech?
- Pronunciation
- Hand me the lead pipe.
- Predicting what words can be expected next
- Personal pronoun (e.g., I, she) ____________
- Stemming
- -s means singular for verbs, plural for nouns
- As the basis for syntactic parsing and then
meaning extraction - I will lead the group into the lead smelter.
- Machine translation
- (E) content N ? (F) contenu N
- (E) content Adj ? (F) content Adj or
satisfait Adj
3LIN6932 Topics in Computational Linguistics
- Hana Filip
- Lecture 4
- Part of Speech Tagging (II) - Introduction to
Probability - February 1, 2007
4What is a Part of Speech?
Is this a semantic distinction? For example,
maybe Noun is the class of words for people,
places and things. Maybe Adjective is the class
of words for properties of nouns. Consider green
book book is a Noun green is an
Adjective Now consider book worm This green
is very soothing.
5How Many Parts of Speech Are There?
- A first cut at the easy distinctions
- Open classes
- nouns, verbs, adjectives, adverbs
- Closed classes function words
- conjunctions and, or, but
- pronounts I, she, him
- prepositions with, on
- determiners the, a, an
6Part of speech tagging
- 8 (ish) traditional parts of speech
- Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc - This idea has been around for over 2000 years
(Dionysius Thrax of Alexandria, c. 100 B.C.) - Called parts-of-speech, lexical category, word
classes, morphological classes, lexical tags, POS - Well use POS most frequently
- Ill assume that you all know what these are
7POS examples
- N noun chair, bandwidth, pacing
- V verb study, debate, munch
- ADJ adj purple, tall, ridiculous
- ADV adverb unfortunately, slowly,
- P preposition of, by, to
- PRO pronoun I, me, mine
- DET determiner the, a, that, those
8Tagsets
Brown corpus tagset (87 tags)
http//www.scs.leeds.ac.uk/amalgam/tagsets/brown.h
tml Penn Treebank tagset (45 tags)
http//www.cs.colorado.edu/martin/SLP/Figures/
(8.6) C7 tagset (146 tags) http//www.comp.lancs
.ac.uk/ucrel/claws7tags.html
9POS Tagging Definition
- The process of assigning a part-of-speech or
lexical class marker to each word in a corpus
10POS Tagging example
- WORD tag
- the DET
- koala N
- put V
- the DET
- keys N
- on P
- the DET
- table N
11POS tagging Choosing a tagset
- There are so many parts of speech, potential
distinctions we can draw - To do POS tagging, need to choose a standard set
of tags to work with - Could pick very coarse tagets
- N, V, Adj, Adv.
- More commonly used set is finer grained, the
UPenn TreeBank tagset, 45 tags - PRP, WRB, WP, VBG
- Even more fine-grained tagsets exist
12Penn TreeBank POS Tag set
13Using the UPenn tagset
- The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./. - Prepositions and subordinating conjunctions
marked IN (although/IN I/PRP..) - Except the preposition/complementizer to is
just marked to.
14POS Tagging
- Words often have more than one POS back
- The back door JJ
- On my back NN
- Win the voters back RB
- Promised to back the bill VB
- The POS tagging problem is to determine the POS
tag for a particular instance of a word.
These examples from Dekang Lin
15How hard is POS tagging? Measuring ambiguity
16Algorithms for POS Tagging
- Ambiguity In the Brown corpus, 11.5 of the
word types are ambiguous (using 87 tags)
Worse, 40 of the tokens are ambiguous.
17Algorithms for POS Tagging
- Why cant we just look them up in a dictionary?
- Words that arent in the dictionary
http//story.news.yahoo.com/news?tmplstorycid57
8ncid578e1u/nm/20030922/ts_nm/iraq_usa_dc
- One idea P(ti wi) the probability that a
random hapax legomenon in the corpus has tag ti. - Nouns are more likely than verbs, which are more
likely than pronouns. - Another idea use morphology.
18Algorithms for POS Tagging - Knowledge
- Dictionary
- Morphological rules, e.g.,
- _____-tion
- _____-ly
- capitalization
- N-gram frequencies
- to _____
- DET _____ N
- But what about rare words, e.g, smelt (two verb
forms, melt and past tense of smell, and one noun
form, a small fish) - Combining these
- V _____-ing I was gracking vs. Gracking
is fun.
19POS Tagging - Approaches
- Approaches
- Rule-based tagging
- (ENGTWOL)
- Stochastic (Probabilistic) tagging
- HMM (Hidden Markov Model) tagging
- Transformation-based tagging
- Brill tagger
- Do we return one best answer or several answers
and let later steps decide? - How does the requisite knowledge get entered?
203 methods for POS tagging
- 1. Rule-based tagging
- Example Karlsson (1995) EngCG tagger based on
the Constraint Grammar architecture and ENGTWOL
lexicon - Basic Idea
- Assign all possible tags to words (morphological
analyzer used) - Remove wrong tags according to set of constraint
rules (typically more than 1000 hand-written
constraint rules, but may be machine-learned)
213 methods for POS tagging
- 2. Transformation-based tagging
- Example Brill (1995) tagger - combination of
rule-based and stochastic (probabilistic) tagging
methodologies - Basic Idea
- Start with a tagged corpus dictionary (with
most frequent tags) - Set the most probable tag for each word as a
start value - Change tags according to rules of type if word-1
is a determiner and word is a verb then change
the tag to noun in a specific order (like
rule-based taggers) - machine learning is usedthe rules are
automatically induced from a previously tagged
training corpus (like stochastic approach)
223 methods for POS tagging
- 3. Stochastic (Probabilistic) tagging
- Example HMM (Hidden Markov Model) tagging - a
training corpus used to compute the probability
(frequency) of a given word having a given POS
tag in a given context
23Topics
- Probability
- Conditional Probability
- Independence
- Bayes Rule
- HMM tagging
- Markov Chains
- Hidden Markov Models
246. Introduction to Probability
- Experiment (trial)
- Repeatable procedure with well-defined possible
outcomes - Sample Space (S)
- the set of all possible outcomes
- finite or infinite
- Example
- coin toss experiment
- possible outcomes S heads, tails
- Example
- die toss experiment
- possible outcomes S 1,2,3,4,5,6
25Introduction to Probability
- Definition of sample space depends on what we are
asking - Sample Space (S) the set of all possible
outcomes - Example
- die toss experiment for whether the number is
even or odd - possible outcomes even,odd
- not 1,2,3,4,5,6
26More definitions
- Events
- an event is any subset of outcomes from the
sample space - Example
- die toss experiment
- let A represent the event such that the outcome
of the die toss experiment is divisible by 3 - A 3,6
- A is a subset of the sample space S 1,2,3,4,5,6
27Introduction to Probability
- Some definitions
- Events
- an event is a subset of sample space
- simple and compound events
- Example
- deck of cards draw experiment
- suppose sample space S heart,spade,club,diamond
(four suits) - let A represent the event of drawing a heart
- let B represent the event of drawing a red card
- A heart (simple event)
- B heart u diamond heart,diamond
(compound event) - a compound event can be expressed as a set union
of simple events - Example
- alternative sample space S set of 52 cards
- A and B would both be compound events
28Introduction to Probability
- Some definitions
- Counting
- suppose an operation oi can be performed in ni
ways, - a set of k operations o1o2...ok can be performed
in n1 ? n2 ? ... ? nk ways - Example
- dice toss experiment, 6 possible outcomes
- two dice are thrown at the same time
- number of sample points in sample space 6 ? 6
36
29Definition of Probability
- The probability law assigns to an event a
nonnegative number - Called P(A)
- Also called the probability A
- That encodes our knowledge or belief about the
collective likelihood of all the elements of A - Probability law must satisfy certain properties
30Probability Axioms
- Nonnegativity
- P(A) gt 0, for every event A
- Additivity
- If A and B are two disjoint events, then the
probability of their union satisfies - P(A U B) P(A) P(B)
- Normalization
- The probability of the entire sample space S is
equal to 1, i.e. P(S) 1.
31An example
- An experiment involving a single coin toss
- There are two possible outcomes, H and T
- Sample space S is H,T
- If coin is fair, should assign equal
probabilities to 2 outcomes - Since they have to sum to 1
- P(H) 0.5
- P(T) 0.5
- P(H,T) P(H)P(T) 1.0
32Another example
- Experiment involving 3 coin tosses
- Outcome is a 3-long string of H or T
- S HHH,HHT,HTH,HTT,THH,THT,TTH,TTT
- Assume each outcome is equiprobable
- Uniform distribution
- What is probability of the event that exactly 2
heads occur? - A HHT,HTH,THH 3 events/outcomes
- P(A) P(HHT)P(HTH)P(THH) additivity -
union of the probability of the
individual events - 1/8 1/8 1/8 total 8
events/outcomes - 3/8
33Probability definitions
- In summary
- Probability of drawing a spade from 52
well-shuffled playing cards
34Moving toward language
- Whats the probability of drawing a 2 from a deck
of 52 cards with four 2s? - Whats the probability of a random word (from a
random dictionary page) being a verb?
35Probability and part of speech tags
- Whats the probability of a random word (from a
random dictionary page) being a verb? - How to compute each of these
- All words just count all the words in the
dictionary - of ways to get a verb of words which are
verbs! - If a dictionary has 50,000 entries, and 10,000
are verbs. P(V) is 10000/50000 1/5 .20
36Conditional Probability
- A way to reason about the outcome of an
experiment based on partial information - In a word guessing game the first letter for the
word is a t. What is the likelihood that the
second letter is an h? - How likely is it that a person has a disease
given that a medical test was negative? - A spot shows up on a radar screen. How likely is
it that it corresponds to an aircraft?
37More precisely
- Given an experiment, a corresponding sample space
S, and a probability law - Suppose we know that the outcome is some event B
- We want to quantify the likelihood that the
outcome also belongs to some other event A - We need a new probability law that gives us the
conditional probability of A given B - P(AB)
38An intuition
- Lets say A is its raining.
- Lets say P(A) in dry Florida is .01
- Lets say B is it was sunny ten minutes ago
- P(AB) means what is the probability of it
raining now if it was sunny 10 minutes ago - P(AB) is probably way less than P(A)
- Perhaps P(AB) is .0001
- Intuition The knowledge about B should change
our estimate of the probability of A.
39Conditional Probability
- let A and B be events in the sample space
- P(AB) the conditional probability of event A
occurring given some fixed event B occurring - definition P(AB) P(A ? B) / P(B)
40Conditional probability
Note P(A,B)P(AB) P(B) Also P(A,B) P(B,A)
41Independence
- What is P(A,B) if A and B are independent?
- P(A,B)P(A) P(B) iff A,B independent.
- P(heads,tails) P(heads) P(tails) .5 .5
.25 - Note P(AB)P(A) iff A,B independent
- Also P(BA)P(B) iff A,B independent
42Bayes Theorem
- Idea The probability of an A conditional on
another event B is generally different from the
probability of B conditional on A. There is a
definite relationship between the two.
43Deriving Bayes Rule
The probability of event A given event B is
44Deriving Bayes Rule
The probability of event B given event A is
45Deriving Bayes Rule
46Deriving Bayes Rule
47Deriving Bayes Rule
the theorem may be paraphrased as conditional/pos
terior probability (LIKELIHOOD multiplied by
PRIOR) divided by NORMALIZING CONSTANT
48Hidden Markov Model (HMM) Tagging
- Using an HMM to do POS tagging
- HMM is a special case of Bayesian inference
- It is also related to the noisy channel model
in ASR (Automatic Speech Recognition)
49POS tagging as a sequence classification task
- We are given a sentence (an observation or
sequence of observations) - Secretariat is expected to race tomorrow
- sequence of n words w1wn.
- What is the best sequence of tags which
corresponds to this sequence of observations? - Probabilistic/Bayesian view
- Consider all possible sequences of tags
- Out of this universe of sequences, choose the tag
sequence which is most probable given the
observation sequence of n words w1wn.
50Getting to HMM
- Let T t1,t2,,tn
- Let W w1,w2,,wn
- Goal Out of all sequences of tags t1tn, get the
the most probable sequence of POS tags T
underlying the observed sequence of words
w1,w2,,wn - Hat means our estimate of the best the most
probable tag sequence - Argmaxx f(x) means the x such that f(x) is
maximized - it maximazes our estimate of the best tag
sequence
51Getting to HMM
- This equation is guaranteed to give us the best
tag sequence - But how do we make it operational? How do we
compute this value? - Intuition of Bayesian classification
- Use Bayes rule to transform it into a set of
other probabilities that are easier to compute - Thomas Bayes British mathematician (1702-1761)
52Bayes Rule
Breaks down any conditional probability P(xy)
into three other probabilities P(xy) The
conditional probability of an event x assuming
that y has occurred
53Bayes Rule
We can drop the denominator it does not change
for each tag sequence we are looking for the
best tag sequence for the same observation, for
the same fixed set of words
54Bayes Rule
55Likelihood and prior
n
56Likelihood and prior Further Simplifications
1. the probability of a word appearing depends
only on its own POS tag, i.e, independent of
other words around it
n
2. BIGRAM assumption the probability of a
tag appearing depends only on the previous tag
3. The most probable tag sequence estimated by
the bigram tagger
57Likelihood and prior Further Simplifications
1. the probability of a word appearing depends
only on its own POS tag, i.e, independent of
other words around it
n
58Likelihood and prior Further Simplifications
2. BIGRAM assumption the probability of a
tag appearing depends only on the previous tag
Bigrams are groups of two written letters, two
syllables, or two words they are a special case
of N-gram. Bigrams are used as the basis for
simple statistical analysis of text The bigram
assumption is related to the first-order Markov
assumption
59Likelihood and prior Further Simplifications
3. The most probable tag sequence estimated by
the bigram tagger
--------------------------------------------------
--------------------------------------------------
-----------
n
biagram assumption
60Two kinds of probabilities (1)
- Tag transition probabilities p(titi-1)
- Determiners likely to precede adjs and nouns
- That/DT flight/NN
- The/DT yellow/JJ hat/NN
- So we expect P(NNDT) and P(JJDT) to be high
- But P(DTJJ) to be?
61Two kinds of probabilities (1)
- Tag transition probabilities p(titi-1)
- Compute P(NNDT) by counting in a labeled corpus
of times DT is followed by NN
62Two kinds of probabilities (2)
- Word likelihood probabilities p(witi)
- P(isVBZ) probability of VBZ (3sg Pres verb)
being is - Compute P(isVBZ) by counting in a labeled corpus
If we were expecting a third person singular
verb, how likely is it that this verb would be
is?
63An Example the verb race
- Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NR - People/NNS continue/VB to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN - How do we pick the right tag?
64Disambiguating race
65Disambiguating race
- P(NNTO) .00047
- P(VBTO) .83
- The tag transition probabilities P(NNTO) and
P(VBTO) answer the question How likely are we
to expect verb/noun given the previous tag TO? - P(raceNN) .00057
- P(raceVB) .00012
- Lexical likelihoods from the Brown corpus for
race given a POS tag NN or VB. - P(NRVB) .0027
- P(NRNN) .0012
- tag sequence probability for the likelihood of an
adverb occurring given the previous tag verb or
noun - P(VBTO)P(NRVB)P(raceVB) .00000027
- P(NNTO)P(NRNN)P(raceNN).00000000032
- Multiply the lexical likelihoods with the tag
sequence probabiliies the verb wins
66Hidden Markov Models
- What weve described with these two kinds of
probabilities is a Hidden Markov Model (HMM) - Lets just spend a bit of time tying this into
the model - In order to define HMM, we will first introduce
the Markov Chain, or observable Markov Model.
67Definitions
- A weighted finite-state automaton adds
probabilities to the arcs - The sum of the probabilities leaving any arc must
sum to one - A Markov chain is a special case of a WFST in
which the input sequence uniquely determines
which states the automaton will go through - Markov chains cant represent inherently
ambiguous problems - Useful for assigning probabilities to unambiguous
sequences
68Markov chain First-order observed Markov
Model
- a set of states
- Q q1, q2qN the state at time t is qt
- a set of transition probabilities
- a set of probabilities A a01a02an1ann.
- Each aij represents the probability of
transitioning from state i to state j - The set of these is the transition probability
matrix A - Distinguished start and end states
- Special initial probability vector ?
- ?i the probability that the MM will start in
state i, each ?i expresses the probability
p(qiSTART)
69Markov chain First-order observed Markov
Model
- Markov Chain for weather Example 1
- three types of weather sunny, rainy, foggy
- we want to find the following conditional
probabilities - P(qnqn-1, qn-2, , q1)
-
- - I.e., the probability of the unknown weather
on day n, depending on the (known) weather of
the preceding days - - We could infer this probability from the
relative frequency (the statistics) of past
observations of weather sequences - Problem the larger n is, the more observations
we must collect. - Suppose that n6, then we have to collect
statistics for 3(6-1) 243 past histories
70Markov chain First-order observed Markov
Model
- Therefore, we make a simplifying assumption,
called the (first-order) Markov assumption -
- for a sequence of observations q1, qn,
- current state only depends on previous state
- the joint probability of certain past and current
observations
71Markov chain First-order observable Markov
Model
72Markov chain First-order observed Markov
Model
- Given that today the weather is sunny, what's
the probability that tomorrow is sunny and the
day after is rainy? - Using the Markov assumption and the
probabilities in table 1, this translates into -
-
73The weather figure specific example
- Markov Chain for weather Example 2
74Markov chain for weather
- What is the probability of 4 consecutive rainy
days? - Sequence is rainy-rainy-rainy-rainy
- I.e., state sequence is 3-3-3-3
- P(3,3,3,3)
- ?1a11a11a11a11 0.2 x (0.6)3 0.0432
75Hidden Markov Model
- For Markov chains, the output symbols are the
same as the states. - See sunny weather were in state sunny
- But in part-of-speech tagging (and other things)
- The output symbols are words
- But the hidden states are part-of-speech tags
- So we need an extension!
- A Hidden Markov Model is an extension of a Markov
chain in which the output symbols are not the
same as the states. - This means we dont know which state we are in.
76Markov chain for weather
77Markov chain for words
Observed events words Hidden events tags
78Hidden Markov Models
- States Q q1, q2qN
- Observations O o1, o2oN
- Each observation is a symbol from a vocabulary V
v1,v2,vV - Transition probabilities (prior)
- Transition probability matrix A aij
- Observation likelihoods (likelihood)
- Output probability matrix Bbi(ot)
- a set of observation likelihoods, each
expressing the probability of an observation ot
being generated from a state i, emission
probabilities - Special initial probability vector ?
- ?i the probability that the HMM will start in
state i, each ?i expresses the probability - p(qiSTART)
79Assumptions
- Markov assumption the probability of a
particular state depends only on the previous
state - Output-independence assumption the probability
of an output observation depends only on the
state that produced that observation
80HMM for Ice Cream
- You are a climatologist in the year 2799
- Studying global warming
- You cant find any records of the weather in
Boston, MA for summer of 2007 - But you find Jason Eisners diary
- Which lists how many ice-creams Jason ate every
date that summer - Our job figure out how hot it was
81Noam task
- Given
- Ice Cream Observation Sequence 1,2,3,2,2,2,3
- (cp. with output symbols)
- Produce
- Weather Sequence C,C,H,C,C,C,H
- (cp. with hidden states, causing states)
82HMM for ice cream
83Different types of HMM structure
Ergodic fully-connected
Bakis left-to-right
84HMM Taggers
- Two kinds of probabilities
- A transition probabilities (PRIOR) (slide 36)
- B observation likelihoods (LIKELIHOOD) (slide 36)
- HMM Taggers choose the tag sequence which
maximizes the product of word likelihood and tag
sequence probability
85Weighted FSM corresponding to hidden states of
HMM, showing A probs
86B observation likelihoods for POS HMM
87HMM Taggers
- The probabilities are trained on hand-labeled
training corpora (training set) - Combine different N-gram levels
- Evaluated by comparing their output from a test
set to human labels for that test set (Gold
Standard)
88Next Time
- Minimum Edit Distance
- A dynamic programming algorithm
- A probabilistic version of this called Viterbi
is a key part of the Hidden Markov Model!
89Evaluation
90Error Analysis
91Tag indeterminacy
92Unknown words