CS60057 Speech

About This Presentation

Title:

CS60057 Speech

Description:

suppose sample space S = {heart,spade,club,diamond} (four suits) ... Probability of drawing a spade from 52 well-shuffled playing cards: Lecture 1, 7/21/2005 ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 92

Provided by: IBMU306

Category:

more less

Transcript and Presenter's Notes

Title: CS60057 Speech

1
CS60057Speech Natural Language Processing

Autumn 2007

Lecture 8 9 August 2007
2
Why Do We Care about Parts of Speech?

Pronunciation
Hand me the lead pipe.
Predicting what words can be expected next
Personal pronoun (e.g., I, she) ____________
Stemming
-s means singular for verbs, plural for nouns
As the basis for syntactic parsing and then
meaning extraction
I will lead the group into the lead smelter.
Machine translation
(E) content N ? (F) contenu N
(E) content Adj ? (F) content Adj or
satisfait Adj

3
LIN6932 Topics in Computational Linguistics

Hana Filip
Lecture 4
Part of Speech Tagging (II) - Introduction to
Probability
February 1, 2007

4
What is a Part of Speech?
Is this a semantic distinction? For example,
maybe Noun is the class of words for people,
places and things. Maybe Adjective is the class
of words for properties of nouns. Consider green
book book is a Noun green is an
Adjective Now consider book worm This green
is very soothing.
5
How Many Parts of Speech Are There?

A first cut at the easy distinctions
Open classes
nouns, verbs, adjectives, adverbs
Closed classes function words
conjunctions and, or, but
pronounts I, she, him
prepositions with, on
determiners the, a, an

6
Part of speech tagging

8 (ish) traditional parts of speech
Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc
This idea has been around for over 2000 years
(Dionysius Thrax of Alexandria, c. 100 B.C.)
Called parts-of-speech, lexical category, word
classes, morphological classes, lexical tags, POS
Well use POS most frequently
Ill assume that you all know what these are

7
POS examples

N noun chair, bandwidth, pacing
V verb study, debate, munch
ADJ adj purple, tall, ridiculous
ADV adverb unfortunately, slowly,
P preposition of, by, to
PRO pronoun I, me, mine
DET determiner the, a, that, those

8
Tagsets
Brown corpus tagset (87 tags)
http//www.scs.leeds.ac.uk/amalgam/tagsets/brown.h
tml Penn Treebank tagset (45 tags)
http//www.cs.colorado.edu/martin/SLP/Figures/
(8.6) C7 tagset (146 tags) http//www.comp.lancs
.ac.uk/ucrel/claws7tags.html
9
POS Tagging Definition

The process of assigning a part-of-speech or
lexical class marker to each word in a corpus

10
POS Tagging example

WORD tag
the DET
koala N
put V
the DET
keys N
on P
the DET
table N

11
POS tagging Choosing a tagset

There are so many parts of speech, potential
distinctions we can draw
To do POS tagging, need to choose a standard set
of tags to work with
Could pick very coarse tagets
N, V, Adj, Adv.
More commonly used set is finer grained, the
UPenn TreeBank tagset, 45 tags
PRP, WRB, WP, VBG
Even more fine-grained tagsets exist

12
Penn TreeBank POS Tag set
13
Using the UPenn tagset

The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
Prepositions and subordinating conjunctions
marked IN (although/IN I/PRP..)
Except the preposition/complementizer to is
just marked to.

14
POS Tagging

Words often have more than one POS back
The back door JJ
On my back NN
Win the voters back RB
Promised to back the bill VB
The POS tagging problem is to determine the POS
tag for a particular instance of a word.

These examples from Dekang Lin
15
How hard is POS tagging? Measuring ambiguity
16
Algorithms for POS Tagging

Ambiguity In the Brown corpus, 11.5 of the
word types are ambiguous (using 87 tags)

Worse, 40 of the tokens are ambiguous.
17
Algorithms for POS Tagging

Why cant we just look them up in a dictionary?
Words that arent in the dictionary

http//story.news.yahoo.com/news?tmplstorycid57
8ncid578e1u/nm/20030922/ts_nm/iraq_usa_dc

One idea P(ti wi) the probability that a
random hapax legomenon in the corpus has tag ti.
Nouns are more likely than verbs, which are more
likely than pronouns.
Another idea use morphology.

18
Algorithms for POS Tagging - Knowledge

Dictionary
Morphological rules, e.g.,
_____-tion
_____-ly
capitalization
N-gram frequencies
to _____
DET _____ N
But what about rare words, e.g, smelt (two verb
forms, melt and past tense of smell, and one noun
form, a small fish)
Combining these
V _____-ing I was gracking vs. Gracking
is fun.

19
POS Tagging - Approaches

Approaches
Rule-based tagging
(ENGTWOL)
Stochastic (Probabilistic) tagging
HMM (Hidden Markov Model) tagging
Transformation-based tagging
Brill tagger
Do we return one best answer or several answers
and let later steps decide?
How does the requisite knowledge get entered?

20
3 methods for POS tagging

1. Rule-based tagging
Example Karlsson (1995) EngCG tagger based on
the Constraint Grammar architecture and ENGTWOL
lexicon
Basic Idea
Assign all possible tags to words (morphological
analyzer used)
Remove wrong tags according to set of constraint
rules (typically more than 1000 hand-written
constraint rules, but may be machine-learned)

21
3 methods for POS tagging

2. Transformation-based tagging
Example Brill (1995) tagger - combination of
rule-based and stochastic (probabilistic) tagging
methodologies
Basic Idea
Start with a tagged corpus dictionary (with
most frequent tags)
Set the most probable tag for each word as a
start value
Change tags according to rules of type if word-1
is a determiner and word is a verb then change
the tag to noun in a specific order (like
rule-based taggers)
machine learning is usedthe rules are
automatically induced from a previously tagged
training corpus (like stochastic approach)

22
3 methods for POS tagging

3. Stochastic (Probabilistic) tagging
Example HMM (Hidden Markov Model) tagging - a
training corpus used to compute the probability
(frequency) of a given word having a given POS
tag in a given context

23
Topics

Probability
Conditional Probability
Independence
Bayes Rule
HMM tagging
Markov Chains
Hidden Markov Models

24
6. Introduction to Probability

Experiment (trial)
Repeatable procedure with well-defined possible
outcomes
Sample Space (S)
the set of all possible outcomes
finite or infinite
Example
coin toss experiment
possible outcomes S heads, tails
Example
die toss experiment
possible outcomes S 1,2,3,4,5,6

25
Introduction to Probability

Definition of sample space depends on what we are
asking
Sample Space (S) the set of all possible
outcomes
Example
die toss experiment for whether the number is
even or odd
possible outcomes even,odd
not 1,2,3,4,5,6

26
More definitions

Events
an event is any subset of outcomes from the
sample space
Example
die toss experiment
let A represent the event such that the outcome
of the die toss experiment is divisible by 3
A 3,6
A is a subset of the sample space S 1,2,3,4,5,6

27
Introduction to Probability

Some definitions
Events
an event is a subset of sample space
simple and compound events
Example
deck of cards draw experiment
suppose sample space S heart,spade,club,diamond
(four suits)
let A represent the event of drawing a heart
let B represent the event of drawing a red card
A heart (simple event)
B heart u diamond heart,diamond
(compound event)
a compound event can be expressed as a set union
of simple events
Example
alternative sample space S set of 52 cards
A and B would both be compound events

28
Introduction to Probability

Some definitions
Counting
suppose an operation oi can be performed in ni
ways,
a set of k operations o1o2...ok can be performed
in n1 ? n2 ? ... ? nk ways
Example
dice toss experiment, 6 possible outcomes
two dice are thrown at the same time
number of sample points in sample space 6 ? 6
36

29
Definition of Probability

The probability law assigns to an event a
nonnegative number
Called P(A)
Also called the probability A
That encodes our knowledge or belief about the
collective likelihood of all the elements of A
Probability law must satisfy certain properties

30
Probability Axioms

Nonnegativity
P(A) gt 0, for every event A
Additivity
If A and B are two disjoint events, then the
probability of their union satisfies
P(A U B) P(A) P(B)
Normalization
The probability of the entire sample space S is
equal to 1, i.e. P(S) 1.

31
An example

An experiment involving a single coin toss
There are two possible outcomes, H and T
Sample space S is H,T
If coin is fair, should assign equal
probabilities to 2 outcomes
Since they have to sum to 1
P(H) 0.5
P(T) 0.5
P(H,T) P(H)P(T) 1.0

32
Another example

Experiment involving 3 coin tosses
Outcome is a 3-long string of H or T
S HHH,HHT,HTH,HTT,THH,THT,TTH,TTT
Assume each outcome is equiprobable
Uniform distribution
What is probability of the event that exactly 2
heads occur?
A HHT,HTH,THH 3 events/outcomes
P(A) P(HHT)P(HTH)P(THH) additivity -
union of the probability of the
individual events
1/8 1/8 1/8 total 8
events/outcomes
3/8

33
Probability definitions

In summary
Probability of drawing a spade from 52
well-shuffled playing cards

34
Moving toward language

Whats the probability of drawing a 2 from a deck
of 52 cards with four 2s?
Whats the probability of a random word (from a
random dictionary page) being a verb?

35
Probability and part of speech tags

Whats the probability of a random word (from a
random dictionary page) being a verb?
How to compute each of these
All words just count all the words in the
dictionary
of ways to get a verb of words which are
verbs!
If a dictionary has 50,000 entries, and 10,000
are verbs. P(V) is 10000/50000 1/5 .20

36
Conditional Probability

A way to reason about the outcome of an
experiment based on partial information
In a word guessing game the first letter for the
word is a t. What is the likelihood that the
second letter is an h?
How likely is it that a person has a disease
given that a medical test was negative?
A spot shows up on a radar screen. How likely is
it that it corresponds to an aircraft?

37
More precisely

Given an experiment, a corresponding sample space
S, and a probability law
Suppose we know that the outcome is some event B
We want to quantify the likelihood that the
outcome also belongs to some other event A
We need a new probability law that gives us the
conditional probability of A given B
P(AB)

38
An intuition

Lets say A is its raining.
Lets say P(A) in dry Florida is .01
Lets say B is it was sunny ten minutes ago
P(AB) means what is the probability of it
raining now if it was sunny 10 minutes ago
P(AB) is probably way less than P(A)
Perhaps P(AB) is .0001
Intuition The knowledge about B should change
our estimate of the probability of A.

39
Conditional Probability

let A and B be events in the sample space
P(AB) the conditional probability of event A
occurring given some fixed event B occurring
definition P(AB) P(A ? B) / P(B)

40
Conditional probability

P(AB) P(A ? B)/P(B)
Or

Note P(A,B)P(AB) P(B) Also P(A,B) P(B,A)
41
Independence

What is P(A,B) if A and B are independent?
P(A,B)P(A) P(B) iff A,B independent.
P(heads,tails) P(heads) P(tails) .5 .5
.25
Note P(AB)P(A) iff A,B independent
Also P(BA)P(B) iff A,B independent

42
Bayes Theorem

Idea The probability of an A conditional on
another event B is generally different from the
probability of B conditional on A. There is a
definite relationship between the two.

43
Deriving Bayes Rule
The probability of event A given event B is
44
Deriving Bayes Rule
The probability of event B given event A is
45
Deriving Bayes Rule
46
Deriving Bayes Rule
47
Deriving Bayes Rule
the theorem may be paraphrased as conditional/pos
terior probability (LIKELIHOOD multiplied by
PRIOR) divided by NORMALIZING CONSTANT
48
Hidden Markov Model (HMM) Tagging

Using an HMM to do POS tagging
HMM is a special case of Bayesian inference
It is also related to the noisy channel model
in ASR (Automatic Speech Recognition)

49
POS tagging as a sequence classification task

We are given a sentence (an observation or
sequence of observations)
Secretariat is expected to race tomorrow
sequence of n words w1wn.
What is the best sequence of tags which
corresponds to this sequence of observations?
Probabilistic/Bayesian view
Consider all possible sequences of tags
Out of this universe of sequences, choose the tag
sequence which is most probable given the
observation sequence of n words w1wn.

50
Getting to HMM

Let T t1,t2,,tn
Let W w1,w2,,wn
Goal Out of all sequences of tags t1tn, get the
the most probable sequence of POS tags T
underlying the observed sequence of words
w1,w2,,wn
Hat means our estimate of the best the most
probable tag sequence
Argmaxx f(x) means the x such that f(x) is
maximized
it maximazes our estimate of the best tag
sequence

51
Getting to HMM

This equation is guaranteed to give us the best
tag sequence
But how do we make it operational? How do we
compute this value?
Intuition of Bayesian classification
Use Bayes rule to transform it into a set of
other probabilities that are easier to compute
Thomas Bayes British mathematician (1702-1761)

52
Bayes Rule
Breaks down any conditional probability P(xy)
into three other probabilities P(xy) The
conditional probability of an event x assuming
that y has occurred
53
Bayes Rule
We can drop the denominator it does not change
for each tag sequence we are looking for the
best tag sequence for the same observation, for
the same fixed set of words
54
Bayes Rule
55
Likelihood and prior
n
56
Likelihood and prior Further Simplifications
1. the probability of a word appearing depends
only on its own POS tag, i.e, independent of
other words around it
n
2. BIGRAM assumption the probability of a
tag appearing depends only on the previous tag
3. The most probable tag sequence estimated by
the bigram tagger
57
Likelihood and prior Further Simplifications
1. the probability of a word appearing depends
only on its own POS tag, i.e, independent of
other words around it
n
58
Likelihood and prior Further Simplifications
2. BIGRAM assumption the probability of a
tag appearing depends only on the previous tag
Bigrams are groups of two written letters, two
syllables, or two words they are a special case
of N-gram. Bigrams are used as the basis for
simple statistical analysis of text The bigram
assumption is related to the first-order Markov
assumption
59
Likelihood and prior Further Simplifications
3. The most probable tag sequence estimated by
the bigram tagger
--------------------------------------------------
--------------------------------------------------
-----------
n
biagram assumption
60
Two kinds of probabilities (1)

Tag transition probabilities p(titi-1)
Determiners likely to precede adjs and nouns
That/DT flight/NN
The/DT yellow/JJ hat/NN
So we expect P(NNDT) and P(JJDT) to be high
But P(DTJJ) to be?

61
Two kinds of probabilities (1)

Tag transition probabilities p(titi-1)
Compute P(NNDT) by counting in a labeled corpus

of times DT is followed by NN
62
Two kinds of probabilities (2)

Word likelihood probabilities p(witi)
P(isVBZ) probability of VBZ (3sg Pres verb)
being is
Compute P(isVBZ) by counting in a labeled corpus

If we were expecting a third person singular
verb, how likely is it that this verb would be
is?
63
An Example the verb race

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NR
People/NNS continue/VB to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN
How do we pick the right tag?

64
Disambiguating race
65
Disambiguating race

P(NNTO) .00047
P(VBTO) .83
The tag transition probabilities P(NNTO) and
P(VBTO) answer the question How likely are we
to expect verb/noun given the previous tag TO?
P(raceNN) .00057
P(raceVB) .00012
Lexical likelihoods from the Brown corpus for
race given a POS tag NN or VB.
P(NRVB) .0027
P(NRNN) .0012
tag sequence probability for the likelihood of an
adverb occurring given the previous tag verb or
noun
P(VBTO)P(NRVB)P(raceVB) .00000027
P(NNTO)P(NRNN)P(raceNN).00000000032
Multiply the lexical likelihoods with the tag
sequence probabiliies the verb wins

66
Hidden Markov Models

What weve described with these two kinds of
probabilities is a Hidden Markov Model (HMM)
Lets just spend a bit of time tying this into
the model
In order to define HMM, we will first introduce
the Markov Chain, or observable Markov Model.

67
Definitions

A weighted finite-state automaton adds
probabilities to the arcs
The sum of the probabilities leaving any arc must
sum to one
A Markov chain is a special case of a WFST in
which the input sequence uniquely determines
which states the automaton will go through
Markov chains cant represent inherently
ambiguous problems
Useful for assigning probabilities to unambiguous
sequences

68
Markov chain First-order observed Markov
Model

a set of states
Q q1, q2qN the state at time t is qt
a set of transition probabilities
a set of probabilities A a01a02an1ann.
Each aij represents the probability of
transitioning from state i to state j
The set of these is the transition probability
matrix A
Distinguished start and end states
Special initial probability vector ?
?i the probability that the MM will start in
state i, each ?i expresses the probability
p(qiSTART)

69
Markov chain First-order observed Markov
Model

Markov Chain for weather Example 1
three types of weather sunny, rainy, foggy
we want to find the following conditional
probabilities
P(qnqn-1, qn-2, , q1)
- I.e., the probability of the unknown weather
on day n, depending on the (known) weather of
the preceding days
- We could infer this probability from the
relative frequency (the statistics) of past
observations of weather sequences
Problem the larger n is, the more observations
we must collect.
Suppose that n6, then we have to collect
statistics for 3(6-1) 243 past histories

70
Markov chain First-order observed Markov
Model

Therefore, we make a simplifying assumption,
called the (first-order) Markov assumption
for a sequence of observations q1, qn,
current state only depends on previous state
the joint probability of certain past and current
observations

71
Markov chain First-order observable Markov
Model

72
Markov chain First-order observed Markov
Model

Given that today the weather is sunny, what's
the probability that tomorrow is sunny and the
day after is rainy?
Using the Markov assumption and the
probabilities in table 1, this translates into

73
The weather figure specific example

Markov Chain for weather Example 2

74
Markov chain for weather

What is the probability of 4 consecutive rainy
days?
Sequence is rainy-rainy-rainy-rainy
I.e., state sequence is 3-3-3-3
P(3,3,3,3)
?1a11a11a11a11 0.2 x (0.6)3 0.0432

75
Hidden Markov Model

For Markov chains, the output symbols are the
same as the states.
See sunny weather were in state sunny
But in part-of-speech tagging (and other things)
The output symbols are words
But the hidden states are part-of-speech tags
So we need an extension!
A Hidden Markov Model is an extension of a Markov
chain in which the output symbols are not the
same as the states.
This means we dont know which state we are in.

76
Markov chain for weather
77
Markov chain for words
Observed events words Hidden events tags
78
Hidden Markov Models

States Q q1, q2qN
Observations O o1, o2oN
Each observation is a symbol from a vocabulary V
v1,v2,vV
Transition probabilities (prior)
Transition probability matrix A aij
Observation likelihoods (likelihood)
Output probability matrix Bbi(ot)
a set of observation likelihoods, each
expressing the probability of an observation ot
being generated from a state i, emission
probabilities
Special initial probability vector ?
?i the probability that the HMM will start in
state i, each ?i expresses the probability
p(qiSTART)

79
Assumptions

Markov assumption the probability of a
particular state depends only on the previous
state
Output-independence assumption the probability
of an output observation depends only on the
state that produced that observation

80
HMM for Ice Cream

You are a climatologist in the year 2799
Studying global warming
You cant find any records of the weather in
Boston, MA for summer of 2007
But you find Jason Eisners diary
Which lists how many ice-creams Jason ate every
date that summer
Our job figure out how hot it was

81
Noam task

Given
Ice Cream Observation Sequence 1,2,3,2,2,2,3
(cp. with output symbols)
Produce
Weather Sequence C,C,H,C,C,C,H
(cp. with hidden states, causing states)

82
HMM for ice cream
83
Different types of HMM structure
Ergodic fully-connected
Bakis left-to-right
84
HMM Taggers

Two kinds of probabilities
A transition probabilities (PRIOR) (slide 36)
B observation likelihoods (LIKELIHOOD) (slide 36)
HMM Taggers choose the tag sequence which
maximizes the product of word likelihood and tag
sequence probability

85
Weighted FSM corresponding to hidden states of
HMM, showing A probs
86
B observation likelihoods for POS HMM
87
HMM Taggers

The probabilities are trained on hand-labeled
training corpora (training set)
Combine different N-gram levels
Evaluated by comparing their output from a test
set to human labels for that test set (Gold
Standard)

88
Next Time