CS60057 Speech - PowerPoint PPT Presentation

About This Presentation
Title:

CS60057 Speech

Description:

suppose sample space S = {heart,spade,club,diamond} (four suits) ... Probability of drawing a spade from 52 well-shuffled playing cards: Lecture 1, 7/21/2005 ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 92
Provided by: IBMU306
Category:
Tags: cs60057 | spade | speech

less

Transcript and Presenter's Notes

Title: CS60057 Speech


1
CS60057Speech Natural Language Processing
  • Autumn 2007

Lecture 8 9 August 2007
2
Why Do We Care about Parts of Speech?
  • Pronunciation
  • Hand me the lead pipe.
  • Predicting what words can be expected next
  • Personal pronoun (e.g., I, she) ____________
  • Stemming
  • -s means singular for verbs, plural for nouns
  • As the basis for syntactic parsing and then
    meaning extraction
  • I will lead the group into the lead smelter.
  • Machine translation
  • (E) content N ? (F) contenu N
  • (E) content Adj ? (F) content Adj or
    satisfait Adj

3
LIN6932 Topics in Computational Linguistics
  • Hana Filip
  • Lecture 4
  • Part of Speech Tagging (II) - Introduction to
    Probability
  • February 1, 2007

4
What is a Part of Speech?
Is this a semantic distinction? For example,
maybe Noun is the class of words for people,
places and things. Maybe Adjective is the class
of words for properties of nouns. Consider green
book book is a Noun green is an
Adjective Now consider book worm This green
is very soothing.
5
How Many Parts of Speech Are There?
  • A first cut at the easy distinctions
  • Open classes
  • nouns, verbs, adjectives, adverbs
  • Closed classes function words
  • conjunctions and, or, but
  • pronounts I, she, him
  • prepositions with, on
  • determiners the, a, an

6
Part of speech tagging
  • 8 (ish) traditional parts of speech
  • Noun, verb, adjective, preposition, adverb,
    article, interjection, pronoun, conjunction, etc
  • This idea has been around for over 2000 years
    (Dionysius Thrax of Alexandria, c. 100 B.C.)
  • Called parts-of-speech, lexical category, word
    classes, morphological classes, lexical tags, POS
  • Well use POS most frequently
  • Ill assume that you all know what these are

7
POS examples
  • N noun chair, bandwidth, pacing
  • V verb study, debate, munch
  • ADJ adj purple, tall, ridiculous
  • ADV adverb unfortunately, slowly,
  • P preposition of, by, to
  • PRO pronoun I, me, mine
  • DET determiner the, a, that, those

8
Tagsets
Brown corpus tagset (87 tags)
http//www.scs.leeds.ac.uk/amalgam/tagsets/brown.h
tml Penn Treebank tagset (45 tags)
http//www.cs.colorado.edu/martin/SLP/Figures/
(8.6) C7 tagset (146 tags) http//www.comp.lancs
.ac.uk/ucrel/claws7tags.html
9
POS Tagging Definition
  • The process of assigning a part-of-speech or
    lexical class marker to each word in a corpus

10
POS Tagging example
  • WORD tag
  • the DET
  • koala N
  • put V
  • the DET
  • keys N
  • on P
  • the DET
  • table N

11
POS tagging Choosing a tagset
  • There are so many parts of speech, potential
    distinctions we can draw
  • To do POS tagging, need to choose a standard set
    of tags to work with
  • Could pick very coarse tagets
  • N, V, Adj, Adv.
  • More commonly used set is finer grained, the
    UPenn TreeBank tagset, 45 tags
  • PRP, WRB, WP, VBG
  • Even more fine-grained tagsets exist

12
Penn TreeBank POS Tag set
13
Using the UPenn tagset
  • The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
    number/NN of/IN other/JJ topics/NNS ./.
  • Prepositions and subordinating conjunctions
    marked IN (although/IN I/PRP..)
  • Except the preposition/complementizer to is
    just marked to.

14
POS Tagging
  • Words often have more than one POS back
  • The back door JJ
  • On my back NN
  • Win the voters back RB
  • Promised to back the bill VB
  • The POS tagging problem is to determine the POS
    tag for a particular instance of a word.

These examples from Dekang Lin
15
How hard is POS tagging? Measuring ambiguity
16
Algorithms for POS Tagging
  • Ambiguity In the Brown corpus, 11.5 of the
    word types are ambiguous (using 87 tags)

Worse, 40 of the tokens are ambiguous.
17
Algorithms for POS Tagging
  • Why cant we just look them up in a dictionary?
  • Words that arent in the dictionary

http//story.news.yahoo.com/news?tmplstorycid57
8ncid578e1u/nm/20030922/ts_nm/iraq_usa_dc
  • One idea P(ti wi) the probability that a
    random hapax legomenon in the corpus has tag ti.
  • Nouns are more likely than verbs, which are more
    likely than pronouns.
  • Another idea use morphology.

18
Algorithms for POS Tagging - Knowledge
  • Dictionary
  • Morphological rules, e.g.,
  • _____-tion
  • _____-ly
  • capitalization
  • N-gram frequencies
  • to _____
  • DET _____ N
  • But what about rare words, e.g, smelt (two verb
    forms, melt and past tense of smell, and one noun
    form, a small fish)
  • Combining these
  • V _____-ing I was gracking vs. Gracking
    is fun.

19
POS Tagging - Approaches
  • Approaches
  • Rule-based tagging
  • (ENGTWOL)
  • Stochastic (Probabilistic) tagging
  • HMM (Hidden Markov Model) tagging
  • Transformation-based tagging
  • Brill tagger
  • Do we return one best answer or several answers
    and let later steps decide?
  • How does the requisite knowledge get entered?

20
3 methods for POS tagging
  • 1. Rule-based tagging
  • Example Karlsson (1995) EngCG tagger based on
    the Constraint Grammar architecture and ENGTWOL
    lexicon
  • Basic Idea
  • Assign all possible tags to words (morphological
    analyzer used)
  • Remove wrong tags according to set of constraint
    rules (typically more than 1000 hand-written
    constraint rules, but may be machine-learned)

21
3 methods for POS tagging
  • 2. Transformation-based tagging
  • Example Brill (1995) tagger - combination of
    rule-based and stochastic (probabilistic) tagging
    methodologies
  • Basic Idea
  • Start with a tagged corpus dictionary (with
    most frequent tags)
  • Set the most probable tag for each word as a
    start value
  • Change tags according to rules of type if word-1
    is a determiner and word is a verb then change
    the tag to noun in a specific order (like
    rule-based taggers)
  • machine learning is usedthe rules are
    automatically induced from a previously tagged
    training corpus (like stochastic approach)

22
3 methods for POS tagging
  • 3. Stochastic (Probabilistic) tagging
  • Example HMM (Hidden Markov Model) tagging - a
    training corpus used to compute the probability
    (frequency) of a given word having a given POS
    tag in a given context

23
Topics
  • Probability
  • Conditional Probability
  • Independence
  • Bayes Rule
  • HMM tagging
  • Markov Chains
  • Hidden Markov Models

24
6. Introduction to Probability
  • Experiment (trial)
  • Repeatable procedure with well-defined possible
    outcomes
  • Sample Space (S)
  • the set of all possible outcomes
  • finite or infinite
  • Example
  • coin toss experiment
  • possible outcomes S heads, tails
  • Example
  • die toss experiment
  • possible outcomes S 1,2,3,4,5,6

25
Introduction to Probability
  • Definition of sample space depends on what we are
    asking
  • Sample Space (S) the set of all possible
    outcomes
  • Example
  • die toss experiment for whether the number is
    even or odd
  • possible outcomes even,odd
  • not 1,2,3,4,5,6

26
More definitions
  • Events
  • an event is any subset of outcomes from the
    sample space
  • Example
  • die toss experiment
  • let A represent the event such that the outcome
    of the die toss experiment is divisible by 3
  • A 3,6
  • A is a subset of the sample space S 1,2,3,4,5,6

27
Introduction to Probability
  • Some definitions
  • Events
  • an event is a subset of sample space
  • simple and compound events
  • Example
  • deck of cards draw experiment
  • suppose sample space S heart,spade,club,diamond
    (four suits)
  • let A represent the event of drawing a heart
  • let B represent the event of drawing a red card
  • A heart (simple event)
  • B heart u diamond heart,diamond
    (compound event)
  • a compound event can be expressed as a set union
    of simple events
  • Example
  • alternative sample space S set of 52 cards
  • A and B would both be compound events

28
Introduction to Probability
  • Some definitions
  • Counting
  • suppose an operation oi can be performed in ni
    ways,
  • a set of k operations o1o2...ok can be performed
    in n1 ? n2 ? ... ? nk ways
  • Example
  • dice toss experiment, 6 possible outcomes
  • two dice are thrown at the same time
  • number of sample points in sample space 6 ? 6
    36

29
Definition of Probability
  • The probability law assigns to an event a
    nonnegative number
  • Called P(A)
  • Also called the probability A
  • That encodes our knowledge or belief about the
    collective likelihood of all the elements of A
  • Probability law must satisfy certain properties

30
Probability Axioms
  • Nonnegativity
  • P(A) gt 0, for every event A
  • Additivity
  • If A and B are two disjoint events, then the
    probability of their union satisfies
  • P(A U B) P(A) P(B)
  • Normalization
  • The probability of the entire sample space S is
    equal to 1, i.e. P(S) 1.

31
An example
  • An experiment involving a single coin toss
  • There are two possible outcomes, H and T
  • Sample space S is H,T
  • If coin is fair, should assign equal
    probabilities to 2 outcomes
  • Since they have to sum to 1
  • P(H) 0.5
  • P(T) 0.5
  • P(H,T) P(H)P(T) 1.0

32
Another example
  • Experiment involving 3 coin tosses
  • Outcome is a 3-long string of H or T
  • S HHH,HHT,HTH,HTT,THH,THT,TTH,TTT
  • Assume each outcome is equiprobable
  • Uniform distribution
  • What is probability of the event that exactly 2
    heads occur?
  • A HHT,HTH,THH 3 events/outcomes
  • P(A) P(HHT)P(HTH)P(THH) additivity -
    union of the probability of the
    individual events
  • 1/8 1/8 1/8 total 8
    events/outcomes
  • 3/8

33
Probability definitions
  • In summary
  • Probability of drawing a spade from 52
    well-shuffled playing cards

34
Moving toward language
  • Whats the probability of drawing a 2 from a deck
    of 52 cards with four 2s?
  • Whats the probability of a random word (from a
    random dictionary page) being a verb?

35
Probability and part of speech tags
  • Whats the probability of a random word (from a
    random dictionary page) being a verb?
  • How to compute each of these
  • All words just count all the words in the
    dictionary
  • of ways to get a verb of words which are
    verbs!
  • If a dictionary has 50,000 entries, and 10,000
    are verbs. P(V) is 10000/50000 1/5 .20

36
Conditional Probability
  • A way to reason about the outcome of an
    experiment based on partial information
  • In a word guessing game the first letter for the
    word is a t. What is the likelihood that the
    second letter is an h?
  • How likely is it that a person has a disease
    given that a medical test was negative?
  • A spot shows up on a radar screen. How likely is
    it that it corresponds to an aircraft?

37
More precisely
  • Given an experiment, a corresponding sample space
    S, and a probability law
  • Suppose we know that the outcome is some event B
  • We want to quantify the likelihood that the
    outcome also belongs to some other event A
  • We need a new probability law that gives us the
    conditional probability of A given B
  • P(AB)

38
An intuition
  • Lets say A is its raining.
  • Lets say P(A) in dry Florida is .01
  • Lets say B is it was sunny ten minutes ago
  • P(AB) means what is the probability of it
    raining now if it was sunny 10 minutes ago
  • P(AB) is probably way less than P(A)
  • Perhaps P(AB) is .0001
  • Intuition The knowledge about B should change
    our estimate of the probability of A.

39
Conditional Probability
  • let A and B be events in the sample space
  • P(AB) the conditional probability of event A
    occurring given some fixed event B occurring
  • definition P(AB) P(A ? B) / P(B)

40
Conditional probability
  • P(AB) P(A ? B)/P(B)
  • Or

Note P(A,B)P(AB) P(B) Also P(A,B) P(B,A)
41
Independence
  • What is P(A,B) if A and B are independent?
  • P(A,B)P(A) P(B) iff A,B independent.
  • P(heads,tails) P(heads) P(tails) .5 .5
    .25
  • Note P(AB)P(A) iff A,B independent
  • Also P(BA)P(B) iff A,B independent

42
Bayes Theorem
  • Idea The probability of an A conditional on
    another event B is generally different from the
    probability of B conditional on A. There is a
    definite relationship between the two.

43
Deriving Bayes Rule
The probability of event A given event B is
44
Deriving Bayes Rule
The probability of event B given event A is
45
Deriving Bayes Rule
46
Deriving Bayes Rule
47
Deriving Bayes Rule
the theorem may be paraphrased as conditional/pos
terior probability (LIKELIHOOD multiplied by
PRIOR) divided by NORMALIZING CONSTANT
48
Hidden Markov Model (HMM) Tagging
  • Using an HMM to do POS tagging
  • HMM is a special case of Bayesian inference
  • It is also related to the noisy channel model
    in ASR (Automatic Speech Recognition)

49
POS tagging as a sequence classification task
  • We are given a sentence (an observation or
    sequence of observations)
  • Secretariat is expected to race tomorrow
  • sequence of n words w1wn.
  • What is the best sequence of tags which
    corresponds to this sequence of observations?
  • Probabilistic/Bayesian view
  • Consider all possible sequences of tags
  • Out of this universe of sequences, choose the tag
    sequence which is most probable given the
    observation sequence of n words w1wn.

50
Getting to HMM
  • Let T t1,t2,,tn
  • Let W w1,w2,,wn
  • Goal Out of all sequences of tags t1tn, get the
    the most probable sequence of POS tags T
    underlying the observed sequence of words
    w1,w2,,wn
  • Hat means our estimate of the best the most
    probable tag sequence
  • Argmaxx f(x) means the x such that f(x) is
    maximized
  • it maximazes our estimate of the best tag
    sequence

51
Getting to HMM
  • This equation is guaranteed to give us the best
    tag sequence
  • But how do we make it operational? How do we
    compute this value?
  • Intuition of Bayesian classification
  • Use Bayes rule to transform it into a set of
    other probabilities that are easier to compute
  • Thomas Bayes British mathematician (1702-1761)

52
Bayes Rule
Breaks down any conditional probability P(xy)
into three other probabilities P(xy) The
conditional probability of an event x assuming
that y has occurred
53
Bayes Rule
We can drop the denominator it does not change
for each tag sequence we are looking for the
best tag sequence for the same observation, for
the same fixed set of words
54
Bayes Rule
55
Likelihood and prior
n
56
Likelihood and prior Further Simplifications
1. the probability of a word appearing depends
only on its own POS tag, i.e, independent of
other words around it
n
2. BIGRAM assumption the probability of a
tag appearing depends only on the previous tag
3. The most probable tag sequence estimated by
the bigram tagger
57
Likelihood and prior Further Simplifications
1. the probability of a word appearing depends
only on its own POS tag, i.e, independent of
other words around it
n
58
Likelihood and prior Further Simplifications
2. BIGRAM assumption the probability of a
tag appearing depends only on the previous tag
Bigrams are groups of two written letters, two
syllables, or two words they are a special case
of N-gram. Bigrams are used as the basis for
simple statistical analysis of text The bigram
assumption is related to the first-order Markov
assumption
59
Likelihood and prior Further Simplifications
3. The most probable tag sequence estimated by
the bigram tagger
--------------------------------------------------
--------------------------------------------------
-----------
n
biagram assumption
60
Two kinds of probabilities (1)
  • Tag transition probabilities p(titi-1)
  • Determiners likely to precede adjs and nouns
  • That/DT flight/NN
  • The/DT yellow/JJ hat/NN
  • So we expect P(NNDT) and P(JJDT) to be high
  • But P(DTJJ) to be?

61
Two kinds of probabilities (1)
  • Tag transition probabilities p(titi-1)
  • Compute P(NNDT) by counting in a labeled corpus

of times DT is followed by NN
62
Two kinds of probabilities (2)
  • Word likelihood probabilities p(witi)
  • P(isVBZ) probability of VBZ (3sg Pres verb)
    being is
  • Compute P(isVBZ) by counting in a labeled corpus

If we were expecting a third person singular
verb, how likely is it that this verb would be
is?
63
An Example the verb race
  • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
    tomorrow/NR
  • People/NNS continue/VB to/TO inquire/VB the/DT
    reason/NN for/IN the/DT race/NN for/IN outer/JJ
    space/NN
  • How do we pick the right tag?

64
Disambiguating race
65
Disambiguating race
  • P(NNTO) .00047
  • P(VBTO) .83
  • The tag transition probabilities P(NNTO) and
    P(VBTO) answer the question How likely are we
    to expect verb/noun given the previous tag TO?
  • P(raceNN) .00057
  • P(raceVB) .00012
  • Lexical likelihoods from the Brown corpus for
    race given a POS tag NN or VB.
  • P(NRVB) .0027
  • P(NRNN) .0012
  • tag sequence probability for the likelihood of an
    adverb occurring given the previous tag verb or
    noun
  • P(VBTO)P(NRVB)P(raceVB) .00000027
  • P(NNTO)P(NRNN)P(raceNN).00000000032
  • Multiply the lexical likelihoods with the tag
    sequence probabiliies the verb wins

66
Hidden Markov Models
  • What weve described with these two kinds of
    probabilities is a Hidden Markov Model (HMM)
  • Lets just spend a bit of time tying this into
    the model
  • In order to define HMM, we will first introduce
    the Markov Chain, or observable Markov Model.

67
Definitions
  • A weighted finite-state automaton adds
    probabilities to the arcs
  • The sum of the probabilities leaving any arc must
    sum to one
  • A Markov chain is a special case of a WFST in
    which the input sequence uniquely determines
    which states the automaton will go through
  • Markov chains cant represent inherently
    ambiguous problems
  • Useful for assigning probabilities to unambiguous
    sequences

68
Markov chain First-order observed Markov
Model
  • a set of states
  • Q q1, q2qN the state at time t is qt
  • a set of transition probabilities
  • a set of probabilities A a01a02an1ann.
  • Each aij represents the probability of
    transitioning from state i to state j
  • The set of these is the transition probability
    matrix A
  • Distinguished start and end states
  • Special initial probability vector ?
  • ?i the probability that the MM will start in
    state i, each ?i expresses the probability
    p(qiSTART)

69
Markov chain First-order observed Markov
Model
  • Markov Chain for weather Example 1
  • three types of weather sunny, rainy, foggy
  • we want to find the following conditional
    probabilities
  • P(qnqn-1, qn-2, , q1)
  • - I.e., the probability of the unknown weather
    on day n, depending on the (known) weather of
    the preceding days
  • - We could infer this probability from the
    relative frequency (the statistics) of past
    observations of weather sequences
  • Problem the larger n is, the more observations
    we must collect.
  • Suppose that n6, then we have to collect
    statistics for 3(6-1) 243 past histories

70
Markov chain First-order observed Markov
Model
  • Therefore, we make a simplifying assumption,
    called the (first-order) Markov assumption
  • for a sequence of observations q1, qn,
  • current state only depends on previous state
  • the joint probability of certain past and current
    observations

71
Markov chain First-order observable Markov
Model

72
Markov chain First-order observed Markov
Model
  • Given that today the weather is sunny, what's
    the probability that tomorrow is sunny and the
    day after is rainy?
  • Using the Markov assumption and the
    probabilities in table 1, this translates into

73
The weather figure specific example
  • Markov Chain for weather Example 2

74
Markov chain for weather
  • What is the probability of 4 consecutive rainy
    days?
  • Sequence is rainy-rainy-rainy-rainy
  • I.e., state sequence is 3-3-3-3
  • P(3,3,3,3)
  • ?1a11a11a11a11 0.2 x (0.6)3 0.0432

75
Hidden Markov Model
  • For Markov chains, the output symbols are the
    same as the states.
  • See sunny weather were in state sunny
  • But in part-of-speech tagging (and other things)
  • The output symbols are words
  • But the hidden states are part-of-speech tags
  • So we need an extension!
  • A Hidden Markov Model is an extension of a Markov
    chain in which the output symbols are not the
    same as the states.
  • This means we dont know which state we are in.

76
Markov chain for weather
77
Markov chain for words
Observed events words Hidden events tags
78
Hidden Markov Models
  • States Q q1, q2qN
  • Observations O o1, o2oN
  • Each observation is a symbol from a vocabulary V
    v1,v2,vV
  • Transition probabilities (prior)
  • Transition probability matrix A aij
  • Observation likelihoods (likelihood)
  • Output probability matrix Bbi(ot)
  • a set of observation likelihoods, each
    expressing the probability of an observation ot
    being generated from a state i, emission
    probabilities
  • Special initial probability vector ?
  • ?i the probability that the HMM will start in
    state i, each ?i expresses the probability
  • p(qiSTART)

79
Assumptions
  • Markov assumption the probability of a
    particular state depends only on the previous
    state
  • Output-independence assumption the probability
    of an output observation depends only on the
    state that produced that observation

80
HMM for Ice Cream
  • You are a climatologist in the year 2799
  • Studying global warming
  • You cant find any records of the weather in
    Boston, MA for summer of 2007
  • But you find Jason Eisners diary
  • Which lists how many ice-creams Jason ate every
    date that summer
  • Our job figure out how hot it was

81
Noam task
  • Given
  • Ice Cream Observation Sequence 1,2,3,2,2,2,3
  • (cp. with output symbols)
  • Produce
  • Weather Sequence C,C,H,C,C,C,H
  • (cp. with hidden states, causing states)

82
HMM for ice cream
83
Different types of HMM structure
Ergodic fully-connected
Bakis left-to-right
84
HMM Taggers
  • Two kinds of probabilities
  • A transition probabilities (PRIOR) (slide 36)
  • B observation likelihoods (LIKELIHOOD) (slide 36)
  • HMM Taggers choose the tag sequence which
    maximizes the product of word likelihood and tag
    sequence probability

85
Weighted FSM corresponding to hidden states of
HMM, showing A probs
86
B observation likelihoods for POS HMM
87
HMM Taggers
  • The probabilities are trained on hand-labeled
    training corpora (training set)
  • Combine different N-gram levels
  • Evaluated by comparing their output from a test
    set to human labels for that test set (Gold
    Standard)

88
Next Time
  • Minimum Edit Distance
  • A dynamic programming algorithm
  • A probabilistic version of this called Viterbi
    is a key part of the Hidden Markov Model!

89
Evaluation
90
Error Analysis
91
Tag indeterminacy
92
Unknown words
Write a Comment
User Comments (0)
About PowerShow.com