Title: Stochastic Methods
1Stochastic Methods
2Random Variables
- A random variable is a function whose domain is a
sample space and whose range is a set of
outcomes, usually real numbers - A boolean random variable is a function from an
even space to true,false or 1.0, 2.0 - A bernoulli experiment
- Is performed n times
- The trials are independent
- The probability of success on each trial is a
constant p the probability of failure is q
1-p. - A random variable, Y, counts the number of
success in the n trials
3Example
- A fair die is cast six times
- Success a six is rolled
- Failure all other outcomes
- An observed sequence is a sequence of 4 1s and
0s (0,0,1,0) means that a six has been rolled
on the third trial. - Call this event, X.
- Since every trial in the sequence is independent,
the probability of this event is the product of
the probabilities of the atomic events composing
it. - So, p(X) 5/6 5/6 1/6 5/6 (1/6)(5/6)3
4Binomial Probabilities
- In a sequence of Bernoulli trials, we are
interested in the total number of successes, not
their order - Let the event, X, be the number of observed
successes in n Bernoulli trials. - If x individual successes occur, where x 0, 1,
2, , n - then n-x failures occur
5- Facts
- The number of ways of selecting x positions from
n in total is just - nXx
- Since the trials are independent with x successes
and n-x failures, the probability of each success
is just the probability of success times the
probability of failure px(1-p)n-x - The probability of the event X is the sum of the
probabilities of all individual events - (nCx)px(1-p)n-x
- The random variable, X, is said to have a
binomial distribution
6What is the probability of obtaining 5 heads in 7
flips of a fair coin
- The probability of the event X, p(X), is the sum
of the probabilities of all individual events - (nCx)px(1-p)n-x
- The Event X is 5 successes
- n 7, x 5
- p(of a single success) ½
- p(of a single failure) ½
- P(X) (7C5)(1/2)5(1/2)2
7Expectation
- If the reward for the occurrence of an event E,
with probabiliy p(E), is r, and the cost of the
event not occurring, 1-p(E), is c, then the
expectation for an even occuring, ex(E), is - ex(E) r x p(E) c (1-p(E))
8Expectation Example
- A fair roulette wheel has integers, 0 to 36.
- Each player places 5 dollars on any slot.
- If the wheel stops on the spot, the player wins
35, else she loses 1 - So,
- p(winning) 1/37
- P(losing) 36/37
- ex(E) 35(1/37) (-5)(36/37)
- (approx) -3.92
9Bayes Theorem For Two Events
- Recall that we defined conditional (posterior)
probability like this - We can also express s in terms of d
- Multiplying (2) by p(d) we get
- Substituting (3) into (1) gives Bayes theorem
for two events
1
2
3
10- If d is a disease and s is a symptom, the theorem
tells us that the probability of the disease
given the symptom is the probability of the
symptom given the disease times the probability
of the disease divided by the probability of the
symptom
11The Probability of the Intersection of Multiple
Events (that are not necessarily independent)
- Called the Multiplication Rule
12This can be extended to multiple events
Given
13Lets do one More
Called the Chain rule and can be proven by
induction
14Example Four cards are to be dealt one after
another at random and without replacement from an
fair deck. What is the probability of receiving a
spade, a heart, a diamond, and a club in that
order? A event of being dealt a spade B event
of being dealt a heart C event of being dealt a
diamond D event of being dealt a club P(A)
13/52 P(BA) 13/51 P(CA B) 13/50 P(DA B
C) 13/49 Total Probability
13/5213/5113/5013/49
15An Application
- Def A probabilistic finite state machine is a
finite state machine where each arc is associated
with a probability, indicating how likely that
path is to be taken. The sum of the probabilities
of all arcs leaving a node must sum to 1.0. - A PFSM is an acceptor when one or more states are
indicated as the start states and one or more
states is indicated as the accept state.
16Phones/Phonemes
- Def A phone is a speech sound
- Def A phoneme is a collection of related phones
(allophones) that are pronounced differently in
different contexts - So t is phoneme.
- The t sound in tunafish differs from the t
sound in starfish. The first t is aspirated,
meaning the vocal chords briefly dont vibrate,
producing a sound like a puff a air. A t
followed by an s is unaspirated - FSA showing the probabilities of allophones in
the word tomato
17More Phonemes
- This happens with a k and gboth are
unaspirated, leading to the mishearing of the
Jimi Hendrix song - Scuse me, while I kiss the sky
- Scuse me, while I kiss this guy
18PFSA for the pronunciation of tomatoe
19Phoneme Recognition Problem
- Computational Linguists have collections of
spoken and written language called corpora. - The Brown Corpus and the Switchboard Corpus are
two examples. Together, they contain 2.5 million
written and spoken words that we can use as a base
20Now Suppose
- Our machine identified the phone I
- Next the machine has identified the phone ni (as
in knee) - Turns out that an investigation of the
Switchboard corpus shows 7 words that can be
pronounced ni after I - the,neat, need, new, knee, to, you
21How can this be?
- Phoneme t is often deleted at the end of the
word say neat little quickly - the can be pronounced like ni after in or.
Talk like Jersey gangster here or Bob Marley
22Strategy
- Compile the probabilities of each of the
candidate words from the corpora - Applies Bayes theorem for two events
23- Word Frequency Probability
- knee 61 .000024
- the 114834 .046
- neat 338 .00013
- need 1417 .00056
- new 2625 .001
24Apply Simplified Bayes
Since all of the candidtates will be divided by
pni, we can drop it off givingp(wordni)
p(niword)p(word))
- But where does p(niword) come from?
- Rules of pronunciation variation in English are
well-known. - Run them through the corpora and generate
probabilities for each. - So, for example, that word initial th becomes
n if the preceding word ended in n is .15 - This can be done for other pronunciation rules
25Result
- Word p(niword) p(word) p(niword)p(word)
- New .36 .001 .00036
- Neat .52 .00013 .000068
- Need .11 .00056 .000062
- Knee 1.0 .000024 .000024
- The 0.0 .046 0.0
- The has a probability of 0.0 since the previous
phone was the not n - Notice that new seems to be the most likely
candidate. This might be resolved at the
syntactic level - Another possibility is to look at the probability
of two word combinations in the corpora - I new is less probable than I need
- This is referred to as N-Gram analysis
26General Bayes Theorem
- Recall Bayes Theorem for two events
- P(AB) p(BA)p(A)/p(B)
- We would like to generalize this to multiple
events
27Example
- Suppose
- Bowl A contains 2 red and 4 white chips
- Bowl B contains 1 red and 2 white chips
- Bowl C contains 5 red and 4 white chips
- We want to select the bowls and compute the p of
drawing a red chip - Suppose further
- P(A) 1/3
- P(B) 1/6
- P(C) ½
- Where A,B,C are the events that A,B,C are chosen
28- P(R) is dependent upon two probabilities p(which
bowl) then the p(drawing a red chip) - So, p(R) is the union of the probability of
mutually exclusive events
29- Now suppose that the outcome of the experiment is
a red chip, but we dont know which bowl it was
drawn from. - So we can compute the conditional probability for
each of the bowls. - From the definition of conditional probability
and the result above, we know
30- We can do the same thing for the other bowls
- p(BR) 1/8
- P(CR) 5/8
- This accords with intuition. The probability
that the red bowl was chosen increases over the
original probability, because since it has more
red chips, it is the more likely candidate. - The original probabilities are called prior
probabilities - The conditional probabilities (e.g., p(AR)) are
called the posterior probabilities.
31To Generalize
- Let events B1,B2,,Bm constitute a partition of
the sample space S. - That is
Suppose R is an event with B1 Bm its prior
probabilities, all of which 0, then R is the
union m mutually exclusive events, namely,
32- Now,
- If p(A) 0, we have from the definition of
conditional probability that
P(BkR) is the posterior probability
33Example
- Machines A,B,C produce bolts of the same size.
- Each machine produces as follows
- Machine A 35, with 2 defective
- Machine B 25,
- with 1 defective
- Machine C 40
- with 3 defective
- Suppose we select one bolt at the end of the day.
The probability that it is defective is
34- Now suppose the selected bolt is defective. The
probability that it was produced by machine 3 is
Notice how the posterior probability increased,
once we concentrated on C since C produces both
more bolts and a more defective bolts.
35Evidence and Hypotheses
- We can think of these various events as evidence
(E) and hypotheses (H).
Where p(HkE) is the probability that hypothesis
i is true given the evidence, E p(Hk) is the
probability that hypothesis I is true
overall p(EHk) is the probability of observing
evidence, E, when Hi is true m is the number of
hypotheses
36Why Bayes Works
- The probability of evidence given hypotheses is
often easier to determine than the probability of
hypotheses given the evidence. - Suppose the evidence is a headache.
- The hypothesis is meningitis.
- It is easier to determine the number of patients
who have headaches given that they have
meningitis than it is to determine the number of
patients who have meningitis, given that that
they have headaches. - Because the population of headache sufferers is
37But There Are Issues
- When we thought about bowls (hypotheses) and
chips (evidence), the probability of a kind of
bowl given a red chip required that we compute 3
posterior probabilities for each of three bowls.
If we also worked it out for white chips, we
would have to compute 3X2 6 posterior
probabilities. - Now suppose our hypotheses are drawn from the set
of m diseases and our evidence from the set of n
symptoms, we have to compute mXn posterior
probabilities.
38But Theres More
- Bayes assumes that the hypothesis partitions the
set of evidence into disjoint sets. - This is fine with bolts and machines or red chips
and bowls, but much less fine with natural
phenomena. Pneumonia and strep probably doesnt
partition the set of fever sufferers (since they
could overlap)
39That is
- We have to use a form of Bayes theorem that that
considers any single hypothesis, hi, in the
context of the union of multiple symptoms ei
If n is the number of symptoms and m the number
of diseases, this works out to be mxn2 n2 m
pieces of information to collect. In a expert
system that is to classify 200 diseases using
2000 symptoms, this is 800,000,000 pieces of
information to collect.
40Naïve Bayes to the Rescue
- Naive Bayes classification assumes that variables
are independent. - The probability that a fruit is an apple, given
that it is red, round, and firm, can be
calculated from the independent probabilities
that the observed fruit is red, that it is round,
and that it is firm. - The probability that a person has strep, given
that he has a fever, and a sore throat, can be
calculated from the independent probabilities
that a person has a fever and has a sore throat.
41- In effect, we want to calculate this
Since the intersection of sets is a set, Bayes
lets us write
Since we only want to classify and the
denominator is constant, we can ignore it giving
42Independent Events to the Rescue
- Assume that all pieces of evidence are
independent given a particular hypothesis. - Recall the chain rule
Since p(BA) p(B) and p(C)A B) p(C), that
is, the events are mutually exclusive, then
43Becomes (with a little hand-waving) P(hiE)
p(e1h)Xp(e2hiXXp(enhi)
44Leading to the naïve Bayes Classifier