Stochastic Methods - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Stochastic Methods

Description:

A random variable is a function whose domain is a sample space and whose range ... Talk like Jersey gangster here or Bob Marley. Strategy ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 45

Provided by: paulde8

Category:

more less

Transcript and Presenter's Notes

Title: Stochastic Methods

1
Stochastic Methods

A Review

2
Random Variables

A random variable is a function whose domain is a
sample space and whose range is a set of
outcomes, usually real numbers
A boolean random variable is a function from an
even space to true,false or 1.0, 2.0
A bernoulli experiment
Is performed n times
The trials are independent
The probability of success on each trial is a
constant p the probability of failure is q
1-p.
A random variable, Y, counts the number of
success in the n trials

3
Example

A fair die is cast six times
Success a six is rolled
Failure all other outcomes
An observed sequence is a sequence of 4 1s and
0s (0,0,1,0) means that a six has been rolled
on the third trial.
Call this event, X.
Since every trial in the sequence is independent,
the probability of this event is the product of
the probabilities of the atomic events composing
it.
So, p(X) 5/6 5/6 1/6 5/6 (1/6)(5/6)3

4
Binomial Probabilities

In a sequence of Bernoulli trials, we are
interested in the total number of successes, not
their order
Let the event, X, be the number of observed
successes in n Bernoulli trials.
If x individual successes occur, where x 0, 1,
2, , n
then n-x failures occur

Facts
The number of ways of selecting x positions from
n in total is just
nXx
Since the trials are independent with x successes
and n-x failures, the probability of each success
is just the probability of success times the
probability of failure px(1-p)n-x
The probability of the event X is the sum of the
probabilities of all individual events
(nCx)px(1-p)n-x
The random variable, X, is said to have a
binomial distribution

6
What is the probability of obtaining 5 heads in 7
flips of a fair coin

The probability of the event X, p(X), is the sum
of the probabilities of all individual events
(nCx)px(1-p)n-x
The Event X is 5 successes
n 7, x 5
p(of a single success) ½
p(of a single failure) ½
P(X) (7C5)(1/2)5(1/2)2

7
Expectation

If the reward for the occurrence of an event E,
with probabiliy p(E), is r, and the cost of the
event not occurring, 1-p(E), is c, then the
expectation for an even occuring, ex(E), is
ex(E) r x p(E) c (1-p(E))

8
Expectation Example

A fair roulette wheel has integers, 0 to 36.
Each player places 5 dollars on any slot.
If the wheel stops on the spot, the player wins
35, else she loses 1
So,
p(winning) 1/37
P(losing) 36/37
ex(E) 35(1/37) (-5)(36/37)
(approx) -3.92

9
Bayes Theorem For Two Events

Recall that we defined conditional (posterior)
probability like this
We can also express s in terms of d
Multiplying (2) by p(d) we get
Substituting (3) into (1) gives Bayes theorem
for two events

1
2
3
10

If d is a disease and s is a symptom, the theorem
tells us that the probability of the disease
given the symptom is the probability of the
symptom given the disease times the probability
of the disease divided by the probability of the
symptom

11
The Probability of the Intersection of Multiple
Events (that are not necessarily independent)

Called the Multiplication Rule

12
This can be extended to multiple events
Given
13
Lets do one More
Called the Chain rule and can be proven by
induction
14
Example Four cards are to be dealt one after
another at random and without replacement from an
fair deck. What is the probability of receiving a
spade, a heart, a diamond, and a club in that
order? A event of being dealt a spade B event
of being dealt a heart C event of being dealt a
diamond D event of being dealt a club P(A)
13/52 P(BA) 13/51 P(CA B) 13/50 P(DA B
C) 13/49 Total Probability
13/5213/5113/5013/49
15
An Application

Def A probabilistic finite state machine is a
finite state machine where each arc is associated
with a probability, indicating how likely that
path is to be taken. The sum of the probabilities
of all arcs leaving a node must sum to 1.0.
A PFSM is an acceptor when one or more states are
indicated as the start states and one or more
states is indicated as the accept state.

16
Phones/Phonemes

Def A phone is a speech sound
Def A phoneme is a collection of related phones
(allophones) that are pronounced differently in
different contexts
So t is phoneme.
The t sound in tunafish differs from the t
sound in starfish. The first t is aspirated,
meaning the vocal chords briefly dont vibrate,
producing a sound like a puff a air. A t
followed by an s is unaspirated
FSA showing the probabilities of allophones in
the word tomato

17
More Phonemes

This happens with a k and gboth are
unaspirated, leading to the mishearing of the
Jimi Hendrix song
Scuse me, while I kiss the sky
Scuse me, while I kiss this guy

18
PFSA for the pronunciation of tomatoe
19
Phoneme Recognition Problem

Computational Linguists have collections of
spoken and written language called corpora.
The Brown Corpus and the Switchboard Corpus are
two examples. Together, they contain 2.5 million
written and spoken words that we can use as a base

20
Now Suppose

Our machine identified the phone I
Next the machine has identified the phone ni (as
in knee)
Turns out that an investigation of the
Switchboard corpus shows 7 words that can be
pronounced ni after I
the,neat, need, new, knee, to, you

21
How can this be?

Phoneme t is often deleted at the end of the
word say neat little quickly
the can be pronounced like ni after in or.
Talk like Jersey gangster here or Bob Marley

22
Strategy

Compile the probabilities of each of the
candidate words from the corpora
Applies Bayes theorem for two events

Word Frequency Probability
knee 61 .000024
the 114834 .046
neat 338 .00013
need 1417 .00056
new 2625 .001

24
Apply Simplified Bayes
Since all of the candidtates will be divided by
pni, we can drop it off givingp(wordni)
p(niword)p(word))

But where does p(niword) come from?
Rules of pronunciation variation in English are
well-known.
Run them through the corpora and generate
probabilities for each.
So, for example, that word initial th becomes
n if the preceding word ended in n is .15
This can be done for other pronunciation rules

25
Result

Word p(niword) p(word) p(niword)p(word)
New .36 .001 .00036
Neat .52 .00013 .000068
Need .11 .00056 .000062
Knee 1.0 .000024 .000024
The 0.0 .046 0.0
The has a probability of 0.0 since the previous
phone was the not n
Notice that new seems to be the most likely
candidate. This might be resolved at the
syntactic level
Another possibility is to look at the probability
of two word combinations in the corpora
I new is less probable than I need
This is referred to as N-Gram analysis

26
General Bayes Theorem

Recall Bayes Theorem for two events
P(AB) p(BA)p(A)/p(B)
We would like to generalize this to multiple
events

27
Example

Suppose
Bowl A contains 2 red and 4 white chips
Bowl B contains 1 red and 2 white chips
Bowl C contains 5 red and 4 white chips
We want to select the bowls and compute the p of
drawing a red chip
Suppose further
P(A) 1/3
P(B) 1/6
P(C) ½
Where A,B,C are the events that A,B,C are chosen

P(R) is dependent upon two probabilities p(which
bowl) then the p(drawing a red chip)
So, p(R) is the union of the probability of
mutually exclusive events

Now suppose that the outcome of the experiment is
a red chip, but we dont know which bowl it was
drawn from.
So we can compute the conditional probability for
each of the bowls.
From the definition of conditional probability
and the result above, we know

We can do the same thing for the other bowls
p(BR) 1/8
P(CR) 5/8
This accords with intuition. The probability
that the red bowl was chosen increases over the
original probability, because since it has more
red chips, it is the more likely candidate.
The original probabilities are called prior
probabilities
The conditional probabilities (e.g., p(AR)) are
called the posterior probabilities.

31
To Generalize

Let events B1,B2,,Bm constitute a partition of
the sample space S.
That is

Suppose R is an event with B1 Bm its prior
probabilities, all of which 0, then R is the
union m mutually exclusive events, namely,
32

Now,
If p(A) 0, we have from the definition of
conditional probability that

P(BkR) is the posterior probability
33
Example

Machines A,B,C produce bolts of the same size.
Each machine produces as follows
Machine A 35, with 2 defective
Machine B 25,
with 1 defective
Machine C 40
with 3 defective
Suppose we select one bolt at the end of the day.
The probability that it is defective is

Now suppose the selected bolt is defective. The
probability that it was produced by machine 3 is

Notice how the posterior probability increased,
once we concentrated on C since C produces both
more bolts and a more defective bolts.
35
Evidence and Hypotheses

We can think of these various events as evidence
(E) and hypotheses (H).

Where p(HkE) is the probability that hypothesis
i is true given the evidence, E p(Hk) is the
probability that hypothesis I is true
overall p(EHk) is the probability of observing
evidence, E, when Hi is true m is the number of
hypotheses
36
Why Bayes Works

The probability of evidence given hypotheses is
often easier to determine than the probability of
hypotheses given the evidence.
Suppose the evidence is a headache.
The hypothesis is meningitis.
It is easier to determine the number of patients
who have headaches given that they have
meningitis than it is to determine the number of
patients who have meningitis, given that that
they have headaches.
Because the population of headache sufferers is

37
But There Are Issues

When we thought about bowls (hypotheses) and
chips (evidence), the probability of a kind of
bowl given a red chip required that we compute 3
posterior probabilities for each of three bowls.
If we also worked it out for white chips, we
would have to compute 3X2 6 posterior
probabilities.
Now suppose our hypotheses are drawn from the set
of m diseases and our evidence from the set of n
symptoms, we have to compute mXn posterior
probabilities.

38
But Theres More

Bayes assumes that the hypothesis partitions the
set of evidence into disjoint sets.
This is fine with bolts and machines or red chips
and bowls, but much less fine with natural
phenomena. Pneumonia and strep probably doesnt
partition the set of fever sufferers (since they
could overlap)

39
That is

We have to use a form of Bayes theorem that that
considers any single hypothesis, hi, in the
context of the union of multiple symptoms ei

If n is the number of symptoms and m the number
of diseases, this works out to be mxn2 n2 m
pieces of information to collect. In a expert
system that is to classify 200 diseases using
2000 symptoms, this is 800,000,000 pieces of
information to collect.
40
Naïve Bayes to the Rescue

Naive Bayes classification assumes that variables
are independent.
The probability that a fruit is an apple, given
that it is red, round, and firm, can be
calculated from the independent probabilities
that the observed fruit is red, that it is round,
and that it is firm.
The probability that a person has strep, given
that he has a fever, and a sore throat, can be
calculated from the independent probabilities
that a person has a fever and has a sore throat.

In effect, we want to calculate this

Since the intersection of sets is a set, Bayes
lets us write
Since we only want to classify and the
denominator is constant, we can ignore it giving
42
Independent Events to the Rescue

Assume that all pieces of evidence are
independent given a particular hypothesis.
Recall the chain rule

Since p(BA) p(B) and p(C)A B) p(C), that
is, the events are mutually exclusive, then
43
Becomes (with a little hand-waving) P(hiE)
p(e1h)Xp(e2hiXXp(enhi)
44
Leading to the naïve Bayes Classifier