Stochastic Methods - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Stochastic Methods

Description:

A random variable is a function whose domain is a sample space and whose range ... Talk like Jersey gangster here or Bob Marley. Strategy ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 45
Provided by: paulde8
Category:

less

Transcript and Presenter's Notes

Title: Stochastic Methods


1
Stochastic Methods
  • A Review

2
Random Variables
  • A random variable is a function whose domain is a
    sample space and whose range is a set of
    outcomes, usually real numbers
  • A boolean random variable is a function from an
    even space to true,false or 1.0, 2.0
  • A bernoulli experiment
  • Is performed n times
  • The trials are independent
  • The probability of success on each trial is a
    constant p the probability of failure is q
    1-p.
  • A random variable, Y, counts the number of
    success in the n trials

3
Example
  • A fair die is cast six times
  • Success a six is rolled
  • Failure all other outcomes
  • An observed sequence is a sequence of 4 1s and
    0s (0,0,1,0) means that a six has been rolled
    on the third trial.
  • Call this event, X.
  • Since every trial in the sequence is independent,
    the probability of this event is the product of
    the probabilities of the atomic events composing
    it.
  • So, p(X) 5/6 5/6 1/6 5/6 (1/6)(5/6)3

4
Binomial Probabilities
  • In a sequence of Bernoulli trials, we are
    interested in the total number of successes, not
    their order
  • Let the event, X, be the number of observed
    successes in n Bernoulli trials.
  • If x individual successes occur, where x 0, 1,
    2, , n
  • then n-x failures occur

5
  • Facts
  • The number of ways of selecting x positions from
    n in total is just
  • nXx
  • Since the trials are independent with x successes
    and n-x failures, the probability of each success
    is just the probability of success times the
    probability of failure px(1-p)n-x
  • The probability of the event X is the sum of the
    probabilities of all individual events
  • (nCx)px(1-p)n-x
  • The random variable, X, is said to have a
    binomial distribution

6
What is the probability of obtaining 5 heads in 7
flips of a fair coin
  • The probability of the event X, p(X), is the sum
    of the probabilities of all individual events
  • (nCx)px(1-p)n-x
  • The Event X is 5 successes
  • n 7, x 5
  • p(of a single success) ½
  • p(of a single failure) ½
  • P(X) (7C5)(1/2)5(1/2)2

7
Expectation
  • If the reward for the occurrence of an event E,
    with probabiliy p(E), is r, and the cost of the
    event not occurring, 1-p(E), is c, then the
    expectation for an even occuring, ex(E), is
  • ex(E) r x p(E) c (1-p(E))

8
Expectation Example
  • A fair roulette wheel has integers, 0 to 36.
  • Each player places 5 dollars on any slot.
  • If the wheel stops on the spot, the player wins
    35, else she loses 1
  • So,
  • p(winning) 1/37
  • P(losing) 36/37
  • ex(E) 35(1/37) (-5)(36/37)
  • (approx) -3.92

9
Bayes Theorem For Two Events
  • Recall that we defined conditional (posterior)
    probability like this
  • We can also express s in terms of d
  • Multiplying (2) by p(d) we get
  • Substituting (3) into (1) gives Bayes theorem
    for two events

1
2
3
10
  • If d is a disease and s is a symptom, the theorem
    tells us that the probability of the disease
    given the symptom is the probability of the
    symptom given the disease times the probability
    of the disease divided by the probability of the
    symptom

11
The Probability of the Intersection of Multiple
Events (that are not necessarily independent)
  • Called the Multiplication Rule

12
This can be extended to multiple events
Given
13
Lets do one More
Called the Chain rule and can be proven by
induction
14
Example Four cards are to be dealt one after
another at random and without replacement from an
fair deck. What is the probability of receiving a
spade, a heart, a diamond, and a club in that
order? A event of being dealt a spade B event
of being dealt a heart C event of being dealt a
diamond D event of being dealt a club P(A)
13/52 P(BA) 13/51 P(CA B) 13/50 P(DA B
C) 13/49 Total Probability
13/5213/5113/5013/49
15
An Application
  • Def A probabilistic finite state machine is a
    finite state machine where each arc is associated
    with a probability, indicating how likely that
    path is to be taken. The sum of the probabilities
    of all arcs leaving a node must sum to 1.0.
  • A PFSM is an acceptor when one or more states are
    indicated as the start states and one or more
    states is indicated as the accept state.

16
Phones/Phonemes
  • Def A phone is a speech sound
  • Def A phoneme is a collection of related phones
    (allophones) that are pronounced differently in
    different contexts
  • So t is phoneme.
  • The t sound in tunafish differs from the t
    sound in starfish. The first t is aspirated,
    meaning the vocal chords briefly dont vibrate,
    producing a sound like a puff a air. A t
    followed by an s is unaspirated
  • FSA showing the probabilities of allophones in
    the word tomato

17
More Phonemes
  • This happens with a k and gboth are
    unaspirated, leading to the mishearing of the
    Jimi Hendrix song
  • Scuse me, while I kiss the sky
  • Scuse me, while I kiss this guy

18
PFSA for the pronunciation of tomatoe
19
Phoneme Recognition Problem
  • Computational Linguists have collections of
    spoken and written language called corpora.
  • The Brown Corpus and the Switchboard Corpus are
    two examples. Together, they contain 2.5 million
    written and spoken words that we can use as a base

20
Now Suppose
  • Our machine identified the phone I
  • Next the machine has identified the phone ni (as
    in knee)
  • Turns out that an investigation of the
    Switchboard corpus shows 7 words that can be
    pronounced ni after I
  • the,neat, need, new, knee, to, you

21
How can this be?
  • Phoneme t is often deleted at the end of the
    word say neat little quickly
  • the can be pronounced like ni after in or.
    Talk like Jersey gangster here or Bob Marley

22
Strategy
  • Compile the probabilities of each of the
    candidate words from the corpora
  • Applies Bayes theorem for two events

23
  • Word Frequency Probability
  • knee 61 .000024
  • the 114834 .046
  • neat 338 .00013
  • need 1417 .00056
  • new 2625 .001

24
Apply Simplified Bayes
Since all of the candidtates will be divided by
pni, we can drop it off givingp(wordni)
p(niword)p(word))
  • But where does p(niword) come from?
  • Rules of pronunciation variation in English are
    well-known.
  • Run them through the corpora and generate
    probabilities for each.
  • So, for example, that word initial th becomes
    n if the preceding word ended in n is .15
  • This can be done for other pronunciation rules

25
Result
  • Word p(niword) p(word) p(niword)p(word)
  • New .36 .001 .00036
  • Neat .52 .00013 .000068
  • Need .11 .00056 .000062
  • Knee 1.0 .000024 .000024
  • The 0.0 .046 0.0
  • The has a probability of 0.0 since the previous
    phone was the not n
  • Notice that new seems to be the most likely
    candidate. This might be resolved at the
    syntactic level
  • Another possibility is to look at the probability
    of two word combinations in the corpora
  • I new is less probable than I need
  • This is referred to as N-Gram analysis

26
General Bayes Theorem
  • Recall Bayes Theorem for two events
  • P(AB) p(BA)p(A)/p(B)
  • We would like to generalize this to multiple
    events

27
Example
  • Suppose
  • Bowl A contains 2 red and 4 white chips
  • Bowl B contains 1 red and 2 white chips
  • Bowl C contains 5 red and 4 white chips
  • We want to select the bowls and compute the p of
    drawing a red chip
  • Suppose further
  • P(A) 1/3
  • P(B) 1/6
  • P(C) ½
  • Where A,B,C are the events that A,B,C are chosen

28
  • P(R) is dependent upon two probabilities p(which
    bowl) then the p(drawing a red chip)
  • So, p(R) is the union of the probability of
    mutually exclusive events

29
  • Now suppose that the outcome of the experiment is
    a red chip, but we dont know which bowl it was
    drawn from.
  • So we can compute the conditional probability for
    each of the bowls.
  • From the definition of conditional probability
    and the result above, we know

30
  • We can do the same thing for the other bowls
  • p(BR) 1/8
  • P(CR) 5/8
  • This accords with intuition. The probability
    that the red bowl was chosen increases over the
    original probability, because since it has more
    red chips, it is the more likely candidate.
  • The original probabilities are called prior
    probabilities
  • The conditional probabilities (e.g., p(AR)) are
    called the posterior probabilities.

31
To Generalize
  • Let events B1,B2,,Bm constitute a partition of
    the sample space S.
  • That is

Suppose R is an event with B1 Bm its prior
probabilities, all of which 0, then R is the
union m mutually exclusive events, namely,
32
  • Now,
  • If p(A) 0, we have from the definition of
    conditional probability that

P(BkR) is the posterior probability
33
Example
  • Machines A,B,C produce bolts of the same size.
  • Each machine produces as follows
  • Machine A 35, with 2 defective
  • Machine B 25,
  • with 1 defective
  • Machine C 40
  • with 3 defective
  • Suppose we select one bolt at the end of the day.
    The probability that it is defective is

34
  • Now suppose the selected bolt is defective. The
    probability that it was produced by machine 3 is

Notice how the posterior probability increased,
once we concentrated on C since C produces both
more bolts and a more defective bolts.
35
Evidence and Hypotheses
  • We can think of these various events as evidence
    (E) and hypotheses (H).

Where p(HkE) is the probability that hypothesis
i is true given the evidence, E p(Hk) is the
probability that hypothesis I is true
overall p(EHk) is the probability of observing
evidence, E, when Hi is true m is the number of
hypotheses
36
Why Bayes Works
  • The probability of evidence given hypotheses is
    often easier to determine than the probability of
    hypotheses given the evidence.
  • Suppose the evidence is a headache.
  • The hypothesis is meningitis.
  • It is easier to determine the number of patients
    who have headaches given that they have
    meningitis than it is to determine the number of
    patients who have meningitis, given that that
    they have headaches.
  • Because the population of headache sufferers is

37
But There Are Issues
  • When we thought about bowls (hypotheses) and
    chips (evidence), the probability of a kind of
    bowl given a red chip required that we compute 3
    posterior probabilities for each of three bowls.
    If we also worked it out for white chips, we
    would have to compute 3X2 6 posterior
    probabilities.
  • Now suppose our hypotheses are drawn from the set
    of m diseases and our evidence from the set of n
    symptoms, we have to compute mXn posterior
    probabilities.

38
But Theres More
  • Bayes assumes that the hypothesis partitions the
    set of evidence into disjoint sets.
  • This is fine with bolts and machines or red chips
    and bowls, but much less fine with natural
    phenomena. Pneumonia and strep probably doesnt
    partition the set of fever sufferers (since they
    could overlap)

39
That is
  • We have to use a form of Bayes theorem that that
    considers any single hypothesis, hi, in the
    context of the union of multiple symptoms ei

If n is the number of symptoms and m the number
of diseases, this works out to be mxn2 n2 m
pieces of information to collect. In a expert
system that is to classify 200 diseases using
2000 symptoms, this is 800,000,000 pieces of
information to collect.
40
Naïve Bayes to the Rescue
  • Naive Bayes classification assumes that variables
    are independent.
  • The probability that a fruit is an apple, given
    that it is red, round, and firm, can be
    calculated from the independent probabilities
    that the observed fruit is red, that it is round,
    and that it is firm.
  • The probability that a person has strep, given
    that he has a fever, and a sore throat, can be
    calculated from the independent probabilities
    that a person has a fever and has a sore throat.

41
  • In effect, we want to calculate this

Since the intersection of sets is a set, Bayes
lets us write
Since we only want to classify and the
denominator is constant, we can ignore it giving
42
Independent Events to the Rescue
  • Assume that all pieces of evidence are
    independent given a particular hypothesis.
  • Recall the chain rule

Since p(BA) p(B) and p(C)A B) p(C), that
is, the events are mutually exclusive, then
43
Becomes (with a little hand-waving) P(hiE)
p(e1h)Xp(e2hiXXp(enhi)
44
Leading to the naïve Bayes Classifier
  • P(EHj)
Write a Comment
User Comments (0)
About PowerShow.com