Bayesian%20Methods%20with%20Monte%20Carlo%20Markov%20Chains%20I - PowerPoint PPT Presentation

About This Presentation

Title:

Bayesian%20Methods%20with%20Monte%20Carlo%20Markov%20Chains%20I

Description:

Alternative Derivation: http://en.wikipedia.org/wiki/Bayes'_theorem. 3 ... will frequently encounter the word 'Viagra' in spam emails, but will seldom ... – PowerPoint PPT presentation

Number of Views:461

Avg rating:3.0/5.0

Slides: 68

Provided by: tigpbpIis

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian%20Methods%20with%20Monte%20Carlo%20Markov%20Chains%20I

1
Bayesian Methods with Monte Carlo Markov Chains I

Henry Horng-Shing Lu
Institute of Statistics
National Chiao Tung University
hslu_at_stat.nctu.edu.tw
http//tigpbp.iis.sinica.edu.tw/courses.htm

2
Part 1 Introduction to Bayesian Methods
3
Bayes' Theorem

Conditional Probability
One Derivation
Alternative Derivation
http//en.wikipedia.org/wiki/Bayes'_theorem

4
False Positive and Negative

Medical diagnosis
Type I and II Errors hypothesis testing in
statistical inference
http//en.wikipedia.org/wiki/False_positive

Actual Status Actual Status
Disease (H1) Normal (H0)
Diagnosis Test Result Positive (Reject H0) True Positive (Power, 1-ß) False Positive (Type I Error, a)
Diagnosis Test Result Negative (Accept H0) False Negative (Type II Error, ß) True Negative (Confidence Level, 1-a)
5
Bayesian Inference (1)

False positives in a medical test
Test accuracy by conditional probabilities
Prior probabilities

6
Bayesian Inference (2)

Posterior probabilities by Bayes' theorem

7
Bayesian Inference (3)

Equal Prior probabilities
Posterior probabilities by Bayes theorem
http//en.wikipedia.org/wiki/Bayesian_inference

8
Bayesian Inference (4)

In the courtroom
and
Based on the evidence other than the DNA match,
and
By the Bayes Theorem,

9
Naive Bayes Classifier

Naive Bayes Classifier is a simple probabilistic
classifier based on applying Bayes' theorem with
strong (naive) independence assumptions.
http//en.wikipedia.org/wiki/Naive_Bayes_classifie
r

10
Naive Bayes Probabilistic Model (1)

The probability model for a classifier is a
conditional modelwhere is a dependent class
variable and
are several feature variables.
By Bayes theorem,

11
Naive Bayes Probabilistic Model (2)

Use repeated applications of the definition of
conditional probability
and so forth.
Assume that each is conditionally independent
of every other for , this means that

12
Naive Bayes Probabilistic Model (3)

So can be expressed as
So can be expressed like
where Z is constant if the values of the feature
variables are known.
Constructing a classifier from the probability
model

13
Bayesian Spam Filtering (1)

Bayesian spam filtering, a form of e-mail
filtering, is the process of using a Naive Bayes
classifier to identify spam email.
References
http//en.wikipedia.org/wiki/Spam_28e-mail29
http//en.wikipedia.org/wiki/Bayesian_spam_filteri
nghttp//www.gfi.com/whitepapers/why-bayesian-fil
tering.pdf

14
Bayesian Spam Filtering (2)

Probabilistic model
where words mean certain words in spam
emails.
Particular words have particular probabilities of
occurring in spam emails and in legitimate
emails. For instance, most email users will
frequently encounter the word Viagra in spam
emails, but will seldom see it in other emails.

15
Bayesian Spam Filtering (3)

Before mails can be filtered using this method,
the user needs to generate a database with words
and tokens (such as the sign, IP addresses and
domains, and so on), collected from a sample of
spam mails and valid mails.
After generating, each word in the email
contributes to the email's spam probability. This
contribution is called the posterior probability
and is computed using Bayes theorem.

16
Bayesian Spam Filtering (4)

Then, the email's spam probability is computed
over all words in the email, and if the total
exceeds a certain threshold (say 95), the filter
will mark the email as a spam.

17
Bayesian Network (1)

Bayesian network is compact representation of
probability distributions via conditional
independence.
For example, a Bayesian network could represent
the probabilistic relationships between diseases
and symptoms.
http//en.wikipedia.org/wiki/Bayesian_networkhttp
//www.cs.ubc.ca/murphyk/Bayes/bnintro.htmlhttp
//www.cs.huji.ac.il/nirf/Nips01-Tutorial/index.ht
ml

18
Bayesian Network (2)

Conditional independencies graphical language
capture structure of many real-world
distributions
Graph structure provides much insight into domain
Allows knowledge discovery

19
Bayesian Network (3)

Qualitative part
Directed acyclic graph (DAG)
Nodes - random variables
Edges - direct influence

Quantitative part Set of conditional
probability distributions
Together Define a unique distribution in a
factored form
20
Inference

Posterior probabilities
Probability of any event given any evidence
Most likely explanation
Scenario that explains evidence
Rational decision making
Maximize expected utility
Value of Information
Effect of intervention

Radio
21
Example 1 (1)
22
Example 1 (2)

By the chain rule of probability, the joint
probability of all the nodes in the graph above
is
By using conditional independence relationships,
we can rewrite this as
where we were allowed to simplify the third term
because R is independent of S given its parent C,
and the last term because W is independent of C
given its parents S and R.

23
Example 1 (3)

Bayes theorem
is a normalizing constant, equal to the
probability (likelihood) of the data.

24
Example 1 (4)

The posterior probability of each explanation
So we see that it is more likely that the grass
is wet because it is raining the likelihood
ratio is .

25
Part 2 MLE vs. Bayesian Methods
26
Maximum Likelihood Estimates (MLEs) vs. Bayesian
Methods

Binomial Experiments http//www.math.tau.ac.il/n
in/Courses/ML04/ml2.ppt
More Explanations and Examples
http//www.dina.dk/phd/s/s6/learning2.pdf

27
MLE (1)

Binomial Experiments suppose we toss coin N
times and the random variable is
We denote by ?the (unknown) probability
.
Estimation task
Given a sequence of toss samples we
want to estimate the probabilities ? and
.

28
MLE (2)

The number of heads we see has a binomial
distribution
and thus
Clearly, the MLE of ?is and is also equal
to MME of .

29
MLE (3)

Suppose we observe the sequence
H, H.
MLE estimate is .
Should we really believe that tails are
impossible at this stage?
Such an estimate can have disastrous effect.
If we assume that P(T)0, then we are willing to
act as though this outcome is impossible.

30
Bayesian Reasoning

In Bayesian reasoning we represent our
uncertainty about the unknown parameter ?by a
probability distribution.
This probability distribution can be viewed as
subjective probability
This is a personal judgment of uncertainty.

31
Bayesian Inference

-prior distribution about the values of
-likelihood of binomial
experiment given a known value ?
Given , we can compute posterior
distribution on ?
The marginal likelihood is
http//www.dina.dk/phd/s/s6/learning2.pdf

32
Binomial Example (1)

In binomial experiment, the unknown parameter is
Simplest prior for (Uniform
prior)
Likelihood
where k is number of heads in the sequence
Marginal Likelihood

33
Binomial Example (2)

Using integration by parts, we have
Multiply both side by choose , we have

34
Binomial Example (3)

The recursion terminates when ,
Thus,
We conclude that the posterior is

35
Binomial Example (4)

How do we predict (estimate ) using the
posterior?
We can think of this as computing the probability
of the next element in the sequence
Assumption if we know , the probability of
is independent

36
Binomial Example (5)

Thus, we conclude that

37
Beta Prior (1)

The uniform priori distribution is a particular
case of the Beta Distribution. Its general form
is
Where and show as .
The expected value of the parameter is
The uniform is

38
Beta Prior (2)

There are important theoretical reasons for using
the Beta prior distribution?
One of them has also important practical
consequences it is the conjugate distribution of
binomial sampling.
If the prior is and we have
observed some data with and cases for the
two possible values of the variable, then the
posterior is also Beta with parameters

39
Beta Prior (3)

The expected value for the posterior
distribution is
The value represent the prior
probabilities for the value of the variables
based in our past experience.
The value is called equivalent sample
size measure the importance of our past
experience.
Larger values make that prior probabilities have
more importance.

40
Beta Prior (4)

When , then we have maximum likelihood
estimation

41
Multinomial Experiments

Now, assume that we have a variable taking values
on a finite set and we have a serious
of independent observations of this distribution,
and we want to estimate the value
, .
Let be the number of cases in the sample in
which we have obtained the value
The MLE of is
The problems with small samples are completely
analogous.

42
Dirichlet Prior (1)

We can also follow the Bayesian approach, but the
prior distribution is the Dirichlet distribution,
a generalization of the Beta distribution for
more than 2
cases .
The expression of is
where is the equivalent sample size.

43
Dirichlet Prior (2)

The expected vector is
Greater value of s makes this distribution more
concentrated around the mean vector.

44
Dirichlet Posterior

If we have a set of data with counts
, then the posterior distribution is also
Dirichlet with parameters
The Bayesian estimation of probabilities are
where , .

45
Multinomial Example (1)

Imagine that we have an urn with balls of
different colors red(R), blue(B) and green(G)
but on an unknown quantity.
Assume that we picked up balls with replacement,
with the following sequence
.

46
Multinomial Example (2)

If we assume a Dirichlet prior distribution with
parameters , then the estimated
frequencies for red,blue and
green
Observe, as green has a positive probability,
even if never appears in the sequence.

47
Part 3 An Example in Genetics
48
Example 1 in Genetics (1)

Two linked loci with alleles A and a, and B and b
A, B dominant
a, b recessive
A double heterozygote AaBb will produce gametes
of four types AB, Ab, aB, ab

49
Example 1 in Genetics (2)

Probabilities for genotypes in gametes

No Recombination Recombination
Male 1-r r
Female 1-r r
AB ab aB Ab
Male (1-r)/2 (1-r)/2 r/2 r/2
Female (1-r)/2 (1-r)/2 r/2 r/2
50
Example 1 in Genetics (3)

Fisher, R. A. and Balmukand, B. (1928). The
estimation of linkage from the offspring of
selfed heterozygotes. Journal of Genetics, 20,
7992.
More
http//en.wikipedia.org/wiki/Genetics
http//www2.isye.gatech.edu/brani/isyebayes/bank/
handout12.pdf

51
Example 1 in Genetics (4)
MALE MALE MALE MALE
AB (1-r)/2 ab (1-r)/2 aB r/2 Ab r/2
F E M A L E AB (1-r)/2 AABB (1-r) (1-r)/4 aABb (1-r) (1-r)/4 aABB r (1-r)/4 AABb r (1-r)/4
F E M A L E ab (1-r)/2 AaBb (1-r) (1-r)/4 aabb (1-r) (1-r)/4 aaBb r (1-r)/4 Aabb r (1-r)/4
F E M A L E aB r/2 AaBB (1-r) r/4 aabB (1-r) r/4 aaBB r r/4 AabB r r/4
F E M A L E Ab r/2 AABb (1-r) r/4 aAbb (1-r) r/4 aABb r r/4 AAbb r r/4
52
Example 1 in Genetics (5)

Four distinct phenotypes
AB, Ab, aB and ab.
A the dominant phenotype from (Aa, AA, aA).
a the recessive phenotype from aa.
B the dominant phenotype from (Bb, BB, bB).
b the recessive phenotype from bb.
AB 9 gametic combinations.
Ab 3 gametic combinations.
aB 3 gametic combinations.
ab 1 gametic combination.
Total 16 combinations.

53
Example 1 in Genetics (6)

Let , then

54
Example 1 in Genetics (7)

Hence, the random sample of n from the offspring
of selfed heterozygotes will follow a multinomial
distribution
We know that
and
So

55
Bayesian for Example 1 in Genetics (1)

To simplify computation, we let
The random sample of n from the offspring of
selfed heterozygotes will follow a multinomial
distribution

56
Bayesian for Example 1 in Genetics (2)

If we assume a Dirichlet prior distribution with
parameters to estimate
probabilities for AB, Ab, aB and ab.
Recall that
AB 9 gametic combinations.
Ab 3 gametic combinations.
aB 3 gametic combinations.
ab 1 gametic combination.
We consider

57
Bayesian for Example 1 in Genetics (3)

Suppose that we observe the data of
.
So the posterior distribution is also Dirichlet
with parameters
The Bayesian estimation for probabilities are

58
Bayesian for Example 1 in Genetics (4)

Consider the original model,
The random sample of n also follow a multinomial
distribution
We will assume a Beta prior distribution

59
Bayesian for Example 1 in Genetics (5)

The posterior distribution becomes
The integration in the above denominator,
does not have a close form.

60
Bayesian for Example 1 in Genetics (6)

How to solve this problem? Monte Carlo Markov
Chains (MCMC) Method!
What value is appropriate for ?

61
Part 4 Monte Carlo Methods
62
Monte Carlo Methods (1)

Consider the game of solitaire whats the chance
of winning with a properly shuffled deck?
http//en.wikipedia.org/wiki/Monte_Carlo_method
http//nlp.stanford.edu/local/talks/mcmc_2004_07_0
1.ppt

Chance of winning is 1 in 4!
62
63
Monte Carlo Methods (2)

Hard to compute analytically because winning or
losing depends on a complex procedure of
reorganizing cards.
Insight why not just play a few hands, and see
empirically how many do in fact win?
More generally, can approximate a probability
density function using only samples from that
density.

64
Monte Carlo Methods (3)