Title: Bayesian%20Methods%20with%20Monte%20Carlo%20Markov%20Chains%20I
1Bayesian Methods with Monte Carlo Markov Chains I
- Henry Horng-Shing Lu
- Institute of Statistics
- National Chiao Tung University
- hslu_at_stat.nctu.edu.tw
- http//tigpbp.iis.sinica.edu.tw/courses.htm
2Part 1 Introduction to Bayesian Methods
3Bayes' Theorem
- Conditional Probability
- One Derivation
- Alternative Derivation
- http//en.wikipedia.org/wiki/Bayes'_theorem
4False Positive and Negative
- Medical diagnosis
- Type I and II Errors hypothesis testing in
statistical inference - http//en.wikipedia.org/wiki/False_positive
Actual Status Actual Status
Disease (H1) Normal (H0)
Diagnosis Test Result Positive (Reject H0) True Positive (Power, 1-ß) False Positive (Type I Error, a)
Diagnosis Test Result Negative (Accept H0) False Negative (Type II Error, ß) True Negative (Confidence Level, 1-a)
5Bayesian Inference (1)
- False positives in a medical test
- Test accuracy by conditional probabilities
-
- Prior probabilities
6Bayesian Inference (2)
- Posterior probabilities by Bayes' theorem
7Bayesian Inference (3)
- Equal Prior probabilities
- Posterior probabilities by Bayes theorem
- http//en.wikipedia.org/wiki/Bayesian_inference
8Bayesian Inference (4)
- In the courtroom
- and
- Based on the evidence other than the DNA match,
and - By the Bayes Theorem,
9Naive Bayes Classifier
- Naive Bayes Classifier is a simple probabilistic
classifier based on applying Bayes' theorem with
strong (naive) independence assumptions. - http//en.wikipedia.org/wiki/Naive_Bayes_classifie
r
10Naive Bayes Probabilistic Model (1)
- The probability model for a classifier is a
conditional modelwhere is a dependent class
variable and - are several feature variables.
- By Bayes theorem,
11Naive Bayes Probabilistic Model (2)
- Use repeated applications of the definition of
conditional probability - and so forth.
- Assume that each is conditionally independent
of every other for , this means that
12Naive Bayes Probabilistic Model (3)
- So can be expressed as
- So can be expressed like
- where Z is constant if the values of the feature
variables are known. - Constructing a classifier from the probability
model
13Bayesian Spam Filtering (1)
- Bayesian spam filtering, a form of e-mail
filtering, is the process of using a Naive Bayes
classifier to identify spam email. - References
- http//en.wikipedia.org/wiki/Spam_28e-mail29
http//en.wikipedia.org/wiki/Bayesian_spam_filteri
nghttp//www.gfi.com/whitepapers/why-bayesian-fil
tering.pdf
14Bayesian Spam Filtering (2)
- Probabilistic model
- where words mean certain words in spam
emails. - Particular words have particular probabilities of
occurring in spam emails and in legitimate
emails. For instance, most email users will
frequently encounter the word Viagra in spam
emails, but will seldom see it in other emails.
15Bayesian Spam Filtering (3)
- Before mails can be filtered using this method,
the user needs to generate a database with words
and tokens (such as the sign, IP addresses and
domains, and so on), collected from a sample of
spam mails and valid mails. - After generating, each word in the email
contributes to the email's spam probability. This
contribution is called the posterior probability
and is computed using Bayes theorem.
16Bayesian Spam Filtering (4)
- Then, the email's spam probability is computed
over all words in the email, and if the total
exceeds a certain threshold (say 95), the filter
will mark the email as a spam.
17Bayesian Network (1)
- Bayesian network is compact representation of
probability distributions via conditional
independence. - For example, a Bayesian network could represent
the probabilistic relationships between diseases
and symptoms. - http//en.wikipedia.org/wiki/Bayesian_networkhttp
//www.cs.ubc.ca/murphyk/Bayes/bnintro.htmlhttp
//www.cs.huji.ac.il/nirf/Nips01-Tutorial/index.ht
ml
18Bayesian Network (2)
- Conditional independencies graphical language
capture structure of many real-world
distributions - Graph structure provides much insight into domain
- Allows knowledge discovery
19Bayesian Network (3)
- Qualitative part
- Directed acyclic graph (DAG)
- Nodes - random variables
- Edges - direct influence
Quantitative part Set of conditional
probability distributions
Together Define a unique distribution in a
factored form
20Inference
- Posterior probabilities
- Probability of any event given any evidence
- Most likely explanation
- Scenario that explains evidence
- Rational decision making
- Maximize expected utility
- Value of Information
- Effect of intervention
Radio
21Example 1 (1)
22Example 1 (2)
- By the chain rule of probability, the joint
probability of all the nodes in the graph above
is - By using conditional independence relationships,
we can rewrite this as - where we were allowed to simplify the third term
because R is independent of S given its parent C,
and the last term because W is independent of C
given its parents S and R.
23Example 1 (3)
- Bayes theorem
- is a normalizing constant, equal to the
probability (likelihood) of the data.
24Example 1 (4)
- The posterior probability of each explanation
- So we see that it is more likely that the grass
is wet because it is raining the likelihood
ratio is .
25Part 2 MLE vs. Bayesian Methods
26Maximum Likelihood Estimates (MLEs) vs. Bayesian
Methods
- Binomial Experiments http//www.math.tau.ac.il/n
in/Courses/ML04/ml2.ppt - More Explanations and Examples
- http//www.dina.dk/phd/s/s6/learning2.pdf
27MLE (1)
- Binomial Experiments suppose we toss coin N
times and the random variable is -
- We denote by ?the (unknown) probability
- .
- Estimation task
- Given a sequence of toss samples we
want to estimate the probabilities ? and
.
28MLE (2)
- The number of heads we see has a binomial
distribution - and thus
- Clearly, the MLE of ?is and is also equal
to MME of .
29MLE (3)
- Suppose we observe the sequence
- H, H.
- MLE estimate is .
- Should we really believe that tails are
impossible at this stage? - Such an estimate can have disastrous effect.
- If we assume that P(T)0, then we are willing to
act as though this outcome is impossible.
30Bayesian Reasoning
- In Bayesian reasoning we represent our
uncertainty about the unknown parameter ?by a
probability distribution. - This probability distribution can be viewed as
subjective probability - This is a personal judgment of uncertainty.
31Bayesian Inference
- -prior distribution about the values of
- -likelihood of binomial
experiment given a known value ? - Given , we can compute posterior
distribution on ? - The marginal likelihood is
- http//www.dina.dk/phd/s/s6/learning2.pdf
32Binomial Example (1)
- In binomial experiment, the unknown parameter is
- Simplest prior for (Uniform
prior) - Likelihood
- where k is number of heads in the sequence
- Marginal Likelihood
33Binomial Example (2)
- Using integration by parts, we have
- Multiply both side by choose , we have
34Binomial Example (3)
- The recursion terminates when ,
- Thus,
- We conclude that the posterior is
35Binomial Example (4)
- How do we predict (estimate ) using the
posterior? - We can think of this as computing the probability
of the next element in the sequence - Assumption if we know , the probability of
is independent
36Binomial Example (5)
37Beta Prior (1)
- The uniform priori distribution is a particular
case of the Beta Distribution. Its general form
is - Where and show as .
- The expected value of the parameter is
- The uniform is
38Beta Prior (2)
- There are important theoretical reasons for using
the Beta prior distribution? - One of them has also important practical
consequences it is the conjugate distribution of
binomial sampling. - If the prior is and we have
observed some data with and cases for the
two possible values of the variable, then the
posterior is also Beta with parameters
39Beta Prior (3)
- The expected value for the posterior
- distribution is
- The value represent the prior
- probabilities for the value of the variables
- based in our past experience.
- The value is called equivalent sample
size measure the importance of our past
experience. - Larger values make that prior probabilities have
more importance.
40Beta Prior (4)
- When , then we have maximum likelihood
estimation
41Multinomial Experiments
- Now, assume that we have a variable taking values
on a finite set and we have a serious
of independent observations of this distribution,
and we want to estimate the value
, . - Let be the number of cases in the sample in
which we have obtained the value - The MLE of is
- The problems with small samples are completely
analogous.
42Dirichlet Prior (1)
- We can also follow the Bayesian approach, but the
prior distribution is the Dirichlet distribution,
a generalization of the Beta distribution for
more than 2 - cases .
- The expression of is
- where is the equivalent sample size.
43Dirichlet Prior (2)
- The expected vector is
- Greater value of s makes this distribution more
concentrated around the mean vector.
44Dirichlet Posterior
- If we have a set of data with counts
- , then the posterior distribution is also
- Dirichlet with parameters
- The Bayesian estimation of probabilities are
- where , .
45Multinomial Example (1)
- Imagine that we have an urn with balls of
different colors red(R), blue(B) and green(G)
but on an unknown quantity. - Assume that we picked up balls with replacement,
with the following sequence - .
46Multinomial Example (2)
- If we assume a Dirichlet prior distribution with
parameters , then the estimated
frequencies for red,blue and - green
- Observe, as green has a positive probability,
even if never appears in the sequence.
47Part 3 An Example in Genetics
48Example 1 in Genetics (1)
- Two linked loci with alleles A and a, and B and b
- A, B dominant
- a, b recessive
- A double heterozygote AaBb will produce gametes
of four types AB, Ab, aB, ab
49Example 1 in Genetics (2)
- Probabilities for genotypes in gametes
No Recombination Recombination
Male 1-r r
Female 1-r r
AB ab aB Ab
Male (1-r)/2 (1-r)/2 r/2 r/2
Female (1-r)/2 (1-r)/2 r/2 r/2
50Example 1 in Genetics (3)
- Fisher, R. A. and Balmukand, B. (1928). The
estimation of linkage from the offspring of
selfed heterozygotes. Journal of Genetics, 20,
7992. - More
- http//en.wikipedia.org/wiki/Genetics
http//www2.isye.gatech.edu/brani/isyebayes/bank/
handout12.pdf
51Example 1 in Genetics (4)
MALE MALE MALE MALE
AB (1-r)/2 ab (1-r)/2 aB r/2 Ab r/2
F E M A L E AB (1-r)/2 AABB (1-r) (1-r)/4 aABb (1-r) (1-r)/4 aABB r (1-r)/4 AABb r (1-r)/4
F E M A L E ab (1-r)/2 AaBb (1-r) (1-r)/4 aabb (1-r) (1-r)/4 aaBb r (1-r)/4 Aabb r (1-r)/4
F E M A L E aB r/2 AaBB (1-r) r/4 aabB (1-r) r/4 aaBB r r/4 AabB r r/4
F E M A L E Ab r/2 AABb (1-r) r/4 aAbb (1-r) r/4 aABb r r/4 AAbb r r/4
52Example 1 in Genetics (5)
- Four distinct phenotypes
- AB, Ab, aB and ab.
- A the dominant phenotype from (Aa, AA, aA).
- a the recessive phenotype from aa.
- B the dominant phenotype from (Bb, BB, bB).
- b the recessive phenotype from bb.
- AB 9 gametic combinations.
- Ab 3 gametic combinations.
- aB 3 gametic combinations.
- ab 1 gametic combination.
- Total 16 combinations.
53Example 1 in Genetics (6)
54Example 1 in Genetics (7)
- Hence, the random sample of n from the offspring
of selfed heterozygotes will follow a multinomial
distribution - We know that
- and
- So
55Bayesian for Example 1 in Genetics (1)
- To simplify computation, we let
- The random sample of n from the offspring of
selfed heterozygotes will follow a multinomial
distribution
56Bayesian for Example 1 in Genetics (2)
- If we assume a Dirichlet prior distribution with
parameters to estimate
probabilities for AB, Ab, aB and ab. - Recall that
- AB 9 gametic combinations.
- Ab 3 gametic combinations.
- aB 3 gametic combinations.
- ab 1 gametic combination.
- We consider
57Bayesian for Example 1 in Genetics (3)
- Suppose that we observe the data of
- .
- So the posterior distribution is also Dirichlet
with parameters - The Bayesian estimation for probabilities are
58Bayesian for Example 1 in Genetics (4)
- Consider the original model,
- The random sample of n also follow a multinomial
distribution - We will assume a Beta prior distribution
59Bayesian for Example 1 in Genetics (5)
- The posterior distribution becomes
- The integration in the above denominator,
- does not have a close form.
60Bayesian for Example 1 in Genetics (6)
- How to solve this problem? Monte Carlo Markov
Chains (MCMC) Method! - What value is appropriate for ?
61Part 4 Monte Carlo Methods
62Monte Carlo Methods (1)
- Consider the game of solitaire whats the chance
of winning with a properly shuffled deck? - http//en.wikipedia.org/wiki/Monte_Carlo_method
- http//nlp.stanford.edu/local/talks/mcmc_2004_07_0
1.ppt
Chance of winning is 1 in 4!
62
63Monte Carlo Methods (2)
- Hard to compute analytically because winning or
losing depends on a complex procedure of
reorganizing cards. - Insight why not just play a few hands, and see
empirically how many do in fact win? - More generally, can approximate a probability
density function using only samples from that
density.
64Monte Carlo Methods (3)
- Given a very large set and a distribution
- over it.
- We draw a set of i.i.d. random samples.
- We can then approximate the distribution using
these samples.
65Monte Carlo Methods (4)
- We can also use these samples to compute
expectations - And even use them to find a maximum
66Monte Carlo Example
- be i.i.d. , find ?
- Solution
- Use Monte Carlo method to approximation
- gt x lt- rnorm(100000) 100000 samples from
N(0,1) - gt x lt- x4
- gt mean(x)
- 1 3.034175
67Exercises
- Write your own programs similar to those examples
presented in this talk. - Write programs for those examples mentioned at
the reference web pages. - Write programs for the other examples that you
know.