Title: Nuts and Bolts A Review of Probability Theory
1Nuts and BoltsA Review of Probability Theory
- Review classical, frequentist, and subjective
interpretations of probability - Probability Axioms and Definitions
- Conditional Probability
- Bayes Theorem
2The frequency interpretation of probability
- The frequency interpretation The probability
that some specific outcome of a process will be
obtained can be interpreted as the relative
frequency with which that outcome would be
obtained if the process were repeated a large
number of times under similar conditions. -
- e.g. the probability of obtaining a head in a
fair coin toss is ½ because the relative
frequency of heads should be ½ if I were to flip
a coin many times. - How do p-values relate to the frequency
interpretation of probability?
3The frequency interpretation and the sampling
distribution
- When we make statistical inferences from the
frequentist perspective, we assume that our data
is a sample from an entire population. -
- - The population is described by the population
mean and the population variance that are
unknown. - - The sample is described by the sample mean and
the sample variance. - The sample mean and variance provide estimates
about the mean and variance of the entire
population. - Importantly, these estimates are known only with
some uncertainty. - Our uncertainty about a statistic like the mean
is summarized by its sampling distribution.
4The sampling distribution
- The sampling distribution is a hypothetical
distribution of all possible values of a
statistic of interest for samples of size N that
could be formed for a given population. - The observed sample mean is just one realization
of this population. - Needless to say, this is a theoretical construct
since, with a large population, there will be
billions or even trillions of unique samples and
it would be superior to simply sample the entire
population. - P-values refer to the proportion of hypothetical
draws from the sampling distribution that are
consistent with the null hypothesis. - ? if p-values are based on the concept of a
sampling distribution, do they make sense if your
data contains the entire population.
5The Classical Interpretation of Probability
- The classical interpretation is based on the
concept of equally likely outcomes. -
- If the outcome of some process must be one of n
different outcomes, and if these outcomes are
equally likely to occur, then the probability of
each outcome is 1/n. - e.g. If I were to flip a fair coin, the
probability of a heads would be ½ because heads
and tails are equally likely outcomes.
6The appeal of the classical approach
- The classical approach offers an appealing
summary of uncertainty in a one-shot situation. - The Figure below defines the sample space for a
hypothetical experiment where outcomes are either
a success or a failure. The probability of a
success is the size of the accept region over the
size of the entire sample space. - (Note the success and failure regions are each
defined by a series of equally likely outcomes
such as success 1 or 2 outcome on a 6 sided
die.)
A Hypothetical Sample Space
Note that this sample space must be divided into
a series of equal-sized regions
Success
Failure
7Problems with the Classical Interpretation
- The drawback of the classical interpretation is
that the concept of equally likely outcomes is
itself probabilistic. - ? In a sense, this makes the classical
definition of probability circular. - ? Furthermore, the concept begins to break down
in contexts other than gambling when events are
not equally likely. - The classical response is
- Laplaces Rule of Insufficient Reason in the
absence of compelling evidence to the contrary,
we should assume that events are equally likely. - This concept response is actually more useful to
Bayesians when defending their priors than for
classicists.
8The Subjective Interpretation of Probability
- The probability that a person assigns to a
possible outcome of some process represents his
or her own judgment of the likelihood that the
outcome will be obtained. - In contrast to the classical and frequentist
interpretations of probability, this means that
different individuals could have different
probability judgments. - e.g. If I were to flip a fair coin, the
probability of a head could be 3/4 because, for
some reason, I think that God wants it to be a
head.
9Is subjective probability theory really that ad
hoc?
- Not NecessarilyGood-Bayesians elicit priors in
a manner than ensures coherence. - Consider the probability that an individual
attributes to an event E is defined as the number
p such that, for an arbitrary positive or
negative stake S, the individual would be willing
to exchange the certain quantity of money pS for
a lottery in which she receives S if E occurs and
zero otherwise. - This being granted, once an individual has
evaluated the probabilities of certain events,
two cases present themselves either it is
possible to bet with him in such a way as to be
assured of winning, or else this probability does
not exist. In the first case, one should say that
the evaluation of probabilities given by this
individual contains an incoherence, an intrinsic
contradiction in the other we say the individual
is coherent. It is precisely this condition of
coherence which constitutes the sole principle
from which one can deduce the whole calculus of
probability (de Finetti Chapter 1). - Extensions of de Finettis axioms form the basis
of subjective expected utility theory. Later
chapters of the book introduce the concept of
exchangeability which we wont talk about today,
but which is rather important to probability
theory.
10The Axiomatic Definition of Probability
- Suppose that for experimental model M, the sample
space S of possible outcomes is defined as A1,
,An ?? S. - Let Pr(Ai ) the probability of an event Ai in
the sample space S. - A probability distribution on a sample space S is
a specification of numbers Pr(Ai) which satisfy
A1, A2, A3. - A1. For any outcome Ai, Pr(Ai) ? 0.
- A2. Pr(S) 1.
- A3. For any infinite sequence of disjoint events
A1, , An - Pr(??i1 to ? Ai) ??i1 to ? Pr (Ai)
- Note it turns out that each of these three
axioms can be justified using the coherence
criterion.
11Some Theorems Based on the Definition of
Probability and a Few Proofs
- Theorem 1. Pr(?) 0
- Proof
- By definition, Aj and Ak are disjoint if Aj ? Ak
?. - Further, it is obvious that ? ? ? ?.
- Thus, if Aj ? and Ak ?, then Aj and Ak are
disjoint. - Let A1 An define the set of events such that Aj
?. -
- By the above definitions, it follows that the
events Aj are disjoint. - Since the Aj are disjoint, we can exploit A3 such
that - Pr(?) Pr(??i Ai ) ??i Pr (Ai) ?i Pr(?)
n Pr(?) - In order that Pr(?) n Pr(?), Pr(?) must equal 0.
12Some Theorems cont.
- Theorem 2. For any sequence of n disjoint events
A1,,An, - Pr(??i to n Ai ) ??i to n Pr (Ai)
- Proof
- Let A1,,An define the n disjoint events and let
Ak ? for events k ? n1,, ?. - By the definition of disjoint events, we have an
infinite series of disjoint events. - By A3 and Theorem 1 which states that Pr(?)0
- Pr(??i to n Ai ) Pr(??i to ? Ai ) ??i to ?
Pr (Ai). - ?i to n Pr (Ai) ?n1 to ? Pr (Ai)
- ?i to n Pr (Ai) 0
- ?i to n Pr (Ai)
13Some Theorems cont.
- Theorem 3. For any event A, Pr(AC) 1 Pr(A)
- Theorem 4. For any event A, 0 ? Pr(A) ? 1
- Proof by contradiction in two parts
- Part 1. Suppose Pr(A) lt 0. Then that would
violate axiom A1, a contradiction. - Part 2. Suppose Pr(A) gt 1. Then by Theorem 3,
Pr(AC) lt 0, which also contradicts A1. - Thus, 0 ? Pr(A) ? 1.
14Some Theorems cont.
- Theorem 5. For any two events A and B,
- Pr(A ?? B) Pr(A) Pr(B) Pr(A ? B)
- Proof
- A ? B (A ? BC) ? (A ? B) ? (AC ? B)
- Since all three elements in the equation are
disjoint, Theorem 2 implies - Pr(A ? B) Pr(A ? BC) Pr(A ? B) Pr(AC ? B)
- Pr(A ? BC) Pr(A ? B) Pr(AC ? B) Pr(A ?
B) - Pr(A ? B) - Further, we know that Pr(A) Pr(A ? BC) Pr(A ?
B) - and that Pr(B) Pr(A ? B) Pr(AC ? B)
- Thus, Pr(A ? B) Pr(A) Pr(B) - Pr(A ? B)
15Independent Events
- Intuitively, we define independence as
- Two events A and B are independent if the
occurrence or non-occurrence of one of the events
has no influence on the occurrence or
non-occurrence of the other event. - Mathematically, we write define independence as
- Two events A and B are independent if Pr(A ? B)
Pr(A)Pr(B).
16Example of Independence
- Are party id and vote choice independent in
presidential elections? - Suppose Pr(Rep. ID) .4, Pr(Rep. Vote) .5, and
Pr(Rep. ID ? Rep. Vote) .35 -
- To test for independence, we ask whether
- Pr Pr(Rep. ID) Pr(Rep. Vote) .35 ?
- Substituting into the equations, we find that
- Pr Pr(Rep. ID) Pr(Rep. Vote) .4.5 .2 ??
.35, - so the events are not independent.
17Independence of Several Events
- The events A1, , An are independent if
- Pr(A1 ? A2 ? ? An) Pr(A1)Pr(A2)Pr(An)
- And, this identity must hold for any subset of
events.
18Conditional Probability
- Conditional probabilities allow us to understand
how the probability of an event A changes after
it has been learned that some other event B has
occurred. - The key concept for thinking about conditional
probabilities is that the occurrence of B
reshapes the sample space for subsequent events. - - That is, we begin with a sample space S
- - A and B ? S
- - The conditional probability of A given that B
looks just at the subset of the sample space for
B.
The conditional probability of A given B is
denoted Pr(A B). - Importantly, according to
Bayesian orthodoxy, all probability distributions
are implicitly or explicitly conditioned on the
model.
S
Pr(A B)
B
A
19Conditional Probability Cont.
- By definition If A and B are two events such
that Pr(B) gt 0, then -
S
Pr(A B)
B
A
Example What is the Pr(Republican Vote
Republican Identifier)? Pr(Rep. Vote ? Rep. Id)
.35 and Pr(Rep ID) .4 Thus, Pr(Republican
Vote Republican Identifier) .35 / .4 .875
20Useful Properties of Conditional Probabilities
- Property 1. The Conditional Probability for
Independent Events - If A and B are independent events, then
Property 2. The Multiplication Rule for
Conditional Probabilities In an experiment
involving two non-independent events A and B, the
probability that both A and B occurs can be found
in the following two ways
21Conditional Probability and Partitions of a
Sample Space
- The set of events A1,,Ak form a partition of a
sample space S if ?i1 to k Ai S. - If the events A1,,Ak partition S and if B is any
other event in S (note that it is impossible for
Ai ? B ? for some i), then the events A1 ? B,
A2 ? B,,Ak ? B will form a partition of B. - Thus, B (A1 ? B) ? (A2 ? B) ? ? (Ak ? B)
- Pr( B ) ?i1 to k Pr( Ai B )
- Finally, if Pr( Ai ) gt 0 for all i, then
- Pr( B ) ?i1 to k Pr( B Ai ) Pr( Ai )
22Example of conditional probability and partitions
of a sample space
- Pr( B ) ?i1 to k Pr( B Ai ) Pr( Ai )
- Example. What is the Probability of a Republican
Vote? - Pr( Rep. Vote ) Pr( Rep. Vote Rep. ID ) Pr(
Rep. ID ) - Pr( Rep. Vote Ind. ID ) Pr( Ind. ID )
- Pr( Rep. Vote Dem. ID ) Pr( Dem. ID )
- Note the definition for Pr(B) defined above
provides the denominator for Bayes Theorem.
23Bayes Theorem (Rule, Law)
- Bayes Theorem Let events A1,,Ak form a
partition of the space S such that Pr(Aj) gt 0 for
all j and let B be any event such that Pr(B) gt 0.
Then for i 1,..,k -
Proof
Bayes Theorem is just a simple rule for
computing the conditional probability of events
Ai given B from the conditional probability of B
given each event Ai and the unconditional
probability of each Ai
24Interpretation of Bayes Theorem
Pr(Ai) Prior distribution for the Ai. It
summarizes your beliefs about the probability of
event Ai before Ai or B are observed.
Pr( B Ai ) The conditional probability of B
given Ai. It summarizes the likelihood of event
B given Ai.
?k Pr( Ak ) Pr( B Ak ) The normalizing
constant. This is equal to the sum of the
quantities in the numerator for all events Ak.
Thus, P( Ai B ) represents the likelihood of
event Ai relative to all other elements of the
partition of the sample space.
Pr( Ai B ) The posterior distribution of Ai
given B. It represents the probability of event
Ai after Ai has B has been observed.
25Example of Bayes Theorem
- What is the probability in a survey that someone
is black given that they respond that they are
black when asked? - - Suppose that 10 of the population is black, so
Pr(B) .10. - - Suppose that 95 of blacks respond Yes, when
asked if they are black, so Pr( Y1 B ) .95. - - Suppose that 5 of non-blacks respond Yes, when
asked if they are black, so Pr( Y1 BC) .05
We reach the surprising conclusion that even if
95 of black and non-black respondents correctly
classify themselves according to race, the
probability that someone is black given that they
say they are black is less than .7.
26Combining Data
- When applying Bayes Theorem, the order in which
you collect the data doesnt matter. - It also doesnt matter whether you peak at the
data halfway through an experiment.
27Example cont.
- Continuing the last example, suppose that the
interviewer also makes an estimate of the
respondents race. Lets say the interviewer
correctly classifies 90 percent of respondents,
and her classification is independent of the
self-classification. - Thus, Pr(Y2 B) 0.9 and Pr(Y2 BC) 0.1.
One way to incorporate information is to
recalculate our estimates from scratch.
Alternatively, we can just update our last set of
results