Title: Continuous Probability Distributions
1Chapter 8
- Continuous Probability Distributions
2Continuous Probability Distributions
- Continuous probability distributions are
typically associated with ratio scales - Height how likely is it that a child in the
class is 1.7 meters tall? - Finance what are the chances that the ratio of
First and Second Quarter profits will be ? 1.25? - Vision Science at what wavelength (measured in
nm) in the electromagnetic spectrum are the M
human photoreceptors maximally receptive? - Physics the magnetic moment of the electron is
1.001159652201 ? 0.000000000030
3- Similar to discrete probability distributions,
continuous probability distributions identify the
events in a probability space with sets of
numbers on the number line.
4- A function f is a probability density function
(pdf) if - f(x) 0, for every number x
-
- where
5- If f is the pdf of a continuous distribution,
then F is the cumulative distribution function
(cdf) of that distribution
6- We define the probability of the event of the
random variable yielding a value less than some
number a as - Pr(X lt a) F(a)
- Similarly, the probability of X being greater
than a is - Pr(X gt a) 1 F(a)
7- We define the probability of the event of the
random variable yielding a value in the interval
a, b - Pr(a X b) F(b) F(a)
8- We would like to understand a continuous
probability distribution like
9- So lets approximate it with
- We ignore the extreme values whose absolute
value exceeds 4 - We use cell marks (cf. chap. 3) to estimate the
probability of falling within a given range
10- Because of the way we take cell marks, with only
a few categories, our accuracy is limited - With more categories, well get more accurate
11- Now we can figure out the probabilities of being
in one of these categories - Just as in the previous chapter, we can represent
these probabilities precisely and completely with
a histogram. - At this point it is crucial to remember that
histograms express probabilities of events (i.e,
the probability of being in one of these
intervals) as the area of the histogram
corresponding to the event.
12- The probability that X yields a value between .5
and 1
13- In probability theory, we demand that the total
area of the histogram 1 - (This contrasts with how Eviews handles sample
distributions.) - So we have
- Lets check our accuracy on Eviews.
14- We now have a probability distribution for 9
possible categories - Each category is an interval of possible values
- We trimmed off the extremities the values
greater than 4 or less than 4. - Well leave these extremities alone for now.
- But why stop with just 9 categories?
- Lets make a more fine-grained histogram, one
with 20 categories - Howzabout 40 categories? 100? 1000?
15- If you remember your calculus, you can see what
were doing here. - We are creating increasingly fine-grained
(discrete) approximations of a continuous curve. - We finish off (this part of) our project by going
whole-hog - We dont stop with n 100, or n 10 million
- Instead, we let n go to infinity.
16- Lets look at this situation a bit more
carefully. - For any number n you like (for ease, lets assume
n gt 10) - We create a partition of the interval (a, b), by
specifying n 1 points, all equally spaced
apart - Thus a c0, b cn, and for every ci, (i n)
17- Thus, intuitively speaking, our probability
distribution (leaving out the extremities for
now) turns out to be represented by the histogram
of the grouped data for n groups, but with n ?,
and each category containing a single number.
18- Lets use Length(ci) and Height(ci) to denote the
length of category ci ( ci ci-1) and the
height of the bar associated with ci.
19-
- In our current example, a -4, and b 4.
- f is the pdf of the continuous distribution
- It characterizes how probabilities are
distributed across the infinitely many numbers in
the interval (a, b). - It replaces the probability function pr(ci-1 lt
X ci) used in our discrete distributions.
20Extending the distribution to the entire line
of real numbers
- Lets now turn to those numbers outside of (a, b)
that weve ignored so far - To make the situation visually more obvious,
lets pretend we were working with the interval
(-1.5, 1.5), instead of (-4, 4).
21- So far, weve seen how to go from
22 23- Notice that by working with (1.5, 1.5) our
estimation of the curve is forced to be more
inaccurate. - Because the area under the curve must be 1.
24- But now what about those extremities that weve
been ignoring? - We want our theory to allow every number to be a
possible value, not just those between a and b. - So we need to extend our theory just a little bit
more - We will do what we just did, but we will extend
each boundary by some quantity m - (a m), (b m)
- E.g. (1.5 .5), (1.5 .5)
- So our new interval will be (2, 2)
25 26 27- Notice also how our approximation improves
28- Lets make m even larger, and go from
29 30- Now our approximation is getting pretty good
31- The remaining probabilities that we havent yet
accounted for - pr(X 3), pr(X ? 3)
- are rather small, but that doesnt matter here
- We can continue extending our probability space
by setting - m 5
- m 6
- m 60
- m 10,000
32 33 34- Lets examine three features of the pdf f
- Our construction of f ensures that pr(X c) 0,
for any number c. - f is a derivative.
- f is not a probability function.
35- Some Preliminaries.
- Recall
36- More specifically, for any appropriate n and i,
such as n 100, and i 32 ________________
________________
37- 1. pr(X k) 0 for all numbers k.
- Earlier we showed that
- Hence
38But as n gets very large, the length of every
cell ci ( ci ci-1) gets very small
39- 2. f is a derivative
- From our Preliminaries, we have
- Hence
40- Recall that
- Notice also
- So we can argue
41- So, in conclusion, we haveBut this means
that f is a derivative
42- Question Is it possible to put this last
equality in the form for derivatives given by my
calculus book? - here, f F'
43- Let
- So h is determined by n, and as n gets large, h
gets arbitrarily small. - for each h, we can define a function
- where ci-1 lt x ci.
44 45(No Transcript)
46- There is another way that we can tell that f is a
derivative - From the Fundamental Theorem of Calculus, we have
the relationship
47- 3. f is not a probability function.
- Notice that pdfs take single numbers as their
arguments, probability functions take sets of
numbers as their arguments.
48- A concrete (counter-)example
- Sometimes f takes on values greater than 1.
- Probability functions, by definition, cannot do
this! - But for any 0 a b 1, pr(a X b) 1
49The Uniform Distribution
- The uniform distribution is simple but important.
- The uniform distribution over the interval (a,
b) is defined as
50Here is the uniform distribution on (0, 1)
51Here is the uniform distribution on (-2, 14)
52Here are the cdfs of the two distributions. Why
is the cdf F(x) (for a lt x lt b)??
53Here are the cdfs of the two distributions.
54In general, the cdf of U(a, b) (i.e., the uniform
distribution on the interval from a to b) is
55- The uniform distribution is useful in cases where
a number is known (or assumed) to fall within a
definite finite interval, and you have no further
information about what that number might be - Since you have no reason to treat one number as
more likely than the other, you give them all the
same density
56- The uniform distribution appears when all the
data must appear in some fixed interval, but
there is absolutely no further information or
structure that would bias the random variable
to take one value rather than another.
57Example
- The uniform distribution is often used as a kind
of null or default hypothesis regarding the
distribution of probabilities within a
population. - E.g., in a situation where people have varying
degrees of tendency to visit McDonalds over
Burger King, the least informative hypothesis
would be a uniform distribution of probabilities
(on the interval 0, 1)
58- The uniform distribution on (a, b) is a
probability distribution
59Expectations
- Expectations are defined similarly to those for
discrete random variables. - If X is a continuous random variable whose pdf is
f, then
60Expectations
- Using this definition, we can also define the
variance, standard deviation, etc. of X
61Expectations
- You should be able to calculate that if X U(a,
b), then
62Expectations
- Importantly, everything we have proven about
expectations for discrete random variables holds
for continuous random variables. - The linearity of expectations holds
63Expectations
- Importantly, everything we have proven about
expectations for discrete random variables holds
for continuous random variables. - The linearity of expectations holds
64Expectations
- Whether or not we use the linearity of
expectations, or calculate directly from our
definitions, for any continuously distributed
random variable X, whose mean is ? and standard
deviation is ?, we have - Where
65The Normal Distribution
- The normal distribution (aka the Gaussian
distribution) is probably the most common
distribution in all of science.
N(0, 1)
66- the pdf of the normal distribution is
- In the case where ?0, ?1, this equation
simplifies to (and has a special name)
67- the cdf of the normal distribution is
- In the case where ?0, ?1, this equation
simplifies to (and has a special name)
68- We often use expressions like
- N(?, ?2),
- which is shorthand for The normal distribution
with a mean of ?, and a variance of ?2. - We also write things like
- X N(?, ?2),
- which is shorthand for X is normally
distributed, with a mean of ?, and a variance of
?2.
69X N(0, 1)
70X N(3, 1)
71X N(6, 1)
72X N(16, 1)
73X N(3.1, 1)
74X N(0, 1)
75X N(0, 3)
76X N(0, 5)
77- X N(0, 1), Y N(0, 3), Z N(0, 5)
- For any N(?, ?2), where is the high point of the
pdf?
78- For any N(?, ?2)
- the standardized skew
- The standardized kurtosis
79- Notice that the pdf for N(?, ?2) can be seen as
using the standardization of X -
- Where
80- It is easy to turn one normal distribution into
another. - If X N(0, 1), and Y a bX (b ? 0), then
- Y N(a, b2)
- If X N(?, ?2), and Y then
- Y N(0, 1)
- If X N(?, ?2), and Y a bX (b ? 0), then
- Y N(b?a, (b?)2)
81- Two numbers that you will encounter are
1.64 For any Normally distributed X, there is a
95 chance that the value of X will be less than
? (1.64 ?)
82- Two numbers that you will encounter are
1.64 For any Normally distributed X, there is a
95 chance that the value of X will be greater
than ? (1.64 ?)
83- Two numbers that you will encounter are
1.96 For any Normally distributed X, there is a
95 chance that the value of X will be within
1.96 standard deviations from the mean.
84The Central Limit Theorem
- The CLT is a big part of why the normal
distribution is so important to science. - Given the way we typically do empirical science,
by collecting random samples, in a certain
senseregardless of what distribution they are
coming from, as the sample size gets large, the
mean of the random sample becomes approximately
normally distributed.
85- A random sample is a collection of n many i.i.d.
random variables X1,, Xn - Notice the capital letters these are random
variables, not known quantities. - The i.i.d. part is important.
- Independent and Identically Distributed
- But let the distribution that they all have in
common be any probability distribution in the
world that you like - Then as n gets very large the sum of the
standardizations of the Xis (divided by n1/2)
approaches a normal distribution with a mean of 0
and a variance of 1.
86- This is called the Central Limit Theorem.
- More carefully, it says
- If X1,, Xn are i.i.d. random variables from a
distribution with mean ? and variance ?2, then - In short, the Central Limit Theorem says that the
variability in the whole of any (large) random
sample is approximately distributed as N(0, 1).
87- What this means is that if you randomly sample a
population - children, cities, cancer patients, purchases,
etc., - then if you standardize these measurements and
add them up, - the result, divided by n1/2, can of course still
vary some, - you couldve sampled different children,
patients, - will nevertheless vary with a distribution that
is similar to N(0, 1), especially as n gets large.
88- Notice that the only sources of uncertainty
come from the i.i.d. random variables X1,,Xn. - Thus, the Central Limit Theorem tells us that the
sum of all these random variables is
approximately normally distributed - This distribution will be as N(?, ?2), not
necessarily as N(0, 1).
89- The CLT explains why normal distributions are
fairly common - When a population is made up of individuals who
are all of the same general type, and who differ
from one another due to a large number of
influences that are themselves mutually
independent, the resulting population will often
be (approximately) normally distributed.
90- The CLT explains why normal distributions are
fairly common - E.g., people are roughly about the same height,
but heights differ due to many largely
independent influences - diet, various genetic propensities, illness
during adolescence, age, amputation, etc. - Thus, this population (of human heights) might
naturally be modeled as a random variable - where , and
- Y is (approximately) normally distributed.
91- Since Y Y1 Yn, CLT tells us that Y will
be approximately normally distributed - And if we know the mean and variance of the Yis,
then if we can approximate n, we can approximate
the precise distribution of Y. - In our height example, the Yis might be
- Y1 quality of diet during adolescence
- Y2 racial/ethnic background (on a good scale)
- Y3 degree of height propensity from some given
genetic type - Y4 amount of mercury in local water supply
- Y5 severity of measles in childhood.
- Y6 severity of mumps in childhood.
- ETC.
92- More generally, if our measurement Y is the
combination of some other variables, etc., along
with the Xis, then we may have a situation where,
e.g. - Y a bZ (X1 Xn)
- Y a bZ ?
- Here the single random variable (X1 Xn) is
approximately normally distributed - Although Y and/or Z may not be.
- The CLT is one of the reasons why the error in
our models frequently turns out to be a random
variable ? N(0, ?2) - A nice visualization of this phenomenon is at
http//www.inf.ethz.ch/personal/gut/lognormal/
93- Notice that the CLT can be seen as involving the
standardization of a (big) random variable -
- is the very same thing as
94(No Transcript)
95Chebychevs Inequality
- Let X be a random variable with any distribution
you like, with a mean ? and standard deviation ?.
- Chebychevs theorem then says that for any c gt 0
- In other words, regardless of Xs distribution,
the probability of X yielding a value more than c
standard deviations away from Xs mean is always
less than 1/c2.
96- So regardless of the Xs actual distribution,
- The probability that X yields a value more than 2
standard deviations from the mean is less than ¼
.25. - The probability that X yields a value more than 5
standard deviations from the mean is less than
1/25 .04.
97- The core of classical statistical inference
involves finding data which is simply too
unlikely to have come from a certain
distribution. - E.g., often our data sets x1,, xn produce a
certain number b (e.g, b). - Often our experimental design allows us to create
a complex random variable W out of the others
that generated the data set - X1,, Xn
- We then see whether the probability that W would
produce b is below a certain threshold - pr(b W b) lt .05???
98- In theory, we could merely use our threshold (.05
in our example) to figure out how extreme our
data had to be to allow us to draw this
conclusion. - If we use Chebyshevs inequality, W will have to
be further than ?/(.05)1/2 away from the mean of
W. - Although this boundary is rather remarkable,
because it holds for any random variable, it is
rather inefficient. - If we can obtain more information about the null
hypothesis that we are testing, we may be able
to draw stronger conclusions from less extreme
data.
99- For example Suppose our null hypothesis
distribution is N(0, 1), and our threshold is
.05. - From Chebyshevs inequality, we can calculate
- So we solve for ?
100- Since our null hypothesis distribution is N(0,
1), we can continue - Thus, to draw a statistical inference using
Chebychevs inequality, our random variable would
have to yield a value more extreme than ?4.472 - As weve seen, this hardly ever occurs from
- N(0, 1)!
101- In short, it can be rather hard to draw
inferences using Chebyshevs inequality - This is a price you pay for the fact that the
inequality is so general.
102- But what if we made use of the information that
our null hypothesis was N(0, 1)? - This amounts to utilizing more information in the
experimental design. - As you will learn later, if you do use this
information, then you can draw an inference (at
the .05 level) if your data isnt more extreme
than ?4.472 - Instead, it only needs to be more extreme than
?1.96
103- In sum, there is a kind of trade-off
- Chebychevs inequality requires no (significant)
background assumptions, and so applies everywhere - But it is very inefficient.
- The techniques we will explore later require some
significant background assumptions, and so cannot
apply to all situations . - But they are much more efficient.