IE241 Introduction to Mathematical Statistics

About This Presentation

Title:

IE241 Introduction to Mathematical Statistics

Description:

IE241 Introduction to Mathematical Statistics Topic Slide Probability ... – PowerPoint PPT presentation

Number of Views:2348

Avg rating:3.0/5.0

Slides: 242

Provided by: ie1Kaist

Category:

more less

Transcript and Presenter's Notes

Title: IE241 Introduction to Mathematical Statistics

1
IE241 Introduction to Mathematical Statistics
2

Topic
Slide
Probability ..3
a priori ..4
set theory ..10
axiomatic definition .14
marginal probability . 17
conditional probability .19
independent events 20
Bayes formula .28
Discrete sample spaces .33
permutations .34
combinations 35
Statistical distributions 37
random variable ...38
binomial distribution .42
Moments .47
moment generating function .50
Other discrete distributions 59

Topic
Slide
Estimate of mean .112
Estimate of variance .113
degrees of freedom ..116
KAIST sample ..119
Percentiles and quartiles122
Sampling distributions ..124
of the mean ....126
Central Limit Theorem127
Confidence intervals .130
for the mean 130
Students t 137
for the variance ..143
Chi-square distribution .143
Coefficient of variation .146
Properties of estimators149
unbiased150
consistent..152

Statistics is the discipline that permits you
to make decisions in the face of uncertainty.
Probability, a division of mathematics, is the
theory of uncertainty. Statistics is based on
probability theory, but is not strictly a
division of mathematics.
However, in order to understand statistical
theory and procedures, you must have an
understanding of the basics of probability.

4
Probability arose in the 17th century
because of games of chance. Its definition at
the time was an a priori oneIf there are n
mutually exclusive, equally likely outcomes and
if nA of these outcomes have attribute A, then
the probability ofA is nA/n.
5

This definition of probability seems
reasonable for certain situations. For example,
if one wants the probability of a diamond in a
selection from a card deck, then A ?, nA 13,
n 52 and the probability of a diamond 13/52
1/4.
As another example, consider the probability
of an even number on one roll of a die. In this
case, A even number on roll, n 6, nA 3, and
the probability of an even number 3/6 1/2.
As a third example, you are interested in the
probability of J? on one draw from a card deck.
Then A J?, n 52, and nA 1, so the
probability of J? 1/52.

The conditions of equally likely and mutually
exclusive are critical to this a priori approach.
For example, suppose you want the probability
of the event A, where A is either a king or a
spade drawn at random from a new deck. Now when
you figure the number of ways you can achieve the
event A, you count 13 spades and 4 kings, which
seems to give nA 17, for a probability of
17/52.
But one of the kings is a spade, so kings and
spades are not mutually exclusive. This means
that you are double counting. The correct answer
is nA 16, for a probability of 16/52.

As another example, suppose the event A is 2
heads in two tosses of a fair coin. Now the
outcomes are 2H, 2T, or 1 of each. This would
seem to give a probability of 1/3.
But the last outcome really has twice the
probability of each of the others because the
right way to list the outcomes is HH, TT, HT,
TH. Now we see that 1 head and 1 tail can occur
in either of two ways and the correct probability
of 2H is 1/4.

But there are some problems with the a priori
approach.
Suppose you want the probability that a
positive integer drawn at random is even. You
might assume that it would be 1/2, but since
there are infinitely many integers and they need
not be ordered in any given way, there is no way
to prove that the probability of an even integer
1/2.
The integers can even be ordered so that the
ratio of evens to odds oscillates and never
approaches any definite value as n increases.

Besides the difficulty of an infinite number
of possible outcomes, there is also another
problem with the a priori definition. Suppose
the outcomes are not equally likely.
As an example, suppose that a coin is biased
in favor of heads. Now it is clearly not correct
to say that the probability of a head the
probability of a tail 1/2 in a given toss of a
coin.

Because of these difficulties, another
definition of probability arose which is based on
set theory.
Imagine a conceptual experiment that can be
repeated under similar conditions. Each outcome
of the experiment is called a sample point s.
The totality of all sample points resulting from
this experiment is called a sample space S.
An example is two tosses of a coin. In this
case, there are four sample points in S
(H,H), (H,T), (T,H), (T,T).

Some definitions
If s is an element of S, then s?S.
Two sets are equal if every element of one is
also an element of the other.
If every element of S1 is an element of S, but
not conversely, then S1 is a subset of S, denoted
S1?S.
The universal set is S where all other sets are
subsets of S.

More definitions
The complement of a set A with respect to the
sample space S is the set of points in S but not
in A. It is usually denoted .
If a set contains no sample points, it is called
the null set, f.
If S1 and S2 are two sets ?S, then all sample
points in S1 or S2 or both are called the union
of S1 and S2 which is denoted S1? S2.

More definitions
If S1 and S2 are two sets ?S, then the event
consisting of points in both S1 and S2 is called
the intersection of S1 and S2 which is denoted S1
n S2.
S is called a continuous sample space if S
contains a continuum of points.
S is called a discrete sample space if S contains
a discrete number of points or a countable
infinity of points which can be put in one-to-one
correspondence with the positive integers.

Now we can proceed with the axiomatic
definition of probability. Let S be a sample
space where A is an event in S. Then P is a
probability function on S if the following three
axioms are satisfied
Axiom 1. P(A) is a real nonnegative number
for every event A in S.
Axiom 2. P(S) 1.
Axiom 3. If S1, S2, Sn is a sequence of
mutually exclusive events in S, that is, if
Si n Sj f for all i,j where i?j, then
P(S1?S2??Sn) P(S1)P(S2)P(Sn)

Some theorems that follow from this definition
If A is an event in S, then the probability that
A does not happen 1- P(A).
If A is an event in S, then 0 P(A) 1.
P(f) 0.
If A and B are any two events in S, then P(A?B)
P(A) P(B) P(A n B) where
A n B represents the joint occurrence of both
A and B. P(A n B) is also called P(A,B).

As an illustration of this last theorem -- in
S, there are many points, but the event A and the
event B are overlapping. If we didnt subtract
the P(AnB) portion, we would be counting it twice
for P(AUB).

A
B
17

Marginal probability is the term used when one
or more criteria of classification is ignored.
Lets say we have a sample of 60 people who
are either male or female and also who are either
rich, middle-class, or poor.

In this case, we have the cross-tabulation of
gender and financial status shown in the table
below.
The marginal probability of male is 34/60 and
the marginal probability of middle-class is
48/60.

Status Gender Rich Middle-class Poor Gender marginal
Male 3 28 3 34
Female 1 20 5 26
Status marginal 4 48 8 60
19

More theorems
If A and B are two events in S such that P(B)gt0,
the conditional probability of A given that B has
happened is
P(A B) P(A n B) / P(B).
Then it follows that the joint probability P(A n
B) P(A B) P(B).

More theorems
If A and B are two events in S, A and B are
independent of one another if any of the
following is satisfied
P(A B) P(A)
P(B A) P(B)
P(A n B) P(A) P(B)

P(A ? B) is the probability that either the event
A or the event B happens. When we talk about
either/or situations, we always are adding
probabilities.
P(A ? B) P(A) P(B) P(A,B)
P(A n B) or P(A,B) is the probability that both
the event A and the event B happen. When we talk
about both/and situations, we are always
multiplying probabilities.
P(A n B) P(A) P(B) if A and B are
independent and
P(A n B) P(AB) P(B) if A and B are not
independent.

As an example of conditional probability,
consider an
urn with 6 red balls and 4 black balls. If
two balls are drawn without replacement, what is
the probability that the second ball is red if we
know that the first was red?
Let B be the event that the first ball is red
and A be the event the second ball is red. P(A n
B) is the probability that both balls are red.
There are 10C2 45 ways of drawing two balls
and
6C2 15 ways of getting two red balls.
So P(A n B) 15 / 45 1/3. P(B), the
probability that the first ball is red is 6/10
3/5.
Therefore, P(A B) 1/3 5/9.
3/5

This probability could be computed from the
sample space directly because once the first red
ball has been drawn, there remain only 5 red
balls and 4 black balls. So the probability of
drawing red the second time is 5/9.
The idea of conditional probability is to
reduce the total sample space to that portion of
the sample space in which the given event has
happened. All possible probabilities computed in
this reduced sample space must sum to 1. So the
probability of drawing black the second time
4/9.

Another example involves a test for detecting
cancer which has been developed and is being
tested in a large hospital.
It was found that 98 of cancer patients
reacted positively to the test, while only 4 of
non-cancer patients reacted positively.
If 3 of the patients in the hospital have
cancer, what is the probability that a patient
selected at random from the hospital who reacts
positively will have cancer?

Given
P(reaction cancer) .98
P(reaction no cancer) .04
P(cancer) .03
P(no cancer) .97
Needed

P(r c ) P(rc) P(c)
(.98)(.03)
.0294
P(r nc) P(rnc) P(nc)
(.04)(.97)
.0388
P(r) P(r c) P(r nc)
.0294 .0388
.0682

Now we have the information we need to solve
the problem.

Conditional probability led to the development
of Bayes formula, which is used to determine the
likelihood of a hypothesis, given an outcome.
This formula gives the likelihood of Hi given
the data D you actually got versus the total
likelihood of every hypothesis given the data you
got. So Bayes strategy is a likelihood ratio
test.
Bayes formula is one way of dealing with
questions like the last one. If we find a
reaction, what is the probability that it was
caused by cancer?

Now lets cast Bayes formula in the context
of our cancer situation, where there are two
possible hypotheses that might cause the
reaction, cancer and other.
which confirms what we got with the classic
conditional probability approach.

Consider another simple example where there
are two identical boxes. Box 1 contains 2 red
balls and box 2 contains 1 red ball and 1 white
ball. Now a box is selected by chance and 1 ball
is drawn from it, which turns out to be red. What
is the probability that Box 1 was the one that
was selected?
Using conditional probability, we would find
and get the numerator by
P(Box1,R)
P(Box1)P(RBox1)
(½
)(1)
1/2
Then we get the denominator by
P(R) P(Box1,R)
P(Box2,R)
½
¼
3/4

Putting these in the formula,
If we use the sample space method, we have
four equally likely outcomes
B1R1 B1R2 B2R B2W
The condition R restricts the sample space to
the first three of these, each with probability
1/3. Then
P(Box1R) 2/3

Now lets try it with Bayes formula. There
are only two hypotheses here, so H1 Box1 and H2
Box2. The data, of course, R. So we can
find
And we can find
So we can see that the odds of the data
favoring Box1 to Box2 are 21.

Discrete sample spaces with a finite number of
points
Let s1, s2, s3, sn be n sample points in S
which are equally likely. Then
P(s1) P(s2) P(s3) P(sn) 1/n.
If nA of these sample points are in the event
A, then P(A) nA /n, which is the same as the
a priori definition.
Clearly this definition satisfies the axiomatic
conditions because the sample points are mutually
exclusive and equally likely.

Now we need to know how many arrangements of a
set of objects there are. Take as an example the
number of arrangements of the three letters a, b,
c.
In this case, the answer is easy
abc, acb, bac, bca, cab, cba.
But if the number of arrangements were much
larger, it would be nice to have a formula that
figures out how many for us. This formula is the
number of arrangements or permutations of N
things N!.
Now we can find the number of permutations of
N things if we take only x of them at a time.
This formula is NPx N! / (N-x)!

Next we want to know how many combinations of
a set of N objects there are if we take x of them
at a time. This is different from the number of
permutations because we dont care about the
ordering of the objects, so abc and cab count as
one combination though they represent two
permutations.
The formula for the number of combinations
of N things taking x at a time is

How many pairs of cards can be drawn from a
deck, where we dont care about the order in
which they are drawn? The solution is
ways that two cards can be drawn.
Now suppose we want to know the probability
that both cards will be spades. Since there are
13 spades in the deck and we are drawing 2 cards,
the number of ways that 2 spades can be drawn
from the 13 available is
So the probability that two spades will be
drawn is 78 /1326.

Statistical Distributions
Now we begin the study of statistical
distributions. If there is a distribution, then
something must be being distributed. This
something is a random variable.
You are familiar with variables in functions
like a linear form y a x b. In this case,
a and b are constants for any given linear
function and x and y are variables.
In the equation for the circumference of a
circle, we have C pd where C and d are
variables and p is a constant.

A random variable is different from a
mathematical variable because it has a
probability function associated with it.
More precisely, a random variable is a
real-valued function defined on a probability
space, where the function transforms points of S
into values on the real axis.

For example, the number of heads in two tosses
of a fair coin can be transformed as

Points in S s1 HH s2 HT s3 TH s4 TT
X(s) 2 1 1 0
Now X(s) is real-valued and can be used in a
distribution function.
40

Because a probability is associated with each
element in S, this probability is now associated
with each corresponding value of the random
variable.
There are two kinds of random variables
discrete and continuous.
A random variable is discrete if it assumes only
a finite (or denumerable) number of values.
A random variable is continuous if it assumes a
continuum of values.

We begin with discrete random variables.
Consider a random experiment where four fair
coins are tossed and the number of heads is
recorded.
In this case, the random variable X takes on
the five values 0, 1, 2, 3, 4. The probability
associated with each value of the random variable
X is called its probability function p(X) or
probability mass function, because the
probability is massed at each of a discrete
number of points.

One of the most frequently used discrete
distributions in applications of statistics is
the binomial. The binomial distribution is used
for n repeated trials of a given experiment, such
as tossing a coin. In this case, the random
variable X has the probability function
P(x) nCx pxqn-x where pq 1
x 0,1,2,3,,n

In one toss of a coin, this reduces to pxq0
and is called the point binomial or Bernoulli
distribution. p the probability that an
event will occur and, of course, q the
probability that it will not occur.
p and n are called parameters of this family
of distributions. Each time either p or n
changes, we have a new member of the binomial
family of distributions, just as each time a or b
changed in the linear function we had a new
member of the family of linear functions.
The binomial distribution for 10 tosses of a
fair coin is shown below. The actual values
are shown in the accompanying table. Note the
symmetry of the distribution. This always
happens when p .5.

44
(No Transcript)
45
X P(x)
0 0.000977
1 0.009766
2 0.043945
3 0.117188
4 0.205078
5 0.246094
6 0.205078
7 0.117188
8 0.043945
9 0.009766
10 0.000977
46

The probability of 5 heads is highest so 5 is
called the mode of x. The mode of any
distribution is its most frequently occurring
value. The mode is a measure of central
tendency.
5 is also the mean of X, which in general for
the binomial np. The mean of any distribution
is the most important measure of central
tendency. It is the measure of location on the
x-axis.

Every distribution has a set of moments.
Moments for theoretical distributions are
expected values of powers of the random variable.
The rth moment is E(X-?)r where E is the
expectation operator and ? is an origin.
The expected value of a random variable is
defined as
E(X) µ
where µ is Greek because it is the theoretical
mean or average of the random variable.
µ is the first moment about 0.

The second moment is about µ itself
E(X- µ)2
and is called the variance s2 of the random
variable.
The third moment E(X- µ)3 is also about µ and
is a measure of skewness or non-symmetry of the
distribution.

The mean of the distribution is a measure of
its location on the x axis. The mean is the only
point such that the sum of the deviations from it
0. The mean is the most important measure of
centrality of the distribution.
The variance is a measure of the spread of the
distribution or the extent of its variability.
The mean and variance are the two most
important moments.

Every distribution has a moment generating
function (mgf), which for a discrete distribution
is

The way this works is
Assume that p(x) is a function such that the
series above converges. Then

In this expression, the coefficient of ?k/k!
is the kth moment about the origin.
To evaluate a particular moment,
it may be convenient to compute the proper
derivative of Mx(?) at ? 0, since repeated
differentiation of this moment generating
function will show that

From the mgf, we can find the first moment
around ? 0, which is the mean. The mean of the
binomial np.
We can also find the second moment around ?
µ, the variance. The variance of the binomial
npq.
The mgf enables us to find all the moments of
a distribution.

Now suppose in our binomial we change p to .7.
Then a different binomial distribution function
results, as shown in the next graph and the table
of data accompanying it.
This makes sense because with a probability of
.7 that you will get heads, you should see more
heads.

55
(No Transcript)
56
X P(x)
0 5.9E-06
1 0.000138
2 0.001447
3 0.009002
4 0.036757
5 0.102919
6 0.200121
7 0.266828
8 0.233474
9 0.121061
10 0.028248
57

This distribution is called a skewed
distribution because it is not symmetric.
Skewing can be in either the positive or the
negative direction. The skew is named by the
direction of the long tail of the distribution.
The measure of skew is the third moment around ?
µ.
So the binomial with p .7 is negatively
skewed.

The mean of this binomial np 10(.7) 7.
So you will expect more heads when the
probability of heads is greater than that of
tails.
The variance of this binomial is npq
10(.7)(.3) 2.1.

Another discrete distribution that comes in
handy when p is very small is the Poisson
distribution. Its distribution function is
where µ gt0
In the Poisson distribution, the parameter is
µ, which is the mean value of x in this
distribution.

The Poisson distribution is an approximation
to the binomial distribution when np is large
relative to p and n is large relative to np.
Because it does not involve n, it is particularly
useful when n is unknown.
As an example of the Poisson, assume that a
volume V of some fluid contains a large number n
of some very small organisms. These organisms
have no social instincts and therefore are just
as likely to appear in one part of the liquid as
in any other part.
Now take a drop D of the liquid to examine
under a microscope. Then the probability that
any one of the organisms appears in D is D/V.

The probability that x of them are in D is
The Poisson is an approximation to this
expression, which is simply a binomial in which
p D/V is very small.
The above binomial can be transformed to the
Poisson
where Dd µ and n/V d.

Another discrete distribution is the
hypergeometric distribution, which is used when
there is no replacement after each experiment.
Because there is no replacement, the value of
p changes from one trial to the next. In the
binomial, p is always constant from trial to
trial.

Suppose that 20 applicants appear for a job
interview and only 5 will be selected. The value
of p for the first selection is 1/20.
After the first applicant is selected, p
changes from 1/20 to 1/19 because the one
selected is not thrown back in to be selected
again.
For the 5th selection, p has moved to 1/16,
which is quite different from the original 1/20.

Now if there had been 1000 applicants and only
2 were going to be selected, p would change from
1/1000 to 1/999, which is not enough of a change
to be important.
So the binomial could be used here with little
error arising from the assumptions that the
trials are independent and p is constant.

The hypergeometric distribution is

Another discrete distribution is the negative
binomial. The negative binomial distribution is
used for the question On which trial(s) will the
first (and later) success(es) come?
Let p be the probability of success and let
p(X) be the probability that exactly xr trials
will be needed to produce r successes.

The negative binomial is
p(x) pr ( xr-1Cr-1 ) qx
where x 0,1,2,
and p q 1
Notice that this turns the binomial on its
head because instead of the number of successes
in n trials, it gives the number of trials to r
successes. This is why it is called the negative
binomial.

The binomial is the most important of the
discrete distributions in applications, but you
should have a passing familiarity with the
others.
Now we move on to distributions of continuous
random variables.

Because a continuous random variable has a
nondenumerable number of values, its probability
function is a density function. A probability
density function is abbreviated pdf.
There is a logical problem associated with
assigning probabilities to the infinity of points
on the x-axis and still having the density sum to
1. So what we do is deal with intervals instead
of with points. Hence P(xa) 0 for any
number a.

By far, the most important distribution in
statistics is the normal or Gaussian
distribution. Its formula is

The normal distribution is characterized by
only two parameters, its mean µ and its standard
deviation s.
The mgf for a continuous distribution is

This mgf is of the same form as that for
discrete distributions shown earlier, and it
generates moments in the same manner.
A normal distribution with µ 1.5 and s
.9 is shown next.

73
(No Transcript)
74

This is the familiar bell curve. If the
standard deviation s were smaller, the curve
would be tighter. And if s were larger, the
curve would be flatter and more spread out.
Any normal distribution may be transformed
into the standard normal distribution with
µ 0 and s 1. The transformation is
z (x-µ) / s
In this case, z is called the standard normal
variate or random variable.

If we use the transformed variable z, the
normal density becomes

The area under the curve for any normal
distribution from µ to 1s .34 and the area
from µ to -1s .34. So from -1s to 1s is 68
of the area, which means that the values of the
random variable X falling between those two
limits use up .68 of the total probability.
The area from µ to 1.96s .475 and because
the normal curve is symmetric, it is the same
from µ to -1.96s. So from -1.96s to 1.96s 95
of the area under the curve, and the values of
the random variable in that range use up .95 of
the total probability.

.34
.34
.135
.135
78

The normal distribution is very important
for statisticians because it is a mathematically
manageable distribution with wide ranging
applicability, but it is also important on its
own merits.
For one thing, many populations in various
scientific or natural fields have a normal
distribution to a good degree of approximation.
To make inferences about these populations, it is
necessary to know the distributions for various
functions of the sample observations.
The normal distribution may be used as an
approximation to the binomial for large n.

Theorem
If X represents the number of successes in n
independent trials of an event for which p is the
probability of success on a single trial, then
the variable (X-np)/vnpq has a distribution that
approaches the normal distribution with mean 0
and variance 1 as n becomes increasingly large.

Corrollary
The proportion of successes X/n will be
approximately normally distributed with mean p
and standard deviation vpq/n
if n is sufficiently large.
Consider the following illustration of the
normal approximation to the binomial.

In Mendelian genetics, certain crosses of peas
should give yellow and green peas in a ratio of
31. In an experiment that produced 224 peas,
176 turned out to be yellow and only 48 were
green.
The 224 peas may be considered 224 trials of a
binomial experiment where the probability of a
yellow pea ¾. Given this, the average number
of yellow peas should be 224(3/4) 168 and s
v224(3/4)(1/4) 6.5.

Is the theory wrong? Or is the finding of 176
yellow peas just normal variation?
To save the laborious computation required by
the binomial, we can use the normal approximation
to get a region around the mean of 168 which
encompasses 95 of the values that would be found
in the normal distribution.
Since the 176 yellow peas found in this
experiment is within this interval, there is no
reason to reject Mendelian inheritance.

The normal distribution will be re-visited
later, but for now well move on to some other
interesting continuous distributions.

The first of these is the uniform or
rectangular distribution.
f(x) 1/(ß-a) a X ß
0 elsewhere
This is an important distribution for
selecting random samples and computers use it for
this purpose.

Another important continuous distribution is
the gamma distribution, which is used for the
length of time it takes to do something or for
the time between events.
The gamma is a two-parameter family of
distributions, with a and ß as the parameters.
Given ß gt 0 and a gt -1, the gamma density is

Another important continuous distribution is
the beta distribution, which is used to model
proportions, such as the proportion of lead in
paint or the proportion of time that the FAX
machine is under repair.
This is a two-parameter family of distributions
with parameters a and ß, which both must be
greater than -1. The beta density is

The log normal distribution is another
interesting continuous distribution.
Let x be a random variable. If loge(x) is
normally distributed, then x has a log normal
distribution. The log normal has two parameters,
a and ß, both of which are greater than 0. For x
gt 0,

As with the discrete distributions, most of
the continuous distributions are of passing
interest. Only the normal distribution at this
point is critically important. You will come
back to it again and again in statistical study.

Now one kind of distribution we havent
covered so far is the cumulative distribution.
Whereas the distribution of the random variable
is denoted p(x) if it is discrete and f(x) if it
is continuous, the cumulative distribution is
denoted P(x) and F(x) for discrete and continuous
distributions, respectively.
The cumulative distribution or cdf is the
probability that X Xc and thus it is the area
under the p(x) or f(x) function up to and
including the point Xc.

The most interesting cumulative distribution
function or cdf is the normal one, often called
the normal ogive.

The points in a continuous cdf like the normal
F(x) are obtained by integrating over the f(x) to
the point in question.

The cdf can be used to find the probability
that a random variable X is some value
of interest because the cdf gives probabilities
directly.
In the normal distribution shown earlier with
µ 1.5 and s 0.9, the probability that X 2 is
given by the cdf as .71. Also the probability
that 1 x 2 is given by F(2) F(1) .71 -
.29 .42.

Now you know from this normal cdf that the
probability that X 2 is .71.
Suppose you want the probability that X 2.
Well if P(X 2) .71, then
P(X 2) 1-.71 .29.
Note that you are ignoring the fact that P(X
2) is included is included in the cdf
probability because P(X 2) 0 in a continuous
pdf.

For the binomial distribution, a point on the
cumulative distribution function P(x) is obtained
by summing probabilities of the p(x) up to the
point in question. Then P(xi) p(x xi). In
general,

95
(No Transcript)
96

From this cdf, we can see that the
probability that the number of heads will be 2
.05.
And the probability that the number of heads
will be 6 .82.
But the probability that the number of heads
will be between two numbers is tricky here
because the cdf includes the probability of x,
not just the values lt x. So if you want the
probability that 2 x 6, you need to use
P(6)- P(1) because if you subtracted P(2) from
P(6), you would exclude the value 2 heads.
So P(2 x 6) P(6) P(1) .82 -.01
.81.

So if you are given a point on the binomial
cdf, say, (4, .38), then the probability that
X 4 .38.
But suppose you want the probability that X gt
4. Then 1- P(X 4)
1-.38
.62 is the answer.
But if you want the probability that X 4,
you cant get it from the information given
because P(X 4) is included in the binomial cdf.

Now we have covered the major distributions of
interest. But so far, we have been dealing
with theoretical distributions, where the unknown
parameters are given in Greek.
Since we dont know the parameters, we have to
estimate them. This means we have to develop
empirical distributions and estimate the
parameters.

To think about empirical distributions, we must
first consider the topic of sampling.
We need a sample to develop the empirical
distribution, but the sample must be selected
randomly. Only random samples are valid for
statistical use. If any other sample is used,
say, because it is conveniently available, the
information gained from it is useless except to
describe the sample itself.

100

Now how can you tell if a sample is random?
Can you tell by looking at the data you got from
your sample?
Does a random sample have to be representative
of the group from which it was obtained?
The answer to these questions is a resounding
NO.

101

Now lets develop what a random sample really
is.
First, there is a population with a variable
of interest. The population is all elements of
concern, for example, all males from age 18 to
age 30 in Korea. Maybe the variable of interest
is height.
The population is always very large and often
infinite. Otherwise, we would just measure the
entire population on the variable of interest and
not bother with sampling.

102

Since we can never measure every element
(person, object, manufactured part, etc.) in the
population, we draw a sample of these elements to
measure some variable of interest. This variable
is the random variable.

103

The sample may be taken from some portion of
the population, and not from the entire
population. The portion of the population from
which the sample is drawn is called the sampling
frame.
Maybe the sample was taken from males between
18 and 30 in Seoul, not in all of Korea. Then
although Korea is the population of interest,
Seoul is the sampling frame. Any conclusions
reached from the Seoul sample apply only to the
set of 18 to 30 year-old males in Seoul, not in
all of Korea.

104

To show how far astray you can go when you
dont pay attention to the sampling frame,
consider the US presidential election of 1948.
Harry Truman was running against Tom Dewey.
All the polling agencies were sure Dewey would
win and the morning paper after the election
carried the headline
DEWEY WINS
There is a famous picture of the victorious
Truman holding up the morning paper for all to
see.

105

How did the pollsters go so wrong? It was in
their sampling frame.
It turns out that they had used the phone
directories all over the US to select their
sample. But the phone directories all over the
US do not contain all the US voters. At that
time, many people didnt have phones and many
others were unlisted.
This is a glaring and very famous example of
just how wrong you can be when you dont follow
the sampling rules.

106

Now assuming youve got the right sampling
frame, the next requirement is a random sample.
The sample must be taken randomly for any
conclusions to be valid. All conclusions apply
only to the sampling frame, not to the entire
population.
A random sample is one in which each and
every element in the sampling frame has an equal
chance of being selected for the sample.
This means that you can get some random
samples that are quite unrepresentative of the
sampling frame. But the larger the random sample
is, the more representative it tends to be.

107

Suppose you want to estimate the height of
males in Chicago between the ages of 18 and 30.
If you were looking for a random sample of
size 12 in order to estimate the height, you
might end up with the Chicago Bulls basketball
team. This sample of 12 is just as likely as any
other sample of 12 particular males. But it
certainly isnt representative of the height of
Chicago young males.

108

But you must take a random sample to have any
justification for your conclusions.
Now the ONLY way you can know that a sample is
random is if it was selected by a legitimate
random sampling procedure.
Today, most random selections are done by
computer. But there are other methods, such as
drawing names out of a container if the container
was appropriately shaken.

109

The lottery in the US is done by putting a set
of numbered balls in a machine. The machine
stirs them up and selects 5 numbered balls, one
at a time. These numbers are the lottery
winners.
Anyone who bought a lottery ticket which has
the same 5 numbers as were drawn will win the
lottery.
Because this equipment was designed as lottery
equipment, it is fair to say that the sample of 5
balls drawn is a random sample.

110

Formally, in statistics, a random sample is
thought of as n independent and identically
distributed (iid) random variables, that is, x1,
x2, x3, xn.
In this case, xi is the random variable from
which the ith value in the sample was obtained.
When we want to speak of a random sample, we
say Let xi be a set of n iid random
variables.

111

Once you get the random sample, you can get
the distribution of the variable of interest for
the sample.
Then you can use the empirical sample
distribution to estimate the parameters in the
sampling frame, but not in the entire population.
Most of what we estimate are the two most
important moments, µ and s2.

112

Since we dont know the theoretical mean µ and
variance s2, we can estimate them from our
sample.
The mean estimate is
where n is the sample size.

113

The estimate of the second moment, the
variance, is
Although the variance is a measure of the
spread or variability of the distribution around
the mean, usually we take the square root of the
variance, the standard deviation, to get the
measure in the same scale as the mean. The
standard deviation is also a measure of
variability.

114

Now two questions arise. First, if we are
going to take the square root anyway, why do we
bother to square the estimate in the first place?
The answer is simple if you look at the
formula carefully.

115

Clearly, if you didnt square the deviations
in the numerator, they would always sum to 0,
because the mean is the value such that the
deviations around it always sum to 0.

116

Now for the second question. Why is it that
when we estimate the mean, we divide by n, but
when we estimate the variance, we divide by n -1?
The answer is in the concept of degrees of
freedom.
When we estimate the mean, each value of x is
free to be whatever it is. Thus, there are no
constraints on any value of X so there are n
degrees of freedom because there are n
observations in the sample.

117

But when we estimate the variance, we use the
mean estimate in the formula. Once we know the
mean, which we must to compute the variance, we
lose one degree of freedom.
Suppose we have 5 observations and their mean
6. If the values 4, 5, 6, 7 are 4 of these 5
observations, the 5th observation is not free to
be anything but 8.
So when we use the estimated mean in a formula
we always lose a degree of freedom.

118

In the formula for the variance, only n -1 of
the (Xi )2 points is free to vary. The nth
one is not free to vary. Thats why we divide by
n 1.
One last point
The sample mean and the sample variance for
normal distributions are independent of one
another.

119

Now lets take a random sample of size 18 of
the height of Korean male students at KAIST.
Lets say the height measurements are
165,166,168,168,172,172,172,175,175,175,
175,178,178,178,182,182,184,185, all in cm.
Now the mean of these is 175 cm. The standard
deviation is 6 cm. And the distribution is
symmetric, as shown next.

120
(No Transcript)
121

The distribution would be much closer to
normal if the sample were larger, but with 18
observations, it still is symmetric.
The median of the distribution is 175, the same
as the mean. The median is a measure of central
tendency such that half of the observations fall
below and half above.
The mode of this distribution is also 175.

122

For normal distributions, the mean, median,
and mode are all equal. In fact for all unimodal
symmetric distributions, the mean, median, and
mode are all equal.
The mth percentile is the point below which is
m of the observations. The 10th percentile is
the point below which are 10 of the
observations. The 60th percentile is the point
below which are 60 of the observations.
The 1st quartile is the point below which are
25 of the observations. The 3rd quartile is the
point below which are 75 of the observations.
The median is the 50th percentile and the 2nd
quartile.

123

This is our first empirical distribution. We
know its mean, its standard deviation, and its
general shape. The estimates of the mean and
standard deviation are called statistics and are
shown in roman type.
Now assume that the sample that we used was
indeed a random sample of male students at KAIST.
Now we can ask how good is our estimate of the
true mean of all KAIST male students.

124

In order to answer this question, assume
that you did this study -- selecting 18 male
students at KAIST and measuring their height --
infinitely often. After each study, you record
the sample mean and variance.
Now you have infinitely many sample means from
samples of n 18, and they must have a
distribution, with a mean and variance. Note
that now we are getting the distribution of a
statistic, not a fundamental measurement.
Distributions of statistics are called
sampling distributions.

125

So far, we have had theoretical population
distributions of the random variable X and
empirical sample distributions of the random
variable X.
Now we move into sampling distributions, where
the random variable is not X but a function of X
called a statistic.

126

The first sampling distribution we will
consider is that of the sample mean so we can see
how good our estimate of the population mean is.
Because we dont really do the experiment
infinitely often, we just imagine that it is
possible to do so, we need to know the
distribution of the sample mean.

127

This is where an amazing theorem comes to our
rescue the Central Limit Theorem.
Let be the mean and s2 the variance of a
random sample of size n from f(x). Now define
Then y is distributed normally with mean 0
and variance 1 as n increases without bound.
Note that y here is just the standardized
version of the statistic .

128

This theorem holds for means of samples of any
size n where f(x) is normal.
But the really amazing thing is that it also
holds for means of any distributional form of
f(x) for large n. Of course, the more the
distribution differs from normality, the larger n
must be.

129

Now were back to our original question How
good is our sample estimate of the mean of the
population?
We know that is distributed normally with
mean µ thanks to the CLT. The standard deviation
of is
The standard deviation of is often called
the standard error because is an estimate of µ
and any variation of around µ is error of
estimate. By contrast, the standard deviation of
X is just the natural variation of X and is not
error.

130

So now we can define a confidence interval for
our estimate of the mean.
where za is the standard normal deviate which
leaves .5a in each tail of the normal
distribution.
If za 1.96, then the confidence interval
will contain the parameter µ 95 of the time.
Hence, this is called a 95 confidence interval
and its two end points are called confidence
limits.

131

If s is small, the interval will be very
tight, so the estimate is a precise one. On the
other hand, if s is large, the interval will be
wide, so the estimate is not so precise.
Now it is important to get the interpretation
of a confidence interval clear. It does NOT mean
that the population mean µ has a 95 probability
of falling within the interval.

132

That would be tantamount to saying that µ is a
random variable that has a probability function
associated with it.
But µ is a parameter, not a random variable,
so its value is fixed. It is unknown but fixed.

133

So the proper interpretation for a 95
confidence interval is this. Imagine that you
have taken zillions (zillions means infinitely
often) of random samples of n 18 KAIST male
students and obtained the mean and standard
deviation of their height for each sample.
Now imagine that you can form the 95
confidence interval for each sample estimate as
we have done above. Then 95 of these zillions
of confidence intervals will contain the
parameter µ.

134

It may seem counter-intuitive to say that we
have 95 confidence that our 95 confidence
interval contains µ, but that there is not 95
probability that µ falls in the interval.
But if you understand the proper
interpretation, you can see the difference. The
idea is that 95 of the intervals formed in this
way will capture µ. This is why they are called
confidence intervals, not probability intervals.

135

Now we can also form 99 confidence intervals
simply by changing the 1.96 in the formula to
2.58. Of course, this will widen the interval,
but you will have greater confidence.
90 confidence intervals can be formed by
using 1.65 in the formula. This will narrow the
interval, but you will have less confidence.

136

But when we try to find a confidence interval,
we run into a problem. How can we find the
confidence interval when we dont know the
parameter s?
Of course, we could substitute the estimate s
for s, but then our confidence statement would be
inexact, and especially so for small samples.
The way out was shown by W.S. Gossett, who
wrote under the pseudonym Student. His classic
paper introducing the t distribution has made him
the founder of the modern theory of exact
statistical inference.

137

Students t is
t involves only one parameter µ and has the t
distribution with n -1 degrees of freedom, which
involves no unknown parameters.

138

The t distribution is
where k is the only parameter and k the
number of degrees of freedom.
Students t distribution is symmetric like the
normal but with higher and longer tails for small
k. The t distribution approaches the normal as k
? 8, as can be seen in the t table on the
following page.

139
Table of t values for selected df and F(t) Table of t values for selected df and F(t) Table of t values for selected df and F(t) Table of t values for selected df and F(t) Table of t values for selected df and F(t) Table of t values for selected df and F(t) Table of t values for selected df and F(t) Table of t values for selected df and F(t)
F(t) df .75 .90 .95 .975 .99 .995 .9995
17 .689 1.333 1.740 2.110 2.567 2.898 9.965
30 .683 1.310 1.697 2.042 2.457 2.750 3.646
40 .681 1.303 1.684 2.021 2.423 2.704 3.551
60 .679 1.296 1.671 2.000 2.390 2.660 3.460
120 .677 1.289 1.658 1.980 2.358 2.617 3.373
8 .674 1.282 1.645 1.960 2.326 2.576 3.291
140

Now we can solve the problem of computing
confidence intervals for the mean. This formula
is correct only if s is computed with n -1 in the
denominator.
t is tabled so that its extreme points (to get
95, 99 confidence intervals, etc.) are given by
t.975 and t.995, respectively. There is also a
tdist function in Excel which gives the tail
probability for any value.

141

In our sample of 18 KAIST males, the estimated
mean 175 cm and the estimated standard deviation
6 cm. So our 95 confidence interval is
175 2.110 (6 / ) or
(172 µ 178)
where 2.110 is the tabled value of t.975 for
17 df. This interval isnt very tight but then
we had only 18 observations.

142

Technically, we always have to use the t
distribution for confidence intervals for the
mean, even for large samples, because the value s
is always unknown.
But it turns out that when the sample size is
over 30, the t distribution and the normal
distribution give the same values within at least
two decimal points, that is, z.975 t.975
because the t distribution approaches the
normal distribution as df ?8.

143

What about the distribution of s2
the estimate of s2?
The statistic s2 has a chi-square distribution
with n-1 df. Chi-square is a new distribution
for us, but it is the distribution of the
quantity

144

or if we convert to a standard normal deviate,
where
then
has a chi-square distribution with n df. So
the sample variance has a chi-square distribution.

145

What about a confidence interval for s2? In
our KAIST sample, n 18, s 6, and s2 36.
The formula for the confidence interval is
This is a 95 confidence interval for s2 and
it is very wide because we had only 18
observations. The two ?2 values are those for
.975 and .025 with n-1 17 df. Confidence
intervals for variances are rarely of interest.

146

Much more common is the problem of comparing
two variances where the two random variables are
of different orders of magnitude.
For example, which is more variable, the
weight of elephants or the weight of mice?
Now we know that elephants have a very large
mean weight and mice have a very small mean
weight. But is their variability around their
mean very different?

147

The only way we can answer this is to take
their variability relative to their average
weight. To do so, we use the standard deviation
as the measure of variability.
The quantity
is a measure of relative variability called
the coefficient of variation.

148

Now if you had a random sample of elephant
weights and a random sample of mouse weights, you
could compare the coefficient of variation of
elephant weight with the coefficient of variation
of mouse weight and answer the question.

149

What are the properties of an estimator that
make it good?
1. Unbiased
2. Consistent
3. Best unbiased

150

Lets look at each of these in turn.
1. An unbiased estimator is one where
E( ) ?
The sample mean is an unbiased estimator of µ
because
and since E(X)µ and there are n E(X) in this
sum, we have

151

Is s2 an unbiased estimator of s2?

152

2. A consistent estimator is one for which the
estimator gets closer and closer to the parameter
value as n increases without limit.
3. A best unbiased estimator, also called a
minimum variance unbiased estimator, is one which
is first of all unbiased and has the minimum
variance among all unbiased estimators.