Basic Statistics II

About This Presentation

Title:

Basic Statistics II

Description:

Basic Statistics II Biostatistics, MHA, CDC, Jul 09 Prof. KG Satheesh Kumar Asian School of Business Frequency Distribution and Probability Distribution Frequency ... – PowerPoint PPT presentation

Number of Views:224

Avg rating:3.0/5.0

Slides: 51

Provided by: asb4

Category:

more less

Transcript and Presenter's Notes

Title: Basic Statistics II

1
Basic Statistics II

Biostatistics, MHA, CDC, Jul 09
Prof. KG Satheesh Kumar
Asian School of Business

2
Frequency Distribution and Probability
Distribution

Frequency Distribution Plot of frequency along
y-axis and variable along the x-axis
Histogram is an example
Probability Distribution Plot of probability
along y-axis and variable along x-axis
Both have same shape
Properties of probability distributions
Probability is always between 0 and 1
Sum of probabilities must be 1

3
Theoretical Probability Distributions

For a discrete variable we have discrete
probability distribution
Binomial Distribution
Poisson Distribution
Geometric Distribution
Hypergeometric Distribution
For a continuous variable we have continuous
probability distribution
Uniform (rectangular) Distribution
Exponential Distribution
Normal Distribution

4
The Normal Distribution

If a random variable, X is affected by many
independent causes, none of which is
overwhelmingly large, the probability
distribution of X closely follows normal
distribution. Then X is called normal variate and
we write X N(?, ?2), where ? is the mean and ?2
is the variance
A Normal pdf is completely defined by its mean, ?
and variance, ?2. The square root of variance is
called standard deviation ?.
If several independent random variables are
normally distributed, their sum will also be
normally distributed with mean equal to the sum
of individual means and variance equal to the sum
of individual variances.

5
The Normal pdf
6
The area under any pdf between two given values
of X is the probability that X falls between
these two values
7
Standard Normal Variate, Z

SNV, Z is the normal random variable with mean 0
and standard deviation 1
Tables are available for Standard Normal
Probabilities
X and Z are connected by
Z (X - ?) / ? and X ? ?Z
The area under the X curve between X1 and X2 is
equal to the area under Z curve between Z1 and Z2.

z 0.00 0.01 0.02 0.03
0.04 0.05 0.06 0.07 0.08
0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160
0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557
0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948
0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331
0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700
0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054
0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389
0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704
0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995
0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264
0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508
0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729
0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925
0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099
0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251
0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382
0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495
0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591
0.4599 0.4608 0.4616 0.4625 0.4633

Standard Normal Probabilities (Table of z
distribution)
The z-value is on the left and top margins and
the probability (shaded area in the diagram) is
in the body of the table
9
Illustration

Q.A tube light has mean life of 4500 hours with a
standard deviation of 1500 hours. In a lot of
1000 tubes estimate the number of tubes lasting
between 4000 and 6000 hours
P(4000ltXlt6000) P(-1/3ltZlt1)
0.1306
0.3413
0.4719
Hence the probable number of tubes in a lot of
1000 lasting 4000 to 6000 hours is 472

10
Illustration

Q. Cost of a certain procedure is estimated to
average Rs.25,000 per patient. Assuming normal
distribution and standard deviation of Rs.5000,
find a value such that 95 of the patients pay
less than that.
Using tables, P(ZltZ1) 0.95 gives Z1 1.645.
Hence X1 25000 1.645 x 5000 Rs.33,225
95 of the patients pay less than Rs.33,225

11
Sampling Basics

Population or Universe is the collection of all
units of interest. E.g. Households of a specific
type in a given city at a certain time.
Population may be finite or infinite
Sampling Frame is the list of all the units in
the population with identifications like Sl.Nos,
house numbers, telephone nos etc
Sample is a set of units drawn from the
population according to some specified procedure
Unit is an element or group of elements on which
observations are made. E.g. a person, a family, a
school, a book, a piece of furniture etc.

12
Census Vs Sampling

Census
Thought to be accurate and reliable, but often
not so if the population is large
More resources (money, time, manpower)
Unsuitable for destructive tests
Sampling
Less resources
Highly qualified and skilled persons can be used
Sampling error, which can be reduced using large
and representative sample

13
Sampling Methods

Probability Sampling (Random Sampling)
Simple Random Sampling
Systematic Random Sampling
Stratified Random Sampling
Cluster Sampling (Single stage , Multi-stage)
Non-probability Sampling
Convenience Sampling
Judgment Sampling
Quota Sampling

14
Limitations of Non-Random Sampling

Selection does not ensure a known chance that a
unit will be selected (i.e. non-representative)
Inaccurate in view of the selection bias
Results cannot be used for generalisation because
inferential statistics requires probability
sampling for valid conclusions
Useful for pilot studies and exploratory research

15
Sampling Distribution and Standard Error of the
Mean

The sampling distribution of ?x is the
probability distribution of all possible values
of ?x for a given sample size n taken from the
population.
According to the Central Limit Theorem, for large
enough sample size, n, the sampling distribution
is approximately normal with mean ? and standard
deviation ?/?n. This standard deviation is called
standard error of the mean.
CLT holds for non-normal populations also and
states For large enough n, ?x N(?, ?2/n)

16
Illustration

Q. When sampling from a population with SD 55,
using a sample size of 150, what is the
probability that the sample mean will be at least
8 units away from the population mean?
Standard Error of the mean, SE 55/sqrt(150)
4.4907
Hence 8 units 1.7815 SE
Area within 1.7815 SE on both sides of the mean
2 0.4625 0.925
Hence required probability 1-0.925 0.075

17
Illustration

Q. An Economist wishes to estimate the average
family income in a certain population. The
population SD is known to be 4,500 and the
economist uses a random sample of size 225. What
is the probability that the sample mean will fall
within 800 of the population mean?

18
Point and Interval Estimation

The value of an estimator (see next slide),
obtained from a sample can be used to estimate
the value of the population parameter. Such an
estimate is called a point estimate.
This is a 5050 estimate, in the sense, the
actual parameter value is equally likely to be on
either side of the point estimate.
A more useful estimate is the interval estimate,
where an interval is specified along with a
measure of confidence (90, 95, 99 etc)
The interval estimate with its associated measure
of confidence is called a confidence interval.
A confidence interval is a range of numbers
believed to include the unknown population
parameter, with a certain level of confidence

19
Estimators

Population parameters (?, ?2, p) and Sample
Statistics (?x,s2, ps)
An estimator of a population parameter is a
sample statistic used to estimate the parameter
Statistic,?x is an estimator of parameter ?
Statistic, s2 is an estimator of parameter ?2
Statistic, ps is an estimator of parameter p

20
Illustration

Q. A wine importer needs to report the average
percentage of alcohol in bottles of French wine.
From experience with previous kinds of wine, the
importer believes the population SD is 1.2. The
importer randomly samples 60 bottles of the new
wine and obtains a sample mean of 9.3. Find the
90 confidence interval for the average
percentage of alcohol in the population.

21
Answer

Standard Error 1.2/sqrt(60) 0.1549
For 90 confidence interval, Z 1.645
Hence the margin of error 1.6450.1549
0.2548
Hence 90 confidence interval is
9.3 /- 0.3

22
More Sampling Distributions

Sampling Distribution is the probability
distribution of a given test statistic (e.g. Z),
which is a numerical quantity calculated from
sample statistic
Sampling distribution depends on the distribution
of the population, the statistic being considered
and the sample size
Distribution of Sample Mean Z or t distribution
Distribution of Sample Proportion Z (large
sample)
Distribution of Sample Variance Chi-square
distribution

23
The t-distribution

The t-distribution is also bell-shaped and very
similar to the Z(0,1) distribution
Its mean is 0 and variance is df/(df-2)
df degrees of freedom n-1 n sample size
For large sample size, t Z are identical
For small n, the variance of t is larger than
that of Z and hence wider tails, indicating the
uncertainty introduced by unknown population SD
or smaller sample size n

24
(No Transcript)
25
Illustration

Q. A large drugstore wants to estimate the
average weekly sales for a brand of soap. A
random sample of 13 weeks gives the following
numbers 123, 110, 95, 120, 87, 89, 100, 105, 98,
88, 75, 125, 101. Determine the 90 confidence
interval for average weekly sales.
Sample mean 101.23 and Sample SD 15.13. From
t-table, for 90 confidence at df 12 is t
1.782. Hence Margin of Error 1.782
15.13/sqrt(13) 7.48. The 90 confidence
interval is (93.75,108.71)

26
Chi-Square Distribution

Chi-square distribution is the probability
distribution of the sum of several independent
squared Z variables
It has a df parameter associated with it (like t
distribution).
Being a sum of squares, the chi-squares cannot be
negative and hence the distribution curve is
entirely on the positive side, skewed to the
right.

The mean is df and variance is 2df
27
Confidence Interval for population variance using
chi-square distribution

A random sample of 30 gives a sample variance of
18,540 for a certain variable. Give a 95
confidence interval for the population variance
Point estimate for population variance 18,540
Given df 29, excel gives chi-square values
For 2.5, 45.7 and for 97.5, 16.0
Hence for the population variance,
the lower limit of the confidence interval
18540 29/45.7 11,765 and
the upper limit of the confidence interval
1854029/16.0 33,604

28
Chi-Square Distribution

Chi-square distribution is the probability
distribution of the sum of several independent
squared Z variables
It has a df parameter associated with it (like t
distribution).
Being a sum of squares, the chi-squares cannot be
negative and hence the distribution curve is
entirely on the positive side, skewed to the
right.

The mean is df and variance is 2df
29
Chi-Square Test for Goodness of Fit

A goodness-of-fit is a statistical test of how
sample data support an assumption about the
distribution of a population
Chi-square statistic used is
?2 ?(O-E)2/E, where O is the observed value
and E the expected value
The above value is then compared with the
critical value (obtained from table or using
excel) for the given df and the required level of
significance, a (1 or 5)

30
Illustration

Q. A company comes out with a new watch and
wants to find out whether people have special
preferences for colour or whether all four
colours under consideration are equally
preferred. A random sample of 80 prospective
buyers indicated preferences as follows 12, 40,
8, 20. Is there a colour preference at 1
significance?
Assuming no preference, the expected values would
all be 20. Hence the chi-square value is 64/20
400/20 144/20 0 30.4
For df 3 and 1 significance, the right tail
area is 11.3.
The computed value of 30.4 is far greater than
11.3 and hence deeply in the rejection region. So
we reject the assumption of no colour preference.

Q. Following data is about the births of new born
babies on various days of the week during the
past one year in a hospital. Can we assume that
birth is independent of the day of the week?
Sun116, Mon184, Tue 148, Wed 145, Thu 153,
Fri 150, Sat 154 (Total 1050)
Ans Assuming independence, the expected values
would all be 1050/7 150. Hence the chi-square
value is 342/150342/15022/15052/15032/15042/1
502366/150 15.77
For df 6 and 5 significance, the right tail
area is 12.6.
The computed value of 15.77 is greater than the
critical value of 12.6 and hence falls in the
rejection region. So we reject the assumption of
independence.

32
Correlation

Correlation refers to the concomitant variation
between two variables in such a way that change
in one is associated with a change in the other
The statistical technique used to analyse the
strength and direction of the above association
between two variables is called correlation
analysis

33
Correlation and Causation

Even if an association is established between two
variables no cause-effect relationship is implied
Association between x and y may be looked upon
as
x causes y
y causes x
x and y influence each other (mutual influence)
x and y are both influenced by z, v (influence of
third variable)
due to chance (spurious association)
Hence caution needed while interpreting
correlation

34
Types of Correlations

Positive (direct) and negative (inverse)
Positive direction of change is the same
Negative direction of change is opposite
Linear and non-linear
Linear changes are in a constant ratio
Non-linear ratio of change is varying
Simple, Partial and Multiple
Simple Only two variables are involved
Partial There may be third and other variables,
but they are kept constant
Multiple Association of multiple variables
considered simultaneously

35
Scatter Diagrams
Correlation coefficient r 1 r -
0.54 r 0.85 r
- 0.94 r0.42
r0.17
36
Correlation Coefficient

Correlation coefficient (r) indicates the
strength and direction of association
The value of r is between -1 and 1
-1 perfect negative correlation
1 perfect positive correlation
Above 0.75 Very high correlation
0.50 to 0.75 High correlation
0.25 to 0.50 Low correlation
Below 0.25 Very low correlation

37
Methods of Correlation Analysis

Scatter Diagram
A quick approximate visual idea of association
Karl Pearsons Coefficient of Correlation
For numeric data measured on interval or ratio
scale
r Cov(x,y) /(SDx SDy)
Spearmans Rank Correlation
For ordinal (rank) data
R 1 6 Sum of Squared Difference of Ranks /
n(n2-1)
Method of Least Squares
r2 bxy byx, i.e. product of regression
coefficients

38
Karl Pearson Correlation Coefficient
(Product-Moment Correlation)

r Covariance (x,y) / (SD of x SD of y)
Recall n Var(X) SSxx, nVar(Y) SSYY and n
Cov(X,Y) SSXY
Thus r2 Cov2(X,Y)/Var(X) Var(Y)
SS2XY / (SSxx SSYY)
Note r2 is called coefficient of determination

39
Sample Problem

The following data refers to two variables,
promotional expense (Rs. Lakhs) and sales (000
units) collected in the context of a promotional
study. Calculate the correlation coefficient
Promo 7 10 9 4 11 5 3
Sales 12 14 13 5 15 7 4

40
Promo (X) Sales (Y) X - Ave(X) Y - Ave (Y) Sxy Sxx Syy

7 12 0 2 0 0 4
10 14 3 4 12 9 16
9 13 2 3 6 4 9
4 5 -3 -5 15 9 25
11 15 4 5 20 16 25
5 7 -2 -3 6 4 9
3 4 -4 -6 24 16 36

7 10 83 58 124
Ave(X) Ave(Y) SSxy SSxx SSyy

Coefficient of Determination, r-squared 8383 / (58124) Coefficient of Determination, r-squared 8383 / (58124) Coefficient of Determination, r-squared 8383 / (58124) Coefficient of Determination, r-squared 8383 / (58124) Coefficient of Determination, r-squared 8383 / (58124) Coefficient of Determination, r-squared 8383 / (58124) 0.95787
Coefficient of Correlation, r square root of 0.95787 Coefficient of Correlation, r square root of 0.95787 Coefficient of Correlation, r square root of 0.95787 Coefficient of Correlation, r square root of 0.95787 Coefficient of Correlation, r square root of 0.95787 0.978708
41
Spearmans Rank Correlation Coefficient

The ranks of 15 students in two subjects A and B
are given below. Find Spearmans Rank Correlation
Coefficient
(1,10) (2,7) (3,2) (4,6) (5,4) (6,8)
(7,3) (8,1) (9,11) (10,15) (11,9) (12,5)
(13,14) (14,12) and (15,13)
Solution SSD of Ranks 81251414
1649425449144 272
R 1 6272/(141516) 0.5143
Hence moderate degree of positive correlation
between the ranks of students in the two subjects

42
Regression Analysis

Statistical technique for expressing the
relationship between two (or more) variables in
the form of an equation (regression equation)
Dependent or response or predicted variable
Independent or regressor or predictor variable
Used for prediction or forecasting

43
Types of Regression Models

Simple and Multiple Regression Models
Simple Only one independent variable
Multiple More than one independent variable
Linear and Nonlinear Regression Models
Linear Value of response variable changes in
proportion to the change in predictor so that Y
abX

44
Simple Linear Regression Model

Y a bX,
a and b are constants to be determined using the
given data
Note More strictly, we may say Y ayx byxX
To determine a and b solve the following two
equations (called normal equations)
?Y a n b ?x ------- (1)
?YX a ?x b ?x2 ------- (2)

45
Calculating Regression Coeff

Instead of solving the simultaneous equations one
may directly use formulae
For Y a bX, i.e. regression of Y on X
byx SSxy / SSxx
ayx Y byxX where mean values of Y, X are used
For X a bY form (regression of X on Y)
bxy SSxy / SSyy
axy Y bxyX where mean values of Y, X are used

46
Example

For the earlier problem of Sales (dependent
variable) Vs Promotional expenses (independent
variable) set up the simple linear regression
model and predict the sales when promotional
spending is Rs.13 lakhs
Solution We need to find a and b in Y a bX
b SSxy / SSxx 83/58 1.4310
a Y - bX, at mean 10 1.43107 -0.017
Hence regression equation is Y -0.0171.4310X
For X 13 Lakhs, we get Y 18.59, i.e. 18,590
units of predicted sales

47
Linear Regression using Excel
48
Properties of Regression Coeff

Coefficient of determination r2 byx bxy
If one regression coefficient is greater than one
the other is less than one because r2 lies
between 0 and 1
Both regression coeff must have the same sign,
which is also the sign of the correlation coeff r
The regression lines intersect at the means of X
and Y
Each regression coefficient gives the slope of
the respective regression line

49
Coefficient of Determination

Recall
SSyy Sum of squared deviations of Y from the
mean
Let us define
SSR as sum of squared deviations of estimated
(using regression equation) values of Y from the
mean
SSE as the sum of squared deviations of errors
(error means actual Y estimated Y)
It can be shown that
SSyy SSR SSE, i.e. Total Variation
Explained Variation Unexplained (error)
Variation
r2 SSR/SSyy Explained Variation / Total
Variation
Thus r2 represents the proportion of the total
variability of the dependent variable y that is
accounted for or explained by the independent
variable x

50
Coefficient of Determination for Statistical
Validity of Promo-Sales Regression Model
Promo (X) Sales (Y) Ye -0.0171.4310X Squared deviation of Ye from Mean Squared deviation of Ye from Y Squared Deviation of Y from Mean

7 12 10.00 0.00 4.00 4
10 14 14.29 18.43 0.09 16
9 13 12.86 8.19 0.02 9
4 5 5.71 18.43 0.50 25
11 15 15.72 32.76 0.52 25
5 7 7.14 8.19 0.02 9
3 4 4.28 32.76 0.08 36

7 10 118.77 5.22 124
SSR SSE Ssyy

Coefficient of determination, r-squared 118.77/124 Coefficient of determination, r-squared 118.77/124 Coefficient of determination, r-squared 118.77/124 Coefficient of determination, r-squared 118.77/124 0.957824

Thus 96 of the variation in sales is explained by promo expenses Thus 96 of the variation in sales is explained by promo expenses Thus 96 of the variation in sales is explained by promo expenses Thus 96 of the variation in sales is explained by promo expenses Thus 96 of the variation in sales is explained by promo expenses Thus 96 of the variation in sales is explained by promo expenses

Write a Comment

User Comments (0)