Statistics with Economics and Business Applications - PowerPoint PPT Presentation

About This Presentation

Title:

Statistics with Economics and Business Applications

Description:

Example: Age and Fatness. The following scatterplot shows that % fat in ... For example, x=age and y=út in the first example; x=advertising expenditure, y ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 61

Provided by: ValuedGate793

Learn more at: https://yuedong.faculty.pstat.ucsb.edu

Category:

more less

Transcript and Presenter's Notes

Title: Statistics with Economics and Business Applications

1
Statistics with Economics and Business
Applications

Chapter 11 Linear Regression and Correlation
Correlation Coefficient, Least Squares, ANOVA
Table, Test and Confidence Intervals, Estimation
and Prediction

2
Review

I. Whats in last two lectures?
Hypotheses tests for means and proportions
Chapter 8
II. What's in the next three lectures?
Correlation coefficient and linear regression
Read
Chapter 11

3
Introduction

So far we have done statistics
on one variable at a time. We
now interested in
relationships between two
variables and how
to use one variable to predict
another variable.
Does weight depend on height?
Does blood pressure level predict life
expectancy?
Do SAT scores predict college performance?
Does taking PSTAT 5E make you a better person?

4
Example Age and Fatness
The following data was collected in a study of
age and fatness in humans.
One of the questions was, What is the
relationship between age and fatness?
Mazess, R.B., Peppler, W.W., and Gibbons, M.
(1984) Total body composition by dual-photon
(153Gd) absorptiometry. American Journal of
Clinical Nutrition, 40, 834-839
5
Example Age and Fatness
The following scatterplot shows that fat in
general tend to increase with age. The
relationship is close, but not exactly, linear.
6
Example Advertising and Sale

The following table contains sales (y) and
advertising expenditures (x) for 10 branches of
a retail store.

x (100) 18 7 14 31 21 5 11 16 26 29
y (1000) 55 17 36 85 62 18 33 41 63 87
7
Describing the Scatterplot

What pattern or form do you see?
Straight line upward or downward
Curve or no pattern at all
How strong is the pattern?
Strong or weak
Are there any unusual observations?
Clusters or outliers

8
Examples
Positive linear - strong
Negative linear -weak
Curvilinear
No relationship
9
Investigation of Relationship

There are two approaches to investigate linear
relationship
Correlation coefficient a numerical measure of
the strength and direction of the linear
relationship between x and y.
Linear regression a linear equation expresses
the relationship between x and y. It provides a
form of the relationship.

10
The Correlation Coefficient

The strength and direction of the relationship
between x and y are measured using the
correlation coefficient (Pearson product moment
coefficient of correlation), r.

where
sx standard deviation of the xs sy standard
deviation of the ys
11
Example

The table shows the heights and weights of
n 10 randomly selected college football
players.

Player 1 2 3 4 5 6 7 8 9 10
Height, x 73 71 75 72 72 75 67 69 71 69
Weight, y 185 175 200 210 190 195 150 170 180 175
Use your calculator to find the sums and sums of
squares.
12
Football Players
r .8261 Strong positive correlation As the
players height increases, so does his weight.
13
Interpreting r

-1 ? r ? 1
r ? 0
r ? 1 or 1
r 1 or 1

Sign of r indicates direction of the linear
relationship.
No relationship random scatter of points
Strong relationship either positive or negative
All points fall exactly on a straight line.
14
Example Advertising and Sale

Often we want to investigate the form of the
relationship for the purpose of description,
control and prediction. For the advertising and
sale example, sale increases as advertising
expenditure increase. The relationship is
almost, but not exact linear.
Description how sales depends on advertising
expenditure
Control how much to spend on advertising to
reach certain goals on sales
Prediction how much sales do we expect if we
spend certain amount of money on advertising

15
Linear Deterministic Model

Denote x as independent variables and y as
dependent variable. For example, xage and yfat
in the first example xadvertising expenditure,
ysales in the second example
We want to find how y depends on x, or how to
predict y using x
One of the simplest deterministic mathematical
relationship between two variables x and y is a
linear relationship y a ßx

a intercept ß slope
16
Reality

In the real world, things are never so clean!
Age influences fatness. But it is not the sole
influence. There are other factors such as sex,
body type and random variation (e.g. measurement
error)
Other factors such as time of year, state of
economy and size of inventory, besides the
advertising expenditure, can influence the sale
Observations of (x, y) do not fall on a straight
line

As far as the laws of mathematics refer to
reality, they are not certain and as far as they
are certain, they do not refer to reality. Albert
Einstein
17
Probabilistic Model

Probabilistic model
y deterministic model random error
Random error represents random fluctuation from
the deterministic model
The probabilistic model is assumed for the
population
Simple linear regression model
y a ßx e
Without the random deviation e, all observed
points (x, y) points would fall exactly on the
deterministic line. The inclusion of e in the
model equation allows points to deviate from the
line by random amounts.

18
Simple Linear Regression Model
0
0
19
Basic Assumptions of the Simple Linear Regression
Model

The distribution of e at any particular x value
has mean value 0.
The standard deviation of e is the same for any
particular value of x. This standard deviation is
denoted by s.
The distribution of e at any particular x value
is normal.
The random errors are independent of one another.

20
Interpretation of Terms

The line a ßx describes average value of y for
any fixed value of x.
The slope b of the population regression line is
the average change in y associated with a 1-unit
increase in x.
The intercept a is the height of the population
line when x 0.
The size of s determines the extent to which (x,
y) observations deviate from the population line.

21
The Distribution of y

For any fixed x, y has normal distribution
with mean a bx and standard deviation s.

22
Data

So far we have described the population
probabilistic model.
Usually three population parameters, a, ß and s,
are unknown. We need to estimate them from data.
Data n pairs of observations of independent and
dependent variables
(x1,y1), (x2,y2), , (xn,yn)
4. Probabilistic model
yi a b xi ei,
i1,,n
ei are independent normal with mean 0 and
standard deviation s.

23
Steps in Regression Analysis
When you perform simple regression analysis, use
a step-by step approach 1. Fit the model to data
estimate parameters. 2. Use the analysis of
variance F test (or t test) and r2 to determine
how well the model fits the data. 3. Use
diagnostic plots to check for violation of the
regression assumptions. 4. Proceed to estimate
or predict the quantity of interest
We now discuss statistical methods for each step
and use the fatness example to illustrate each
step
24
The Method of Least Squares

The equation of the best-fitting line is
calculated using n pairs of data (xi, yi).

We choose our estimates and to estimate
a and b so that the vertical distances of the
points from the line,
are minimized.

25
Least Squares Estimators
26
Example Age and Fatness
The following data was collected in a study of
age and fatness in humans.
One of the questions was, What is the
relationship between age and fatness?
Mazess, R.B., Peppler, W.W., and Gibbons, M.
(1984) Total body composition by dual-photon
(153Gd) absorptiometry. American Journal of
Clinical Nutrition, 40, 834-839
27
Example Age and Fatness
28
Example Age and Fatness
29
The Analysis of Variance

The total variation in the experiment is measured
by the total sum of squares

The Total SS is divided into two parts
SSR (sum of squares for regression) measures the
variation explained by including the independent
variable x in the model.
SSE (sum of squares for error) measures the
leftover variation not explained by x.

30
The ANOVA Table

Total df Mean Squares
Regression df
Error df

n -1
1
MSR SSR/1
n 1 1 n - 2
MSE SSE/(n-2)
Source df SS MS F
Regression 1 SSR MSR MSR/MSE
Error n - 2 SSE MSE
Total n - 1 Total SS
31
The F Test

We can test the overall usefulness of the
linear model using an F test. If the model is
useful, MSR will be large compared to the
unexplained variation, MSE.

32
Coefficient of Determination

r2 is the square of correlation coefficient
r2 is a number between zero and one and a value
close to zero suggests a poor model.
It gives the proportion of variation in y that
can be attributed to an approximate linear
relationship between x and y.
A very high value of r² can arise even though
the relationship between the two variables is
non-linear. The fit of a model should never
simply be judged from the r² value alone.

33
Estimate of s
An estimator of the variance s2 is
Thus, an estimator of the standard deviation s is
34
Example Age and Fatness
35
Example Age and Fatness
ANOVA Table
Source df SS MS F
Regression 1 891.27 891.27 26.94
Error 16 529.71 33.11
Total 17 1421.58

With r20.627 or 62.7, we can say that 62.7 of
the observed variation in Fat can be attributed
to the probabilistic linear relationship with
human age.
The magnitude of a typical sample deviation from
the least squares line is about 5.75() which is
reasonably large compared to the y values
themselves.
This would suggest that the model is only useful
in the sense of provide gross ballpark estimates
for Fat for humans based on age.

36
Inference Concerning the Slope ß

Do the data present sufficient evidence to
indicate that y increases (or decreases) linearly
as x increases?
Is the independent variable x useful in
predicting y?
A no answer to above questions means that y does
not change, regardless of the value of x. This
implies that the slope of the line, b, is zero.

37
Sampling Distribution

When the four basic assumptions of the simple
linear regression model are satisfied, the
following are true
The mean value of is b. That is, is
unbiased
The standard deviation of the statistic is
has a normal distribution (a consequence
of the error e being normally distributed)
The probability distribution of the standardized
variable
has the t distribution with dfn-2

38
Confidence Interval for b
When then four basic assumptions of the simple
linear regression model are satisfied, a
(1-a)100 confidence interval for b is
where the t critical value is based on df n
- 2.
39
Example Age and Fatness
Based on sample data, the Fat increases .55 on
average with one year of age, and we are 95
confident that the true increase per year is
between 0.33 and 0.77.
40
Hypothesis Tests Concerning b

Step 1 Specify the null and alternative
hypothesis
H0 ß ß0 versus Ha ß ? ß0 (two-sided test)
H0 ß ß0 versus Ha ß gt ß0 (one-sided test)
H0 ß ß0 versus Ha ß lt ß0 (one-sided test)
Step 2 Test statistic

Step 3 When four basic assumptions of the simple
linear regression model are satisfied, under H0 ,
the sampling distribution of t has a Students t
distribution with n-2 degrees of freedom
41
Hypothesis Tests Concerning b

Step 3 Find p-value. Compute
sample statistic

Ha b ? b0 (two-sided test)
Ha b gt b0 (one-sided test)
Ha b lt b0 (one-sided test)

P(tgtt), P(tgtt) and P(tltt) can be found from
the t table
42
Example Age and Fatness
43
Interpreting a Significant Regression

Even if you do not reject the null hypothesis
that the slope of the line equals 0, it does not
necessarily mean that y and x are unrelated.
It may happen that y and x are perfectly related
in a nonlinear way.
CausalityDo not conclude that x causes y. There
may be an unknown variable at work!

44
Checking the Regression Assumptions
Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.

The relationship between x and y is linear, given
by y a bx e.
The random error terms e are independent and, for
any value of x, have a normal distribution with
mean 0 and variance s 2.

45
Residuals

The residual corresponding to (xi,yi) is
The residual is the leftover variation in each
data point after the variation explained by the
regression model has been removed.
Residuals reflect random errors, thus we can use
them to check violations in the assumptions about
random errors.

46
Diagnostic Tools Residual Plots
We can check the normality and equal variance
assumptions using

Normal probability plot of residuals
2. Plot of residuals versus fit or residuals
versus variables

47
Normal Probability Plot

If the normality assumption is valid, the plot
should resemble a straight line, sloping upward
to the right.
If not, you will often see the pattern fail in
the tails of the graph.

48
Residuals versus Fits

If the equal variance assumption is valid, the
plot should appear as a random scatter around the
zero center line.
If not, you will see a pattern in the residuals.

49
Estimation and Prediction

Once we have
determined that the regression line is useful
used the diagnostic plots to check for violation
of the regression assumptions.
We are ready to use the regression line to

Estimate the average value of y for a given value
of x
Predict a particular value of y for a given value
of x.

50
Estimation and Prediction
Predicting a particular value of y when x x0
Estimating the average value of y when x x0
51
Estimation and Prediction

The best estimate of the average value of y and
prediction of y for a given value x x0 is
The prediction of y is more difficult, requiring
a wider range of values in the prediction
interval.

52
Estimation and Prediction
53
Example Age and Fatness
The fitted line is If x045 is put into the
equation for x, we have both an estimated average
Fat for 45 year old humans and a predicted Fat
for a 45 year old human
The two interpretations are quite different.
54
Example Age and Fatness
55
Steps in Regression Analysis
When you perform simple regression analysis, use
a step-by step approach 1. Fit the model to data
estimate parameters. 2. Use the analysis of
variance F test (or t test) and r2 to determine
how well the model fits the data. 3. Use
diagnostic plots to check for violation of the
regression assumptions. 4. Proceed to estimate
or predict the quantity of interest
56
Minitab Output
57
Key Concepts

I. Scatter plot
Pattern and unusual observations
II. Correlation coefficient a numerical measure
of the strength and direction.
Interpretation of r

58
Key Concepts

III. A Linear Probabilistic Model
1. When the data exhibit a linear relationship,
the appropriate model is y a b x e .
2. The random error e has a normal distribution
with mean 0 and variance s2.
IV. Method of Least Squares
1. Estimates and , for a and b, are
chosen to minimize SSE, the sum of the squared
deviations about the regression line,
2. The least squares estimates are
and

59
Key Concepts

V. Analysis of Variance
1. Total SS SSR SSE, where Total SS Syy
and SSR (Sxy)2 / Sxx.
2. The best estimate of s 2 is MSE SSE / (n -
2).
VI. Testing, Estimation, and Prediction
1. A test for the significance of the linear
regressionH0 b 0 can be implemented using
statistic

60
Key Concepts

2. The strength of the relationship between x and
y can be measured using
which gets closer to 1 as the relationship gets
stronger.
3. Use residual plots to check for nonnormality,
inequality of variances, and an incorrectly fit
model.
4. Confidence intervals can be constructed to
estimate the slope b of the regression line and
to estimate the average value of y, for a given
value of x.
5. Prediction intervals can be constructed to
predict a particular observation, y, for a given
value of x. For a given x, prediction intervals
are always wider than confidence intervals.