Title: Statistics and Data Analysis
1Statistics and Data Analysis
- Professor William Greene
- Stern School of Business
- IOMS Department
- Department of Economics
2Statistics and Data Analysis
Part 15 Regression Models
3Linear Regression Models
1/49
- Analyzing residuals
- Violations of assumptions
- Unusual data points
- Hints for improving the model
- Model building
- Linear models cost functions
- Semilog models growth models
- Logs and elasticities
4An Enduring Art Mystery
3/49
Graphics show relative sizes of the two works.
The Persistence of Statistics. Hildebrand, Ott
and Gray, 2005
Why do larger paintings command higher prices?
The Persistence of Memory. Salvador Dali, 1931
5The Data
7/49
6Monet in Large and Small
4/49
Sale prices of 328 signed Monet paintings
The residuals do not show any obvious patterns
that seem inconsistent with the assumptions of
the model.
Log of price a b log surface area e
7Monet Regression
8/49
8Using the Residuals
9/49
- How do you know the model is good?
- Various diagnostics to be developed over the
semester. - But, the first place to look is at the residuals.
9Residuals Can Signal a Flawed Model
10/49
- Standard application Cost function for output
of a production process. - Compare linear equation to a quadratic model (in
logs) - (124 American Electric Utilities)
10Electricity Cost Function
11Candidate Model for Cost
11/49
Log c a b log q e
Most of the points in this area are above the
regression line.
Most of the points in this area are above the
regression line.
Most of the points in this area are below the
regression line.
12A Missing Variable?
12/49
Residuals from the (log)linear cost model
13A Better Model?
13/49
Log Cost a ß1 logOutput ß2 logOutput2
e
14Candidate Models for Cost
14/49
The quadratic equation is the appropriate model.
Logc a b1 logq b2 log2q e
15Missing Variable Included
15/49
Residuals from the quadratic cost model
Residuals from the linear cost model
16Heteroscedasticity
16/49
- Hetero - differences
- Scedastic - function, variation around
the mean - Arises when y is proportional to x
- Arises sometimes when there are natural,
heterogeneous groups
17Heteroscedasticity
17/49
Residuals from a regression of salaries on years
of experience.
Standard deviation of the residuals seems not to
be constant.
18Problem with the Model?
18/49
This usually suggests the model should be defined
in terms of logs of the variable.
19Sometimes Heteroscedasticity Can Be Cured By
Taking Logs
19/49
Residuals from a regression of logs of salaries
on years of experience. Salary aeßtee We will
explore this model below.
20Should I Worry About Heteroscedasticity?
21/49
- Not a problem for using least squares to estimate
a or ß. - But, there is a better method than least squares.
- Assessment of the uncertainty of the least
squares estimates may be too optimistic. - (Not contagious)
21Unusual Data Points
24/49
Outliers have (what appear to be) very large
disturbances, e
Wolf weight vs. tail length The
500 most successful movies
22Outliers (?)
25/49
Remember the empirical rule, 99.5 of
observations will lie within mean 3 standard
deviations? We show (abx) 3se below.)
Titanic is 8.1 standard deviations from the
regression! Only 0.86 of the 466 observations
lie outside the bounds. (We will refine this
later.)
These observations might deserve a close look.
23Prices paid at auction for Monet paintings vs.
surface area (in logs)
logPrice a b logArea e
Not an outlier Monet chose to paint a
small painting. Possibly an outlier
Why was the price so low?
24What to Do About Outliers
26/49
- (1) Examine the data
- (2) Are they due to mismeasurement error or
obvious coding errors? Delete the
observations. - (3) Are they just unusual observations? Do
nothing. - (4) Generally, resist the temptation to remove
outliers. Especially if the sample is large.
(500 movies is large. 10 wolves is not.) - (5) Question why you think it is an outlier. Is
it really?
25Regression Options
29/49
26Minitabs Opinions
32/49
Minitab uses 2S to flag large residuals.
27On Removing Outliers
33/49
- Be careful about singling out particular
observations this way. - The resulting model might be a product of your
opinions - Removing outliers might create new outliers that
were not outliers before. - Statistical inferences from the model will be
incorrect.
28Mechanically Remove Outliers?
29Removing Outliers Creates Outliers
Were they really outliers?
30Normal Distribution of ei?
34/49
31Probability Plot
35/49
Graph -gt Probability Plots
32Using and Interpreting the Model
36/49
- Interpreting the linear model
- Semilog and growth models
- Log-log model and elasticities
33Statistical Cost Analysis
37/49
The units of the LHS and RHS must be the same. M
cost a b MKWH Y cost a cost
2.444 M b M /MKWH 0.005291
M/MKWH So,.. a fixed cost total cost if
MKWH 0 b marginal cost dCost/dMKWH b MKWH
variable cost
Generation cost (M) and output (Millions of KWH)
for 124 American electric utilities. (1970).
34Semilog Models and Growth Rates
38/49
LogSalary 9.84 0.05 Years e
35Growth in a Semilog Model
39/49
36Using Semilog Models for Trends
40/49
Frequent Flyer Flights for 72 Months. (Text, Ex.
11.1, p. 508)
37Regression Approach
41/49
- logFlights a ß Months e
- a 2.770, b 0.03710, s 0.06102
38Elasticity and Loglinear Models
43/49
- logY a ßlogx e
- The responsiveness of one variable to changes
in another - E.g., in economics demand elasticity (?Q) /
(?P) - Math Ratio of percentage changes
- ?Q / ?P 100(?Q )/Q / 100(?P)/P
- Units of measurement and the 100 fall out of
this eqn. - Elasticity (?Q/?P)(P/Q)
- Elasticities are units free
39Monet Regression
8/49
40Summary
49/49
- Residual analysis
- Consistent with model assumptions?
- Suggest missing elements in the model
- Building the regression model
- Interpreting the model cost function
- Growth model semilog
- Double log and estimating elasticities