Title: Statistics and Data Analysis
1Statistics and Data Analysis
- Professor William Greene
- Stern School of Business
- IOMS Department of
- Department of Economics
2Statistics and Data Analysis
Part 19 Multiple Regression 3
3Multiple Regression Modeling
1/57
- Data Preparation
- Examining the data
- Transformations
- Scaling
- Analysis of the Regression
- Residuals and outliers
- Influential data points
- The fit of the regression
- R squared and adjusted R squared
- Analysis of variance
- Individual coefficient estimates and t statistics
- Testing for significance of a set of coefficients
- Prediction
4Data Preparation
2/57
- Get rid of observations with missing values.
- Small numbers of missing values, delete
observations - Large numbers of missing values may need to
give up on certain variables - There are theories and methods for filling
missing values. (Advanced techniques. Usually
not useful or appropriate for real world work.) - Be sure that missingness is not directly
related to the values of the dependent variable.
E.g., a regression that follows systematically
removing high values of Y is likely to be
biased if you then try to use the results to
describe the entire population. (E.g., any
sample related to income or consumption that is
drawn at an airport is likely to biased viz a viz
the entire population.)
5Transform the Data?
3/57
- Just because a variable is skewed does not mean
you should take logs. Take logs if the model you
are fitting calls for taking logs. More later. - Scaling? E.g., per capita data. Scaling by
assets or number of shares, or sales If it is
appropriate in the context of the model (the
study). More later. - Do not transform variables without a good reason
to do so. Skewness, by itself, is not a good
reason. Dont scale variables because the values
are large. - Transform data appropriately for the study you
are doing (i.e., for the story you are trying to
tell your reader).
6Using Logs
4/57
- Generally, use logs for size variables
- Use logs if you are seeking to estimate
elasticities - Use logs if your data span a very large range of
values and the independent variables do not (a
modeling issue some art mixed in with the
science). - If the data contain 0s or negative values then
logs will be inappropriate for the study do not
use ad hoc fixes like adding something to Y so it
will be positive.
7More on Using Logs
5/57
- Generally only for continuous variables like
income or variables that are essentially
continuous. - Not for discrete variables like binary variables
or qualititative variables (e.g., stress level
1,2,3,4,5) - Generally be consistent in the equation dont
mix logs and levels. - Generally DO NOT take the log of time (t) in a
model with a time trend. TIME is discrete and
not a measure.
8Residuals
8/57
- Residual the difference between the actual
value of Y and the value predicted by the
regression. - E.g., Switzerland
- Estimated equation is DALE 36.900
2.9787EDUC .004601PCHexp - Swiss values are EDUC9.418360, PCHexp2646.442
- Regression prediction 77.1307
- Actual Swiss DALE 72.71622
- Residual 72.71622 77.1307 -4.41448
- The regresion overpredicts Switzerland
9Using Residuals
9/57
- As indicators of bad data
- As indicators of observations that deserve
attention - As a diagnostic tool to evaluate the regression
model
10Outliers
13/57
- A residual is ei yi a b1xi1,
- The standard deviation of the residuals is
- Standardized residuals are ei/se.
- Large residuals have ei/se gt 2.
11Strip Mining the Sample Residuals and Outliers
10/57
12An Aside About Plotting
11/57
13Appropriate Plot
14Appropriate Residual Plot
12/57
15Strip Mining the DataUnusual Observations
14/57
16When to Remove Outliers
15/57
- Outliers have very large residuals
- Only if it is ABSOLUTELY necessary
- The data are obviously miscoded
- There is something clearly wrong with the
observation - Do not remove outliers just because Minitab flags
them. This is not sufficient reason.
17Units of Measurement
22/57
- y a b1x1 b2x2 e
- If you multiply every observation of variable x
by the same constant, c, then the regression
coefficient will be divided by c. - E.g., multiply X by .001 to change to thousands
of , then b is multiplied by 1000. b times x
will be unchanged.
18Scaling the Data
23/57
- Units of measurement and coefficients
- Macro data and per capita figures
- Gasoline data
- WHO data
- Micro data and normalizations
- RD and Profits
19The Gasoline Market
24/57
Agregate consumption or expenditure data would
not be interesting. Income data are already per
capita.
20The WHO Data
25/57
Per Capita GDP and Per Capita Health Expenditure.
Aggregate values would make no sense.
Years
21Profits and RD by Industry
26/57
Is there a relationship between RD and Profits?
This just shows that big industries have larger
profits and RD than small ones.
Gujarati, D. Basic Econometrics, McGraw Hill,
1995, p. 388.
22Normalized by Sales
27/57
Profits/Sales a ß RD/Sales e
23More Movie Madness
28/57
- McDonalds and Movies (Craig, Douglas, Greene
International Journal of Marketing) - Log Foreign Box Office(movie,country,year) a
ß1LogBox(movie,US,year) ß2LogPCIncome
ß4LogMacsPC GenreEffect
CountryEffect e.
2429/57
We used McDonalds Per Capita
25Movie Madness Data (n2198)
30/57
26Macs and Movies
31/57
Genres (MPAA) 1Drama 2Romance 3Comedy 4Action
5Fantasy 6Adventure 7Family 8Animated 9Thrill
er 10Mystery 11Science Fiction 12Horror 13Crim
e
Countries and Some of the Data Code
Pop(mm) per cap of Language
Income McDonalds 1 Argentina
37 12090 173 Spanish 2 Chile,
15 9110 70 Spanish 3 Spain
39 19180 300 Spanish 4
Mexico 98 8810 270
Spanish 5 Germany 82 25010 1152
German 6 Austria 8 26310
159 German 7 Australia 19 25370
680 English 8 UK 60 23550
1152 UK
27Making the Genre Variables
32/57
Calc ?
28Movie Genres
33/57
2934/57
CRIME is the left out GENRE. AUSTRIA is the left
out country. Australia and UK were left out for
other reasons (algebraic problem with only 8
countries).
30Model Fit
35/57
- How well does the model fit the data?
- R2 measures fit the larger the better
- Time series expect .9 or better
- Cross sections it depends
- Social science data .1 is good
- Industry or market data .5 is routine
31OK Fit
36/57
32Success Measure
37/57
- Hypothesis There is no regression.
- Equivalent Hypothesis R2 0.
- How to test For now, rough rule.Look for F gt
2 for multiple regressionF 144.34 for Movie
Madness
33A Formal Test of the Regression Model
38/57
- Is there a significant relationship?
- Equivalently, is R2 gt 0?
- Statistically, not numerically.
- Testing
- Compute the F (R2/K)/(1-R2)/(n-K-1)
- Determine if F is large using the appropriate
table
34The F Test for the Model
39/57
- Determine the appropriate critical value from
the table. - Is the F from the computed model larger than the
theoretical F from the table? - Yes Conclude the relationship is significant
- No Conclude R2 0.
35 40/57
n1 Number of predictors n2 Sample size
number of predictors 1
36Testing .
41/57
- Use Minitabs F Calculator
37Finding the Critical F
42/57
Leave as is
Number of predictors in the model K
n-K-1
Standard .95
38Compare Sample F to Critical F
43/57
- F 144.34 for Movie Madness
- Critical value from the table is 1.57536.
- Reject the hypothesis of no relationship.
39An Equivalent Approach
44/57
- What is the P Value?
- We observed an F of 144.34 (or, whatever it is).
- If there really were no relationship, how likely
is it that we would have observed an F this large
(or larger)? - Depends on n and K
- The probability is reported with the regression
results as the P Value.
40The F Test
45/57
S 0.952237 R-Sq 57.0 R-Sq(adj)
56.6 Analysis of Variance Source DF
SS MS F P Regression
20 2617.58 130.88 144.34 0.000 Residual Error
2177 1974.01 0.91 Total 2197
4591.58
41A Huge Theorem
46/57
- R2 always goes up when you add variables to your
model. - Always.
42Adjusted R Squared
47/57
- Adjusted R2 penalizes your model for obtaining
its fit with lots of variables. Adjusted R2
1 (n-1)/(n-K-1)(1 R2) - Adjusted R2 is denoted
- Adjusted R2 is not the mean of anything and it is
not a square. This is just a name.
43The Analysis of Variance
48/57
S 0.952237 R-Sq 57.0 R-Sq(adj)
56.6 Analysis of Variance Source DF
SS MS F P Regression
20 2617.58 130.88 144.34 0.000 Residual Error
2177 1974.01 0.91 Total 2197
4591.58
If n is very large, R2 and Adjusted R2 will not
differ by very much.2198 is quite large for this
purpose.
44Exploring the Relationship
49/57
- F statistic examines the entire relationship.
Benchmark F gt 2 is good for a multiple
regression. - What about individual coefficients?(E.g., is
there a significant relationship between the
number of McDonalds and the local box office
result?)
4550/57
Use individual t statistics. T gt 2 or T lt -2
suggests the variable is significant. T for
LogPCMacs 9.66. This is large. Note the 2
for t statistics and the 4 22 for the F
statistic for a simple regression (one predictor)
is not a coincidence.
46What About a Group of Variables?
51/57
- Is Genre significant?
- There are 12 genre variables
- Some are significant (fantasy, mystery, horror)
some are not. - Can we conclude the group as a whole is?
- Maybe. We need a test.
47Theory for the Test
52/57
- A larger model has a higher R2 than a smaller
one. - (Larger model means it has all the variables in
the smaller one, plus some additional ones) - (1) Compute this statistic with a calculator
48Is Genre Significant?
56/57
With the 12 Genre indicator variables S
0.952237 R-Sq 57.0 Without the 12 Genre
indicator variables S 0.967685 R-Sq
55.4 (0.570 0.554)/12F
-------------------------------------- 6.750
(1 0.570)/(2198 20 1)
Cumulative Distribution Function F distribution
with 12 DF in numerator and 2177 DF in
denominator x P( X lt x ) 6.75 1.00000
THIS IS LARGER THAN 0.95
49Now What?
55/57
- If the value that Minitab shows you is greater
than 0.95, then the F statistic is large - I.e., conclude that the group of coefficients is
significant - This means that at least one is nonzero, not that
all necessarily are.
50Testing .
53/57
- Use Minitabs F Calculator
51F Test
54/57
Leave as is
Number of coefficients in the group
N-K-1 for the larger model
Your F from Step 1
Push the button
52Summary
57/57
- Data preparation missing values
- Residuals and outliners
- Scaling the data
- Model fit and analysis of variance R2
- Testing
- One variable (coefficient) the t test
- A set of variables the F test