Statistics and Data Analysis - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Statistics and Data Analysis

Description:

Individual coefficient estimates and t statistics ... Income McDonalds. 1 Argentina 37 12090 173 Spanish. 2 Chile, 15 9110 70 Spanish ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 53
Provided by: William7
Category:

less

Transcript and Presenter's Notes

Title: Statistics and Data Analysis


1
Statistics and Data Analysis
  • Professor William Greene
  • Stern School of Business
  • IOMS Department of
  • Department of Economics

2
Statistics and Data Analysis
Part 19 Multiple Regression 3
3
Multiple Regression Modeling
1/57
  • Data Preparation
  • Examining the data
  • Transformations
  • Scaling
  • Analysis of the Regression
  • Residuals and outliers
  • Influential data points
  • The fit of the regression
  • R squared and adjusted R squared
  • Analysis of variance
  • Individual coefficient estimates and t statistics
  • Testing for significance of a set of coefficients
  • Prediction

4
Data Preparation
2/57
  • Get rid of observations with missing values.
  • Small numbers of missing values, delete
    observations
  • Large numbers of missing values may need to
    give up on certain variables
  • There are theories and methods for filling
    missing values. (Advanced techniques. Usually
    not useful or appropriate for real world work.)
  • Be sure that missingness is not directly
    related to the values of the dependent variable.
    E.g., a regression that follows systematically
    removing high values of Y is likely to be
    biased if you then try to use the results to
    describe the entire population. (E.g., any
    sample related to income or consumption that is
    drawn at an airport is likely to biased viz a viz
    the entire population.)

5
Transform the Data?
3/57
  • Just because a variable is skewed does not mean
    you should take logs. Take logs if the model you
    are fitting calls for taking logs. More later.
  • Scaling? E.g., per capita data. Scaling by
    assets or number of shares, or sales If it is
    appropriate in the context of the model (the
    study). More later.
  • Do not transform variables without a good reason
    to do so. Skewness, by itself, is not a good
    reason. Dont scale variables because the values
    are large.
  • Transform data appropriately for the study you
    are doing (i.e., for the story you are trying to
    tell your reader).

6
Using Logs
4/57
  • Generally, use logs for size variables
  • Use logs if you are seeking to estimate
    elasticities
  • Use logs if your data span a very large range of
    values and the independent variables do not (a
    modeling issue some art mixed in with the
    science).
  • If the data contain 0s or negative values then
    logs will be inappropriate for the study do not
    use ad hoc fixes like adding something to Y so it
    will be positive.

7
More on Using Logs
5/57
  • Generally only for continuous variables like
    income or variables that are essentially
    continuous.
  • Not for discrete variables like binary variables
    or qualititative variables (e.g., stress level
    1,2,3,4,5)
  • Generally be consistent in the equation dont
    mix logs and levels.
  • Generally DO NOT take the log of time (t) in a
    model with a time trend. TIME is discrete and
    not a measure.

8
Residuals
8/57
  • Residual the difference between the actual
    value of Y and the value predicted by the
    regression.
  • E.g., Switzerland
  • Estimated equation is DALE 36.900
    2.9787EDUC .004601PCHexp
  • Swiss values are EDUC9.418360, PCHexp2646.442
  • Regression prediction 77.1307
  • Actual Swiss DALE 72.71622
  • Residual 72.71622 77.1307 -4.41448
  • The regresion overpredicts Switzerland

9
Using Residuals
9/57
  • As indicators of bad data
  • As indicators of observations that deserve
    attention
  • As a diagnostic tool to evaluate the regression
    model

10
Outliers
13/57
  • A residual is ei yi a b1xi1,
  • The standard deviation of the residuals is
  • Standardized residuals are ei/se.
  • Large residuals have ei/se gt 2.

11
Strip Mining the Sample Residuals and Outliers
10/57
12
An Aside About Plotting
11/57
13
Appropriate Plot
14
Appropriate Residual Plot
12/57
15
Strip Mining the DataUnusual Observations
14/57
16
When to Remove Outliers
15/57
  • Outliers have very large residuals
  • Only if it is ABSOLUTELY necessary
  • The data are obviously miscoded
  • There is something clearly wrong with the
    observation
  • Do not remove outliers just because Minitab flags
    them. This is not sufficient reason.

17
Units of Measurement
22/57
  • y a b1x1 b2x2 e
  • If you multiply every observation of variable x
    by the same constant, c, then the regression
    coefficient will be divided by c.
  • E.g., multiply X by .001 to change to thousands
    of , then b is multiplied by 1000. b times x
    will be unchanged.

18
Scaling the Data
23/57
  • Units of measurement and coefficients
  • Macro data and per capita figures
  • Gasoline data
  • WHO data
  • Micro data and normalizations
  • RD and Profits

19
The Gasoline Market
24/57
Agregate consumption or expenditure data would
not be interesting. Income data are already per
capita.
20
The WHO Data
25/57
Per Capita GDP and Per Capita Health Expenditure.
Aggregate values would make no sense.
Years
21
Profits and RD by Industry
26/57
Is there a relationship between RD and Profits?
This just shows that big industries have larger
profits and RD than small ones.
Gujarati, D. Basic Econometrics, McGraw Hill,
1995, p. 388.
22
Normalized by Sales
27/57
Profits/Sales a ß RD/Sales e
23
More Movie Madness
28/57
  • McDonalds and Movies (Craig, Douglas, Greene
    International Journal of Marketing)
  • Log Foreign Box Office(movie,country,year) a
    ß1LogBox(movie,US,year) ß2LogPCIncome
    ß4LogMacsPC GenreEffect
    CountryEffect e.

24
29/57
We used McDonalds Per Capita
25
Movie Madness Data (n2198)
30/57
26
Macs and Movies
31/57
Genres (MPAA) 1Drama 2Romance 3Comedy 4Action
5Fantasy 6Adventure 7Family 8Animated 9Thrill
er 10Mystery 11Science Fiction 12Horror 13Crim
e
Countries and Some of the Data Code
Pop(mm) per cap of Language
Income McDonalds 1 Argentina
37 12090 173 Spanish 2 Chile,
15 9110 70 Spanish 3 Spain
39 19180 300 Spanish 4
Mexico 98 8810 270
Spanish 5 Germany 82 25010 1152
German 6 Austria 8 26310
159 German 7 Australia 19 25370
680 English 8 UK 60 23550
1152 UK
27
Making the Genre Variables
32/57
Calc ?
28
Movie Genres
33/57
29
34/57
CRIME is the left out GENRE. AUSTRIA is the left
out country. Australia and UK were left out for
other reasons (algebraic problem with only 8
countries).
30
Model Fit
35/57
  • How well does the model fit the data?
  • R2 measures fit the larger the better
  • Time series expect .9 or better
  • Cross sections it depends
  • Social science data .1 is good
  • Industry or market data .5 is routine

31
OK Fit
36/57
32
Success Measure
37/57
  • Hypothesis There is no regression.
  • Equivalent Hypothesis R2 0.
  • How to test For now, rough rule.Look for F gt
    2 for multiple regressionF 144.34 for Movie
    Madness

33
A Formal Test of the Regression Model
38/57
  • Is there a significant relationship?
  • Equivalently, is R2 gt 0?
  • Statistically, not numerically.
  • Testing
  • Compute the F (R2/K)/(1-R2)/(n-K-1)
  • Determine if F is large using the appropriate
    table

34
The F Test for the Model
39/57
  • Determine the appropriate critical value from
    the table.
  • Is the F from the computed model larger than the
    theoretical F from the table?
  • Yes Conclude the relationship is significant
  • No Conclude R2 0.

35

40/57
n1 Number of predictors n2 Sample size
number of predictors 1
36
Testing .
41/57
  • Use Minitabs F Calculator

37
Finding the Critical F
42/57
Leave as is
Number of predictors in the model K
n-K-1
Standard .95
38
Compare Sample F to Critical F
43/57
  • F 144.34 for Movie Madness
  • Critical value from the table is 1.57536.
  • Reject the hypothesis of no relationship.

39
An Equivalent Approach
44/57
  • What is the P Value?
  • We observed an F of 144.34 (or, whatever it is).
  • If there really were no relationship, how likely
    is it that we would have observed an F this large
    (or larger)?
  • Depends on n and K
  • The probability is reported with the regression
    results as the P Value.

40
The F Test
45/57
S 0.952237 R-Sq 57.0 R-Sq(adj)
56.6 Analysis of Variance Source DF
SS MS F P Regression
20 2617.58 130.88 144.34 0.000 Residual Error
2177 1974.01 0.91 Total 2197
4591.58
41
A Huge Theorem
46/57
  • R2 always goes up when you add variables to your
    model.
  • Always.

42
Adjusted R Squared
47/57
  • Adjusted R2 penalizes your model for obtaining
    its fit with lots of variables. Adjusted R2
    1 (n-1)/(n-K-1)(1 R2)
  • Adjusted R2 is denoted
  • Adjusted R2 is not the mean of anything and it is
    not a square. This is just a name.

43
The Analysis of Variance
48/57
S 0.952237 R-Sq 57.0 R-Sq(adj)
56.6 Analysis of Variance Source DF
SS MS F P Regression
20 2617.58 130.88 144.34 0.000 Residual Error
2177 1974.01 0.91 Total 2197
4591.58
If n is very large, R2 and Adjusted R2 will not
differ by very much.2198 is quite large for this
purpose.
44
Exploring the Relationship
49/57
  • F statistic examines the entire relationship.
    Benchmark F gt 2 is good for a multiple
    regression.
  • What about individual coefficients?(E.g., is
    there a significant relationship between the
    number of McDonalds and the local box office
    result?)

45
50/57
Use individual t statistics. T gt 2 or T lt -2
suggests the variable is significant. T for
LogPCMacs 9.66. This is large. Note the 2
for t statistics and the 4 22 for the F
statistic for a simple regression (one predictor)
is not a coincidence.
46
What About a Group of Variables?
51/57
  • Is Genre significant?
  • There are 12 genre variables
  • Some are significant (fantasy, mystery, horror)
    some are not.
  • Can we conclude the group as a whole is?
  • Maybe. We need a test.

47
Theory for the Test
52/57
  • A larger model has a higher R2 than a smaller
    one.
  • (Larger model means it has all the variables in
    the smaller one, plus some additional ones)
  • (1) Compute this statistic with a calculator

48
Is Genre Significant?
56/57
With the 12 Genre indicator variables S
0.952237 R-Sq 57.0 Without the 12 Genre
indicator variables S 0.967685 R-Sq
55.4 (0.570 0.554)/12F
-------------------------------------- 6.750
(1 0.570)/(2198 20 1)
Cumulative Distribution Function F distribution
with 12 DF in numerator and 2177 DF in
denominator x P( X lt x ) 6.75 1.00000
THIS IS LARGER THAN 0.95
49
Now What?
55/57
  • If the value that Minitab shows you is greater
    than 0.95, then the F statistic is large
  • I.e., conclude that the group of coefficients is
    significant
  • This means that at least one is nonzero, not that
    all necessarily are.

50
Testing .
53/57
  • Use Minitabs F Calculator

51
F Test
54/57
Leave as is
Number of coefficients in the group
N-K-1 for the larger model
Your F from Step 1
Push the button
52
Summary
57/57
  • Data preparation missing values
  • Residuals and outliners
  • Scaling the data
  • Model fit and analysis of variance R2
  • Testing
  • One variable (coefficient) the t test
  • A set of variables the F test
Write a Comment
User Comments (0)
About PowerShow.com