Title: Statistics and Data Analysis
1Statistics and Data Analysis
- Professor William Greene
- Stern School of Business
- IOMS Department
- Department of Economics
2Statistics and Data Analysis
Part 18 Multiple Regression 2
3Multiple Regression Models
1/52
Part 17 Multiple Regression2
- Using Minitab To Compute A Multiple Regression
- Basic Multiple Regression
- Using Binary Variables
- Logs and Elasticities
- Hedonic Regression and Interpretation
- Trends in Time Series Data
- Using Quadratic Terms to Improve the Model
4Application WHO
2/52
- Data Used in Assignment 1 WHO data on 191
countries in 1995-1999. - Analysis of Disability Adjusted Life Expectancy
DALE - EDUC average years of education
- PCHexp Per capita health expenditure
- DALE a ß1EDUC ß2HealthExp e
5The (Famous) WHO Data
3/52
64/52
7Specify the Variables in the Model
5/52
8 6/52
9Graphs? Maybe
7/52
10Regression Results
8/52
11Practical Model Building
9/52
- Understanding the regression The left out
variable problem - Using different kinds of variables
- Dummy variables
- Logs
- Time trend
- Quadratic
12A Fundamental Result
10/52
- What happens when you leave a crucial
variable out of your model? (Bad things)
Regression Analysis g versus GasPrice (no
income) The regression equation is g 3.50
0.0280 GasPrice Predictor Coef SE Coef
T P Constant 3.4963 0.1678 20.84
0.000 GasPrice 0.028034 0.002809 9.98
0.000 Regression Analysis G versus GasPrice,
Income The regression equation is G 0.134 -
0.00163 GasPrice 0.000026 Income Predictor
Coef SE Coef T P Constant
0.13449 0.02081 6.46 0.000 GasPrice
-0.0016281 0.0004152 -3.92 0.000 Income
0.00002634 0.00000231 11.43 0.000
13Using Dummy Variables
11/52
- Dummy variable binary variable a variable
that takes values 0 and 1. - E.g. OECD Life Expectancies compared to the rest
of the world - DALE a ß1 EDUC ß2 PCHexp
ß3 OECD e
Australia, Austria, Belgium, Canada, Czech
Republic, Denmark, Finland, France, Germany,
Greece, Hungary, Iceland, Ireland, Italy, Japan,
Korea, Luxembourg, Mexico, The Netherlands, New
Zealand, Norway, Poland, Portugal, Slovak
Republic, Spain, Sweden, Switzerland, Turkey,
United Kingdom, United States.
14OECD Life Expectancy
12/52
According to these results, after accounting for
education and health expenditure differences,
people in the OECD countries have a life
expectancy that is 1.19 years shorter than people
in other countries.
15Binary Variable in Regression
13/52
The regression shifts down by 1.191 years for the
OECD countries
NonOECD DALE 36.770 2.9962 EDUC
.005079 PCHExp
OECD DALE 36.770 2.9962 EDUC
.005079 PCHExp 1.191
We set PCHExp to 1000, approximately the sample
mean.
16Plotting
For DALE_NonOECD, remove -1.191
17Two Plots
18Dummy Variable in Log Regression
14/52
- E.g., Monets signature equation
- LogPrice a ß1 logArea ß2 Signed
- Unsigned PriceU exp(a) Areaß1
- Signed PriceS exp(a) Areaß1 exp(ß2)
- Signed/Unsigned exp(ß2)
- Difference 100(Signed-Unsigned)/Unsigned
- 100exp(ß2) 1
19The Signature Effect 253
15/52
100exp(1.2618) 1 1003.532 1 253.2
20Monet Paintings in Millions
16/52
Difference is about 253
Predicted Price is exp(4.1221.3458logArea1.2618
Signed) / 1000000
21 17/52
22Dummy Variable for One Observation
18/52
ProofsSee p. 40.
- Single out one observation for special attention.
- The equation will predict that observation
perfectly. - For the other coefficients, it is the same as
removing that observation from the sample.
23A London Effect on UK Electronic Store Sales?
19/52
2420/52
Observation 2 is LondonFit Actual, Residual0.
25Logs in Regression
21/52
26Elasticity
22/52
- The coefficient on log(Area) is 1.346
- For each 1 increase in area, price goes up by
1.34 - even accounting for the signature effect. - The elasticity is 1.34
- Remarkable. Not only does price increase with
area, it increases faster than area.
27Monet By the Square Inch
23/52
28Elasticities of Demand for Gasoline
24/52
29Logs and Elasticities
25/52
- Theory In the equationy a ß1x1 ß2x2
ßKxK e - ß (change in y) / (unit change in x)
- Elasticity ß mean of x / mean of y
- When the variables are in logs change in logx
change in x - log y a ß1 log x1 ß2 log x2 ßK
log xK e - Elasticity ß
- These will often give approximately the same
answer. - When in doubt, use logs.
30Elasticities
26/52
Price elasticity -0.02070 Income
elasticity 1.10318
31A Set of Dummy Variables
27/52
- Complete set of dummy variables divides the
sample into groups. - Fit the regression with group effects.
- Need to drop one (any one) of the variables to
compute the regression. (Avoid the dummy
variable trap.)
32Rankings of 132 U.S.Liberal Arts Colleges
28/52
Nancy Burnett Journal of Economic Education,
1998
Reputationaß1Religious ß2GenderEcon
ß3EconFac ß4North
ß5South ß6Midwest ß7West e
33Minitab to the Rescue
29/52
34Unordered Categorical Variables
30/52
House price data (fictitious) Type 1 Split
levelType 2 RanchType 3 ColonialType 4
Tudor Use 3 dummy variables for this kind of
data. (Not all 4) Using variable STYLE in the
model makes no sense. You could change the
numbering scale any way you like. 1,2,3,4 are
just labels.
35Transform Style to Types
31/52
3632/52
37House Price Regression
33/52
Each of these is relative to a Split Level, since
that is the omitted category. E.g., the price of
a Ranch house is 74,369 less than a Split Level
of the same size with the same number of bedrooms.
38Ordered Categories
34/52
- Health Satisfaction1Poor, 2So_so, 3OK,
4Good, 5Great - How to handle such a variable?
- Just use as is? No, So_so Poor 1, but this is
not equal to Great Good 1 (necessarily) - Use 4 of the indicator variables.
- Coding. It is not useful to consider
modifications of the variable, such as
-2,-1,0,1,2 or 2,4,6,8,10. None make sense as
this is just a label. Could also use 1,4,8,17,26
which would also make no sense. - This needs a special kind of model if it is the
dependent variable not a regression equation.
39Hedonic Regression
35/52
- A theory of prices
- Price sum of prices for components
- House price
- Land size
- Rooms Fixed amount per room
- Swimming pool
- View
- N car garage
- Etc.
- Computers
- Speed
- Screen size
- Other features
40Fumiro Computer Data
36/52
41Transform Manufacturer Names to Indicator
Variables
37/52
Calc ? Make Indicator Variables
42Hedonic Regression
38/52
43Time Trends in Regression
39/52
- y a ß1x ß2t e ß2 is the year to
year increase not explained by anything
else. - log y a ß1log x ß2t e (not log t,
just t) 100ß2 is the year to year
increase not explained by anything else.
44Time Trend Regression
40/52
After accounting for Income, the price and the
price of new cars, per capita gasoline
consumption falls by 1.25 per year. I.e., if
Income and the prices were unchanged, consumption
would fall by 1.25. But, of course, these other
things do not remain unchanged.
45Nonlinear Equation
41/52
- Using a quadratic (like using logs)
- y a ß1x ß2x2 e
- Usually ß1 gt 0.
- If ß2 gt 0 If ß2 lt 0
y
y
x
x
46A Quadratic Income vs. Age Regression
42/52
-------------------------------------------------
--- LHSHHNINC Mean
.3520836 Standard deviation
.1769083 Model size Parameters
3 Degrees
of freedom 27323 Residuals
Sum of squares 794.9667
Standard error of e .1705730
Fit R-squared
.7040754E-01 ----------------------------------
------------------ --------------------------
------ Variable Coefficient Mean of
X ---------------------------------
Constant -.39266196 AGE .02458140
43.5256898 AGESQ -.00027237 2022.85549
EDUC .01994416 11.3206310 ------------
---------------------
Note the coefficient on Age squared is negative.
Age ranges from 25 to 65.
47Implied By The Model
43/52
48Case Study A Huge Sports Contract
44/52
- Alex Rodriguez hired by the Texas Rangers for
something like 25 million per year. - Costs the salary plus and minus some fine
tuning of the numbers - Benefits more fans in the stands.
- How to determine if the benefits exceed the
costs? Use a regression model.
49PDV of the Costs
45/52
- Using 8 discount factor
- Accounting for all costs
- Roughly 21M to 28M in each year from 2001 to
2010, then the deferred payments from 2010 to
2020 - Total costs About 165 Million in 2001 (Present
discounted value)
50Benefits
46/52
- More fans in the seats
- Gate
- Parking
- Merchandise
- Increased chance at playoffs and world series
- Sponsorships
- (Loss to revenue sharing)
- Franchise value
51How Many New Fans?
47/52
- Projected 8 more wins per year.
- What is the relationship between wins and
attendance? - Not known precisely
- Many empirical studies (The Journal of Sports
Economics) - Use a regression model to find out.
52A Regression Model
48/52
- Based on 10 years of baseball data on wins and
attendance - Approximately (depends on your model)
- This years attendance
- team specific constant
- 20,000 Number of Wins
- 6,000 Last Years Number of Wins
- .42 Last Years Attendance
- error
53Marginal Value of a Win
49/52
- Roughly, increase in this years attendance if
the team wins one more game - (20,000 6,000) / (1 - .42)
- About 45,000 fans per year per win
54Marginal Value of an A Rod
50/52
- 8 games 45,000 fans 360,000 fans
- 360,000 fans
- 18 per ticket
- 2.50 parking etc.
- 1.80 stuff (hats, bobble head dolls,)
- 8.0 Million per year !!!!! Its not close.
(Marginal cost is about 16.5M / year)
55Postscripts
51/52
- (1) Texas was not out of last place for a single
day while A-Rod was on the team. Was it worth it?
You make the call. - (2) What about the Yankees they now pay most of
the same costs. Is it worth it? How would you
find out? - (3) What about that David Beckham contract with
Major League Soccer?
56Summary
52/52
- Using Minitab To Compute a Regression
- Building a Model
- Logs
- Dummy variables
- Qualitative variables
- Trends
- Quadratics
- Effects across time
- All Assuming You Know the Right Variables!