Title: Class 7: Thurs', Sep' 30
1Class 7 Thurs., Sep. 30
2Outliers and Influential Observations
- Outlier Any really unusual observation.
- Outlier in the X direction (called high leverage
point) Has the potential to influence the
regression line. - Outlier in the direction of the scatterplot An
observation that deviates from the overall
pattern of relationship between Y and X.
Typically has a residual that is large in
absolute value. - Influential observation Point that if it is
removed would markedly change the statistical
analysis. For simple linear regression, points
that are outliers in the x direction are often
influential.
3Housing Prices and Crime Rates
- A community in the Philadelphia area is
interested in how crime rates are associated with
property values. If low crime rates increase
property values, the community might be able to
cover the costs of increased police protection by
gains in tax revenues from higher property
values. - The town council looked at a recent issue of
Philadelphia Magazine (April 1996) and found data
for itself and 109 other communities in
Pennsylvania near Philadelphia. Data is in
philacrimerate.JMP. House price Average house
price for sales during most recent year, Crime
RateRate of crimes per 1000 population.
4(No Transcript)
5Which points are influential?
Center City Philadelphia is influential Gladwyne
is not. In general, points that have high
leverage are more likely to be influential.
6Excluding Observations from Analysis in JMP
- To exclude an observation from the regression
analysis in JMP, go to the row of the
observation, click Rows and then click
Exclude/Unexclude. A red circle with a diagonal
line through it should appear next to the
observation. - To put the observation back into the analysis, go
to the row of the observation, click Rows and
then click Exclude/Unexclude. The red circle
should no longer appear next to the observation.
7Formal measures of leverage and influence
- Leverage Hat values (JMP calls them hats)
- Influence Cooks Distance (JMP calls them Cooks
D Influence). - To obtain them in JMP, click Analyze, Fit Model,
put Y variable in Y and X variable in Model
Effects box. Click Run Model box. After model
is fit, click red triangle next to Response.
Click Save Columns and then Click Hats for
Leverages and Click Cooks D Influences for
Cooks Distances. - To sort observations in terms of Cooks Distance
or Leverage, click Tables, Sort and then put
variable you want to sort by in By box.
8Center City Philadelphia has both influence
(Cooks Distance much Greater than 1 and high
leverage (hat value gt 32/990.06). No
other observations have high influence or high
leverage.
9Rules of Thumb for High Leverage and High
Influence
- High Leverage Any observation with a leverage
(hat value) gt (3 of coefficients in
regression model)/n has high leverage, where - of coefficients in regression model 2 for
simple linear regression. - nnumber of observations.
- High Influence Any observation with a Cooks
Distance greater than 1 indicates a high
influence.
10What to Do About Suspected Influential
Observations?
- See flowchart handout.
- Does removing the observation change the
- substantive conclusions?
- If not, can say something like Observation x has
high influence relative to all other observations
but we tried refitting the regression without
Observation x and our main conclusions didnt
change.
11- If removing the observation does change
substantive conclusions, is there any reason to
believe the observation belongs to a population
other than the one under investigation? - If yes, omit the observation and proceed.
- If no, does the observation have high leverage
(outlier in explanatory variable). - If yes, omit the observation and proceed. Report
that conclusions only apply to a limited range of
the explanatory variable. - If no, not much can be said. More data (or
clarification of the influential observation) are
needed to resolve the questions.
12General Principles for Dealing with Influential
Observations
- General principle Delete observations from the
analysis sparingly only when there is good
cause (observation does not belong to population
being investigated or is a point with high
leverage). If you do delete observations from
the analysis, you should state clearly which
observations were deleted and why.
13The Question of Causation
- The community that ran this regression would like
to increase property values. If low crime rates
increase property values, the community might be
able to cover the costs of increased police
protection by gains in tax revenue from higher
property values. - The regression without Center City Philadelphia
is - Linear Fit
- HousePrice 225233.55 - 2288.6894 CrimeRate
- The community concludes that if it can cut its
crime rate from 30 down to 20 incidents per 1000
population, it will increase its average house
price by 2288.68941022,887. - Is the communitys conclusion justified?
14Potential Outcomes Model
- Let Yi30 denote what the house price for
community i would be if its crime rate was 30 and
Yi20 denote what the house price for community i
would be if its crime rate was 20. - X (crime rate) causes a change in Y (house price)
for community i if . A decrease in
crime rate causes an increase in house price for
community i if -
15Association is Not Causation
- A regression model tells us about how the mean of
YX is associated with changes in X. A
regression model does not tell us what would
happen if we actually changed X. - Possible Explanations for an Observed Association
Between Y and X - Y causes X
- X causes Y
- There is a confounding variable Z that is
associated with changes in both X and Y. - Any combination of the three explanations
may apply to an observed association.
16X Causes Y
Perhaps it is changes in house price that cause
changes in crime rate. When house prices
increase, the residents of a community have more
to lose by engaging in criminal actives this is
called the economic theory of crime.
17Confounding Variables
- Confounding variable for the causal relationship
between X and Y A variable Z that is associated
with both X and Y. - Example of confounding variable in Philadelphia
crime rate data Level of education. Level of
education may be associated with both house
prices and crime rate. - The effect of crime rate on house price is
confounded with the effect of education on house
price. If we just look at data on house price
and crime rate, we cant distinguish between the
effect of crime rate on house price and the
effect of education on house price.
18Note on Confounding Variables and Lurking
Variables
- The books distinction between lurking variable
and confounding variable is confusing and the
term lurking variable is not standard in
statistics, whereas confounding variable is.
So I will just use the term confounding variable
in the rest of the course.
19Examples of Confounding Variables
- Many studies have found that people who are
active in their religion live longer than
nonreligious people. Potential confounding
variables?
20Weekly Wages (Y) and Education (X) in March 1988
CPS
Will getting an extra year of education cause an
increase of 50.41 on average in your weekly
wage? What are some potential confounding
variables?
21Math enrollment data The residual plot vs. time
indicates that there is a confounding variable
associated with time. It turns out that one of
the schools (say the engineering school) in the
university changed its program to require that
entering students take another mathematics
course. The variable of whether the engineering
school requires its students to take another
mathematics course is a confounding variable.
22Establishing Causation
- Best method is an experiment, but many times that
is not ethically or practically possible (e.g.,
smoking and cancer, education and earnings).
23- Main strategy for learning about causation when
we cant do an experiment Consider all
confounding variables you can think of. Try to
take them into account (well see how to do this
when we study multiple regression in Chapter 11)
and see if association between Y and X remains
once the known confounding variables have been
accounted for.
24Other Criteria for Establishing Causation When We
Cant Do An Experiment
- The association is strong.
- The association is consistent.
- Higher doses are associated with stronger
responses. - The alleged cause precedes the effect in time.
- The alleged cause is plausible.