Class 7: Thurs', Sep' 30

1 / 24
About This Presentation
Title:

Class 7: Thurs', Sep' 30

Description:

A regression model tells us about how the mean of Y|X is associated with changes ... Weekly Wages (Y) and Education (X) in March 1988 CPS ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Class 7: Thurs', Sep' 30


1
Class 7 Thurs., Sep. 30
2
Outliers and Influential Observations
  • Outlier Any really unusual observation.
  • Outlier in the X direction (called high leverage
    point) Has the potential to influence the
    regression line.
  • Outlier in the direction of the scatterplot An
    observation that deviates from the overall
    pattern of relationship between Y and X.
    Typically has a residual that is large in
    absolute value.
  • Influential observation Point that if it is
    removed would markedly change the statistical
    analysis. For simple linear regression, points
    that are outliers in the x direction are often
    influential.

3
Housing Prices and Crime Rates
  • A community in the Philadelphia area is
    interested in how crime rates are associated with
    property values. If low crime rates increase
    property values, the community might be able to
    cover the costs of increased police protection by
    gains in tax revenues from higher property
    values.
  • The town council looked at a recent issue of
    Philadelphia Magazine (April 1996) and found data
    for itself and 109 other communities in
    Pennsylvania near Philadelphia. Data is in
    philacrimerate.JMP. House price Average house
    price for sales during most recent year, Crime
    RateRate of crimes per 1000 population.

4
(No Transcript)
5
Which points are influential?
Center City Philadelphia is influential Gladwyne
is not. In general, points that have high
leverage are more likely to be influential.
6
Excluding Observations from Analysis in JMP
  • To exclude an observation from the regression
    analysis in JMP, go to the row of the
    observation, click Rows and then click
    Exclude/Unexclude. A red circle with a diagonal
    line through it should appear next to the
    observation.
  • To put the observation back into the analysis, go
    to the row of the observation, click Rows and
    then click Exclude/Unexclude. The red circle
    should no longer appear next to the observation.

7
Formal measures of leverage and influence
  • Leverage Hat values (JMP calls them hats)
  • Influence Cooks Distance (JMP calls them Cooks
    D Influence).
  • To obtain them in JMP, click Analyze, Fit Model,
    put Y variable in Y and X variable in Model
    Effects box. Click Run Model box. After model
    is fit, click red triangle next to Response.
    Click Save Columns and then Click Hats for
    Leverages and Click Cooks D Influences for
    Cooks Distances.
  • To sort observations in terms of Cooks Distance
    or Leverage, click Tables, Sort and then put
    variable you want to sort by in By box.

8
Center City Philadelphia has both influence
(Cooks Distance much Greater than 1 and high
leverage (hat value gt 32/990.06). No
other observations have high influence or high
leverage.
9
Rules of Thumb for High Leverage and High
Influence
  • High Leverage Any observation with a leverage
    (hat value) gt (3 of coefficients in
    regression model)/n has high leverage, where
  • of coefficients in regression model 2 for
    simple linear regression.
  • nnumber of observations.
  • High Influence Any observation with a Cooks
    Distance greater than 1 indicates a high
    influence.

10
What to Do About Suspected Influential
Observations?
  • See flowchart handout.
  • Does removing the observation change the
  • substantive conclusions?
  • If not, can say something like Observation x has
    high influence relative to all other observations
    but we tried refitting the regression without
    Observation x and our main conclusions didnt
    change.

11
  • If removing the observation does change
    substantive conclusions, is there any reason to
    believe the observation belongs to a population
    other than the one under investigation?
  • If yes, omit the observation and proceed.
  • If no, does the observation have high leverage
    (outlier in explanatory variable).
  • If yes, omit the observation and proceed. Report
    that conclusions only apply to a limited range of
    the explanatory variable.
  • If no, not much can be said. More data (or
    clarification of the influential observation) are
    needed to resolve the questions.

12
General Principles for Dealing with Influential
Observations
  • General principle Delete observations from the
    analysis sparingly only when there is good
    cause (observation does not belong to population
    being investigated or is a point with high
    leverage). If you do delete observations from
    the analysis, you should state clearly which
    observations were deleted and why.

13
The Question of Causation
  • The community that ran this regression would like
    to increase property values. If low crime rates
    increase property values, the community might be
    able to cover the costs of increased police
    protection by gains in tax revenue from higher
    property values.
  • The regression without Center City Philadelphia
    is
  • Linear Fit
  • HousePrice 225233.55 - 2288.6894 CrimeRate
  • The community concludes that if it can cut its
    crime rate from 30 down to 20 incidents per 1000
    population, it will increase its average house
    price by 2288.68941022,887.
  • Is the communitys conclusion justified?

14
Potential Outcomes Model
  • Let Yi30 denote what the house price for
    community i would be if its crime rate was 30 and
    Yi20 denote what the house price for community i
    would be if its crime rate was 20.
  • X (crime rate) causes a change in Y (house price)
    for community i if . A decrease in
    crime rate causes an increase in house price for
    community i if

15
Association is Not Causation
  • A regression model tells us about how the mean of
    YX is associated with changes in X. A
    regression model does not tell us what would
    happen if we actually changed X.
  • Possible Explanations for an Observed Association
    Between Y and X
  • Y causes X
  • X causes Y
  • There is a confounding variable Z that is
    associated with changes in both X and Y.
  • Any combination of the three explanations
    may apply to an observed association.

16
X Causes Y
Perhaps it is changes in house price that cause
changes in crime rate. When house prices
increase, the residents of a community have more
to lose by engaging in criminal actives this is
called the economic theory of crime.
17
Confounding Variables
  • Confounding variable for the causal relationship
    between X and Y A variable Z that is associated
    with both X and Y.
  • Example of confounding variable in Philadelphia
    crime rate data Level of education. Level of
    education may be associated with both house
    prices and crime rate.
  • The effect of crime rate on house price is
    confounded with the effect of education on house
    price. If we just look at data on house price
    and crime rate, we cant distinguish between the
    effect of crime rate on house price and the
    effect of education on house price.

18
Note on Confounding Variables and Lurking
Variables
  • The books distinction between lurking variable
    and confounding variable is confusing and the
    term lurking variable is not standard in
    statistics, whereas confounding variable is.
    So I will just use the term confounding variable
    in the rest of the course.

19
Examples of Confounding Variables
  • Many studies have found that people who are
    active in their religion live longer than
    nonreligious people. Potential confounding
    variables?

20
Weekly Wages (Y) and Education (X) in March 1988
CPS
Will getting an extra year of education cause an
increase of 50.41 on average in your weekly
wage? What are some potential confounding
variables?
21
Math enrollment data The residual plot vs. time
indicates that there is a confounding variable
associated with time. It turns out that one of
the schools (say the engineering school) in the
university changed its program to require that
entering students take another mathematics
course. The variable of whether the engineering
school requires its students to take another
mathematics course is a confounding variable.
22
Establishing Causation
  • Best method is an experiment, but many times that
    is not ethically or practically possible (e.g.,
    smoking and cancer, education and earnings).

23
  • Main strategy for learning about causation when
    we cant do an experiment Consider all
    confounding variables you can think of. Try to
    take them into account (well see how to do this
    when we study multiple regression in Chapter 11)
    and see if association between Y and X remains
    once the known confounding variables have been
    accounted for.

24
Other Criteria for Establishing Causation When We
Cant Do An Experiment
  • The association is strong.
  • The association is consistent.
  • Higher doses are associated with stronger
    responses.
  • The alleged cause precedes the effect in time.
  • The alleged cause is plausible.
Write a Comment
User Comments (0)