Detecting outliers - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Detecting outliers

Description:

We've seen how to use hat matrix diagonals as a way of ... In an example as extreme as the previous one, we would know straight away there was a problem. ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 21
Provided by: DrJames75
Category:

less

Transcript and Presenter's Notes

Title: Detecting outliers


1
Detecting outliers
  • Weve seen how to use hat matrix diagonals as a
    way of detecting potentially high leverage points
  • But what about points that may not have much
    leverage, but are large outliers
  • Such points interfere with model fit measures and
    our tests of significance
  • Weve looked at the residuals for a number of
    uses, but we havent made much (direct) use of
    the properties of the residuals
  • Recall that for all of the models weve
    discussed, weve assumed that the residuals are
    normally distributed, with mean zero and standard
    deviation sigma

2
Standardized residuals
  • When we learned about the normal distribution we
    saw how to work out normal tail probabilities
  • Recall if then
  • So if the residuals, ri have an approximate
    normal distribution with mean 0 and variance
    sigma then

3
Standardized residuals
  • The trouble is that we dont know the standard
    deviation of the residuals
  • But we have an estimate of it a standard error
  • It turns out that the standard error of the ith
    residual is the estimate of the overall standard
    deviation (the square root of the Residual Mean
    Square) times a function of the associated hat
    matrix diagonal, i.e.

So the standardized residuals are given by
4
Studentized residuals
  • Complementary to the idea of the standardized
    residuals are the Studentized residuals

where s(i) is the residual standard deviation if
the least squares procedure is run without the
ith observation
  • Numerically, they are similar to the standardized
    residuals
  • However they may have better theoretical
    properties

5
Using Standardised / Studentized residuals
  • Theory aside, how do we use these quantities
  • If the standardized residuals have a standard
    normal distribution, then it is very easy to spot
    large residuals
  • We know that for a standard normal
  • Pr(Zlt-1.96) or Pr(Zgt1.96) 0.05
  • Pr(Zlt-2.57) or Pr(Zgt2.57) 0.01
  • Large residuals will have large Z-scores and
    hence will have low probabilities of being
    observed if these points truly came from a
    standard normal distribution
  • Therefore, if we have large (in magnitude)
    standardized residuals, we might think that these
    points do not belong to the model
  • The Studentized residuals actually have a
    t-distribution, so if the regression is on a
    small number of points (lt 20) the Studentized
    residuals might be preferable.

6
Multicollinearity
  • One problem that we have in multiple regression
    that we dont have in simple linear regression is
    multicollinearity
  • The concept of multicollinearity is simple,
    detecting it and deciding what to do about it is
    hard
  • Before we make a definition of multicollinearity
    we need a couple of extra terms
  • In regression, the response is often called the
    dependent variable, its value being dependent on
    the value of the predictors
  • Correspondingly the predictors are called the
    independent variables

7
Multicollinearity a definition
  • The previous terms give us a handle on
    multicollinearity
  • Multicollinearity occurs when there is a linear
    relationship or dependency amongst the
    independent variables
  • This sounds a little bizarre this model is okay
    (and in fact is quite a common one)
  • But this one is not and would suffer from the
    effects of multicollinearity
  • The difference is that the relationship between X
    and 3X2 is linear whereas the relationship
    between X and X2 is non-linear

8
  • In an example as extreme as the previous one, we
    would know straight away there was a problem.
  • This is because in calculating the least squares
    solution, the matrix XTX would be singular. This
    means it has no inverse
  • If you dont know any linear algebra, you can
    think of this as trying to calculate 1/x when x
    0
  • Minitab would report
  • z is highly correlated with other X variables
  • z has been removed from the equation
  • Most other packages will report that XTX is
    singular or ill-conditioned
  • In more realistic problems, there may be linear
    dependence between the variables, but not perfect
    dependence

9
Example
  • The following dataset comes from a cheese tasting
    experiment.
  • As cheddar cheese matures, a variety of chemical
    processes take place. The taste of matured cheese
    is related to the concentration of several
    chemicals in the final product.
  • In a study of cheddar cheese from the LaTrobe
    Valley of Victoria, Australia, samples of cheese
    were analysed for their chemical composition and
    were subjected to taste tests.
  • Overall taste scores were obtained by combining
    the scores from several tasters.

10
  • The response in this model is taste
  • The predictors are the log concentrations of
    lactic acid (lactic), acetic acid (acetic) and
    hydrogen sulphide (H2S)
  • Lactic acid is present in all milk products
  • Acetic acid comes from vinegar and is a
    preservative
  • Hydrogen sulphide, the gas that gives Rotorua its
    fresh smell, is used a preservative
  • Because we dont know too much about the
    relationship between the predictors and the
    response (we havent been given any other
    information) we fit a straight forward multiple
    linear regression model

11
Predictor Coef StDev T
P Constant -28.88 19.74 -1.46
0.155 Lactic 19.671 8.629
2.28 0.031 Acetic 0.328 4.460
0.07 0.942 H2S 3.912
1.248 3.13 0.004 S 10.13 R-Sq
65.2 R-Sq(adj) 61.2 Analysis of
Variance Source DF SS
MS F P Regression 3
4994.5 1664.8 16.22 0.000 Residual
Error 26 2668.4 102.6 Total
29 7662.9
  • The P-value from the ANOVA table is small gt
    regression is significant
  • The regression table indicates that perhaps the
    lactic acid and hydrogen sulphide concentrations
    are important in predicting taste where as the
    acetic acid concentration is not
  • The adj-R2 is reasonable were explaining about
    60 of the variation

12
  • Once again we see a typical pattern
  • The norplot of the residuals looks fine
  • There appears to be a strong linear relationship
  • No one point deviates from the line in any
    significant way
  • However, the pred-res plot tells a different
    story
  • Very strong funnel shape plus curvature
  • Something is wrong with our model

13
Detecting multicollinearity
  • When we saw the pattern in the pred-res plot
    previously, it was because we had fit a linear
    model to a non-linear curve
  • How would we know that multicollinearity may be
    the cause this time and not just some non-linear
    effect?
  • We dont but, because were fitting a multiple
    linear regression model, we should check for
    multicollinearity as a matter of course
  • We defined multicollinearity to be the existence
    of a linear dependency between the predictors
  • How do we observe linear relationships
    scatterplots
  • How do we quantify linear relationships
    correlation coefficients

14
Correlations (Pearson) Lactic
Acetic Acetic 0.604 H2S 0.645 0.618
  • A scatterplot matrix gives a scatterplot for each
    pair of predictors
  • We can see from the scatterplot matrix that there
    is definitely a relationship between each pair of
    predictors
  • This is backed up the correlation matrix, which
    gives the linear correlation coefficient for each
    pair of predictorsAll 3 correlation coefficients
    are over 0.6 indicating a significant problem
    with multicollinearity

15
Possible solutions
  • If detection of multicollinearity is hard,
    deciding what to do about it is even harder
  • One solution is if two variables have a
    significant linear between then to drop one of
    the variables
  • In our extreme case (where one variable was a
    linear function of the other)
  • How does it work out in this example.
  • We need to think about dropping variables in a
    sensible or systematic way
  • One way of doing this is with a partial
    regression plot

16
Partial regression plots
  • Think for a moment about the general case where
    we have k possible explanatory variables
  • We wish to decide whether to include the kth
    variable in the model
  • This might fail to happen in two different ways
  • The variable may be completely unrelated to the
    response so including it will not improve model
    fit, or
  • Given the other variables in the model, the
    additional variability explained by this variable
    may be negligible i.e.. this variable might be
    linearly dependent on the other predictors
    multicollinearity
  • We can use this to our advantage

17
  • The part of the response not accounted for by the
    other variables, is simply the
    residual we get when we regress the response on
    these variables
  • We want to see if there is any relationship
    between these residuals and the part of xk not
    related to .
  • This unrelated part of xk is best measured by the
    residuals we get when we regress xk on
  • We examine the relationship between these two
    sets of residuals by plotting the first set of
    residuals versus the second this is called a
    partial regression plot
  • If there appears to be no relationship in this
    plot then adding the variable will not improve
    the fit
  • If the relationship seems linear, then adding the
    variable will very likely improve fit
  • Finally, if there is a curve in the plot then a
    transformation of the predictor may well improve
    the fit further

18
  • These plots can be quite hard to interpret
  • However the appears to be at least a weak
    relationship for H2S and Lactic, but not for
    Acetic, so we will fit the model

19
Predictor Coef StDev T
P Constant -27.592 8.982 -3.07
0.005 Lactic 19.887 7.959
2.50 0.019 H2S 3.946 1.136
3.47 0.002 S 9.942 R-Sq 65.2
R-Sq(adj) 62.6 Analysis of Variance Source
DF SS MS F
P Regression 2 4993.9
2497.0 25.26 0.000 Residual Error 27
2669.0 98.9 Total 29
7662.9
  • The regression seems okay with significant
    coefficients and a reasonable R2 value
  • But we still have non-constant variance
  • We might try taking logs of the response

20
  • Taking the logarithm of the response has cured
    our residual problems, but there are two
    residuals which are disturbing the fit
  • If we remove these points the R2 goes from 42 to
    53.6 - an acceptable level
Write a Comment
User Comments (0)
About PowerShow.com