Title: 2. Exploratory Data Analysis
12. Exploratory Data Analysis
- OR An ABC of EDA
- Peter Watson
- http//imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/gra
phFAQ
2No apriori ideas (model) in EDA!
- For classical analysis, the sequence is
- Problem gt Data gt Model gt Analysis gt
Conclusions - For EDA, the sequence is
- Problem gt Data gt Analysis gt Model gt
Conclusions
3EDA - Exploratory data analysis
- Informal graphical techniques (Tukey, 1977) which
- look at underlying structure
- identify outliers
- check assumptions in later formal analyses
(Normality, equality of variance) - Most of the EDA techniques are graphical and
quite simple - A picture is worth more than ten thousand words
Chinese proverb
4Graphical displays
- Histograms
- Boxplots
- Quantile Plots
- Error Bar plots (groups)
- Stem and Leaf Displays
- Scatterplots (especially for checking linearity
of correlations, residual plots from regressions) - (see also regression talk)
- Under SPSS EXPLORE
5Symmetry
- Clustered around median
- medianmeanmode
- no skewness
- CIs of mean assume symmetry
6Skew and kurtosis
- Skew
- lt0 upper straggle
- 0 symmetric
- gt0 downward straggle
- Rules of Thumb (Hair et al,1998Simon,2002)
- Negative skew lt-1
- Positive Skew gt 1
- Kurtosis
- lt0 flat (Platikurtic)
- 0 normal peak
- gt0 peaked about mean (Leptokurtic)
- Rules of Thumb (Simon, 2002)
- Positive Kurtosis gt 3
- Negative Kurtosis lt -3
7Peakedness
- Kurtosis measures peakedness
- Dont want too peaked or too uniform
distributions - Too peaked -no variation
- Too flat - no one typical value
8Types of Kurtosis (Miles Shevlin, 2001)
9Bimodality
- This is a mixture of two distributions (clear dip
around the middle multipeaked) - Histograms are usually good at spotting this
- Suggests modelling the first half and second half
separately
10Beck Score
- Positive skew (gt1.0)
- Most scores around zero
- Scores above 13 - clinically depressed
- One score of 46!
11Boxplots
12Boxplots
- median line in red box
- Middle half in red box (1.3 sds)
- Outliers circles and stars
- Shape of data
13Outliers in boxplots
- Inner fence - moderately weird. Over 1.5
boxlengths from upper/lower quartiles Circles in
SPSS - (2.67 sds from mean in normal data)
- Outer fence - decidedly weird. Over 3 boxlengths
from upper/lower quartiles Asterisks in SPSS - (4.67 sds from mean in normal data)
14Hinges
- Boxplots actually use Hinges to define locations
of boxes and outliers - Upper Hinge similar to Upper quartile
- Lower Hinge similar to Lower quartile
- Inter-hinge spread similar to interquartile range
15Boxplot of Beck score
- Positive skew
- Concentration of outliers above median score
16Effect of an outlier
- biases mean (green line)
- inflates variance of mean
- median more robust
17Robustness to outliers
- Number of positive responses (max6)
- 0,0,0,4,4,5,5,5,6,6,6,6,6,6,6
- 95 Bootstrap Confidence Intervals
- Median (4.89,5.11) Observed Median5.00
- Mean (4.11,4.41) Observed Mean4.33
18Consistency of median
19Obtaining 95 CIs for skewed data Bootstrapping
(Efron Tibshirani, 1993)
See also http//www.ruf.rice.edu/lane/stat_sim/s
ampling_dist/index.html
20Example revisited
- 1000 random samples of size equal to original
sample (N15) - Results
- Point estimates Mean4.37 Median5
- 95 CIs Mean 3.27, 5.27 Median 4, 6
- Outliers exerting undue influence on the mean
- See http//imaging.mrc-cbu.cam.ac.uk/statswiki/FA
Q/boot
21The sampling distribution of the variance follows
a chisquare which tends to N(n-1,2(n-1)) as n
increases
shoulds.e.(mean) 5/sqrt(25)1
N25 normal(24,48) observed mean is 23.75,
observed variance6.826.8246.5
22Other approaches to identifying outliers (besides
boxplots)
- Cases with z-scores exceeding 2.5
- (z-score subtracts mean and divides by s.d.)
- Grubbs test (see CBU website for details)
23Quantile Plots
- Raw beck score
- deviates from straight line
- Substantial skew
- Limits choice of statistical tests we can use to
analyse beck score - Bump above line positive skew
24Reverse scored Beck
- beck is now negative skewed
- the bump is now under the line
25S shapes
- Symptoms of Kurtosis
- uniformity
- peakedness
26Testing normality more formally
- Kolmogorov-Smirnov test
- Shapiro-Wilks
- Overly sensitive for large samples
Non-Normal
27Symmetric plots
- Rank Beck distances above and below the beck
score median - Plots I-th lowest distance above the median
against I-th lowest distance below the median - Not many points plotted as so many points below
the median so doesnt show asymmetry very well
multiple points with same co-ordinates - If symmetric points fall on line xy
- distances above median gt distances below median
28Stem and Leaf of Beck Score
- Stem Leaf
- 6 . 0 6.0
- Each leaf4 cases
29Temperature
- What is unusual about this
- distribution?
- Clue spacing.
- Each leaftwo temperatures
- STEM LEAF
- -6 6 -6.6 Degrees C
- Frequency Stem Leaf
- 2.00 -6 . 6
- 4.00 -5 . 00
- 10.00 -4 . 44444
- 6.00 -3 . 338
- 14.00 -2 . 2227777
- 14.00 -1 . 1116666
- 8.00 -0 . 0055
- 12.00 0 . 005555
- 6.00 1 . 616
- 14.00 2 . 7777777
- 16.00 3 . 33333888
- 14.00 4 . 4444444
- 13.00 5 . 555555
- 11.00 7 . 22227
- 7.00 8 . 888
- 6.00 9 . 444
30Scales (c/o RSS News)
- Grain diameters recorded to nearest division 1
inch apart - Subsequently told to report in cm 1 inch
2.5cm approx. - Raw data (1,0,2,1,1,0,4,1,3,0,1,1) in inches
- (2.50, 0, 5.00, 2.50, 2.50, 0, 10.00, 2.50, 7.50,
0, 2.50, 2.50) in cm - The village post office is 1.21 km (2 miles)
across the valley on the left - (1 lb) 454 grammes of cheese, (1 pint) 560ml of
beer - The human mind likes whole numbers
31Percentage success
- What is wrong with this graph?
- No axis labels or title
- Y axis strangely scaled Cant have percentages lt
0 or greater than 100 - green markers smaller
- Green and red not distinguishable by colour blind
person yellow partially hidden by background - Other caveats
- Joining points can be misleading
- make sure tick marks on scales are not too near
one another to give false effect.
32Scale invariance or not.
- When is 40 approximately equal to 25?
- When is 73 equal to 111? (asked on University
Challenge in 2005) - ANSWERS
- km mph In base 8. Computers think in
binary (base 2)! -
- But
- Februarys temperature was 55F (13C) which is
three times the average - So the average equals 55/3 18.3F (13/3
4.3C)? - Why is this patently untrue?
- ANSWER
- 18.3F is below freezing (lt32F) but 4.3C is
above freezing(gt0C).
33One more thing...
- When is Halloween equal to Christmas?
- ANSWER Oct. 31 Dec. 25. I.e. 8x3 1 2x10 5
34Error Bar Charts
- Interactive Bar charts
- Bar length represents 95 Confidence interval for
the mean - females have higher depression scores than males
35Bubble Plots (in R) years in education related to
income/prestige
36Multiple scatter plots (R)
37Ladder of Powers (Marsh,1988)
-
- Powers (double star function in SPSS COMPUTE)
e.g. 329 - 2 square
- 1 untransformed
- 0.5 square root
- 0 (natural) log
- -0.5 inverse square root
- -1 reciprocal
- -2 inverse square
38Choosing a power
- Trial and Error
- Box-Cox transformation
- SPSS Box-Cox macro available at
- http//stat.tamu.edu/ftp/pub/mspeed/stat653/spss/
39Box-Cox applied to Beck score
- Looks for a power that minimises Beck score
variance - Suggests a power of 0.3 (near to log transform
(power0)) - Regression improve fit of a covariate to predict
a test score Box-Cox can flag up a non-linear
relationship - Can be used to help determine z-scores and means
but can be misleading for very skewed data e.g.
when floor and ceiling effects are present
40Predicted test score using a covariate vs actual
test score (raw and square rooted)
More linear relationship taking square root
(right hand side picture)
41Box Cox on residual variance
- test score constant Aitem score residual
- Can use boxcox on residuals of fitting item score
on test score - suggests using square root of y
- This is the transform of test score which
minimizes the residual variance -
42Exponential
- Clicks constant AE-B Age
- Another type of non-linear relationship.
Characterised by ever increasing rates of
changes as you get older
43Log Beck
- Skew0.60
- Kurtosis-0.06
- Acceptable using rule of thumb
44Quantile plot - log Beck
- Fits closer to a straight line
- Log transform has made the distribution more
Normal - Log transform enables the use of more powerful
statistical tests
45Symmetry of midpoints
- midpoints of percentiles
- average of thresholds marking blue and green
areas should be equal in symmetric distributions
46Midpoints of beck score
- Beck
- Median6
- 0.5(Sum of Midpoints) - Median
- Quarters 8.3
- Eighths 16.7
- Sixteenths 38
- Log(Beck1)
- Median1.95
- 0.5(Sum of Midpoints) - Median
- Quarters 3.0
- Eighths 14.4
- Sixteenths 26.7
- MORE SYMMETRIC!
47Rank transform
- Downweight outliers
- Useful if power transformations fail
- Useful summary measures
- Medians
- Interquartile ranges (Boxplots)
- Rank sums (Non-parametric tests)
48Using ranks - example
- Compare cost (in ) of two care centres
- Care Centres O R
- Any patient cost saving?
49Centre O stem leaf display
- STEM WIDTH200
- 2 EXTREMES
- POSITIVE SKEW
50Centre R - Stem and Leaf
- (stem width100)
- outliers present
- positive skew
- rank test needed
51RESULTS
- UNRANKED
- t(147) 0.91, p.36
- centre costs the same
- Uses means
- RANKED
- mean Rank
- Study O 65.06
- Study R 85.63
- M-W Z-2.96,p.003
- Centre R costlier
- Uses ranks
52Nonparametric tests
- PROS
- Downweight outliers
- Fewer assumptions
- Useful for skewed distributions
- CONS
- Less powerful
- Lose information
- Limited range of tests
53Equal Group Variances
- Important for t-tests and ANOVAs
- No covariate by group interaction in ANCOVA
Quades (1967) method is a nonparametric
equivalent - May need to transform outcome
- Tests available to identify problems
54Levenes test
- Are group variances equal?
- Gets slope of spread vs location
- Compares slope to 0
- produces F-test
55Proportions
- Variance of a proportion depends on value of
proportion! - Arcsine transform resolves this
- In SPSS use function in COMPUTE to do transform
- 2 arsin(sqrt(p))
56Funny you should say that...
- There is no truth to the allegation that
statisticians are mean. They are just your
standard normal deviates. - Why don't statisticians like to model new
clothes? - Lack of fit.
- Did you hear about the statistician who invented
a device to measure the weight of trees? It's
referred to as the ? scale - ?log
- Old statisticians never die, they just undergo a
transformation. - Or in summary.Normal lack of fit try a log
transformation! - http//research.microsoft.com/users/lamport/pubs/h
air.pdf
57And Finally...
- A Statistician is someone who can have their
head in an oven and their feet in an ice box and
say that on the whole they are feeling perfectly
normal - Check you are using appropriate summary measures
- Further details including references on EDA at
- http//www.itl.nist.gov/div898/handbook/eda/eda.ht
m - Thanks to Frank Duckworth RSS News article on
scales - Thanks to Chrissy Fletcher for supplying the
jokes - Allan Reese (CEFAS, graphical comments)
- Next week (Thursday). 11am
- Ian Nimmo-Smith
- The anatomy of statistical methods models,
hypotheses, significance and power