Title: Checking assumptions - exploratory data analysis (EDA)
1Research Methods 1998Graphical design and
analysis
Ó Gerry Quinn, Monash University, 1998 Do not
modify or distribute without expressed written
permission of author.
2Graphical displays
- Exploration
- assumptions (normality, equal variances)
- unusual values
- which analysis?
- Analysis
- model fitting
- Presentation/communication of results
3Space shuttle data
4Space shuttle data
- NASA meeting Jan 27th 1986
- day before launch of shuttle Challenger
- Concern about low air temperatures at launch
- Affect O-rings that seal joints of rocket motors
- Previous data studied
5O-ring failure vs temperature Pre 1986
6Challenger flight
Jan 28th 1986 - forecast temp 31oF
7O-ring failure vs temperature
8Checking assumptions - exploratory data analysis
(EDA)
- Shape of sample (and therefore population)
- is distribution normal (symmetrical) or skewed?
- Spread of sample
- are variances similar in different groups?
- Are outliers present
- observations very different from the rest of the
sample?
9Distributions of biological data
- Bell-shaped symmetrical distribution
- normal
- Skewed asymmetrical distribution
- log-normal
- poisson
10Common skewed distributions
- Log-normal distribution
- m proportional to s
- measurement data, e.g. length, weight etc.
- Poisson distribution
- m s2
- count data, e.g. numbers of individuals
11Exploring sample data
12Example data set
- Quinn Keough (in press)
- Surveys of 8 rocky shores along Point Nepean
coast - 10 sampling times (1988 - 1993)
- 15 quadrats (0.25m2) at each site
- Numbers of all gastropod species and cover of
macroalgae recorded from each quadrat
13Frequency distributions
Observations grouped into classes
NORMAL
LOG-NORMAL
Number of observations
Value of variable (class)
Value of variable (class)
14Number of Cellana per quadrat
30
Survey 5, all shores combined Total no. quadrats
120
20
Frequency
10
0
0
20
40
60
80
100
Number of Cellana per quadrat
15Dotplots
- Each observation represented by a dot
- Number of Cellana per quadrat, Cheviot Beach
survey 5 - No. quadrats 15
0
10
20
30
40
Number of Cellana per quadrat
16Boxplot
17(No Transcript)
18Boxplots of Cellana numbers in survey 5
100
80
60
Number of Cellana per quadrat
40
20
0
S FPE RR SP CPE CB LB CPW
Site
19Scatterplots
- Plotting bivariate data
- Value of two variables recorded for each
observation - Each variable plotted on one axis (x or y)
- Symbols represent each observation
- Assess relationship between two variables
20Cheviot Beach survey 5 n 15
Number of Cellana per quadrat
cover of Hormosira per quadrat
21Scatterplot matrix
- Abbreviated to SPLOM
- Extension of scatterplot
- For plotting relationships between 3 or more
variables on one plot - Bivariate plots in multiple panels on SPLOM
22SPLOM for Cheviot Beach survey 5
CELLANA - numbers of Cellana SIPHALL - numbers
of Siphonaria HORMOS - cover of Hormosira n
15 quadrats
23Transformations
- Improve normality.
- Remove relationship between mean and variance.
- Make variances more similar in different
populations. - Reduce influence of outliers.
- Make relationships between variables more linear
(regression analysis).
24Log transformation
Lognormal Normal y log(y) Measurement data
25Power transformation
Poisson Normal y Ö(y), i.e. y y0.5, y
y0.25 Count data
26Arcsin Ö transformation
Square Normal y sin-1(Ö(y)) Proportions and
percentages
27Outliers
- Observations very different from rest of sample -
identified in boxplots. - Check if mistakes (e.g. typos, broken measuring
device) - if so, omit. - Extreme values in skewed distribution -
transform. - Alternatively, do analysis twice - outliers in
and outliers excluded. Worry if influential.
28Assumptions not met?
- Check and deal with outliers
- Transformation
- might fix non-normality and unequal variances
- Nonparametric rank test
- does not assume normality
- does assume similar variances
- Mann-Whitney-Wilcoxon
- only suitable for simple analyses
29Category or line plot
Mean number of Cellana per quadrat
Survey