Title: Laboratory in Oceanography: Data and Methods
1Laboratory in Oceanography Data and Methods
Intro to the Statistics Toolbox
- MAR599, Spring 2009
- Miles A. Sundermeyer
2Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics
- Measures of Central Tendency
- Geometric Mean
- Harmonic Mean
3Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics
- Measures of Dispersion
- Interquartile range difference between the 75th
and 25th percentiles - Mean absolute deviation mean(abs(x-mean(x)))
- Moment mean((x-mean(x)).order (e.g., order2
gives variance) - skewness third central moment of x, divided by
cube of its standard deviation (pos/neg skewness
implies longer right/left tail) - kurtosis fourth central moment of x, divided by
4th power of its standard deviation (high
kurtosis means sharper peak and longer/fatter
tails)
4Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics
Examples of Skewness Kurtosis
5Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics
- Bootstrap Method
- Involves choosing random samples with replacement
from a data set and analyzing each sample data
set the same way as the original data set. The
number of elements in each bootstrap sample set
equals the number of elements in the original
data set. The range of sample estimates obtained
provides a means of estimating uncertainty of the
quantity being estimated. - In general, bootstrap method can be used to
compute uncertainty for any functional
calculation, provided the sample data set is
representative of the true distribution. - Jacknife Method
- Similar to the bootstrap is the jackknife, but
uses re-sampling to estimate the bias and
variance of sample statistics.
6Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics
Example Bootstrap Method for estimating
uncertainty on Lagrangian Integral Time Scale
(from Sundermeyer and Price, 1998)
Integrating the LACFs using 100 days as the
upper limit of the integral of Rii(t) in (12)
gives the integral timescales I(11,22) (10.6
4.8, 5.4 2.8) days for the (zonal, meridional)
components, where uncertainties represent 95
confidence limits estimated using a bootstrap
method e.g., Press et al., 1986.
7Intro to Statistics Toolbox Statistics
Toolbox/Statistical Visualization
- Probability Distribution Plots
- Normal Probability Plots
- gtgt x normrnd(10,1,25,1)
- gtgt normplot(x)
gtgt x exprnd(10,100,1) gtgt normplot(x)
8Intro to Statistics Toolbox Statistics
Toolbox/Statistical Visualization
- Probability Distribution Plots
- Quantile-Quantile Plots
- gtgt x poissrnd(10,50,1) y poissrnd(5,100,1)
- gtgt qqplot(x,y)
gtgt x normrnd(5,1,100,1) gtgt y
wblrnd(2,0.5,100,1) gtgt qqplot(x,y)
9Intro to Statistics Toolbox Statistics
Toolbox/Statistical Visualization
- Probability Distribution Plots
- Cumulative Distribution Plots
- gtgt y evrnd(0,3,100,1)
- gtgt cdfplot(y)
- gtgt hold on
- gtgt x -200.110
- gtgt f evcdf(x,0,3)
- gtgt plot(x,f,'m')
- gtgt legend('Empirical', ...
- 'Theoretical', ...
- 'Location','NW')
10Intro to Statistics Toolbox Statistics
Toolbox/Probability Distributions/Supported
Distributions
- Supported distributions include wide range of
- Continuous distributions (data)
- Continuous distributions (statistics)
- Discrete distributions
- Multivariate distributions
- http//www.mathworks.com/access/helpdesk/help/tool
box/stats/index.html?/access/helpdesk/help/toolbox
/stats/http//www.mathworks.com/support/product/p
roduct.html?productST
11Intro to Statistics Toolbox Statistics
Toolbox/Probability Distributions/Supported
Distributions
Supported distributions (contd)
12Intro to Statistics Toolbox Statistics
Toolbox/Probability Distributions/Supported
Distributions
Supported statistics
13Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
- Hypothesis Testing
- Can only disprove a hypothesis
- null hypothesis an assertion about a
population. It is "null" in that it represents a
status quo belief, such as the absence of a
characteristic or the lack of an effect. - alternative hypothesis a contrasting assertion
about the population that can be tested against
the null hypothesis - H1 µ ? null hypothesis value (two-tailed
test) - H1 µ gt null hypothesis value (right-tail
test) - H1 µlt null hypothesis value (left-tail test)
- test statistic random sample of population
collected, and test statistic computed to
characterize the sample. The statistic varies
with type of test, but distribution under null
hypothesis must be known (or assumed). - p-value - probability, under null hypothesis, of
obtaining a value of the test statistic as
extreme or more extreme than the value computed
from the sample. - significance level - threshold of probability,
typical value of a is 0.05. If p-value lt a the
test rejects the null hypothesis if p-value gt a,
there is insufficient evidence to reject the null
hypothesis. - confidence interval - estimated range of values
with a specified probability of containing the
true population value of a parameter.
14Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
- Hypothesis Testing
- Hypothesis tests make assumptions about the
distribution of the random variable being sampled
in the data. These must be considered when
choosing a test and when interpreting the
results. - Z-test (ztest) and the t-test (ttest) both assume
that the data are independently sampled from a
normal distribution. - Both the z-test and the t-test are relatively
robust with respect to departures from this
assumption, so long as the sample size n is large
enough. - Difference between the z-test and the t-test is
in the assumption of the standard deviation s of
the underlying normal distribution. A z-test
assumes that s is known a t-test does not. Thus
t-test must determine s from the sample.
15Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
- ztest
- The test requires s (the standard deviation of
the population) to be known - The formula for calculating the z score for the
z-test is -
-
- where
- x is the sample mean µ is the mean of the
population - The z-score is compared to a z-table, which
contains the percent of area under the normal
curve between the mean and the z-score. This
table will indicate whether the calculated
z-score is within the realm of chance, or if it
is so different from the mean that the sample
mean is unlikely to have happened by chance.
16Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
- ttest
- Like z-test, except the t-test does not require s
to be known - The formula for calculating the t score for the
t-test is -
-
- where
- x is the sample mean µ is the mean of the
population - s is the sample variance
- Under the null hypothesis that the population is
distributed with mean µ, the z-statistic has a
standard normal distribution, N(0,1). Under the
same null hypothesis, the t-statistic has
Student's t distribution with n 1 degrees of
freedom.
17Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
- ttest2
- performs a t-test of the null hypothesis that
data in the vectors x and y are independent
random samples from normal distributions with
equal means and equal but unknown variances
unknown variances may be either equal or unequal. - The formula for calculating the score for the
t-test2 is -
- where
- x, y are sample means sx, sy are the sample
variances - The null hypothesis is that the two samples are
distributed with the same mean.
18Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
19Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
- ANOVA (ANalysis Of VAriance)
- ANOVA is like a t-test among multiple (typically
gt2) data sets simultaneously - T-tests can be done between two data sets, or one
set and a true value - uses the f-distribution instead of the
t-distribution - assumes that all of the data sets have equal
variances - One-way ANOVA is a simple special case of the
linear model. The one-way ANOVA form of the model
is - where
- yij is a matrix of observations, each column
represents a different group. - a.j is a matrix whose columns are the group
means. (The "dot j" notation means a applies to
all rows of column j. That is, aij is the same
for all i.) - eij is a matrix of random disturbances.
- The model assumes that the columns of y are a
constant plus a random disturbance. ANOVA tests
if the constants are all the same.
20Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
- One-way ANOVA
- Example Hogg and Ledolter bacteria counts in
milk. Columns represent different shipments, rows
are bacteria counts from cartons chosen randomly
from each shipment. Do some shipments have higher
counts than others? - gtgt load hogg
- gtgt hogg
- hogg
- 24 14 11 7 19
- 15 7 9 7 24
- 21 12 7 4 19
- 27 17 13 7 15
- 33 14 12 12 10
- 23 16 18 18 20
- gtgt p,tbl,stats anova1(hogg)
- gtgt p
- p 1.1971e-04
- standard ANOVA table has columns for the sums of
squares, dof, mean squares (SS/df), F statistic,
and p-value.
21Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
- One-way ANOVA (contd)
- In this case the p-value is about 0.0001, a very
small value. This is a strong indication that the
bacteria counts from the different shipments are
not the same. An F statistic as extreme as this
would occur by chance only once in 10,000 times
if the counts were truly equal. - The p-value returned by anova1 depends on
assumptions about random disturbances eij in the
model equation. For the p-value to be correct,
these disturbances need to be independent,
normally distributed, and have constant variance.
22Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
- Multiple Comparisons
- Sometimes need to determine not just whether
there are differences among means, but which
pairs of means are significantly different. - In t-test, compute t-statistic and compare to a
critical value. However, when testing multiple
pairs, for example, if probability of t-statistic
exceeding critical value is 5, then for 10
pairs, much more likely that one of these will
falsely fail that criterion. - Can perform a multiple comparison test using the
multcompare function by supplying it with the
stats output from anova1. - Example
- gtgt load hogg
- gtgt p,tbl,stats anova1(hogg)
- gtgt c,m multcompare(stats)
- Example
- see Light_DO.m
23Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
- Two-way ANOVA
- Determine whether data from several groups have a
common mean. Differs from one-way ANOVA in that
the groups in two-way ANOVA have two categories
of defining characteristics instead of one (e.g.,
think of two independent variables/dimensions) - Two-way ANOVA is again a special case of the
linear model. The two-way ANOVA form of the model
is - where
- yijk is a matrix of observations (with rows i,
columns j, and repetition k). - m is a constant matrix of the overall mean of
the observations. - a.j is a matrix whose columns are deviations of
each observation attributable to the first
independent variable. All values in a given
column of are identical, and values in each row
sum to 0. - b.j is a matrix whose rows are the deviations of
each observation attributable to the second
independent variable. All values in a given row
of are identical, and values in each column of
sum to 0. - gij is a matrix of interactions. Values in each
row sum to 0, and values in each column sum to 0. - eij is a matrix of random disturbances.
- The model assumes that the columns of y are a
series of constants plus a random disturbance.
You want to know if the constants are all the
same.
24Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
Two-way ANOVA Example Determine effect of car
model and factory on the mileage rating of
cars. There are three models (columns) and two
factories (rows). Data from first factory is in
first three rows, data from second factory is in
last three rows. Do some cars have different
mileage than others? gtgt load mileage mileage
33.3000 34.5000 37.4000 33.4000
34.8000 36.8000 32.9000 33.8000
37.6000 32.6000 33.4000 36.6000 32.5000
33.7000 37.0000 33.0000 33.9000 36.7000
gtgt cars 3 gtgt p,tbl,stats
anova2(mileage,cars)p,tbl,stats anova1(hogg)
25Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
- Two-way ANOVA (contd)
- In this case the p-value for the first effect is
zero to four decimal places. This indicates that
the effect of the first predictor varies from one
sample to another. - An F statistic as extreme as this would occur by
chance only once in 10,000 - times if the samples were truly equal.
- The p-value for the second effect is 0.0039,
which is also highly significant. This indicates
that the effect of the second predictor varies
from one sample to another. - Does not appear to be any interaction between the
two predictors. The p-value, 0.8411, means that
the observed result is quite likely (84 out 100
times) given that there is no interaction. - The p-values returned by anova2 depend on
assumptions about the random - disturbances eij in the model equation. For the
p-values to be correct, these - disturbances need to be independent, normally
distributed, and have constant - variance.
- In addition, anova2 requires that data be
balanced, which means there must be the same
number of samples for each combination of control
variables. Other ANOVA methods support
unbalanced data with any number of predictors.
26Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
- Linear Regression Models
- In statistics, linear regression models take the
form of a summation of - coefficient (independent variable or
combination of independent variables). -
- For example
- In this example, the response variable y is
modeled as a combination of constant, linear,
interaction, and quadratic terms formed from two
predictor variables x1 and x2. - Uncontrolled factors and experimental errors are
modeled by e. Given data on x1, x2, and y,
regression estimates the model parameters ßj (j
1, ..., 5). - More general linear regression models represent
the relationship between a continuous response y
and a continuous or categorical predictor x in
the form
27Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
Example (system of equations) Suppose we have a
series of measurements of stream discharge and
stage, measured at n different times. time (day)
0 14 28 42 56 70 stage (m) 0.612 0.647
0.580 0.629 0.688 0.583 discharge
(m3/s) 0.330 0.395 0.241 0.338 0.531
0.279 Suppose we now wish to fit a rating
curve to these measurements. Let x stage, y
discharge, then we can write this series of
measurements as yi mxi b, with i 1n.
This in turn can be written as y X b,
or
28Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
yi mxi b y X b
29Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
- Example Harmonic Analysis
- sin(qf) sin(q)cos(f) sin(f)cos(q)
- Let ACcos(f), BCsin(f)
- gt Csin(wtf) Asin(wt) Bcos(wt)
- Linear regression y Xb
30Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
Example Harmonic analysis (contd) Southampton
Surface Currents Harmonic analysis for M2,
M42xM2, M63xM2 ...
- Note Tidal Harmonics can cause tidal cycle to
appear asymmetric.
31Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
Generalized linear models (GLM) are a flexible
generalization of ordinary least squares
regression. They relate the random distribution
of the measured variable of the experiment (the
distribution function) to the systematic
(non-random) portion of the experiment (the
linear predictor) through a function called the
link function. Generalized additive models
(GAMs) are another extension to GLMs in which the
linear predictor ? is not restricted to be linear
in the covariates X but is an additive function
of the xis The smooth functions fi are
estimated from the data. In general this requires
a large number of data points and is
computationally intensive.
32Data Handling Matlab Useful Tidbits
- Useful Tidbits
- regress - performs multiple linear regression
using least squares - nlinfit - performs nonlinear least-squares
regression. - glmfit - fits a generalized linear model.