Laboratory in Oceanography: Data and Methods

About This Presentation

Title:

Laboratory in Oceanography: Data and Methods

Description:

'Integrating the LACFs using 100 days as the upper limit of the integral of Rii(t) ... the t-statistic has Student's t distribution with n 1 degrees of freedom. ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 33

Provided by: msunde

Category:

more less

Transcript and Presenter's Notes

Title: Laboratory in Oceanography: Data and Methods

1
Laboratory in Oceanography Data and Methods
Intro to the Statistics Toolbox

MAR599, Spring 2009
Miles A. Sundermeyer

2
Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics

Measures of Central Tendency
Geometric Mean
Harmonic Mean

3
Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics

Measures of Dispersion
Interquartile range difference between the 75th
and 25th percentiles
Mean absolute deviation mean(abs(x-mean(x)))
Moment mean((x-mean(x)).order (e.g., order2
gives variance)
skewness third central moment of x, divided by
cube of its standard deviation (pos/neg skewness
implies longer right/left tail)
kurtosis fourth central moment of x, divided by
4th power of its standard deviation (high
kurtosis means sharper peak and longer/fatter
tails)

4
Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics
Examples of Skewness Kurtosis
5
Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics

Bootstrap Method
Involves choosing random samples with replacement
from a data set and analyzing each sample data
set the same way as the original data set. The
number of elements in each bootstrap sample set
equals the number of elements in the original
data set. The range of sample estimates obtained
provides a means of estimating uncertainty of the
quantity being estimated.
In general, bootstrap method can be used to
compute uncertainty for any functional
calculation, provided the sample data set is
representative of the true distribution.
Jacknife Method
Similar to the bootstrap is the jackknife, but
uses re-sampling to estimate the bias and
variance of sample statistics.

6
Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics
Example Bootstrap Method for estimating
uncertainty on Lagrangian Integral Time Scale
(from Sundermeyer and Price, 1998)
Integrating the LACFs using 100 days as the
upper limit of the integral of Rii(t) in (12)
gives the integral timescales I(11,22) (10.6
4.8, 5.4 2.8) days for the (zonal, meridional)
components, where uncertainties represent 95
confidence limits estimated using a bootstrap
method e.g., Press et al., 1986.
7
Intro to Statistics Toolbox Statistics
Toolbox/Statistical Visualization

Probability Distribution Plots
Normal Probability Plots
gtgt x normrnd(10,1,25,1)
gtgt normplot(x)

gtgt x exprnd(10,100,1) gtgt normplot(x)
8
Intro to Statistics Toolbox Statistics
Toolbox/Statistical Visualization

Probability Distribution Plots
Quantile-Quantile Plots
gtgt x poissrnd(10,50,1) y poissrnd(5,100,1)
gtgt qqplot(x,y)

gtgt x normrnd(5,1,100,1) gtgt y
wblrnd(2,0.5,100,1) gtgt qqplot(x,y)
9
Intro to Statistics Toolbox Statistics
Toolbox/Statistical Visualization

Probability Distribution Plots
Cumulative Distribution Plots
gtgt y evrnd(0,3,100,1)
gtgt cdfplot(y)
gtgt hold on
gtgt x -200.110
gtgt f evcdf(x,0,3)
gtgt plot(x,f,'m')
gtgt legend('Empirical', ...
'Theoretical', ...
'Location','NW')

10
Intro to Statistics Toolbox Statistics
Toolbox/Probability Distributions/Supported
Distributions

Supported distributions include wide range of
Continuous distributions (data)
Continuous distributions (statistics)
Discrete distributions
Multivariate distributions
http//www.mathworks.com/access/helpdesk/help/tool
box/stats/index.html?/access/helpdesk/help/toolbox
/stats/http//www.mathworks.com/support/product/p
roduct.html?productST

11
Intro to Statistics Toolbox Statistics
Toolbox/Probability Distributions/Supported
Distributions
Supported distributions (contd)
12
Intro to Statistics Toolbox Statistics
Toolbox/Probability Distributions/Supported
Distributions
Supported statistics
13
Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests

Hypothesis Testing
Can only disprove a hypothesis
null hypothesis an assertion about a
population. It is "null" in that it represents a
status quo belief, such as the absence of a
characteristic or the lack of an effect.
alternative hypothesis a contrasting assertion
about the population that can be tested against
the null hypothesis
H1 µ ? null hypothesis value (two-tailed
test)
H1 µ gt null hypothesis value (right-tail
test)
H1 µlt null hypothesis value (left-tail test)
test statistic random sample of population
collected, and test statistic computed to
characterize the sample. The statistic varies
with type of test, but distribution under null
hypothesis must be known (or assumed).
p-value - probability, under null hypothesis, of
obtaining a value of the test statistic as
extreme or more extreme than the value computed
from the sample.
significance level - threshold of probability,
typical value of a is 0.05. If p-value lt a the
test rejects the null hypothesis if p-value gt a,
there is insufficient evidence to reject the null
hypothesis.
confidence interval - estimated range of values
with a specified probability of containing the
true population value of a parameter.

14
Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests

Hypothesis Testing
Hypothesis tests make assumptions about the
distribution of the random variable being sampled
in the data. These must be considered when
choosing a test and when interpreting the
results.
Z-test (ztest) and the t-test (ttest) both assume
that the data are independently sampled from a
normal distribution.
Both the z-test and the t-test are relatively
robust with respect to departures from this
assumption, so long as the sample size n is large
enough.
Difference between the z-test and the t-test is
in the assumption of the standard deviation s of
the underlying normal distribution. A z-test
assumes that s is known a t-test does not. Thus
t-test must determine s from the sample.

15
Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests

ztest
The test requires s (the standard deviation of
the population) to be known
The formula for calculating the z score for the
z-test is
where
x is the sample mean µ is the mean of the
population
The z-score is compared to a z-table, which
contains the percent of area under the normal
curve between the mean and the z-score. This
table will indicate whether the calculated
z-score is within the realm of chance, or if it
is so different from the mean that the sample
mean is unlikely to have happened by chance.

16
Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests

ttest
Like z-test, except the t-test does not require s
to be known
The formula for calculating the t score for the
t-test is
where
x is the sample mean µ is the mean of the
population
s is the sample variance
Under the null hypothesis that the population is
distributed with mean µ, the z-statistic has a
standard normal distribution, N(0,1). Under the
same null hypothesis, the t-statistic has
Student's t distribution with n 1 degrees of
freedom.

17
Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests

ttest2
performs a t-test of the null hypothesis that
data in the vectors x and y are independent
random samples from normal distributions with
equal means and equal but unknown variances
unknown variances may be either equal or unequal.
The formula for calculating the score for the
t-test2 is
where
x, y are sample means sx, sy are the sample
variances
The null hypothesis is that the two samples are
distributed with the same mean.

18
Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
19
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance

ANOVA (ANalysis Of VAriance)
ANOVA is like a t-test among multiple (typically
gt2) data sets simultaneously
T-tests can be done between two data sets, or one
set and a true value
uses the f-distribution instead of the
t-distribution
assumes that all of the data sets have equal
variances
One-way ANOVA is a simple special case of the
linear model. The one-way ANOVA form of the model
is
where
yij is a matrix of observations, each column
represents a different group.
a.j is a matrix whose columns are the group
means. (The "dot j" notation means a applies to
all rows of column j. That is, aij is the same
for all i.)
eij is a matrix of random disturbances.
The model assumes that the columns of y are a
constant plus a random disturbance. ANOVA tests
if the constants are all the same.

20
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance

One-way ANOVA
Example Hogg and Ledolter bacteria counts in
milk. Columns represent different shipments, rows
are bacteria counts from cartons chosen randomly
from each shipment. Do some shipments have higher
counts than others?
gtgt load hogg
gtgt hogg
hogg
24 14 11 7 19
15 7 9 7 24
21 12 7 4 19
27 17 13 7 15
33 14 12 12 10
23 16 18 18 20
gtgt p,tbl,stats anova1(hogg)
gtgt p
p 1.1971e-04
standard ANOVA table has columns for the sums of
squares, dof, mean squares (SS/df), F statistic,
and p-value.

21
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance

One-way ANOVA (contd)
In this case the p-value is about 0.0001, a very
small value. This is a strong indication that the
bacteria counts from the different shipments are
not the same. An F statistic as extreme as this
would occur by chance only once in 10,000 times
if the counts were truly equal.
The p-value returned by anova1 depends on
assumptions about random disturbances eij in the
model equation. For the p-value to be correct,
these disturbances need to be independent,
normally distributed, and have constant variance.

22
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance

Multiple Comparisons
Sometimes need to determine not just whether
there are differences among means, but which
pairs of means are significantly different.
In t-test, compute t-statistic and compare to a
critical value. However, when testing multiple
pairs, for example, if probability of t-statistic
exceeding critical value is 5, then for 10
pairs, much more likely that one of these will
falsely fail that criterion.
Can perform a multiple comparison test using the
multcompare function by supplying it with the
stats output from anova1.
Example
gtgt load hogg
gtgt p,tbl,stats anova1(hogg)
gtgt c,m multcompare(stats)
Example
see Light_DO.m

23
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance

Two-way ANOVA
Determine whether data from several groups have a
common mean. Differs from one-way ANOVA in that
the groups in two-way ANOVA have two categories
of defining characteristics instead of one (e.g.,
think of two independent variables/dimensions)
Two-way ANOVA is again a special case of the
linear model. The two-way ANOVA form of the model
is
where
yijk is a matrix of observations (with rows i,
columns j, and repetition k).
m is a constant matrix of the overall mean of
the observations.
a.j is a matrix whose columns are deviations of
each observation attributable to the first
independent variable. All values in a given
column of are identical, and values in each row
sum to 0.
b.j is a matrix whose rows are the deviations of
each observation attributable to the second
independent variable. All values in a given row
of are identical, and values in each column of
sum to 0.
gij is a matrix of interactions. Values in each
row sum to 0, and values in each column sum to 0.
eij is a matrix of random disturbances.
The model assumes that the columns of y are a
series of constants plus a random disturbance.
You want to know if the constants are all the
same.

24
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
Two-way ANOVA Example Determine effect of car
model and factory on the mileage rating of
cars. There are three models (columns) and two
factories (rows). Data from first factory is in
first three rows, data from second factory is in
last three rows. Do some cars have different
mileage than others? gtgt load mileage mileage
33.3000 34.5000 37.4000 33.4000
34.8000 36.8000 32.9000 33.8000
37.6000 32.6000 33.4000 36.6000 32.5000
33.7000 37.0000 33.0000 33.9000 36.7000
gtgt cars 3 gtgt p,tbl,stats
anova2(mileage,cars)p,tbl,stats anova1(hogg)
25
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance

Two-way ANOVA (contd)
In this case the p-value for the first effect is
zero to four decimal places. This indicates that
the effect of the first predictor varies from one
sample to another.
An F statistic as extreme as this would occur by
chance only once in 10,000
times if the samples were truly equal.
The p-value for the second effect is 0.0039,
which is also highly significant. This indicates
that the effect of the second predictor varies
from one sample to another.
Does not appear to be any interaction between the
two predictors. The p-value, 0.8411, means that
the observed result is quite likely (84 out 100
times) given that there is no interaction.
The p-values returned by anova2 depend on
assumptions about the random
disturbances eij in the model equation. For the
p-values to be correct, these
disturbances need to be independent, normally
distributed, and have constant
variance.
In addition, anova2 requires that data be
balanced, which means there must be the same
number of samples for each combination of control
variables. Other ANOVA methods support
unbalanced data with any number of predictors.

26
Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis

Linear Regression Models
In statistics, linear regression models take the
form of a summation of
coefficient (independent variable or
combination of independent variables).
For example
In this example, the response variable y is
modeled as a combination of constant, linear,
interaction, and quadratic terms formed from two
predictor variables x1 and x2.
Uncontrolled factors and experimental errors are
modeled by e. Given data on x1, x2, and y,
regression estimates the model parameters ßj (j
1, ..., 5).
More general linear regression models represent
the relationship between a continuous response y
and a continuous or categorical predictor x in
the form

27
Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
Example (system of equations) Suppose we have a
series of measurements of stream discharge and
stage, measured at n different times. time (day)
0 14 28 42 56 70 stage (m) 0.612 0.647
0.580 0.629 0.688 0.583 discharge
(m3/s) 0.330 0.395 0.241 0.338 0.531
0.279 Suppose we now wish to fit a rating
curve to these measurements. Let x stage, y
discharge, then we can write this series of
measurements as yi mxi b, with i 1n.
This in turn can be written as y X b,
or
28
Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
yi mxi b y X b
29
Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis

Example Harmonic Analysis
sin(qf) sin(q)cos(f) sin(f)cos(q)
Let ACcos(f), BCsin(f)
gt Csin(wtf) Asin(wt) Bcos(wt)
Linear regression y Xb

30
Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
Example Harmonic analysis (contd) Southampton
Surface Currents Harmonic analysis for M2,
M42xM2, M63xM2 ...

Note Tidal Harmonics can cause tidal cycle to
appear asymmetric.

31
Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
Generalized linear models (GLM) are a flexible
generalization of ordinary least squares
regression. They relate the random distribution
of the measured variable of the experiment (the
distribution function) to the systematic
(non-random) portion of the experiment (the
linear predictor) through a function called the
link function. Generalized additive models
(GAMs) are another extension to GLMs in which the
linear predictor ? is not restricted to be linear
in the covariates X but is an additive function
of the xis The smooth functions fi are
estimated from the data. In general this requires
a large number of data points and is
computationally intensive.
32
Data Handling Matlab Useful Tidbits