Laboratory in Oceanography: Data and Methods - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Laboratory in Oceanography: Data and Methods

Description:

'Integrating the LACFs using 100 days as the upper limit of the integral of Rii(t) ... the t-statistic has Student's t distribution with n 1 degrees of freedom. ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 33
Provided by: msunde
Category:

less

Transcript and Presenter's Notes

Title: Laboratory in Oceanography: Data and Methods


1
Laboratory in Oceanography Data and Methods
Intro to the Statistics Toolbox
  • MAR599, Spring 2009
  • Miles A. Sundermeyer

2
Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics
  • Measures of Central Tendency
  • Geometric Mean
  • Harmonic Mean

3
Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics
  • Measures of Dispersion
  • Interquartile range difference between the 75th
    and 25th percentiles
  • Mean absolute deviation mean(abs(x-mean(x)))
  • Moment mean((x-mean(x)).order (e.g., order2
    gives variance)
  • skewness third central moment of x, divided by
    cube of its standard deviation (pos/neg skewness
    implies longer right/left tail)
  • kurtosis fourth central moment of x, divided by
    4th power of its standard deviation (high
    kurtosis means sharper peak and longer/fatter
    tails)

4
Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics
Examples of Skewness Kurtosis
5
Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics
  • Bootstrap Method
  • Involves choosing random samples with replacement
    from a data set and analyzing each sample data
    set the same way as the original data set. The
    number of elements in each bootstrap sample set
    equals the number of elements in the original
    data set. The range of sample estimates obtained
    provides a means of estimating uncertainty of the
    quantity being estimated.
  • In general, bootstrap method can be used to
    compute uncertainty for any functional
    calculation, provided the sample data set is
    representative of the true distribution.
  • Jacknife Method
  • Similar to the bootstrap is the jackknife, but
    uses re-sampling to estimate the bias and
    variance of sample statistics.

6
Intro to Statistics Toolbox Statistics
Toolbox/Descriptive Statistics
Example Bootstrap Method for estimating
uncertainty on Lagrangian Integral Time Scale
(from Sundermeyer and Price, 1998)
Integrating the LACFs using 100 days as the
upper limit of the integral of Rii(t) in (12)
gives the integral timescales I(11,22) (10.6
4.8, 5.4 2.8) days for the (zonal, meridional)
components, where uncertainties represent 95
confidence limits estimated using a bootstrap
method e.g., Press et al., 1986.
7
Intro to Statistics Toolbox Statistics
Toolbox/Statistical Visualization
  • Probability Distribution Plots
  • Normal Probability Plots
  • gtgt x normrnd(10,1,25,1)
  • gtgt normplot(x)

gtgt x exprnd(10,100,1) gtgt normplot(x)
8
Intro to Statistics Toolbox Statistics
Toolbox/Statistical Visualization
  • Probability Distribution Plots
  • Quantile-Quantile Plots
  • gtgt x poissrnd(10,50,1) y poissrnd(5,100,1)
  • gtgt qqplot(x,y)

gtgt x normrnd(5,1,100,1) gtgt y
wblrnd(2,0.5,100,1) gtgt qqplot(x,y)
9
Intro to Statistics Toolbox Statistics
Toolbox/Statistical Visualization
  • Probability Distribution Plots
  • Cumulative Distribution Plots
  • gtgt y evrnd(0,3,100,1)
  • gtgt cdfplot(y)
  • gtgt hold on
  • gtgt x -200.110
  • gtgt f evcdf(x,0,3)
  • gtgt plot(x,f,'m')
  • gtgt legend('Empirical', ...
  • 'Theoretical', ...
  • 'Location','NW')

10
Intro to Statistics Toolbox Statistics
Toolbox/Probability Distributions/Supported
Distributions
  • Supported distributions include wide range of
  • Continuous distributions (data)
  • Continuous distributions (statistics)
  • Discrete distributions
  • Multivariate distributions
  • http//www.mathworks.com/access/helpdesk/help/tool
    box/stats/index.html?/access/helpdesk/help/toolbox
    /stats/http//www.mathworks.com/support/product/p
    roduct.html?productST

11
Intro to Statistics Toolbox Statistics
Toolbox/Probability Distributions/Supported
Distributions
Supported distributions (contd)
12
Intro to Statistics Toolbox Statistics
Toolbox/Probability Distributions/Supported
Distributions
Supported statistics
13
Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
  • Hypothesis Testing
  • Can only disprove a hypothesis
  • null hypothesis an assertion about a
    population. It is "null" in that it represents a
    status quo belief, such as the absence of a
    characteristic or the lack of an effect.
  • alternative hypothesis a contrasting assertion
    about the population that can be tested against
    the null hypothesis
  • H1 µ ? null hypothesis value (two-tailed
    test)
  • H1 µ gt null hypothesis value (right-tail
    test)
  • H1 µlt null hypothesis value (left-tail test)
  • test statistic random sample of population
    collected, and test statistic computed to
    characterize the sample. The statistic varies
    with type of test, but distribution under null
    hypothesis must be known (or assumed).
  • p-value - probability, under null hypothesis, of
    obtaining a value of the test statistic as
    extreme or more extreme than the value computed
    from the sample.
  • significance level - threshold of probability,
    typical value of a is 0.05. If p-value lt a the
    test rejects the null hypothesis if p-value gt a,
    there is insufficient evidence to reject the null
    hypothesis.
  • confidence interval - estimated range of values
    with a specified probability of containing the
    true population value of a parameter.

14
Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
  • Hypothesis Testing
  • Hypothesis tests make assumptions about the
    distribution of the random variable being sampled
    in the data. These must be considered when
    choosing a test and when interpreting the
    results.
  • Z-test (ztest) and the t-test (ttest) both assume
    that the data are independently sampled from a
    normal distribution.
  • Both the z-test and the t-test are relatively
    robust with respect to departures from this
    assumption, so long as the sample size n is large
    enough.
  • Difference between the z-test and the t-test is
    in the assumption of the standard deviation s of
    the underlying normal distribution. A z-test
    assumes that s is known a t-test does not. Thus
    t-test must determine s from the sample.

15
Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
  • ztest
  • The test requires s (the standard deviation of
    the population) to be known
  • The formula for calculating the z score for the
    z-test is
  • where
  • x is the sample mean µ is the mean of the
    population
  • The z-score is compared to a z-table, which
    contains the percent of area under the normal
    curve between the mean and the z-score. This
    table will indicate whether the calculated
    z-score is within the realm of chance, or if it
    is so different from the mean that the sample
    mean is unlikely to have happened by chance.

16
Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
  • ttest
  • Like z-test, except the t-test does not require s
    to be known
  • The formula for calculating the t score for the
    t-test is
  • where
  • x is the sample mean µ is the mean of the
    population
  • s is the sample variance
  • Under the null hypothesis that the population is
    distributed with mean µ, the z-statistic has a
    standard normal distribution, N(0,1). Under the
    same null hypothesis, the t-statistic has
    Student's t distribution with n 1 degrees of
    freedom.

17
Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
  • ttest2
  • performs a t-test of the null hypothesis that
    data in the vectors x and y are independent
    random samples from normal distributions with
    equal means and equal but unknown variances
    unknown variances may be either equal or unequal.
  • The formula for calculating the score for the
    t-test2 is
  • where
  • x, y are sample means sx, sy are the sample
    variances
  • The null hypothesis is that the two samples are
    distributed with the same mean.

18
Intro to Statistics Toolbox Statistics
Toolbox/Hypothesis Tests
19
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
  • ANOVA (ANalysis Of VAriance)
  • ANOVA is like a t-test among multiple (typically
    gt2) data sets simultaneously
  • T-tests can be done between two data sets, or one
    set and a true value
  • uses the f-distribution instead of the
    t-distribution
  • assumes that all of the data sets have equal
    variances
  • One-way ANOVA is a simple special case of the
    linear model. The one-way ANOVA form of the model
    is
  • where
  • yij is a matrix of observations, each column
    represents a different group.
  • a.j is a matrix whose columns are the group
    means. (The "dot j" notation means a applies to
    all rows of column j. That is, aij is the same
    for all i.)
  • eij is a matrix of random disturbances.
  • The model assumes that the columns of y are a
    constant plus a random disturbance. ANOVA tests
    if the constants are all the same.

20
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
  • One-way ANOVA
  • Example Hogg and Ledolter bacteria counts in
    milk. Columns represent different shipments, rows
    are bacteria counts from cartons chosen randomly
    from each shipment. Do some shipments have higher
    counts than others?
  • gtgt load hogg
  • gtgt hogg
  • hogg
  • 24 14 11 7 19
  • 15 7 9 7 24
  • 21 12 7 4 19
  • 27 17 13 7 15
  • 33 14 12 12 10
  • 23 16 18 18 20
  • gtgt p,tbl,stats anova1(hogg)
  • gtgt p
  • p 1.1971e-04
  • standard ANOVA table has columns for the sums of
    squares, dof, mean squares (SS/df), F statistic,
    and p-value.

21
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
  • One-way ANOVA (contd)
  • In this case the p-value is about 0.0001, a very
    small value. This is a strong indication that the
    bacteria counts from the different shipments are
    not the same. An F statistic as extreme as this
    would occur by chance only once in 10,000 times
    if the counts were truly equal.
  • The p-value returned by anova1 depends on
    assumptions about random disturbances eij in the
    model equation. For the p-value to be correct,
    these disturbances need to be independent,
    normally distributed, and have constant variance.

22
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
  • Multiple Comparisons
  • Sometimes need to determine not just whether
    there are differences among means, but which
    pairs of means are significantly different.
  • In t-test, compute t-statistic and compare to a
    critical value. However, when testing multiple
    pairs, for example, if probability of t-statistic
    exceeding critical value is 5, then for 10
    pairs, much more likely that one of these will
    falsely fail that criterion.
  • Can perform a multiple comparison test using the
    multcompare function by supplying it with the
    stats output from anova1.
  • Example
  • gtgt load hogg
  • gtgt p,tbl,stats anova1(hogg)
  • gtgt c,m multcompare(stats)
  • Example
  • see Light_DO.m

23
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
  • Two-way ANOVA
  • Determine whether data from several groups have a
    common mean. Differs from one-way ANOVA in that
    the groups in two-way ANOVA have two categories
    of defining characteristics instead of one (e.g.,
    think of two independent variables/dimensions)
  • Two-way ANOVA is again a special case of the
    linear model. The two-way ANOVA form of the model
    is
  • where
  • yijk is a matrix of observations (with rows i,
    columns j, and repetition k).
  • m is a constant matrix of the overall mean of
    the observations.
  • a.j is a matrix whose columns are deviations of
    each observation attributable to the first
    independent variable. All values in a given
    column of are identical, and values in each row
    sum to 0.
  • b.j is a matrix whose rows are the deviations of
    each observation attributable to the second
    independent variable. All values in a given row
    of are identical, and values in each column of
    sum to 0.
  • gij is a matrix of interactions. Values in each
    row sum to 0, and values in each column sum to 0.
  • eij is a matrix of random disturbances.
  • The model assumes that the columns of y are a
    series of constants plus a random disturbance.
    You want to know if the constants are all the
    same.

24
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
Two-way ANOVA Example Determine effect of car
model and factory on the mileage rating of
cars. There are three models (columns) and two
factories (rows). Data from first factory is in
first three rows, data from second factory is in
last three rows. Do some cars have different
mileage than others? gtgt load mileage mileage
33.3000 34.5000 37.4000 33.4000
34.8000 36.8000 32.9000 33.8000
37.6000 32.6000 33.4000 36.6000 32.5000
33.7000 37.0000 33.0000 33.9000 36.7000
gtgt cars 3 gtgt p,tbl,stats
anova2(mileage,cars)p,tbl,stats anova1(hogg)
25
Intro to Statistics Toolbox Statistics
Toolbox/Analysis of Variance
  • Two-way ANOVA (contd)
  • In this case the p-value for the first effect is
    zero to four decimal places. This indicates that
    the effect of the first predictor varies from one
    sample to another.
  • An F statistic as extreme as this would occur by
    chance only once in 10,000
  • times if the samples were truly equal.
  • The p-value for the second effect is 0.0039,
    which is also highly significant. This indicates
    that the effect of the second predictor varies
    from one sample to another.
  • Does not appear to be any interaction between the
    two predictors. The p-value, 0.8411, means that
    the observed result is quite likely (84 out 100
    times) given that there is no interaction.
  • The p-values returned by anova2 depend on
    assumptions about the random
  • disturbances eij in the model equation. For the
    p-values to be correct, these
  • disturbances need to be independent, normally
    distributed, and have constant
  • variance.
  • In addition, anova2 requires that data be
    balanced, which means there must be the same
    number of samples for each combination of control
    variables. Other ANOVA methods support
    unbalanced data with any number of predictors.

26
Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
  • Linear Regression Models
  • In statistics, linear regression models take the
    form of a summation of
  • coefficient (independent variable or
    combination of independent variables).
  • For example
  • In this example, the response variable y is
    modeled as a combination of constant, linear,
    interaction, and quadratic terms formed from two
    predictor variables x1 and x2.
  • Uncontrolled factors and experimental errors are
    modeled by e. Given data on x1, x2, and y,
    regression estimates the model parameters ßj (j
    1, ..., 5).
  • More general linear regression models represent
    the relationship between a continuous response y
    and a continuous or categorical predictor x in
    the form

27
Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
Example (system of equations) Suppose we have a
series of measurements of stream discharge and
stage, measured at n different times. time (day)
0 14 28 42 56 70 stage (m) 0.612 0.647
0.580 0.629 0.688 0.583 discharge
(m3/s) 0.330 0.395 0.241 0.338 0.531
0.279 Suppose we now wish to fit a rating
curve to these measurements. Let x stage, y
discharge, then we can write this series of
measurements as yi mxi b, with i 1n.
This in turn can be written as y X b,
or
28
Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
yi mxi b y X b
29
Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
  • Example Harmonic Analysis
  • sin(qf) sin(q)cos(f) sin(f)cos(q)
  • Let ACcos(f), BCsin(f)
  • gt Csin(wtf) Asin(wt) Bcos(wt)
  • Linear regression y Xb

30
Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
Example Harmonic analysis (contd) Southampton
Surface Currents Harmonic analysis for M2,
M42xM2, M63xM2 ...
  • Note Tidal Harmonics can cause tidal cycle to
    appear asymmetric.

31
Intro to Statistics Toolbox Statistics
Toolbox/Regression Analysis
Generalized linear models (GLM) are a flexible
generalization of ordinary least squares
regression. They relate the random distribution
of the measured variable of the experiment (the
distribution function) to the systematic
(non-random) portion of the experiment (the
linear predictor) through a function called the
link function. Generalized additive models
(GAMs) are another extension to GLMs in which the
linear predictor ? is not restricted to be linear
in the covariates X but is an additive function
of the xis The smooth functions fi are
estimated from the data. In general this requires
a large number of data points and is
computationally intensive.
32
Data Handling Matlab Useful Tidbits
  • Useful Tidbits
  • regress - performs multiple linear regression
    using least squares
  • nlinfit - performs nonlinear least-squares
    regression.
  • glmfit - fits a generalized linear model.
Write a Comment
User Comments (0)
About PowerShow.com