Title: Some statistics
1Some statistics
- use statistics as a drunkard uses a lamppost,
for support, not for illumination
2Why use statistics?
- Assess significance and error (confidence)
- what is the error of that C14 date?
- Im holding a pair of aces how likely is it
that someone else has a straight? - Prediction
- what are the chances of an earthquake here in
the next 20 years?
3Prediction in geology is hard
- Satellite orbits are example of deterministic
variables can be calculated the sun rises
every morning. - Earthquakes are pretty random in time.
- This is characteristic of nonlinear systems such
as long-term weather (first shown in 1960s by
Lorentz). Even if current state is known exactly,
the future cannot be modeled (due to chaotic
behavior the butterfly effect). - Geological systems have lots of nonlinear
effects. - Statistics are often the only weapon for these
type of systems.
4Linear and non-linear
- However, many common geologic systems are modeled
as linear systems. - Linear systems follow the following rules
- It means that effects can be added linear
assumption is often a poor assumption but
nonlinear systems are hard. - Is f(x) ax2 linear?
5data analysis
- With current technology, it is usually easy to
make lots of measurements. - Geophysics a 3D seismic survey might have
1x1011 measurements. - But anything over a few dozen is hard to handle
without statistics. - Statistics are good for
- Getting a feel for the data
- Checking if two datasets are related
- Testing a hypothesis rigorously
6Some terms
- Random variable represents the random outcome
of an measurement. - Distribution expected range of values for a
random variable.
7Discrete versus continuous
- Discrete consists of sample points
- Continuous functions dont have breaks
- We will mostly deal with discrete functions here.
Mostly. - To convert a continuous function to discrete, we
need to sample it in some way. - Nyquist theorem says we have to sample at twice
the highest frequency in the signal to retain the
shape (sampling at less than that will lead to
significant error) - Tides vary twice per day gt must sample at least
4 times per day - Earthquake waves vary up to 50 times per second
(50 Hz) so must sample at 100 times per second
(100 Hz). - Also applies to spatial sampling. If a ore body
is 100 m across, must sample every 50 m to get
any sort of idea of shape.
8accuracy and precision
- Accurate measurements are close to the truth
- Precise measurements are close to each other,
(very little scatter) but may (or may not) be
accurate. - A set of very precise measurements may include a
bias. - Example Using a compass next to a large magnet
we can make very precise measurements but the
accuracy is terrible.
9significant figures
A significant figure is a digit from 1 to 9 and
zero if it is not a placeholder. It shows the
precision of the measurement (but not necessarily
the accuracy). 3.14567 has 7 significant
figures 0.00320 has 3 significant figures If we
conduct calculations we must use the proper
number of significant figures use the number
with the least amount og significant
figures. For example, 1.2 1.111111111111
1.3 (we round-off the number of decimal places to
ensure that the answer has the correct
precision). It is easy to calculate too many
decimal points in Excel.
10mean
- Known as arithmetic mean, mean, or average.
- We will deal largely with discrete samples.
- In Excel use average()
- Yields an unbiased estimate of the mean value
(m). - Approaches the real expected value for large n
11median
- The number in the middle of a set of numbers
- 50 of the numbers are above it 50 are below
it. - Better represents the most common value.
- Excel median()
12Average deviation and variance
- Measures how much a dataset varies around the
mean ( ). - Average deviation (avedev)
- Sample Variance (var)
13Standard deviation
- Often useful to have the deviation from the
average described in the same units as the
original data. - Excel stdev
- This is the unbiased form (use if the mean was
calculated from the same samples). - If we use n rather n-1 we get a different
estimate (more efficient but biased estimate if
the mean was calculated from the same data) - Square root of the variance
- Both the variance and the standard deviation are
commonly used to show the error in a measurement.
14These are estimates
- Need an infinite number of samples for the true
values. - Different subsets of the data will yield
different estimates of the mean.
15Difference between mean and median
- Assume we have three numbers
- 1,2,10
- The mean (or average) is 4.6667
- The median is 3
- Example
- L.T. averaged 5.2 yards per carry in 2006
- Why doesnt the coach always run the ball?
- should easily get 10 yards in four down, on
average
16Propagation of errors
If we multiply by a constant number we multiply
the error by the same number. (4.00.3)(3.1456)
12.60.9 If we add two numbers add the square of
aach error and then take the square root. Dz
sqrt((Dx)2 (Dy)2) For multiplication and
division we add the ratio of the errors to the
numbers (Dx/x)2 (Dy/y)2 (Dz/z)2 (z)(sqrt(Dx/
x Dy/y )) Dz sqrt() means take the square
root
17Example
Suppose we want to estimate the amount of oil in
a prospect we have just found. Using seismic
data, we find that the trap is 30020 m
high 100010 m long 5005 m wide It is composed
of sandstone. A core through the sandstone yields
6 measurements of the porosity 0.29, 0.19 0.25,
0.23, 0.22, and 0.26 The mean is 0.24 with a
variance of 0.0012 and standard deviation of
0.035 How much oil does it hold with correct
error estimates?
18Volume error Volume (30020)(100010)(5005)
1.5x108 m3 Error sqrt((20/300)2 (10/1000)2
(5/500)2 )1.5x108 1.0x107 Now we want to
multiply by the porosity (0.240.12) (1.5x108
1.0x107)(0.240.04) 3.5x1073.3x106 cubic
meters of crude oil There are about 8.5 barrels
per cubic meter 2.3x108 2.1x107 barrels 230
million barrels At 100 per barrel 23.02.0
billion dollars For comparison, the last big
discovery made in the Gulf of Mexico
Thunderhorse, holds about 1 billion barrels.
The Thunderhorse platform after Hurricane Dennis
19We can think of the histogram of pixels as a
distribution. It tells us the likelihood of
finding a specific color.
20Some terms
- Random variable represents the random outcome
of an measurement. - Probability distribution function (PDF)
describes how often a particular measurement
might occur.
PDF for a normal coin
PDF for a weighted coin (always heads)
1
1
0.5
0.5
0
0
heads
tails
heads
tails
21answer
- L.T. averaged 5.3 yards per run in the last game
but the median run was 3 yards. - A few long runs greatly increased the average.
- So the coach is not completely crazy.