Statistics - PowerPoint PPT Presentation

About This Presentation
Title:

Statistics

Description:

Statistics Statistics is . . . . . making a silk purse out of a pig's ear, i.e. getting reliable results from data which is never quite as good as you hoped it was ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 41
Provided by: PeterW233
Category:

less

Transcript and Presenter's Notes

Title: Statistics


1
Statistics
  • Statistics is . . . . .
  • making a silk purse out of a pig's ear,
  • i.e. getting reliable results from data which is
    never quite as good as you hoped it was going to
    be.

N.B. The book used for these notes was
Statistics by R J Barlow, Wiley 1989.
2
How long is a piece of string?
  • N measurements x1, . . . . . xN of the length of
    a piece of string
  • Moments of distribution mean
    and variance V
  • We suppose that is an estimator of the string
    length ? and ? an estimator of the measurement
    errors.

a good estimator should be unbiasedits
expectation value should equal the true
value, should be consistentthe limit as N??
should be the true value and should be
efficientits variance should be small.
(It turns out that the mean satisfies these
criteria, but the variance does notit is biased.
However the expression is unbiased.)
But what do we mean by the true value?
3
The population model
  • Set up a model of the measurement, where each
    observation is subject to an error ei which is a
    random variable of mean 0 and variance V, which
    added to ? gives xi.
  • This model dataset has an infinite number of
    randomly-distributed observations from which our
    real dataset has been sampled.
  • The expectation value lt gt is evidently ?.
    Because we know the probability distribution of
    the xi we can answer questions like " if I repeat
    this experiment 1000 times, what will be the
    magnitude and distribution of the values of ?-
    ?" and so to deduce the accuracy of .

4
Statistical inference
  • This is an example of the process of statistical
    inference, where we set up a model of the
    measurement and then deduce its properties using
    suitable estimators. For more complicated models
    than the one used here we might use a fitting
    process to deduce the values of several
    parameters of the model, but the process is
    essentially the same.

Conceptually, we can repeat the measurement as
often as we like. This allows us to answer the
question, what is the probability that I will
get a value of in the range a to b by
repeating the (thought) experiment many times and
finding the fraction which give a result in this
range. This is the frequentist approach to
defining the probability of an outcome.
The obvious problem is that the idea is highly
artificial no one would really try to answer
these questions by actually repeating the
experimentsometimes in astronomy that is not
even possible in principle. Observations of a
classical gamma-ray burst could not be repeated
because that star no longer exists! On the other
hand it defines clearly what we mean by
probability, which otherwise is not easy.
5
The Calculus of Probabilities
  • Suppose a certain event has two possible
    outcomes A and B with probabilities P(A ) and
    P(B). Also let P(A ? B) be the probability of A
    given B.
  • Then, if A and B are mutually exclusive P(A)
    P(B) 1
  • P(A B) 0
  • P(A B) 1

If A and B are independent P(A ? B) P(A)
Otherwise, more generally, P(A B) P(A)
P(B) P(A B) and P(A B) P(A) P(B ? A)
P(B) P(A ? B)
The last expression can be re-arranged to give
Bayes Theorem P(A ? B) P( A) P(B ? A) /
P(B)
These axioms are the basis of the mathematical
theory of probablility
6
The Bayesian approach
  • Suppose P(m ? d) is the probability of our
    model, given the observed data (the posterior
    probability), then by Bayes Theorem we can write
  • P(m ? d) P( m) P(d ? m) / P(d)
  • where P( m) is the probability of the model
    before we take into account our new measurements
    (the prior probability),
  • P(d ? m) is the probability of getting our
    particular dataset if the model is true,
  • and P(d) is a normalisation constant which makes
    the sum of P(m ? d) 1, taken over all possible
    models.

Now P(m ? d) is just exactly what we would like
to knowgoing back to our piece of string, P(? ?
x) is much better than P( ? x) which is what
we get from the frequentist analysis. But nobody
knows what this probability really means, and
they attach different namespropensity,
plausibility, confidence which make it plain it
is somewhat subjective, unlike the frequentist
probability which has an objective meaning.
In addition it is usually difficult to know what
value to give the prior normally it is given
some rather bland value, implying little prior
knowledge.
7
Probability theory and statistical inference
How do we calculate the probability P(d ? m) of
getting a given dataset, from a knowledge of
the model?
True model
Parameter estimation
statistical inference-- what can we say about the
probability P(m ? d) of the model or its
parameters from the properties of our dataset?
8
(No Transcript)
9
Probability distributions
The specification of the model will tell us the
distribution function which we must use to
calculate the probability P(r) of any particular
outcome r for a measurement ( error ei or
numbers of photons).
For example, if we are observing random events
which occur at a mean rate of ? per sec, then we
must use a Poissonian distribution with that
mean value.
For example, if we are observing random events
which occur at a mean rate of ? per sec, then we
must use a Poissonian distribution with that
mean value.
If we are measuring the heights of students in a
class, with mean ? and variance ?2, we must use
a Gaussian distribution with those parameters
(Central Limit Theorem).
Note that r is discrete but x is a continuous
variable, so PG is a probability density dPG
PG. dx
10
Poisson distribution

The mean ltrgt ?, of course, but in addition the
variance ?????? also
Gaussian distribution
For large ?, the Poisson distribution tends to a
Gaussian of mean and variance ?
11
The ?2 distribution
Another important distribution is that of ?2, the
statistic often used in fitting. This is where
nN-m is the number of degrees of freedomN is
the number of data points fitted, and m the
number of parameters optimisedand ? is the
gamma function.
12
The Central Limit Theorem
It is noticeable that Poissonian and ?2
distributions tend to become more symmetrical
and to look more like a Gaussian at large ??or n,
respectively they do in fact tend to
Gaussians. Another important case is where the
distribution of a quantity (like the heights of
a population) depends on a number of independent
factors it then always tends to be
Gaussian-distributed. The Central Limit Theorem
states that any variable which depends on a large
number of independent factors will be
Gaussian-distributed.
13
Model-fitting with Maximum Likelihood
  • Calculate P(d ? m) for the particular case of
    models of photon-counting.
  • Usually called the likelihood of the dataonly
    difference is normalisation
  • L(da) P(d ? m)
  • so

Next we maximise the likelihood by using the
values of a for which
14
Example
  • Model constant source of counting rate a.
  • Since the statistics are Poissonian, we have
  • So the Maximum Likelihood value is just the mean.

15
In reality . . .
  • Of course the model is likely to be much more
    complicated,
  • e.g. a point source of strength a at position
    (????) in the field of view, together with a
    background b, would be modelled by
  • m(x, y) b a. h(x - ?, y - ?)
  • where h(a, b) is the point spread function of the
    image.

Now lnL has to be minimised with respect to 4
parameters, a, b, ? and ?, so it must be done
numerically.
Note that with ML/Poissonian statistics it is
never correct to subtract the background before
fitting as it changes the statistics.
16
Finally, in real life (i.e. LAT)
  • The source model is considered as
  • This model is folded with the Instrument Response
    Functions (IRFs) to obtain the predicted counts
    in the measured quantity space (E,p,t)
  • where
  • is the combined IRF. is the orientation
    vector of the spacecraft. The integral is
    performed over the Source Region, i.e. the sky
    region encompassing all sources contributing to
    the Region-of Interest (ROI).

17
Extended ML fitting
  • The above formulation of the likelihood is
    correct only if the total number of photon events
    is predetermined by stopping the exposure when a
    desired value has been reached. In practice of
    course we normally collect events until a desired
    exposure time is reached in which case the total
    number of events is a parameter of the model.
    This changes the form of lnL slightly the main
    effect is to increase the estimated errors.

18
Bayesian Model-fitting
  • From Bayes Theorem, P(m ? d) P( m)
    P(d ? m) / P(d)
  • If we have no prior information about the model
    parameters, P( m) will be uniform over the
    parameter space and P(d) is only a normalising
    factor, so the parameter values that which were
    obtained from the ML analysis which maximise P(d
    ? m) will also maximise P(m ? d) and so are the
    Bayesian values.

19
?2 (Least squares) Model-fitting
  • If our data is binned
  • and if the number of photon events per bin is
    large enough (gt5, 10, 20, say)
  • then we can replace the Poisson distribution in
    the model by a Gaussian distribution
  • and for a one-parameter model

For a multi-parameter model , where mi and
?i2 are the predicted mean and variance in the
ith bin.
Maximising lnL is equivalent to minimising
, so Least
Squares fitting is just ML fitting with large
numbers of events. In fact, in place of lnL,
the Cash statistic C -2lnL can be used which
makes the correspondence even closer.
20
Pros and cons of ??
  • Cons
  • (i) Data must be binned reducing resolution
  • (ii) What value to use for the estimate of ?i2?
  • Commonly, xi is used but this introduces a bias
  • Other choices have been proposed. As an example,
    Keith Arnaud did 1000 simulations of a Chandra
    power law spectrum of slope 1.7 with 1000 events
    in the spectrum and fitted this data using
    different fit statistics to get the following
    results for the fitted slope
  • Standard ?2 1.41 (-0.15, 0.16)
  • Gehrels ?2 1.88 (-0.22, 0.25)
  • Churasov ?2 1.69 (-0.15, 0.16)
  • ML 1.70 (-0.13, 0.14)
  • The superiority of ML is obvious it is partly
    due to this issue of the estimate of the
    variance, and partly to the use of the
    approximate distribution function.

Pros ?2 has one important strength, howeverits
distribution is known which allows an estimate to
be made of the goodness-of-fit of the model.
21
Calculating Optimal Parameter Values
  • The calculation of maximal/minimal statistics is
    usually done numerically using a standard
    computer package.
  • BEWARE!!
  • Suppose we need to minimise a statistic S using a
    model with m free parameters. The simplest method
    is, starting from a suitably-chosen point on the
    S-m surface, to calculate the direction of
    steepest descent and take a small step in that
    direction. Repeating this procedure, one will
    eventually arrive at a minimum.

The problem is that unless m is small, the S-m
surface may well have multiple minima and the one
found may be only a local minimum. Always repeat
the calculation from a large range of initial
parameters and make sure the fit always converges
on the same solution.
22
Confidence intervals
We now assume that this means that ? will lie
within this range in 90 of experiments.
23
Reverend Bayes is back!!This is a Bayesian
statement based on a uniform prior.Suppose we
estimate a parameter m to be x, and that we have
an estimate of the error s. We assume that x is
Gaussian-distributed and use a prior P(m)
1Note that this is correctly normalisedx
must have a value between - 8 and 8 and P
integrated over this range is 1.We know m must
be positive, so now we use a prior P(m) 1, m
0 0 otherwiseand now we must
re-normalise so that x gt
0
24
Dealing with negative intensities
We can use this result to advantage in those
cases where the confidence interval includes
physically-nonsensical negative values, e.g. an
intensity of 0.060.10 units. Using gives
the 90 confidence interval as 0.0 to 0.19
instead of -0.04 to 0.16. See Barlow, p130
25
Confidence Intervals in Model-fitting
  • The fit statistic gives directly the relative
    probability attached to any set of parameter
    values this is true whether LnL or ?2
  • If the prior is bland, this is also the model
    probability given the observed data.
  • The parameter for which an error estimate is
    required is varied in steps, whilst at each stage
    re-optimising the rest, until the change ?S in
    the fit statistic S reaches a required value, say
    2.71

? ?2 for one parameter is distributed as ?2 (1)
for two parameters as ?2 (2) Confidence, 68
90 95 99 ?S (one parameter) 1.0 2.71
3.84 6.63 ?S (two paras) 2.30 4.61 5.99
9.21 For ML, ?(lnL) is distributed as - ?2 /2
provided the number of data points n is large,
so these values can be multiplied by -1/2..
If two parameters are varied at once, the line ?S
2.3 is an ellipse which traces out the 68
confidence region. If the axes of the ellipse are
not parallel to the parameter axes, then the two
parameters are correlated.
26
Correlated variables
27
Goodness of Fit
  • It is obvious from the expression
  • that we would expect the minimum value of ?2 to
    be close to N and this is truethe distribution
    of ?2 is
  • n N-m is the number of degrees of freedom
    where N is the number of data points and m is the
    number of parameters in the model
  • This is true for all n, but if n is large enough
    it approaches a Gaussian of mean n and variance
    2n

This assumes the model is a good one and so can
be used as a test of that hypothesis, by
comparing the reduced ?2min i.e. the best-fit
value, with the expected value. If n is large,
we compare the reduced ?2min / n with 1.
28
(No Transcript)
29
A good agreement implies the model is a good
description of the data if the reduced ?2 is too
large not only is the model poor, but the
difference indicates how likely such a value is
by chance.
For example, a bremsstrahlung fit to a 20 channel
spectrum gives ?2min 35 There are 2 parameters
so there are 18 dof. ?2min is suspiciously
highP 0.01 by chance
Alternatively, a low value implies that the
errors are overestimated.
This is a very valuable property of ?2 fitting
and, perhaps, the only reason to consider using
it. Unfortunately, there is no ML equivalent.
The usual solution is to run e.g. 1000 Monte
Carlo simulations of the experiment.
30
How many Parameters?
  • Adding more adjustable parameters will usually
    make a model fit better.
  • Exampletry adding a power law component (2 extra
    parameters) to the bremsstrahlung
  • ?2min drops to 18 with 16 dofobviously a better
    fit
  • How can you know when to stop?

The first answer is just to stop when you get an
acceptable reduced ?2, but the F test gives a
more rigorous answer.
31
The F test
  • Suppose an initial model with ?i degrees of
    freedom gives a ?2 of Si and after fitting extra
    parameters one gets Sf with ?f dof.
  • Calculate
  • and look up the significance of the result in a
    table of the F-distribution.

(It is essential to note that books usually give
this test in its simpler, original form and the
tables to go with it. You need tables of the
distribution of F(???????????as in Bevington,
Data reduction and error analysis for the
physical sciences)
For the example, F (35-18)/18 x 16/2 7.6.
The F-test table shows the chance of exceeding
6.23 by chance is 0.01 so the power law is
significant at 99 confidence.
32
Conditions for using the F-test
  • The F-test can only validly be used if two
    conditions are satisfied
  • the simpler model is nested within the more
    complex
  • the null hypothesis does not lie along a
    boundary in data space
  • (added spectral line, point source, power law)
  • See Protassov et al , Ap J 2002, 571, 545

33
The Maximum Likelihood Test StatisticThe
F-statistic is closely related to the test
statistic TS used to determine the significance
of a possible point source, (see below)
TS 2 ( lnL1 lnL2 )If N is large, TS
is approximately distributed as ?2 (nf - ni)
/2For a point source, this is ?2 (4) /2
(position, flux, spectral index)or ?2 (3) /2
(position, intensity)TS 25 for 4 additional
parameters corresponds to P 2.10-5Rule of
thumb TS s2 (s is actually 4.1 in this
case)There are 20000 PSFs in the LAT sky so 0.4
spurious sources expected.
34
Hypothesis testing
  • We started by considering the problem of
    parameter estimation
  • We have now moved into hypothesis testinge.g.
    that a spectrum includes a significant power law
    component.
  • An important tool of hypothesis testing is the
    Null Hypothesis, useful where a testable form of
    the hypothesis cannot be set up.

35
the Null hypothesis
Suppose that we count photons from a source in
constant intervals, and find the mean number of
events in each is ?. Then in one interval we
measure some (greater) number, l. Is this due to
a flare in the source?
First we formulate the null hypothesis-- there is
no flare, the measurement l is simply drawn from
the same population as all the others.
The probability of obtaining a count of l or
greater is
which for ?10, l20 is 1 How significant this
result is, depends on how many measurements in
all we have made.
In 1000 intervals, 10 would give this value by
chance alone
36
Source Searching
  • Searching an image for point sources is another
    example of hypothesis testing, where one is
    looking for pixels (or groups of pixels) which
    contain a significant excess of counts.
  • One estimates the background b from some selected
    large region of the image and compares this with
    the number of counts n in each pixel.
  • The significance of a hypothetical source is then
    (n-b)/ vb in standard deviations, and the
    hypothesis that there is no source present can
    then be rejected at some level of confidence.

A group of pixels would be used to increase the
sensitivity of the test when the image point
spread function covers several pixels.
A maximum probability of 10-6 ("5?") is usually
taken as a realistic value when searching for
point sources in an image which can easily have
105 pixels.
37
Timing Analyses
  • Dave showed data from the Vela Pulsar using
    gtptest

38
Testing Pulsation Significance Output from
gtptest on Vela
  • Type of test Chi-squared Test, 10 phase bins
  • Probability distribution Chi-squared, 9 degrees
    of freedom
  • Test Statistic 824.028880866426
  • Chance Probability Range (0, 2.03757046903054e-99
    )
  • Type of test Rayleigh Test
  • Probability distribution Chi-squared, 2 degrees
    of freedom
  • Test Statistic 46.2571601550502
  • Chance Probability Range (9.02305042259081e-11,
    9.02395272685048e-11)
  • Type of test Z2n Test, 10 harmonics
  • Probability distribution Chi-squared, 20 degrees
    of freedom
  • Test Statistic 1511.03487971911
  • Chance Probability Range (0, 2.0785338644267e-99)
  • Type of test H Test, 10 maximum harmonics
  • Probability distribution H Test-specific
  • Test Statistic 1475.03487971911
  • Chance Probability Range (0, 4e-08)
  • From Dave's timing talk

39
(No Transcript)
40
Pulsation tests
  • Chi-squared null hypothesis all bins equal
  • Rayleigh phases form a random walk
  • Z2n pulsation is a sine wave
    with n harmonics
  • H Test pulsation
    is a sine wave with automatic selection
    of n

Why have more than one? Because the waveform can
vary from one or two sharp peaks/cycle to a sine
wave The tests are most sensitive for waveforms
which match their null hypothesis Also, the last
three tests do not require the data to be binned
Write a Comment
User Comments (0)
About PowerShow.com