Title: Statistics
1Statistics
- Statistics is . . . . .
- making a silk purse out of a pig's ear,
- i.e. getting reliable results from data which is
never quite as good as you hoped it was going to
be.
N.B. The book used for these notes was
Statistics by R J Barlow, Wiley 1989.
2How long is a piece of string?
- N measurements x1, . . . . . xN of the length of
a piece of string - Moments of distribution mean
and variance V - We suppose that is an estimator of the string
length ? and ? an estimator of the measurement
errors.
a good estimator should be unbiasedits
expectation value should equal the true
value, should be consistentthe limit as N??
should be the true value and should be
efficientits variance should be small.
(It turns out that the mean satisfies these
criteria, but the variance does notit is biased.
However the expression is unbiased.)
But what do we mean by the true value?
3The population model
- Set up a model of the measurement, where each
observation is subject to an error ei which is a
random variable of mean 0 and variance V, which
added to ? gives xi. - This model dataset has an infinite number of
randomly-distributed observations from which our
real dataset has been sampled. - The expectation value lt gt is evidently ?.
Because we know the probability distribution of
the xi we can answer questions like " if I repeat
this experiment 1000 times, what will be the
magnitude and distribution of the values of ?-
?" and so to deduce the accuracy of .
4Statistical inference
- This is an example of the process of statistical
inference, where we set up a model of the
measurement and then deduce its properties using
suitable estimators. For more complicated models
than the one used here we might use a fitting
process to deduce the values of several
parameters of the model, but the process is
essentially the same.
Conceptually, we can repeat the measurement as
often as we like. This allows us to answer the
question, what is the probability that I will
get a value of in the range a to b by
repeating the (thought) experiment many times and
finding the fraction which give a result in this
range. This is the frequentist approach to
defining the probability of an outcome.
The obvious problem is that the idea is highly
artificial no one would really try to answer
these questions by actually repeating the
experimentsometimes in astronomy that is not
even possible in principle. Observations of a
classical gamma-ray burst could not be repeated
because that star no longer exists! On the other
hand it defines clearly what we mean by
probability, which otherwise is not easy.
5The Calculus of Probabilities
- Suppose a certain event has two possible
outcomes A and B with probabilities P(A ) and
P(B). Also let P(A ? B) be the probability of A
given B. - Then, if A and B are mutually exclusive P(A)
P(B) 1 - P(A B) 0
- P(A B) 1
If A and B are independent P(A ? B) P(A)
Otherwise, more generally, P(A B) P(A)
P(B) P(A B) and P(A B) P(A) P(B ? A)
P(B) P(A ? B)
The last expression can be re-arranged to give
Bayes Theorem P(A ? B) P( A) P(B ? A) /
P(B)
These axioms are the basis of the mathematical
theory of probablility
6The Bayesian approach
- Suppose P(m ? d) is the probability of our
model, given the observed data (the posterior
probability), then by Bayes Theorem we can write - P(m ? d) P( m) P(d ? m) / P(d)
- where P( m) is the probability of the model
before we take into account our new measurements
(the prior probability), - P(d ? m) is the probability of getting our
particular dataset if the model is true, - and P(d) is a normalisation constant which makes
the sum of P(m ? d) 1, taken over all possible
models.
Now P(m ? d) is just exactly what we would like
to knowgoing back to our piece of string, P(? ?
x) is much better than P( ? x) which is what
we get from the frequentist analysis. But nobody
knows what this probability really means, and
they attach different namespropensity,
plausibility, confidence which make it plain it
is somewhat subjective, unlike the frequentist
probability which has an objective meaning.
In addition it is usually difficult to know what
value to give the prior normally it is given
some rather bland value, implying little prior
knowledge.
7Probability theory and statistical inference
How do we calculate the probability P(d ? m) of
getting a given dataset, from a knowledge of
the model?
True model
Parameter estimation
statistical inference-- what can we say about the
probability P(m ? d) of the model or its
parameters from the properties of our dataset?
8(No Transcript)
9Probability distributions
The specification of the model will tell us the
distribution function which we must use to
calculate the probability P(r) of any particular
outcome r for a measurement ( error ei or
numbers of photons).
For example, if we are observing random events
which occur at a mean rate of ? per sec, then we
must use a Poissonian distribution with that
mean value.
For example, if we are observing random events
which occur at a mean rate of ? per sec, then we
must use a Poissonian distribution with that
mean value.
If we are measuring the heights of students in a
class, with mean ? and variance ?2, we must use
a Gaussian distribution with those parameters
(Central Limit Theorem).
Note that r is discrete but x is a continuous
variable, so PG is a probability density dPG
PG. dx
10Poisson distribution
The mean ltrgt ?, of course, but in addition the
variance ?????? also
Gaussian distribution
For large ?, the Poisson distribution tends to a
Gaussian of mean and variance ?
11The ?2 distribution
Another important distribution is that of ?2, the
statistic often used in fitting. This is where
nN-m is the number of degrees of freedomN is
the number of data points fitted, and m the
number of parameters optimisedand ? is the
gamma function.
12The Central Limit Theorem
It is noticeable that Poissonian and ?2
distributions tend to become more symmetrical
and to look more like a Gaussian at large ??or n,
respectively they do in fact tend to
Gaussians. Another important case is where the
distribution of a quantity (like the heights of
a population) depends on a number of independent
factors it then always tends to be
Gaussian-distributed. The Central Limit Theorem
states that any variable which depends on a large
number of independent factors will be
Gaussian-distributed.
13Model-fitting with Maximum Likelihood
- Calculate P(d ? m) for the particular case of
models of photon-counting. - Usually called the likelihood of the dataonly
difference is normalisation - L(da) P(d ? m)
- so
Next we maximise the likelihood by using the
values of a for which
14Example
- Model constant source of counting rate a.
- Since the statistics are Poissonian, we have
- So the Maximum Likelihood value is just the mean.
15In reality . . .
- Of course the model is likely to be much more
complicated, - e.g. a point source of strength a at position
(????) in the field of view, together with a
background b, would be modelled by - m(x, y) b a. h(x - ?, y - ?)
- where h(a, b) is the point spread function of the
image.
Now lnL has to be minimised with respect to 4
parameters, a, b, ? and ?, so it must be done
numerically.
Note that with ML/Poissonian statistics it is
never correct to subtract the background before
fitting as it changes the statistics.
16Finally, in real life (i.e. LAT)
- The source model is considered as
- This model is folded with the Instrument Response
Functions (IRFs) to obtain the predicted counts
in the measured quantity space (E,p,t) - where
- is the combined IRF. is the orientation
vector of the spacecraft. The integral is
performed over the Source Region, i.e. the sky
region encompassing all sources contributing to
the Region-of Interest (ROI).
17Extended ML fitting
- The above formulation of the likelihood is
correct only if the total number of photon events
is predetermined by stopping the exposure when a
desired value has been reached. In practice of
course we normally collect events until a desired
exposure time is reached in which case the total
number of events is a parameter of the model.
This changes the form of lnL slightly the main
effect is to increase the estimated errors.
18Bayesian Model-fitting
- From Bayes Theorem, P(m ? d) P( m)
P(d ? m) / P(d) - If we have no prior information about the model
parameters, P( m) will be uniform over the
parameter space and P(d) is only a normalising
factor, so the parameter values that which were
obtained from the ML analysis which maximise P(d
? m) will also maximise P(m ? d) and so are the
Bayesian values.
19?2 (Least squares) Model-fitting
- If our data is binned
- and if the number of photon events per bin is
large enough (gt5, 10, 20, say) - then we can replace the Poisson distribution in
the model by a Gaussian distribution - and for a one-parameter model
For a multi-parameter model , where mi and
?i2 are the predicted mean and variance in the
ith bin.
Maximising lnL is equivalent to minimising
, so Least
Squares fitting is just ML fitting with large
numbers of events. In fact, in place of lnL,
the Cash statistic C -2lnL can be used which
makes the correspondence even closer.
20Pros and cons of ??
- Cons
- (i) Data must be binned reducing resolution
- (ii) What value to use for the estimate of ?i2?
- Commonly, xi is used but this introduces a bias
- Other choices have been proposed. As an example,
Keith Arnaud did 1000 simulations of a Chandra
power law spectrum of slope 1.7 with 1000 events
in the spectrum and fitted this data using
different fit statistics to get the following
results for the fitted slope - Standard ?2 1.41 (-0.15, 0.16)
- Gehrels ?2 1.88 (-0.22, 0.25)
- Churasov ?2 1.69 (-0.15, 0.16)
- ML 1.70 (-0.13, 0.14)
- The superiority of ML is obvious it is partly
due to this issue of the estimate of the
variance, and partly to the use of the
approximate distribution function.
Pros ?2 has one important strength, howeverits
distribution is known which allows an estimate to
be made of the goodness-of-fit of the model.
21Calculating Optimal Parameter Values
- The calculation of maximal/minimal statistics is
usually done numerically using a standard
computer package. - BEWARE!!
- Suppose we need to minimise a statistic S using a
model with m free parameters. The simplest method
is, starting from a suitably-chosen point on the
S-m surface, to calculate the direction of
steepest descent and take a small step in that
direction. Repeating this procedure, one will
eventually arrive at a minimum.
The problem is that unless m is small, the S-m
surface may well have multiple minima and the one
found may be only a local minimum. Always repeat
the calculation from a large range of initial
parameters and make sure the fit always converges
on the same solution.
22Confidence intervals
We now assume that this means that ? will lie
within this range in 90 of experiments.
23Reverend Bayes is back!!This is a Bayesian
statement based on a uniform prior.Suppose we
estimate a parameter m to be x, and that we have
an estimate of the error s. We assume that x is
Gaussian-distributed and use a prior P(m)
1Note that this is correctly normalisedx
must have a value between - 8 and 8 and P
integrated over this range is 1.We know m must
be positive, so now we use a prior P(m) 1, m
0 0 otherwiseand now we must
re-normalise so that x gt
0
24Dealing with negative intensities
We can use this result to advantage in those
cases where the confidence interval includes
physically-nonsensical negative values, e.g. an
intensity of 0.060.10 units. Using gives
the 90 confidence interval as 0.0 to 0.19
instead of -0.04 to 0.16. See Barlow, p130
25Confidence Intervals in Model-fitting
- The fit statistic gives directly the relative
probability attached to any set of parameter
values this is true whether LnL or ?2 - If the prior is bland, this is also the model
probability given the observed data. - The parameter for which an error estimate is
required is varied in steps, whilst at each stage
re-optimising the rest, until the change ?S in
the fit statistic S reaches a required value, say
2.71
? ?2 for one parameter is distributed as ?2 (1)
for two parameters as ?2 (2) Confidence, 68
90 95 99 ?S (one parameter) 1.0 2.71
3.84 6.63 ?S (two paras) 2.30 4.61 5.99
9.21 For ML, ?(lnL) is distributed as - ?2 /2
provided the number of data points n is large,
so these values can be multiplied by -1/2..
If two parameters are varied at once, the line ?S
2.3 is an ellipse which traces out the 68
confidence region. If the axes of the ellipse are
not parallel to the parameter axes, then the two
parameters are correlated.
26Correlated variables
27Goodness of Fit
- It is obvious from the expression
- that we would expect the minimum value of ?2 to
be close to N and this is truethe distribution
of ?2 is -
- n N-m is the number of degrees of freedom
where N is the number of data points and m is the
number of parameters in the model - This is true for all n, but if n is large enough
it approaches a Gaussian of mean n and variance
2n -
This assumes the model is a good one and so can
be used as a test of that hypothesis, by
comparing the reduced ?2min i.e. the best-fit
value, with the expected value. If n is large,
we compare the reduced ?2min / n with 1.
28(No Transcript)
29A good agreement implies the model is a good
description of the data if the reduced ?2 is too
large not only is the model poor, but the
difference indicates how likely such a value is
by chance.
For example, a bremsstrahlung fit to a 20 channel
spectrum gives ?2min 35 There are 2 parameters
so there are 18 dof. ?2min is suspiciously
highP 0.01 by chance
Alternatively, a low value implies that the
errors are overestimated.
This is a very valuable property of ?2 fitting
and, perhaps, the only reason to consider using
it. Unfortunately, there is no ML equivalent.
The usual solution is to run e.g. 1000 Monte
Carlo simulations of the experiment.
30How many Parameters?
- Adding more adjustable parameters will usually
make a model fit better. - Exampletry adding a power law component (2 extra
parameters) to the bremsstrahlung - ?2min drops to 18 with 16 dofobviously a better
fit - How can you know when to stop?
The first answer is just to stop when you get an
acceptable reduced ?2, but the F test gives a
more rigorous answer.
31The F test
- Suppose an initial model with ?i degrees of
freedom gives a ?2 of Si and after fitting extra
parameters one gets Sf with ?f dof. - Calculate
- and look up the significance of the result in a
table of the F-distribution.
(It is essential to note that books usually give
this test in its simpler, original form and the
tables to go with it. You need tables of the
distribution of F(???????????as in Bevington,
Data reduction and error analysis for the
physical sciences)
For the example, F (35-18)/18 x 16/2 7.6.
The F-test table shows the chance of exceeding
6.23 by chance is 0.01 so the power law is
significant at 99 confidence.
32Conditions for using the F-test
- The F-test can only validly be used if two
conditions are satisfied - the simpler model is nested within the more
complex - the null hypothesis does not lie along a
boundary in data space - (added spectral line, point source, power law)
- See Protassov et al , Ap J 2002, 571, 545
33The Maximum Likelihood Test StatisticThe
F-statistic is closely related to the test
statistic TS used to determine the significance
of a possible point source, (see below)
TS 2 ( lnL1 lnL2 )If N is large, TS
is approximately distributed as ?2 (nf - ni)
/2For a point source, this is ?2 (4) /2
(position, flux, spectral index)or ?2 (3) /2
(position, intensity)TS 25 for 4 additional
parameters corresponds to P 2.10-5Rule of
thumb TS s2 (s is actually 4.1 in this
case)There are 20000 PSFs in the LAT sky so 0.4
spurious sources expected.
34Hypothesis testing
- We started by considering the problem of
parameter estimation - We have now moved into hypothesis testinge.g.
that a spectrum includes a significant power law
component. - An important tool of hypothesis testing is the
Null Hypothesis, useful where a testable form of
the hypothesis cannot be set up.
35the Null hypothesis
Suppose that we count photons from a source in
constant intervals, and find the mean number of
events in each is ?. Then in one interval we
measure some (greater) number, l. Is this due to
a flare in the source?
First we formulate the null hypothesis-- there is
no flare, the measurement l is simply drawn from
the same population as all the others.
The probability of obtaining a count of l or
greater is
which for ?10, l20 is 1 How significant this
result is, depends on how many measurements in
all we have made.
In 1000 intervals, 10 would give this value by
chance alone
36Source Searching
- Searching an image for point sources is another
example of hypothesis testing, where one is
looking for pixels (or groups of pixels) which
contain a significant excess of counts. - One estimates the background b from some selected
large region of the image and compares this with
the number of counts n in each pixel. - The significance of a hypothetical source is then
(n-b)/ vb in standard deviations, and the
hypothesis that there is no source present can
then be rejected at some level of confidence.
A group of pixels would be used to increase the
sensitivity of the test when the image point
spread function covers several pixels.
A maximum probability of 10-6 ("5?") is usually
taken as a realistic value when searching for
point sources in an image which can easily have
105 pixels.
37Timing Analyses
- Dave showed data from the Vela Pulsar using
gtptest
38 Testing Pulsation Significance Output from
gtptest on Vela
- Type of test Chi-squared Test, 10 phase bins
- Probability distribution Chi-squared, 9 degrees
of freedom - Test Statistic 824.028880866426
- Chance Probability Range (0, 2.03757046903054e-99
) - Type of test Rayleigh Test
- Probability distribution Chi-squared, 2 degrees
of freedom - Test Statistic 46.2571601550502
- Chance Probability Range (9.02305042259081e-11,
9.02395272685048e-11) - Type of test Z2n Test, 10 harmonics
- Probability distribution Chi-squared, 20 degrees
of freedom - Test Statistic 1511.03487971911
- Chance Probability Range (0, 2.0785338644267e-99)
- Type of test H Test, 10 maximum harmonics
- Probability distribution H Test-specific
- Test Statistic 1475.03487971911
- Chance Probability Range (0, 4e-08)
- From Dave's timing talk
39(No Transcript)
40Pulsation tests
- Chi-squared null hypothesis all bins equal
- Rayleigh phases form a random walk
- Z2n pulsation is a sine wave
with n harmonics - H Test pulsation
is a sine wave with automatic selection
of n
Why have more than one? Because the waveform can
vary from one or two sharp peaks/cycle to a sine
wave The tests are most sensitive for waveforms
which match their null hypothesis Also, the last
three tests do not require the data to be binned