Statistical inference for astrophysics - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical inference for astrophysics

Description:

Statistical inference for astrophysics A short course for astronomers Cardiff University 16-17 Feb 2005 Graham Woan, University of Glasgow The meaning of probability ... – PowerPoint PPT presentation

Number of Views:927
Avg rating:3.0/5.0
Slides: 116
Provided by: Graha49
Category:

less

Transcript and Presenter's Notes

Title: Statistical inference for astrophysics


1
Statistical inference for astrophysics
  • A short course for astronomers
  • Cardiff University 16-17 Feb 2005Graham Woan,
    University of Glasgow

2
Lecture Plan
  • Why statistical astronomy?
  • What is probability?
  • Estimating parameters values
  • the Bayesian way
  • the Frequentist way
  • Testing hypotheses
  • the Bayesian way
  • the Frequentist way
  • Assigning Bayesian probabilities
  • Monte-Carlo methods

Lectures 1 2
Lectures 3 4
3
Why statistical astronomy?
Generally, statistics has got a bad reputation
There are three types of lies lies, damned lies
and statistics
Benjamin Disraeli
Mark Twain
Often for good reason
Jun 3rd 2004 two researchers at the University
of Girona in Spain, have found that 38 of a
sample of papers in Nature contained one or more
statistical errors
The Economist
4
Why statistical astronomy?
Data analysis methods are often regarded as
simple recipes
http//www.numerical-recipes.com/
5
Why statistical astronomy?
Data analysis methods are often regarded as
simple recipes
but in analysis as in life, sometimes the
recipes dont work as you expect.
  • Low number counts
  • Distant sources
  • Correlated residuals
  • Incorrect assumptions
  • Systematic errors and/or
  • misleading results

6
Why statistical astronomy?
and the tools can be clunky
" The trouble is that what we statisticians
call modern statistics was developed under strong
pressure on the part of biologists. As a result,
there is practically nothing done by us which is
directly applicable to problems of astronomy."
Jerzy Neyman, founder of frequentist hypothesis
testing.
7
Why statistical astronomy?
For example, we can observe only the one Universe
(From Bennett et al 2003)
8
The Astrophysicists Shopping List
We want tools capable of
  • dealing with very faint sources
  • handling very large data sets
  • correcting for selection effects
  • diagnosing systematic errors
  • avoiding unnecessary assumptions
  • estimating parameters and testing models

9
Why statistical astronomy?
Key question
How do we infer properties of the Universe from
incomplete and imprecise astronomical data?
Our goal
To make the best inference, based on our observed
data and any prior knowledge, reserving the right
to revise our position if new information comes
to light.
Lets come to this problem afresh with an
astrophyicists eye, and bypass some of the
jargon of orthodox statistics, going right back
to plain probability
10
Right-thinking gentlemen 1
A decision was wise, even though it led to
disastrous consequences, if with the evidence at
hand indicated it was the best one to make and a
decision was foolish, even though it led to the
happiest possible consequences, if it was
unreasonable to expect those consequencesWe
should do the best with what we have, not what we
wished we had.
Herodotus, c.500 BC
11
Right-thinking gentlemen 2
Probability theory is nothing but common sense
reduced to calculation
Pierre-Simon Laplace (1749 1827)
12
Right-thinking gentlemen 3
Occams Razor
Frustra fit per plura, quod fieri potest per
pauciora. It is vain to do with more what can
be done with less.
Everything else being equal, we favour models
which are simple.
William of Occam (1288 1348 AD)
13
The meaning of probability
  • Definitions, algebra and useful distributions

14
But what is probability?
  • There are three popular interpretations of the
    word, each with an interesting history
  • Probability as a measure of our degree of belief
    in a statement
  • Probability as a measure of limiting relative
    frequency of outcome of a set of identical
    experiments
  • Probability as the fraction of favourable
    (equally likely) possibilities
  • We will call these the Bayesian, Frequentist and
    Combinatorial interpretations.
  • Note there are signs of trouble here
  • How do you quantify degree of belief?
  • How do you define relative frequency without
    using ideas of probability?
  • What does equally likely mean?
  • Thankfully, at some level, all three
    interpretations agree on the algebra of
    probability, which we will present in Bayesian
    terms

15
Algebra of (Bayesian) probability
  • Probability of a statement, such as y 3,
    the mass of a neutron star is 1.4 solar masses
    or it will rain tomorrow is a number between 0
    and 1, such that
  • For some statement X, where the bar denotes
    the negation of the statement -- The Sum Rule
  • If there are two statements X and Y, then joint
    probabilitywhere the vertical line denotes the
    conditional statement X given Y is true The
    Product Rule

16
Algebra of (Bayesian) probability
  • From these we can deduce thatwhere denotes
    or and I represents common background
    information -- The Extended Sum Rule
  • and, because p(X,Y)p(Y,X), we getwhich is
    called Bayes Theorem
  • Note that these results are also applicablein
    Frequentist probability theory, with a suitable
    change in meaning of p.

17
Algebra of (Bayesian) probability
Bayes theorem is the appropriate rule for
updating our degree of belief when we have new
data
Prior
Likelihood
Posterior
Evidence
note that the word evidence is sometimes used
for something else (the log odds). We will
stick to the p(dI) definition here.
18
Algebra of (Bayesian) probability
  • We can also deduce the marginal probabilities.
    If X and Y are propositions that can take on
    values drawn from
    and then
    this gives use the probability of X when
    we dont care about Y. In these circumstances, Y
    is known as a nuisance parameter.
  • All these relationships can be smoothly extended
    from discrete probabilities to probability
    densities, e.g.where p(y)dy is the
    probability that y lies in the range y to ydy.

1
19
Algebra of (Bayesian) probability
These equations are the key to Bayesian
Inference the methodology upon which (much)
astronomical data analysis is now founded. Clear
introduction by Devinder Sivia (OUP). (fuller
bibliography tomorrow)
See also the free book by Praesenjit Saha (QMW,
London). http//ankh-morpork.maths.qmw.ac.uk/7Esa
ha/book
20
Example
  • A gravitational wave detector may have seen a
    type II supernova as a burst of gravitational
    radiation. Burst-like signals can also come from
    instrumental glitches, and only 1 in 10,000
    bursts is really a supernova, so the data are
    checked for glitches by examining veto channels.
    The test is expected to confirm the burst is
    astronomical in 95 of cases in which it truly
    is, and in 1 when it truly is not.The burst
    passes the veto test!! What is the probability
    we have seen a supernova?Answer
    DenoteLet I represent the information that
    the burst seen is typical of those used to deduce
    the information in the question. Then we are
    told that


Prior probabilities for S and G

Likelihoods for S and G
21
Example cont
  • But we want to know the probability that its a
    supernova, given it passed the veto testBy
    Bayes, theoremand we are directly told
    everything on the rhs except , the
    probability that any burst candidate would pass
    the veto test. If the burst is either a supernova
    or a hardware glitch then we can marginalise over
    these alternativesso

0.9999 0.01 0.0001 0.95
22
Example cont
  • So, despite passing the test there is only a 1
    probability that the burst is a supernova!Veto
    tests have to be blisteringly good if supernovae
    are rare.
  • Why? Because most bursts that pass the test are
    just instrumental glitches it really is just
    common sense reduced to calculation.
  • Note however that by passing the test, the
    probability that this burst is from a supernova
    has increased by a factor of 100 (from 0.0001 to
    0.01).
  • Moral

Probability that a supernova burst gets through
the veto (0.95)
Probability that its a supernova burst if it
gets through the veto(0.01)
23
the basis of frequentist probability
24
Basis of frequentist probability
  • Bayesian probability theory is simultaneously a
    very old and a very young field
  • Old original interpretation of Bernoulli,
    Bayes, Laplace
  • Young state of the art in (astronomical) data
    analysis
  • But BPT was rejected for several centuries
    because probability as degree of belief was seen
    as too subjective

Frequentist approach
25
Basis of frequentist probability
Probability long run relative frequency of
an event it appears at first that this can be
measured objectively e.g. rolling a die. What
is ? If die is fair we
expect These probabilities are fixed (but
unknown) numbers. Can imagine rolling die M
times. Number rolled is a random variable
different outcome each time.
26
Basis of frequentist probability
  • We define If
    die is fair
  • But objectivity is an illusion
  • assumes each outcome equally likely
  • (i.e. equally probable)
  • Also assumes infinite series of identical trials
  • What can we say about the fairness of the die
    after
  • (say) 5 rolls, or 10, or 100 ?

27
Basis of frequentist probability
  • In the frequentist approach, a lot of
    mathematical machinery is specifically defined to
    let us address frequency questions
  • We take the data as a random sample of size M ,
    drawn from an assumed underlying pdf
  • Sampling distribution, derived from the
    underlying pdf and M
  • Define an estimator function of the sample
    data that is used to estimate properties of the
    pdf
  • But how do we decide what makes an
    acceptable estimator?

28
Basis of frequentist probability
  • Example measuring a galaxy redshift
  • Let the true redshift z0 -- a fixed but
    unknown parameter. We use two telescopes to
    estimate z0 and compute sampling distributions
    for and modelling errors
  • Small telescope
  • low dispersion spectrometer
  • Unbiased
  • Repeat observation a
  • large number of times
  • ? average estimate is
  • equal to z0
  • BUT
    is large due to the low dispersion.

29
Basis of frequentist probability
  • Example measuring a galaxy redshift
  • Let the true redshift z0 -- a fixed but
    unknown parameter. We use two telescopes to
    estimate z0 and compute sampling distributions
    for and modelling errors
  • Large telescope
  • high dispersion spectrometer
  • but faulty astronomer!
  • (e.g. wrong calibration)
  • Biased
  • BUT is small. Which is the better
    estimator?

30
Basis of frequentist probability
What about the sample mean?
  • Let be a random sample from pdf
    with mean
  • and variance . Then
  • sample
    mean
  • Can show that -- an unbiased estimator
  • But bias is defined formally in terms of an
    infinite set of randomly chosen samples, each of
    size M.
  • What can we say with a finite number of samples,
    each of finite size? Before that

31
Some important probability distributions quick
definitions
32
Some important pdfs discrete case
  • Poisson pdf
  • e.g. number of photons / second counted by a
    CCD,
  • number of galaxies / degree2 counted by a
    survey

n number of detections Poisson pdf assumes
detections are independent, and there is a
constant rate Can show that
33
Some important pdfs discrete case
  • Poisson pdf
  • e.g. number of photons / second counted by a
    CCD,
  • number of galaxies / degree2 counted by a
    survey

34
Some important pdfs discrete case
  • Binomial pdf
  • number of successes from N observations, for
    two mutually
  • exclusive outcomes (Heads and Tails)
  • e.g. number of binary stars, Seyfert galaxies,
    supernovae

r number of successes probability
of success for single observation
Can show that
35
Some important pdfs continuous case
  • Uniform pdf

36
Some important pdfs continuous case
  • Central, or Normal pdf
  • (also known as Gaussian )

p(x)
x
37
Cumulative distribution function (CDF)
Prob( x lt a )
38
Measures and moments of a pdf
The nth moment of a pdf is defined as
Discrete case
Continuous case
39
Measures and moments of a pdf
The 1st moment is called the mean or
expectation value
Discrete case
Continuous case
40
Measures and moments of a pdf
The 2nd moment is called the mean square
Discrete case
Continuous case
41
Measures and moments of a pdf
The variance is defined as and is
often written . is called the
standard deviation
Discrete case
Continuous case
In general
42
Measures and moments of a pdf
pdf mean variance
Poisson Binomial Uniform Normal
discrete
continuous
43
Measures and moments of a pdf
The Median divides the CDF into two equal
halves
Prob( x lt xmed ) Prob( x gt xmed ) 0.5
44
Measures and moments of a pdf
The Mode is the value of x for which the
pdf is a maximum
p(x)
x
For a Normal pdf, mean median mode
45
Parameter estimation
  • The Bayesian way

46
Bayesian parameter estimation
In the Bayesian approach, we can test our model,
in the light of our data (i.e. rolling the die)
and see how our degree of belief in its
fairness evolves, for any sample size,
considering only the data that we did actually
observe.
Prior
Likelihood
Posterior
Influence of our observations
What we knew before
What we know now
47
Bayesian parameter estimation
Astronomical example 1 Probability that a
galaxy is a Seyfert 1
We want to know the fraction of Seyfert galaxies
that are type 1. How large a sample do we need
to reliably measure this? Model as a binomial
pdf global fraction of Seyfert
1s Suppose we sample N Seyferts, and observe
r Seyfert 1s
probability of obtaining observed data, given
model the likelihood of
48
Bayesian parameter estimation
  • But what do we choose as our prior? This has
    been the source of much argument between
    Bayesians and frequentists, though it is often
    not that important.
  • We can sidestep this for a moment, realising that
    if our data are good enough, the choice of prior
    shouldnt matter!

Prior
Likelihood
Posterior
Dominates
49
Bayesian parameter estimation
Consider a simulation of this problem using two
different priors
Flat prior all values of ? equally probable
p(? I )
Normal prior peaked at ? 0.5
?
50
After observing 0 galaxies
p(? data, I )
?
51
After observing 1 galaxy Seyfert 1
p(? data, I )
?
52
After observing 2 galaxies S1 S1
p(? data, I )
?
53
After observing 3 galaxies S1 S1 S2
p(? data, I )
?
54
After observing 4 galaxies S1 S1 S2 S2
p(? data, I )
?
55
After observing 5 galaxies S1 S1 S2 S2
S2
p(? data, I )
?
56
After observing 10 galaxies 5 S1 5 S2
p(? data, I )
?
57
After observing 20 galaxies 7 S1 13 S2
p(? data, I )
?
58
After observing 50 galaxies 17 S1 33 S2
p(? data, I )
?
59
After observing 100 galaxies 32 S1 68 S2
p(? data, I )
?
60
After observing 200 galaxies 59 S1 141 S2
p(? data, I )
?
61
After observing 500 galaxies 126 S1 374 S2
p(? data, I )
?
62
After observing 1000 galaxies 232 S1 768 S2
p(? data, I )
?
63
Bayesian parameter estimation
What do we learn from all this?
  • As our data improve (e.g. our sample increases),
    the posterior pdf narrows and becomes less
    sensitive to our choice of prior.
  • The posterior conveys our (evolving) degree of
    belief in different values of ? , in the light of
    our data
  • If we want to express our result as a single
    number we could perhaps adopt the mean, median,
    or mode
  • We can use the variance of the posterior pdf
    to assign an
  • uncertainty for ?
  • It is very straightforward to define
    confidence intervals

64
Bayesian parameter estimation
Bayesian confidence intervals
?
We are 95 sure that lies between and Note
the confidence interval is not unique, but we can
define the shortest C.I.
?1
?2
95 of area under pdf
p(? data, I )
?2
?1
?
65
Bayesian parameter estimation
Astronomical example 2 Flux density of a GRB
Take Gamma Ray Bursts to be equally luminous
events, distributed homogeneously in the
Universe. We see three gamma ray photons from a
GRB in an interval of 1 s. What is the flux of
the source, F?
  • The seat-of-the-pants answer is F3 photons/s,
    with an uncertainty of about , but we can do
    better than that by including our prior
    information on luminosity and homogeneity. Call
    this background information I

Homogeneity implies that the probability the
source is in any particular volume of space is
proportional to the volume, so the prior
probability that the source is in a thin shell of
radius r is
66
Bayesian parameter estimation
  • But the sources have a fixed luminosity, L, so r
    and F are directly related by
  • The prior on F is thereforeInterpretation
    low flux sources are intrinsically more probable,
    as there is more space for them to sit in.
  • We now apply Bayes theorem to determine the
    posterior for F after seeing n photons

hence
p(FI)
F
67
Bayesian parameter estimation
  • The Likelihood for F comes from the Poisson
    nature of photonsso finally, For n3 we
    get thiswith the most probable value ofF
    equalling 0.5 photons/sec.
  • Clearly it is more probable this is a distant
    sourcefrom which we have seen an unusually high
    number of photons than it is an unusually nearby
    source from which we have seen an expected number
    of photons. (The most probable value of F is
    n-5/2, approaching n for ngtgt1)

p(Fn3,I)
Fn/T
Fmost prob
F
68
Parameter estimation
  • The frequentist way

69
Frequentist parameter estimation
  • Recall that in frequentist (orthodox) statistics,
    probability is limiting relative frequency of
    outcome, so only random variables can
    have frequentist probabilitiesas only these
    show variation with repeated measurement. So we
    cant talk about the probability of a model
    parameter, or of a hypothesis. E.g., a
    measurement of a mass is a RV, but the mass
    itself is not.
  • So no orthodox probabilistic statement can be
    interpreted as directly referring to the
    parameter in question! For example, orthodox
    confidence intervals do not indicate the range in
    which we are confident the parameter value lies.
    Thats what Bayesian intervals do.
  • So what do they mean?

70
Frequentist parameter estimation
  • Orthodox parameter estimate proceeds as follows
    Imagine we have some data, di, that we want to
    use to estimate the value of a parameter, a, in a
    model. These data must depend on the true value
    of a in a known way, but as they are random
    variables all we can say is that we know, or can
    estimate, p(di) for some given a.
  • Use the data to construct a statistic (i.e. a
    function of the data) that can be used as an
    estimator for a called . A good estimator
    will have a pdf that depends heavily on a, and
    which is sharply peaked at, or very close to, a.

Measured value of
71
Frequentist parameter estimation
  • 2. One such estimator is the maximum
    likelihood (ML) estimator, constructed in the
    following way given the distribution from which
    the data are drawn, p(da), construct the
    sampling distribution p(dia), which is just
    p(d1a). p(d2a). p(d3a) if the data are
    independent. Interpret this as a function of
    a, called the Likelihood of a, and call the value
    of a that maximises it for the given data .
    The corresponding sampling distribution,
    , is the one from which the data were most
    likely drawn.

72
Frequentist parameter estimation
  • but remember that this is just one value of
    , albeit drawn from a population that has an
    attractive average behaviour
  • And we still havent said anything about a.
  • 3. Define a confidence interval around a
    enclosing, say, 95 of the expected values of
    from repeated experiments

Measured value of
73
Frequentist parameter estimation
  • may be known from the sampling pdf or it may have
    to be estimated too.
  • 4. We can now say that
    . Note this is a probabilistic
    statement about the estimator, not about a.
    However this expression can be restated 0.95 is
    the relative frequency with which the statement
    a lies in the region is true,
    over many repeated experiments.

a
No
Yes
Yes
The disastrous shorthand for this is Note that
this is not a statement about our degree of
belief that a lies in the numerical interval
generated in our particular experiment.
Yes
No
Different experiments
74
Example B and F compared
  • Two gravitational wave (GW) detectors see 7
    simultaneous burst-like signal in the data from
    one hour, consisting of GW signals and spurious
    signals. When the two sets of data are offset in
    time by 10 minutes there are 9 simultaneous
    signals. What is the true GW burst rate?(Note
    no need to be a expert to realise there is not
    much GW activity here!)
  • A frequentist solutionTake the events as
    Poisson, characterised by the background rate, rb
    and the GW rate, rg. We get

Where is the variance of the sampling
distribution
75
Example B and F compared
  • So we would quote our result as
  • But burst rates cant be negative! What does
    this mean? Infact it is quite consistent with
    our definition of a frequentist confidence
    interval. Our value of is drawn from its
    sampling distribution, that will look something
    like

So, in the shorthand of the subject, this result
is quite correct.
4
Our particular sample
76
Example B and F compared
  • The Bayesian solution Well go through this
    carefully. Our job is to determine rg. In
    Bayesian terms that means we are after
    p(rgdata). We start by realising there are two
    experiments here one to determine the background
    rate and one to determine the joint rate, so we
    will also determine p(rbdata).If the
    background count comes from a Poisson process of
    mean rb then the probability of n events
    iswhich is our Bayesian likelihood for rb. We
    will choose a prior probability for rb
    proportional to 1/rb. This is the scale
    invariant prior that encapsulates total ignorance
    of the non-zero rate (of course in reality we may
    have something that constrains rb more tightly a
    priori). See later

77
Example B and F compared
  • So our posterior for rb, based on the background
    counts, is which, normalising and setting
    n9, gives
  • The probability of seeing m coincident bursts,
    given the two rates, is

Background rate
0.1
n9)
b
p(r
0.05
0
0
2
4
6
8
10
12
14
16
18
20
r
b
78
Example B and F compared
  • And our joint posterior for the rates is, by
    Bayes theorem,The joint prior in the above
    can be split into a probability for rb, which we
    have just evaluated, and a prior for rg. This
    may be zero, so we will say that we are equally
    ignorant over whether to expect 0,1,2, counts
    from this source. This translates into a uniform
    prior on rg, so our joint prior isFinally we
    get the posterior on rg by marginalising over rb

likelihood
prior
79
Example B and F compared
...giving our final answer to the problem
p(rgn9,m7)
Compare with our frequentist resultto 1-sigma
rg
Note that p(rglt0)0, due to the prior, and that
the true value of rg could very easily be as high
as 4 or 5
80
Example B and F compared
Lets see how this result would change if the
background count was 3, rather than 9 (joint
count still 7)
p(rgn3,m7)
rg
Again, this looks very reasonable
81
  • Bayesian hypothesis testing
  • And why we already know how to do it

82
Bayesian hypothesis testing
  • In Bayesian analysis, hypothesis testing can be
    performed as a generalised application of Bayes
    theorem. Generally a hypothesis differs from a
    parameter in that many values of the parameter(s)
    are consistent with one hypothesis. Hypotheses
    are models that depend of parameters.
  • Note however that we cannot define the
    probability of one hypothesis given some data, d,
    without defining all the alternatives in its
    class, i.e.and often this is impossible. So
    questions like do the data fit a gaussian? are
    not well-formed until you list all the curves the
    data could fit. A well-formed question would be
    do the data fit a gaussian better than these
    other curves?, or more usually do the data fit
    a gaussian better than a lorentzian?.
  • Simple comparisons can be expressed as an odds
    ratio, O

83
Bayesian hypothesis testing
  • The odds ratio an be divided into the prior odds
    and the Bayes factor
  • The prior odds simply express our prior
    preference for H1 over H2, and is set to 1 if you
    are indifferent.
  • The Bayes factor is just the ratio of the
    evidences, as defined in the earlier lectures.
    Recall that for a model that depends on a
    parameter aso the evidence is simply the
    joint probability of the parameter(s) and the
    data, marginalised over all hypothesis parameter
    values

84
Bayesian hypothesis testing
  • Example A spacecraft is sent to a moon of
    Saturn and, using a penetrating probe, detects a
    liquid sea deep under the surface at 1 atmosphere
    pressure and a temperature of -3C. However, the
    thermometer has a fault, so that the temperature
    reading may differ from the true temperature by
    as much as 5C, with a uniform probability
    within this range.Determine the temperature of
    the liquid, assuming it is water (liquid within
    0ltTlt100C) and then assuming it is ethanol
    (liquid within -80ltTlt80C). What are the odds of
    it being ethanol? based loosely
    on a problem by John Skilling

measurement
-80 0
80 100 C
Ethanol is liquid
water is liquid
85
Bayesian hypothesis testing
  • Call the water hypothesis H1 and the ethanol
    hypothesis H2.For H1The prior on the
    temperature is the likelihood of the
    temperature is the probability of the data d,
    given the temperaturethought of as a
    function of T, for d-3,

0 100 T
T
d
10
-3
T
10
86
Bayesian hypothesis testing
  • The posterior for T is the normalised product of
    the prior and the likelihood, giving
  • H1 only allows the temperature to be between 0
    and 2C.
  • The evidence for water (as we defined it) is
  • For H2By the same argumentsand the
    evidence for ethanol is

0 2 T
-8 0 2 T
87
Bayesian hypothesis testing
  • So under the water hypothesis we have a tighter
    possible range for the liquids temperature, but
    it may not be water. In fact, the odds of it
    being water rather than ethanol arewhich
    means 31 in favour of ethanol. Of course this
    depends on our prior odds too, which we have set
    to 1. If the choice was between water and whisky
    under the surface of the moon the result would be
    very different, though the Bayes factor would be
    roughly the same!
  • Why do we prefer the ethanol option? Because too
    much of the prior for temperature, assuming
    water, falls at values that are excluded by the
    data. In other words, the water hypothesis is
    unnecessarily complicated . Bayes factors
    naturally implement Occams razor in a
    quantitative way.

88
Bayesian hypothesis testing
0.1
0.01
T
-8 0 2
0.1
0.00625
T
-8 0 2
Overlap integral (evidence) is greater for H2
(ethanol) than for H1 (water)
89
Bayesian hypothesis testing
  • To look at this a bit more generally, we can
    split the evidence into two approximate parts,
    the maximum of the likelihood and an Occam
    factor
    i.e., evidence maximum likelihood x
    Occam factorThe Occam factor penalises models
    that include wasted parameter space, even if they
    show a good ML fit.

likelihood
Lmax
Likelihood_range
prior
x
Prior_range
90
Bayesian hypothesis testing
  • This is a very powerful property of the Bayesian
    method. Example your given a time series of
    1000 data points comprising a number of sinusoids
    embedded in gaussian noise. Determine the number
    of sinusoids and the standard deviation of the
    noise.
  • We could think of this as comparing hypotheses Hn
    that there are n sinusoids in the data, with n
    ranging from 0 to nmax. Equivalently, we could
    consider it as a parameter fitting problem, with
    n an unknown parameter within the model. The
    joint posterior for the n signals, with
    amplitudes An and frequencies ?n and noise
    variance given the overall model and data
    D isThe likelihood term, based on gaussian
    noise is and we can set the priors as
    independent and uniform over sensible ranges.

91
Bayesian hypothesis testing
  • Our result for n is just its marginal
    probabilityand similarly we could
    marginalise for ?. Recent work, in collaboration
    with Umstatter and Christensen, has explored
    this
  • 1000 data points with 50 embedded signals
  • ? 2.6

Result around 33 signals can be recovered from
the data, the rest are indistinguishable from
noise, and ? is consequentially higher
92
Bayesian hypothesis testing
  • This has implications for the analysis of LISA
    data, which is expected to contain many (perhaps
    50,000) signals from white dwarf binaries. The
    data will contain resolvable binaries and
    binaries that just contribute to the overall
    noise (either because they are faint or because
    their frequencies are too close
    together).Bayes can sort these
    out without having to introduce ad hoc acceptance
    and rejection criteria, and without needing to
    know the true noise level (whatever that
    means!).

93
Frequentist hypothesis testing And why we
should tread carefully, if at all
94
Frequentist hypothesis testing significance
tests
  • A note on why these should really be avoided!
  • The method goes like this
  • To test a hypothesis H1 consider another
    hypothesis, called the null hypothesis, H0, the
    truth of which would deny H1. Then argue against
    H0
  • Use the data you have gathered to compute a test
    statistic Tobs (eg, the chisquared statistic)
    which has a calculable pdf if H0 is true. This
    can be calculated analytically or by Monte Carlo
    methods.
  • Look where your observed value of the statistic
    lies in the pdf, and reject H0 based on how far
    in the wings of the distribution you have fallen.

p(TH0)
Reject H0 if your result lies in here
T
95
Frequentist hypothesis testing significance
tests
  • H0 is rejected at the x level if x of the
    probability lies to the right of the observed
    value of the statistic (or is worse in some
    other sense)and makes no reference to
    how improbable the value is under any alternative
    hypothesis (not even H1!). So

p(TH0)
X of the area
T
Tobs
An hypothesis H0 that may be true is rejected
because it has failed to predict observable
results that have not occurred. This seems a
remarkable procedure. On the face of it, the
evidence might more reasonably be taken as
evidence for the hypothesis, not against it.

Harold Jeffreys
Theory of Probability 1939
96
Frequentist hypothesis testing significance
tests
  • Eg, You are given a data point Tobs affected by
    gaussian noise, drawn either from
    N(mean-1,?0.5) or N(1,0.5). H0 The datum
    comes from ? -1 H1 The datum comes from ?
    1 Test whether H1 is true.
  • Here our statistic is simply the value of the
    observed data. We will chose a critical region
    of Tgt0, so that if Tobsgt0 we reject H0 and
    therefore accept H1.

H0
H1
T
-1 1
97
Frequentist hypothesis testing
Type II error
Type I error
T
  • Formally, this can go wrong in two ways
  • A Type I error occurs when we reject the null
    hypothesis when it is true
  • A Type II error occurs when we accept the null
    hypothesis when it is false
  • both of which we should strive to minimise
  • The probabilities (p-values) of these are shown
    as the coloured areas above (about 5 for this
    problem)
  • If Tobsgt0 then we reject the null hypothesis at
    the 5 level and therefore accept H1.

98
Frequentist hypothesis testing
  • But note that we do not consider the relative
    probabilities of the hypotheses (we cant!
    Orthodox statistics does not allow the idea of
    hypothesis probabilities), so the results can be
    misleading.
  • For example, let Tobs0.01. This lies just on
    the boundary of the critical region, so we
    reject H0 in favour of H1 at the 5 level,
    despite the fact that we know a value of 0.01 is
    just as likely under both hypotheses (acutally
    just as unlikely, but it has happened).
  • Note that the Bayes factor for this same result
    is 1, reflecting the intuitive answer that you
    cant decide between the hypotheses based on such
    a datum.
    Moral
  • The subtleties of p-values are rarely
    reflected in papers that quote them, and many
    errors of Type III (misuse!) occur. Always take
    them with a pinch of salt, and avoid them if
    possible there are better tools available.

99
  • Assigning Bayesian probabilities
  • (you have to start somewhere!)

100
Assigning Bayesian probabilities
  • We have done much on the manipulation of
    probabilities, but not much on the initial
    assignments of likelihoods and priors. Here are
    a few wordsThe principle of insufficient
    reason (Bernoulli, 1713) helps us out If we
    have N mutually exclusive possibilities for an
    outcome, and no reason to believe one more than
    another, then each should be assigned the
    probability 1/N.
  • So the probability of throwing a 6 on a die is
    1/6, if all you know is it has 6 faces.
  • The key is to realise that this is one example of
    a more general principle. Your state of
    knowledge is such that the probabilities are
    invariant under some transformation group (here,
    the exchange of numbers on faces).
  • Using this idea we can extend the principle of
    insufficient reason to continuous parameters.

101
Assigning Bayesian probabilities
  • So, a parameter that is not known to within a
    change of origin (a location parameter) should
    have a uniform probability
  • A parameter for which you have no knowledge of
    scale (a scale parameter) is a location
    parameter in its log so the so-called
    Jeffreys prior.
  • Note that both these prior are improper you
    cant normalise them, so their unfettered use is
    restricted. However, you are rarely that
    ignorant about the value of a parameter.

102
Assigning Bayesian probabilities
  • More generally, given some information, I, that
    we wish to use to assign probabilities p1pn to
    n different possibilities, then the most honest
    way of assigning the probabilities is the one
    that maximisessubject to the constraints
    imposed by I. H is traditionally called the
    information entropy of the distribution and
    measures the amount of uncertainty in the
    distribution in the sense of Shannon (though
    there are several routes to the same result).
    Honesty demands we maximise this, otherwise we
    are implicitly imposing further constraints not
    contained in I.
  • For example, the maximum entropy solution for the
    probability of seeing k events given only the
    information that they are characterised by a
    single rate constant ? is the Poisson
    distribution. If you are told the first and
    second moments of a continuous distribution the
    maxent solution is a gaussian etc

103
Monte Carlo methods How to do those nasty
marginalisations
104
Monte Carlo Methods
  • Uniform random number, U0,1
  • See Numerical Recipes!

http//www.numerical-recipes.com/
105
Monte Carlo Methods
2. Transformed random variables Suppose we
have Let Then
Probability of number between y and ydy
Probability of number between x and xdx
Because probability must be positive
106
Monte Carlo Methods
  • Transformed random variables
  • Suppose we have
  • Let
  • Then

p(y)
1
b-a
y
a
0
b
Numerical Recipes uses the transformation method
to provide
107
Monte Carlo Methods
  • 3. Probability integral transform
  • Suppose we can compute the CDF of some desired
    random variable
  • 1)
  • Compute
  • Then

108
Monte Carlo Methods

4. Rejection Sampling Suppose we want to
sample from some pdf p(x) and we know that
where q(x) is a simpler pdf called the
proposal distribution
  • Sample x1 from q(x)
  • Sample yU0,q(x1)
  • If yltp(x) ACCEPT
  • otherwise REJECT

Set of accepted values xi are a sample from
p(x).
109
Monte Carlo Methods

4. Rejection Sampling
The method can be very slow if the shaded region
is too large -particularly in high-N problems,
such as the LISA problem considered earlier.
LISA
110
Monte Carlo Methods
  • Metropolis-Hastings Algorithm
  • Speed this up by letting the proposal
    distribution depend on the current sample
  • Sample initial point
  • Sample tentative new state
  • from (e.g. Gaussian)
  • Compute
  • If Accept
  • Otherwise Accept with probability a

Acceptance Rejection
X is a Markov Chain. Note that rejected samples
appear in the chain as repeated values of the
current state.
111
Monte Carlo Methods

A histogram of the contents of the chain for a
parameter converges to the marginal pdf for that
parameter. In this way, high-dimensional
posteriors can be explored and marginalised to
return parameter ranges and interrelations
inferred fro complex data sets (eg, the WMAP
results)
http//www.statslab.cam.ac.uk/mcmc/pages/links.ht
ml
112
(No Transcript)
113
The end nearly
114
Final thoughts
  • Things to keep in mind
  • Priors are fundamental, but not always
    influential. If you have good data most
    reasonable priors return very similar
    posteriors.
  • Degree of belief is subjective but not arbitrary.
    Two people with the same degree of belief agree
    about the value of the corresponding
    probability.
  • Write down the probability of everything, and
    marginalise over those parameters that are of no
    interest.
  • Dont pre-filter if you can help it. Work with
    the data, not statistics of the data.

115
Bayesian bibliography for astronomy
  • There are an increasing number of good books
    and articles on Bayesian methods. Here are just
    a few
  • E.T. Jaynes, Probability theory the logic of
    science, CUP 2003
  • D.J.C. Mackay, Information theory, inference and
    learning algorithms, CUP, 2004
  • D.S. Sivia, Data analysis a Bayesian tutorial,
    OUP 1996
  • T.J. Loredo, From Laplace to Supernova SN 1987A
    Bayesian Inference in Astrophysics, 1990, copy at
    http//bayes.wustl.edu/gregory/articles.pdf
  • G.L. Bretthorst, Bayesian Spectrum Analysis and
    Parameter Estimation, 1988 copy at
    http//bayes.wustl.edu/glb/book.pdf,
  • G. D'Agostini, Bayesian reasoning in high-energy
    physics principles and applications
    http//preprints.cern.ch/cgi-bin/setlink?basecern
    repcategYellow_Reportid99-03
  • Soon to appear P. Gregory, Bayesian Logical Data
    Analysis for the Physical Sciences, 2005, CUP
  • Review sites
  • http//bayes.wustl.edu/
  • http//www.bayesian.org/
Write a Comment
User Comments (0)
About PowerShow.com