Title: G. Cowan
1Statistical Issues for ATLAS Physics
ATLAS Statistics Workshop CERN, 18-19 January,
2007
Glen Cowan Physics Department Royal Holloway,
University of London g.cowan_at_rhul.ac.uk www.pp.rhu
l.ac.uk/cowan
2Outline
1 Probability Frequentist vs. Subjective
(Bayesian) 2 Statistical tests multivariate
methods, goodness-of-fit tests 3 Parameter
estimation maximum likelihood, least
squares, variance of estimators 4 Interval
estimation setting limits 5 Further
topics systematic errors, MCMC
3Frequentist Statistics - general philosophy
In frequentist statistics, probabilities are
associated only with the data, i.e., outcomes of
repeatable observations. Probability limiting
frequency Probabilities such as P (Higgs boson
exists), P (0.117 lt as lt 0.121), etc. are
either 0 or 1, but we dont know which.
The tools of frequentist statistics tell us what
to expect, under the assumption of certain
probabilities, about hypothetical repeated
observations.
The preferred theories (models, hypotheses, ...)
are those for which our observations would be
considered usual.
4Bayesian Statistics - general philosophy
In Bayesian statistics, interpretation of
probability extended to degree of belief
(subjective probability). Use this for
hypotheses
probability of the data assuming hypothesis H
(the likelihood)
prior probability, i.e., before seeing the data
posterior probability, i.e., after seeing the
data
normalization involves sum over all possible
hypotheses
Bayesian methods can provide more natural
treatment of non- repeatable phenomena
systematic uncertainties, probability that Higgs
boson exists,... No golden rule for priors
(if-then character of Bayes thm.)
5Statistical tests (in a particle physics context)
Suppose the result of a measurement for an
individual event is a collection of numbers x1
number of muons, x2 mean pt of jets, x3
missing energy, ... follows some
n-dimensional joint pdf, which depends on the
type of event produced, i.e., was it
For each reaction we consider we will have a
hypothesis for the pdf of , e.g.,
etc.
Often call H0 the signal hypothesis (the event
type we want) H1, H2, ... are background
hypotheses.
6Selecting events
Suppose we have a data sample with two kinds of
events, corresponding to hypotheses H0 and H1 and
we want to select those of type H0. Each event is
a point in space. What decision boundary
should we use to accept/reject events as
belonging to event type H0?
H1
Probably start with cuts
H0
accept
7Other ways to select events
Or maybe use some other sort of decision boundary
linear
or nonlinear
H1
H1
H0
H0
accept
accept
How can we do this in an optimal way?
8Test statistics
Construct a test statistic of lower dimension
(e.g. scalar)
Goal is to compactify data without losing ability
to discriminate between hypotheses.
We can work out the pdfs
Decision boundary is now a single cut on t.
This effectively divides the sample space into
two regions where we either accept H0
(acceptance region) or reject it
(critical region).
9Significance level and power of a test
Probability to reject H0 if it is true (error of
the 1st kind)
(significance level)
Probability to accept H0 if H1 is true (error of
the 2nd kind)
(1 - b power)
10Efficiency, purity, etc.
Signal efficiency
Background efficiency
Expected number of signal events s ?s ?s
L Expected number of background events b ? b
?b L
Prior probabilities ps, pb proportional to cross
sections, so for e.g. the signal purity,
11Constructing a test statistic
How can we select events in an optimal
way? Neyman-Pearson lemma (proof in Brandt Ch.
8) states
To get the lowest eb for a given es (highest
power for a given significance level), choose
acceptance region such that
where c is a constant which determines es.
Equivalently, optimal scalar test statistic is
N.B. any monotonic function of this is just as
good.
12Purity vs. efficiency optimal trade-off
Consider selecting n events expected numbers s
from signal, b from background ? n Poisson (s
b) Suppose b is known and goal is to estimate s
with minimum relative statistical error. Take
as estimator Variance of Poisson variable equals
its mean, therefore
?
So we should maximize
equivalent to maximizing product of signal
efficiency ? purity.
13Why Neyman-Pearson doesnt always help
The problem is that we usually dont have
explicit formulae for the pdfs
Instead we may have Monte Carlo models for signal
and background processes, so we can produce
simulated data, and enter each event into an
n-dimensional histogram. Use e.g. M bins for each
of the n dimensions, total of Mn cells.
But n is potentially large, ? prohibitively
large number of cells to populate with Monte
Carlo data.
Compromise make Ansatz for form of test
statistic with fewer parameters determine them
(e.g. using MC) to give best discrimination
between signal and background.
14Fisher discriminant
Assume linear test statistic,
H1
and maximize separation between the two
classes
H0
Corresponds to a linear decision boundary.
accept
Equivalent to Neyman-Pearson if the signal and
background pdfs are multivariate Gaussian with
equal covariances otherwise not optimal, but
still often a simple, practical
solution. Sometimes first transform data to
better approximate Gaussians.
15Nonlinear test statistics
The optimal decision boundary may not be a
hyperplane, ? nonlinear test statistic
H1
Multivariate statistical methods are a Big
Industry
Neural Networks, Support Vector Machines, Boosted
decision trees, Kernel density methods, ...
H0
accept
Particle Physics can benefit from progress in
Machine Learning.
16Neural networks the multi-layer perceptron
Use e.g. logistic sigmoid activation function,
Define values for hidden nodes
The network output is given by
17Neural network discussion
Why not use all of the available input
variables? Fewer inputs ? fewer parameters to be
adjusted, ? parameters better determined for
finite training data. Some inputs may be highly
correlated ? drop all but one. Some inputs may
contain little or no discriminating power
between the hypotheses ? drop them. NN exploits
higher moments (nonlinear features) of joint
pdf f(xH), but these may not be well modeled in
training data. Better to have simper t(x) where
you can understand what its doing.
18Neural network discussion (2)
The purpose of the statistical test is often to
select objects for further study and then
measure their properties. Need to avoid input
variables that are correlated with
the properties of the selected objects that you
want to study. (Not always easy correlations
may be poorly known.) Some NN references L.
Lönnblad et al., Comp. Phys. Comm., 70 (1992)
167 C. Peterson et al., Comp. Phys. Comm., 81
(1994) 185 C.M. Bishop, Neural Networks for
Pattern Recognition, OUP (1995) John Hertz et
al., Introduction to the Theory of
Neural Computation, Addison-Wesley, New York
(1991).
19Testing goodness-of-fit
for a set of
Suppose hypothesis H predicts pdf
observations
We observe a single point in this space
What can we say about the validity of H in light
of the data?
Decide what part of the data space represents
less compatibility with H than does the point
more compatible with H
less compatible with H
(Not unique!)
20p-values
Express goodness-of-fit by giving the p-value
for H
p probability, under assumption of H, to
observe data with equal or lesser compatibility
with H relative to the data we got.
This is not the probability that H is true!
In frequentist statistics we dont talk about
P(H) (unless H represents a repeatable
observation). In Bayesian statistics we do use
Bayes theorem to obtain
where p (H) is the prior probability for H.
For now stick with the frequentist approach
result is p-value, regrettably easy to
misinterpret as P(H).
21The significance of an observed signal
Suppose we observe n events these can consist of
nb events from known processes (background) ns
events from a new process (signal)
If ns, nb are Poisson r.v.s with means s, b, then
n ns nb is also Poisson, mean s b
Suppose b 0.5, and we observe nobs 5. Should
we claim evidence for a new discovery? Give
p-value for hypothesis s 0
22The significance of a peak
Suppose we measure a value x for each event and
find
Each bin (observed) is a Poisson r.v., means
are given by dashed lines.
In the two bins with the peak, 11 entries found
with b 3.2. The p-value for the s 0
hypothesis is
23The significance of a peak (2)
But... did we know where to look for the peak? ?
give P(n 11) in any 2 adjacent bins Is the
observed width consistent with the expected x
resolution? ? take x window several times the
expected resolution How many bins ? distributions
have we looked at? ? look at a thousand of
them, youll find a 10-3 effect Did we adjust the
cuts to enhance the peak? ? freeze cuts,
repeat analysis with new data How about the bins
to the sides of the peak... (too low!) Should we
publish????
24Making a discovery
Often compute p-value of the background only
hypothesis H0 using test variable related to a
characteristic of the signal. p-value
Probability to see data as incompatible with H0,
or more so, relative to the data
observed. Requires definition of incompatible
with H0 HEP folklore claim discovery if
p-value equivalent to a 5? fluctuation of
Gaussian variable (one-sided) Actual p-value
at which discovery becomes believable will
depend on signal in question (subjective) Why not
do Bayesian analysis? Often dont know how to
assign meaningful prior probabilities
25Parameter estimation
Consider a pdf containing one or more
undetermined parameters
r.v.
parameter
Suppose we have a sample of observed values
We want to find some function of the data to
estimate the parameter(s)
? estimator written with a hat
No golden rule -- construct estimators to possess
desirable properties (small or zero bias, small
variance, etc.)
26Properties of estimators
If we were to repeat the entire measurement, the
estimates from each would follow a pdf
best
large variance
biased
We want small (or zero) bias (systematic error)
? average of repeated measurements should tend
to true value.
And we want a small variance (statistical error)
? small bias variance are in general
conflicting criteria
27The likelihood function
Suppose the output of our measurement is some set
of quantities x (here symbolizes the whole set
of measured values). A hypothesized model will
predict the joint pdf for x, f (x q). The
model may contain undetermined parameters q
(q1, ..., qm). Now evaluate f (x q) with the
data sample obtained and regard it as a function
of the parameter(s). This is the likelihood
function
If the data are n independent and identically
distributed (iid) observations of x x1, ...,
xn, then the joint pdf of x factorizes and
28Maximum likelihood estimators
If the hypothesized q is close to the true value,
then we expect a high probability to get data
like that which we actually found.
So we define the maximum likelihood (ML)
estimator(s) to be the parameter value(s) for
which the likelihood is maximum. ML estimators
not guaranteed to have any optimal properties,
(but in practice theyre very good).
29The method of least squares
Suppose we measure N values, y1, ..., yN,
assumed to be independent Gaussian r.v.s with
Assume known values of the control variable x1,
..., xN and known variances
We want to estimate q, i.e., fit the curve to the
data points.
The likelihood function is
30The method of least squares (2)
The log-likelihood function is therefore
So maximizing the likelihood is equivalent to
minimizing
Minimum of this quantity defines the least
squares estimator, even for cases when data are
not Gaussian. Outliers arising from
non-Gaussian errors can have unexpectedly
large weight in LS fits.
31Parameter estimation discussion
There are some cases where we would intentionally
choose a biased (non-ML) estimator, e.g., in
unfolding problems where we accept a small bias
in exchange for a large reduction in
variance. And there could be cases where, for
sake of convenience, we choose LS rather than ML
even for non-Gaussian data. And occasionally we
choose an unbiased estimator rather than ML if
its not too difficult, e.g., the sample
variance. But usually ML estimators are about as
good as you can do. No one ever got fired for
maximizing the likelihood. anonymous
32Confidence intervals
Frequentist intervals (limits) for a parameter q
can be found by defining a test of the
hypothesized value q (do this for all q)
Specify values of the data x that are
disfavoured by q (critical region) such that
P(x in critical region) g for a prespecified
g, e.g., 0.05 or 0.1. If x is observed in the
critical region, reject the value q. Now invert
the test to define a confidence interval as set
of q values that would not be rejected in a test
with significance level g (confidence level is
1 - g ). The interval will cover the true value
of q with probability 1 - g. Equivalent to
Neyman confidence-belt construction. Confidence
belt is acceptance region of test.
33Poisson data with background
Count n events, e.g., in fixed time or integrated
luminosity. s expected number of signal
events b expected number of background events
n Poisson(sb)
Sometimes b known, other times it is in some way
uncertain. Goals (i) convince people that s ?
0 (discovery) (ii) measure or place limits on
s, taking into consideration the uncertainty
in b. Widely discussed in HEP community, see e.g.
proceedings of PHYSTAT meetings, Durham,
Fermilab, CERN workshops...
34Setting limits classical method
E.g. for upper limit on s, take critical region
to be low values of n, limit sup at confidence
level 1 - b thus found from
Similarly for lower limit at confidence level 1 -
a,
Sometimes choose a b g /2 ? central
confidence interval.
35Calculating classical limits
To solve for slo, sup, can exploit relation to ?2
distribution
Quantile of ?2 distribution
For low fluctuation of n this can give negative
result for sup i.e. confidence interval is
empty.
b
36The Bayesian approach
In Bayesian statistics need to start with prior
pdf p(q), this reflects degree of belief about
q before doing the experiment. Bayes theorem
tells how our beliefs should be updated in light
of the data x
Integrate posterior pdf p(q x) to give
interval with any desired probability content.
For e.g. Poisson parameter 95 CL upper limit
from
37Bayesian prior for Poisson parameter
Include knowledge that s 0 by setting prior p(s)
0 for slt0. Often try to reflect prior
ignorance with e.g.
Not normalized but this is OK as long as L(s)
dies off for large s. Not invariant under change
of parameter if we had used instead a flat
prior for, say, the mass of the Higgs boson, this
would imply a non-flat prior for the expected
number of Higgs events. Doesnt really reflect a
reasonable degree of belief, but often used as a
point of reference or viewed as a recipe for
producing an interval whose frequentist properties
can be studied (coverage will depend on true s).
38Bayesian interval with flat prior for s
Solve numerically to find limit sup. For special
case b 0, Bayesian upper limit with flat
prior numerically same as classical case
(coincidence).
Otherwise Bayesian limit is everywhere greater
than classical (conservative). Never goes
negative. Doesnt depend on b if n 0.
39Likelihood ratio limits (Feldman-Cousins)
Define likelihood ratio for hypothesized
parameter value s
Here is the ML estimator, note
Critical region defined by low values of
likelihood ratio. Resulting intervals can be one-
or two-sided (depending on n).
(Re)discovered for HEP by Feldman and Cousins,
Phys. Rev. D 57 (1998) 3873.
40More on intervals from LR test (Feldman-Cousins)
Caveat with coverage suppose we find n gtgt
b. Usually one then quotes a measurement
If, however, n isnt large enough to claim
discovery, one sets a limit on s. FC pointed out
that if this decision is made based on n,
then the actual coverage probability of the
interval can be less than the stated confidence
level (flip-flopping). FC intervals remove
this, providing a smooth transition from 1- to
2-sided intervals, depending on n. But, suppose
FC gives e.g. 0.1 lt s lt 5 at 90 CL, p-value of
s0 still substantial. Part of upper-limit
wasted?
41Properties of upper limits
Example take b 5.0, 1 - ? 0.95
Upper limit sup vs. n
Mean upper limit vs. s
42Upper limit versus b
Feldman Cousins, PRD 57 (1998) 3873
b
If n 0 observed, should upper limit depend on
b? Classical yes Bayesian no FC yes
43Coverage probability of confidence intervals
Because of discreteness of Poisson data,
probability for interval to include true value in
general gt confidence level (over-coverage)
44Discussion on limits
Different sorts of limits answer different
questions. A frequentist confidence interval
does not (necessarily) answer, What do we
believe the parameters value is? Coverage
nice, but crucial? Look at sensitivity, e.g.,
Esup s 0. Consider also politics, need
for consensus/conventions convenience and
ability to combine results, ... For any result,
consumer will compute (mentally or otherwise)
Need likelihood (or summary thereof).
consumers prior
45Statistical vs. systematic errors
Statistical errors How much would the result
fluctuate upon repetition of the
measurement? Implies some set of assumptions to
define probability of outcome of the
measurement. Systematic errors What is the
uncertainty in my result due to uncertainty in
my assumptions, e.g., model (theoretical)
uncertainty modelling of measurement
apparatus. Usually taken to mean the sources of
error do not vary upon repetition of the
measurement. Often result from uncertain value
of, e.g., calibration constants, efficiencies,
etc.
46Systematic errors and nuisance parameters
Response of measurement apparatus is never
modelled perfectly
y (measured value)
model
truth
x (true value)
Model can be made to approximate better the truth
by including more free parameters.
systematic uncertainty ? nuisance parameters
47A typical fit (symbolic)
Given measurements
and (usually) covariances
Predicted value
expectation value
PDF parameters, ?s, etc.
control variable
bias
Often take
Minimize
Equivalent to maximizing L(?) e-?2/2, i.e.,
least squares same as maximum likelihood using a
Gaussian likelihood function.
48Its Bayesian equivalent
Take
Joint probability for all parameters
and use Bayes theorem
To get desired probability for ?, integrate
(marginalize) over b
? Posterior is Gaussian with mode same as least
squares estimator, ?? same as from ?2
?2min 1. (Back where we started!)
49Marginalizing with Markov Chain Monte Carlo
In a Bayesian analysis we usually need to
integrate over some (or all) of the parameters,
e.g.,
Probability density for prediction of observable
?(?)
Integrals often high dimension, usually cannot be
done in closed form or with acceptance-rejection
Monte Carlo.
Markov Chain Monte Carlo (MCMC) has
revolutionized Bayesian computation. (Google
words Metropolis-Hastings, MCMC) Produces a
correlated sequence of points in the sampled
space. Correlations here not fatal, but stat.
error larger than naive vn.
50A prior for bias ?b(b) with longer tails
Represents error on the error standard
deviation of ps(s) is ss.
?b(b)
b
Gaussian (?s 0) P(b gt 4?sys) 6.3
10-5
?s 0.5 P(b gt 4?sys)
0.65
51A simple test
Suppose fit effectively averages four
measurements. Take ?sys ?stat 0.1,
uncorrelated.
Case 1 data appear compatible
Posterior p(?y)
measurement
p(?y)
experiment
?
Usually summarize posterior p(?y) with mode and
standard deviation
52Simple test with inconsistent data
Case 2 there is an outlier
Posterior p(?y)
measurement
p(?y)
?
experiment
? Bayesian fit less sensitive to outlier. ? Error
now connected to goodness-of-fit.
53Goodness-of-fit vs. size of error
In LS fit, value of minimized ?2 does not affect
size of error on fitted parameter. In Bayesian
analysis with non-Gaussian prior for
systematics, a high ?2 corresponds to a larger
error (and vice versa).
2000 repetitions of experiment, ?s 0.5, here no
actual bias.
posterior ??
?? from least squares
?2
54Bayesian limits with uncertainty on b
Uncertainty on b goes into the prior, e.g.,
Put this into Bayes theorem,
Marginalize over b, then use p(sn) to find
intervals for s with any desired probability
content. Controversial part here is prior for
signal ?s(s) (treatment of nuisance parameters
is easy).
55Cousins-Highland method
Regard b as random, characterized by pdf
?(b). Makes sense in Bayesian approach, but in
frequentist model b is constant (although
unknown). A measurement bmeas is random but this
is not the mean number of background events,
rather, b is. Compute anyway
This would be the probability for n if Nature
were to generate a new value of b upon repetition
of the experiment with ?b(b). Now e.g. use this
P(ns) in the classical recipe for upper limit at
CL 1 - b
Result has hybrid Bayesian/frequentist character.
56Integrated likelihoods
Consider again signal s and background b, suppose
we have uncertainty in b characterized by a prior
pdf ?b(b). Define integrated likelihood as
also called modified profile likelihood, in any
case not a real likelihood.
Now use this to construct likelihood ratio test
and invert to obtain confidence intervals.
Feldman-Cousins Cousins-Highland (FHC2), see
e.g. J. Conrad et al., Phys. Rev. D67 (2003)
012002 and Conrad/Tegenfeldt PHYSTAT05
talk. Calculators available (Conrad, Tegenfeldt,
Barlow).
57Interval from inverting profile LR test
Suppose we have a measurement bmeas of b. Build
the likelihood ratio test with profile
likelihood
and use this to construct confidence
intervals. See PHYSTAT05 talks by Cranmer,
Feldman, Cousins, Reid.
58Wrapping up...
Some areas to think about Look at Bayesian
methods, especially for systematics. Multivariate
methods (? machine learning) Practical
concerns conventions(?), ease of use,
etc. Tools ROOT, R, ...? (Much room for
discussion here.) Hopefully Nature will be kind
and we wont be so worried about setting limits
as we were at LEP...-)
59Extra slides
60Some statistics books, papers, etc.
G. Cowan, Statistical Data Analysis, Clarendon,
Oxford, 1998 see also www.pp.rhul.ac.uk/cowan/sd
a R.J. Barlow, Statistics, A Guide to the Use of
Statistical in the Physical Sciences, Wiley,
1989 see also hepwww.ph.man.ac.uk/roger/book.htm
l L. Lyons, Statistics for Nuclear and Particle
Physics, CUP, 1986 W. Eadie et al., Statistical
and Computational Methods in Experimental
Physics, North-Holland, 1971 (2nd ed.
imminent) S. Brandt, Statistical and
Computational Methods in Data Analysis, Springer,
New York, 1998 (with program library on CD) W.M.
Yao et al. (Particle Data Group), Review of
Particle Physics, Journal of Physics G 33 (2006)
1 see also pdg.lbl.gov sections on probability
statistics, Monte Carlo
61Bayes theorem
From the definition of conditional probability we
have
and
, so
but
Bayes theorem
First published (posthumously) by the Reverend
Thomas Bayes (1702-1761)
An essay towards solving a problem in
the doctrine of chances, Philos. Trans. R. Soc.
53 (1763) 370 reprinted in Biometrika, 45 (1958)
293.
62Digression marginalization with MCMC
Bayesian computations involve integrals like
often high dimensionality and impossible in
closed form, also impossible with normal
acceptance-rejection Monte Carlo. Markov Chain
Monte Carlo (MCMC) has revolutionized Bayesian
computation. Google for MCMC, Metropolis,
Bayesian computation, ... MCMC generates
correlated sequence of random numbers cannot
use for many applications, e.g., detector
MC effective stat. error greater than vn
. Basic idea sample multidimensional look,
e.g., only at distribution of parameters of
interest.
63MCMC basics Metropolis-Hastings algorithm
Goal given an n-dimensional pdf
generate a sequence of points
Proposal density e.g. Gaussian centred about
1) Start at some point
2) Generate
3) Form Hastings test ratio
4) Generate
move to proposed point
5) If
else
old point repeated
6) Iterate
64Metropolis-Hastings (continued)
This rule produces a correlated sequence of
points (note how each new point depends on the
previous one).
For our purposes this correlation is not fatal,
but statistical errors larger than naive
The proposal density can be (almost) anything,
but choose so as to minimize autocorrelation.
Often take proposal density symmetric
Test ratio is (Metropolis-Hastings)
I.e. if the proposed step is to a point of higher
, take it if not, only take the step
with probability If proposed step rejected, hop
in place.
65Metropolis-Hastings caveats
Actually one can only prove that the sequence of
points follows the desired pdf in the limit where
it runs forever.
There may be a burn-in period where the
sequence does not initially follow
Unfortunately there are few useful theorems to
tell us when the sequence has converged.
Look at trace plots, autocorrelation. Check
result with different proposal density. If you
think its converged, try it again with 10 times
more points.