Title: Probability
1- Lecture 2
- Probability
- and what it has to do with
- data analysis
2Abstraction
- Random variable, x
- it has no set value, until you realize it
- its properties are described by a probability, P
3pot of an infinite number of xs
One way to think about it
x
p(x)
Drawing one x from the pot realizes x
4Describing P
- If x can take on only discrete values,
- say (1, 2, 3, 4, or 5)
- then a table would work
40 probability that x4
x 1 2 3 4 5
P 10 30 40 15 5
Probabilities should sum to 100
5- Sometimes you see probabilities written as
fractions, instead of percentages
Probability should sum to 1
x 1 2 3 4 5
P 0.10 0.40 0.40 0.15 0.05
0.15 probability that x4
And sometimes you see probabilities plotted as a
histogram
0.5
0.15 probability that x4
P(x)
x
0.0
1
2
3
4
5
6- If x can take on any value, then use a smooth
function (or distribution) p(x) instead of a
table
probability that x is between x1 and x2 is
proportional to this area
p(x)
x
x1
x2
mathematically P(x1ltxltx2) ?x1x2 p(x) dx
7p(x)
x
Probability that x is between -? and ? is 100,
so total area 1 Mathematically
?-?? p(x) dx 1
8One Reason Why all this is relevant
- Any measurement of data that contains noise is
treated as a random variable, d - and
9- The distribution p(d) embodies both the true
value of the datum being measured and the
measurement noise - and
10- All quantities derived from a random variable are
themselves random variables, - so
11- The algebra of random variables allows you to
understand how - measurement noise affects inferences made from
the data
12Basic Description of Distributionswant two
basic numbers1) something that describes what
xs commonly occur2) something that describes
the variability of the xs
131) something that describes what xs e commonly
occurthat is, where the distribution is centered
14Mode x at which distribution has peak most-likely
value of x
peak
p(x)
x
xmode
15- The most popular car in the US is the Honda CR-V
Honda CV-R
But the next car you see on the highway will
probably not be a Honda CR-V
Wheres a CV-R?
16But modes can be deceptive
100 realizations of x
x N 0-1 3 1-2 18 2-3 11 3-4 8 4-5 11 5-6 14 6-7 8
7-8 7 8-9 11 9-10 9
Sure, the 1-2 range has the most counts, but most
of the measurements are bigger than 2!
peak
p(x)
x
0
10
xmode
17Median 50 chance x is smaller than xmedian 50
chance x is bigger than xmedian
No special reason the median needs to coincide
with the peak
p(x)
50
50
x
xmedian
18Expected value or mean value you would get if
you took the mean of lots of realizations of x
Lets examine a discrete distribution, for
simplicity ...
4
3
P(x)
2
1
0
x
1
2
3
19Hypothetical table of 140 realizations of x
mean 20 ? 1 80 ? 2 40 ? 3 /
140 (20/140) ? 1 (80/140) ?
2 (40/140) ? 3 p(1) ? 1 p(2) ? 2 p(3)
? 3 Si p(xi) xi
20by analogyfor a smooth distribution
- Expected (or mean) value of x
- E(x) ?-?? x p(x) dx
212) something that describes the variability of
the xsthat is, the width of the distribution
22Heres a perfectly sensible way to define the
width of a distribution
p(x)
50
25
25
x
W50
its not used much, though
23Width of a distribution Heres another way
Parabola x-E(x)2
p(x)
x
E(x)
multiply and integrate
24Idea is that if distribution is narrow, then most
of the probability lines up with the low spot of
the parabola
x-E(x)2
p(x)
x
E(x)
But if it is wide, then some of the probability
lines up with the high parts of the parabola
x-E(x)2 p(x)
Compute this total area
x
E(x)
Variance s2 ?-?? x-E(x)2 p(x) dx
25?variance s A measure of width
p(x)
s
x
E(x)
we dont immediately know its relationship to
area, though
26the Gaussian or normal distributionp(x)
exp - (x-x)2 / 2s2 )
s2 is variance
x is expected value
1 ?(2p)s
Memorize me !
27p(x)
x 1 s 1
Examples of Normal Distributions
x
p(x)
x 3 s 0.5
x
28Properties of the normal distribution
Expectation Median Mode x 95 of
probability within 2s of the expected value
p(x)
95
x
29Again, Why all this is relevant
- Inference depends on data
- You use measurement, d, to deduce the values of
some underlying parameter of interest, m. - e.g.
- use measurements of travel time, d, to deduce
the seismic velocity, m, of the earth
30- model parameter, m, depends on measurement, d
-
- so m is a function of d, m(d)
- so
31- If data, d, is a random variable
- then so is model parameter, m
- All inferences made from uncertain data are
themselves uncertain - Model parameters are described by a distribution,
p(m)
32Functions of a random variable
any function of a random variable is itself a
random variable
33Special case of a linear relationship and a
normal distribution
- Normal p(d) with mean d and variance s2d
- Linear relationship m a d b
- Normal p(m) with mean adb and variance a2s2d
34multivariate distributions
35Example
- Liberty island is inhabited by both pigeons and
seagulls - 40 of the birds are pigeons
- and 60 of the birds are gulls
- 50 of pigeons are white and 50 are grey
- 100 of gulls are white
36Two variables
- species s takes two values
- pigeon p
- and gull g
- color c takes two values
- white w
- and tan t
Of 100 birds, 20 are white pigeons 20 are grey
pigeons 60 are white gulls 0 are grey gulls
37What is the probability that a bird has species s
and color c ?
a random bird, that is
p
20
20
s
g
60
0
Note sum of all boxes is 100
w
t
c
38This is called theJoint Probabilityand is
writtenP(s,c)
39Two continuous variablessay x1 and x2have a
joint probability distributionand writtenp(x1,
x2)with ? ? p(x1, x2) dx1 dx2 1
40You would contour a joint probability
distributionand it would look something like
x2
x1
41What is the probability that a bird has color c ?
Of 100 birds, 20 are white pigeons 20 are grey
pigeons 60 are white gulls 0 are grey gulls
start with P(s,c)
p
20
20
s
g
60
0
w
t
and sum columns
c
80
20
To get P(c)
42What is the probability that a bird has species s
?
start with P(s,c)
p
20
20
40
and sum rows
s
Of 100 birds, 20 are white pigeons 20 are grey
pigeons 60 are white gulls 0 are grey gulls
g
60
0
60
w
t
To get P(s)
c
43These operations make sense with distributions,
too
x2
x2
x2
x1
x1
p(x2)
p(x1)
x1
p(x1) ? p(x1,x2) dx2
p(x2) ? p(x1,x2) dx1
distribution of x1 (irrespective of x2)
distribution of x2 (irrespective of x1)
44Given that a bird is species swhat is the
probability that it has color c ?
Of 100 birds, 20 are white pigeons 20 are grey
pigeons 60 are white gulls 0 are grey gulls
Note, all rows sum to 100
45This is called theConditional Probability of c
given sand is writtenP(cs)similarly
46Given that a bird is color cwhat is the
probability that it has species s ?
Of 100 birds, 20 are white pigeons 20 are grey
pigeons 60 are white gulls 0 are grey gulls So
25 of white birds are pigeons
p
25
100
s
g
75
0
w
t
Note, all columns sum to 100
c
47This is called theConditional Probability of s
given cand is writtenP(sc)
48Beware!P(cs) ? P(sc)
p
p
50
50
25
100
s
s
g
100
0
g
75
0
w
t
w
t
c
c
49Actor Patrick Swaysepancreatic cancer victim
Lot of errors occur from confusing the
two Probability that, if you have pancreatic
cancer, that you will die from it 90 Probabilit
y that, if you die, you will have died of
pancreatic cancer 1.4
50note
25 of 80 is 20
?
w
t
c
51and
50 of 40 is 20
p
?
s
g
52and if
- P(s,c) P(sc) P(c) P(cs) P(s)
then
P(sc) P(cs) P(s) / P(c) and P(cs) P(sc)
P(c) / P(s) which is called Bayes Theorem
53In this example bird color is the observable,
the data bird species is the model
parameter P(cs) color given species or
P(dm) is making a prediction based on the
model Given a pigeon, what the probability
that its grey? P(sc), species given color
or P(md) is making an inference from the
data Given a grey bird, what the probability
that its a pigeon?
54 Why Bayes Theorem is important It provides a
framework for relating making a prediction from
the model, P(dm) to making an inference
from the data, P(md)
55 Bayes Theorem also implies that the joint
distribution of data and model parameters p(d,
m) is the fundamental quantity If you know
p(d, m), you know everything there is to know
56- Expectation
- Variance
- And
- Covariance
- Of a multivariate distribution
57The expectation is computed by first reducing the
distribution to one dimension
x2
x2
x2
x2
x1
x1
p(x2)
take the expectation of p(x2) to get x2
p(x1)
x1
x1
take the expectation of p(x1) to get x1
58The varaince is also computed by first reducing
the distribution to one dimension
x2
x2
x2
x2
s2
x1
x1
p(x2)
s1
take the variance of p(x2) to get s22
p(x1)
x1
x1
take the variance of p(x1) to get s12
59Note that in this distribution if x1 is bigger
than x1, then x2 is bigger than x2 and if x1 is
smaller than x1, then x2 is smaller than x2
x2
This is a positive correlation
x2
x1
x1
Expected value
60Conversely, in this distribution if x1 is bigger
than x1, then x2 is smaller than x2 and if x1
is smaller than x1, then x2 is smaller than x2
x2
This is a negative correlation
x2
x1
x1
Expected value
61This correlation can be quantified by multiplying
the distribution by a four-quadrant function
x2
-
x2
-
x1
x1
And then integrating. The function
(x1-x1)(x2-x2) works fine
C ?? (x1-x1) (x2-x2) p(x1,x2) dx1dx2
Called the covariance
62Note that the matrix C with elementsCij ??
(xi-xi) (xj-xj) p(x1,x2) dx1dx2has diagonal
elements of sxi2 the variance of
xiandoff-diagonal elements of cov(xi,xj) the
covariance of xi and xj
s12 cov(x1,x2) cov(x1,x3)
cov(x1,x2) s22 cov(x2,x2)
cov(x1,x3) cov(x2,x2) s32
C
63The vector of means of multivatiate
distribution x and the Covariance matrix of
multivariate distributionCxsummarized a lot
but not everything about a multivariate
distribution
64Functions of a set of random variables, x
A vector of of N random variables in a vector, x
65Special Case
- linear function yMx
- the expectation of y is
yMx
Memorize!
66So Cy M Cx MT
Memorize!
67Note that these rules work regardless of the
distribution of xif y is linearly related to x,
yMx then yMx (rule for means) Cy M Cx
MT(rule for propagating error)
Memorize!