Title: A segmentation algorithm for copy number data'
1A segmentation algorithm for copy number data.
- Blissfully Short (we promise!)
- Archisman Rudra and Raoul-Sam Daruwala
2Biological Significance
- Genomes in a population are polymorphic, giving
rise to the diversity and variation - Parts of the genome are deleted (hemi- or
homo-zygously) with a decrease in copy number or
amplified with an increase copy number. - We assume there is a normal copy number value,
which represents regions with no deletions or
amplifications.
3A Bayesian Approach.
- Priors Deletion Amplification
- Data Priors Noise
- Goal Find the most plausible hypothesis of
regional changes and their associated copy numbers
4Prior Structure (I)
- The prior is a probability distribution over the
structure described below. - Given N probes, we assume the data is divided
into k sub-intervals Ij(µj,i). - Each sub-interval has its own mean value, µj. and
i is the last probe index in the interval. - µj may be the normal mean value global mean.
5Prior Structure (II)
- The prior depends on two parameters pe and pb.
- pe is the probability of a particular probe being
normal. - pb is the average number of intervals per unit
length.
6Prior Structure (III)
as the prior distribution where global
number of probes with the global mean
value and, local number of remaining
probes.
The data is modeled by adding independent
Gaussian noise to this prior structure. In each
interval Ij, the data is modelled as
7Likelihood Function
- The µ values of non-global probes are unknown.
- We estimate these µ values using the sample mean
for that interval. - Our Bayesian solution maximizes L to yield the
optimal segmentation
8A dynamic programming algorithm.
- Extension
- Adds a new interval to the end.
- Likelihood function can be incrementally
computed
9Dynamic Programming (II)
Let Opti be the optimal segmentation of the first
i probes. Let WSi be the working set for
computing Opti
10A reasonable choice of priors yields good
segmentation.
11Raising the value of p_e causes more points to be
classified as part of a normal interval.
12A reasonable choice of priors yields good
segmentation.
13By raising p_b those points which arent normal
are segmented very aggressively.
14Selection of Priors
- The choice of pe and pb is critical in
obtaining a reasonable result. - Two kinds of errors
- Goodness of fit Too few segments
- Over-fitting More segments than necessary.
- We minimize the maximum likelihood value OR test
for over-fitting using standard statistical
tests. F-test.
15Prior SelectionMinimax Criteria
- Choose values of Pe and Pb which minimize the
maximum likelihood value.
16Chromosome 8
(pe,pb) max at (0.55,0.01)
17Prior Selection F criterion
- We want to ensure that for every break introduced
that we are not over-fitting the data. - For each break we have a T2 statistic and the
appropriate tail probability (p value) calculated
from the distribution of the statistic. In this
case, this is an F distribution. - For the whole segmentation we take the minimum
p-value for each break. - The best (pe,pb) is the one that leads to the
maximum min p-value. -
18Chromosome 8
(pe,pb) max at (0.55,0.01)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)