A segmentation algorithm for copy number data' - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

A segmentation algorithm for copy number data'

Description:

Genomes in a population are polymorphic, giving rise to the diversity and variation. Parts of the genome are deleted (hemi- or homo-zygously) with a decrease in copy ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 23

Provided by: raoulsam

Category:

more less

Transcript and Presenter's Notes

Title: A segmentation algorithm for copy number data'

1
A segmentation algorithm for copy number data.

Blissfully Short (we promise!)
Archisman Rudra and Raoul-Sam Daruwala

2
Biological Significance

Genomes in a population are polymorphic, giving
rise to the diversity and variation
Parts of the genome are deleted (hemi- or
homo-zygously) with a decrease in copy number or
amplified with an increase copy number.
We assume there is a normal copy number value,
which represents regions with no deletions or
amplifications.

3
A Bayesian Approach.

Priors Deletion Amplification
Data Priors Noise
Goal Find the most plausible hypothesis of
regional changes and their associated copy numbers

4
Prior Structure (I)

The prior is a probability distribution over the
structure described below.
Given N probes, we assume the data is divided
into k sub-intervals Ij(µj,i).
Each sub-interval has its own mean value, µj. and
i is the last probe index in the interval.
µj may be the normal mean value global mean.

5
Prior Structure (II)

The prior depends on two parameters pe and pb.
pe is the probability of a particular probe being
normal.
pb is the average number of intervals per unit
length.

6
Prior Structure (III)

We define

as the prior distribution where global
number of probes with the global mean
value and, local number of remaining
probes.
The data is modeled by adding independent
Gaussian noise to this prior structure. In each
interval Ij, the data is modelled as
7
Likelihood Function

The µ values of non-global probes are unknown.
We estimate these µ values using the sample mean
for that interval.
Our Bayesian solution maximizes L to yield the
optimal segmentation

8
A dynamic programming algorithm.

Extension
Adds a new interval to the end.
Likelihood function can be incrementally
computed

9
Dynamic Programming (II)
Let Opti be the optimal segmentation of the first
i probes. Let WSi be the working set for
computing Opti
10
A reasonable choice of priors yields good
segmentation.
11
Raising the value of p_e causes more points to be
classified as part of a normal interval.
12
A reasonable choice of priors yields good
segmentation.
13
By raising p_b those points which arent normal
are segmented very aggressively.
14
Selection of Priors

The choice of pe and pb is critical in
obtaining a reasonable result.
Two kinds of errors
Goodness of fit Too few segments
Over-fitting More segments than necessary.
We minimize the maximum likelihood value OR test
for over-fitting using standard statistical
tests. F-test.

15
Prior SelectionMinimax Criteria

Choose values of Pe and Pb which minimize the
maximum likelihood value.

16
Chromosome 8
(pe,pb) max at (0.55,0.01)
17
Prior Selection F criterion

We want to ensure that for every break introduced
that we are not over-fitting the data.
For each break we have a T2 statistic and the
appropriate tail probability (p value) calculated
from the distribution of the statistic. In this
case, this is an F distribution.
For the whole segmentation we take the minimum
p-value for each break.
The best (pe,pb) is the one that leads to the
maximum min p-value.