Computational Systems Biology of Cancer:

About This Presentation

Title:

Computational Systems Biology of Cancer:

Description:

Computational Systems Biology of Cancer: ... – PowerPoint PPT presentation

Number of Views:246

Avg rating:3.0/5.0

Slides: 82

Provided by: Bud138

Learn more at: https://cs.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Computational Systems Biology of Cancer:

1
(II)
Computational Systems Biology of Cancer
2
Bud Mishra

Professor of Computer Science, Mathematics and
Cell Biology
Courant Institute, NYU School of Medicine, Tata
Institute of Fundamental Research, and Mt. Sinai
School of Medicine

3
The New Synthesis
Genome Evolution
Selection
perturbed pathways
genetic instability
4
Is the Genomic View of Cancer Necessarily
Accurate ?

If I said yes, that would then suggest that that
might be the only place where it might be done
which would not be accurate, necessarily
accurate. It might also not be inaccurate, but
I'm disinclined to mislead anyone.
US Secretary of Defense, Mr. Donald Rumsfeld,
Once again quoted completely out of context.

5
Cancer Initiation and Progression
Genomics (Mutations, Translocations,
Amplifications, Deletions) Epigenomics (Hyper
Hypo-Methylation) Transcriptomics (Alternate
Splicing, mRNA) Proteomics (Synthesis,
Post-Translational Modification,
Degradation) Signaling
Cancer Initiation and Progression
Proliferation, Motility, Immortality, Metastasis,
Signaling
6
Mishras Mystical 3Ms

Rapid and accurate solutions
Bioinformatic, statistical, systems, and
computational approaches.
Approaches that are scalable, agnostic to
technologies, and widely applicable
Promises, challenges and obstacles

Measure
Mine
Model
7
Measure

What we can quantify and what we cannot

8
Microarray Analysis of Cancer Genome

Representations are reproducible samplings of DNA
populations in which the resulting DNA has a new
format and reduced complexity.
We array probes derived from low complexity
representations of the normal genome
We measure differences in gene copy number
between samples ratiometrically
Since representations have a lower nucleotide
complexity than total genomic DNA, we obtain a
stronger specific hybridization signal relative
to non-specific and noise

9
Minimizing Cross Hybridization(Complexity
Reduction)
10
Copy Number Fluctuation
A1
B1
C1
A2
B2
C2
A3
B3
C3
11
Critical Innovations

Data Normalization and Background Correction for
Affy-Chips
10K, 100K, 500K (Affy) Generalized RMA
Multi-Experiment-Based Probe-Characterization
(Kalman EM)
A novel genome segmenter algorithm
Empirical Bayes Approach Maximum A Posteriori
(MAP)
Generative Model (Hierarchical, Heteroskedastic)
Dynamic Programming Solution
Cubic-Time Linear-time Approximation using
Beam-Search Heuristic
Single Molecule Technologies
Optical and Nanotechnologies
Sequencing SMASH
Epigenomics
Transcriptomics

12
Background Correction Normalization
13
Oligo Arrays SNP genotyping

Given 500K human SNPs to be measured, select 10
25-mers that over lap each SNP location for
Allele A.
Select another 10 25-mers corresponding to SNP
Allele B.
Problem Cross Hybridization

14
Using SNP arrays to detect Genomic Aberrations

Each SNP probeset measures absense/presence of
one of two Alleles.
If a region of DNA is deleted by cancer, one or
both alleles will be missing!
If a region of DNA is duplicated/amplified by
cancer, one or both alleles will be amplified.
Problem Oligo arrays are noisy.

15
90 humans, 1 SNP (A0.48)
Allele B
Allele A
16
90 humans, 1 SNP (A0.24)
Allele B
Allele A
17
90 humans, 1 SNP (A0.96)
Allele B
Allele A
18
Background Correction Normalization

Consider a genomic location L and two similar
nucleotide sequences sL,x and sL,y starting at
that location in the two copies of a diploid
genomes
E.g., they may differ in one SNP.
Let qx and qy be their respective copy numbers in
the whole genome and all copies are selected in
the reduced complexity representation. The gene
chip contains four probes px 2 sL,x py 2 sL,y
px, py 2 G.
After PCR amplification, we have some Kx qx
amount of DNA that is complementary to the probe
px, etc.K' (¼ Kx) amount of DNA that is
additionally approximately complementary to the
probe px.

19
Normalize using a Generalized RMA

I U - mn
a sn2 - fN(0,1)(a/b)/FN(0,1)(a/b)
(1 b Bsn/FN(0,1)(a/b)-1
bsn/Bsn )
(1 FN(0,1)(a/b)/(b Bsn)-1,
Where a U-mn -a sn2 b sn, and
bsn å Ii,j U mn fN(0,1)(Ii,j U mn )
Bsn å fN(0,1)(Ii,j U mn )

20
Background Correction Normalization

If the probe has an affinity fx, then the
measured intensity is can be expressed as
Kx qx K fx noise
qx K/Kx fx noise
With Expm1 e s1, a multiplicative logNormal
noise,
m2 e s2 an additive Gaussian noise,
and fx Kx fx an amplified affinity.
A more general model
Ix qx K/Kx fx em1e s1 m2 e s2

21
Mathematical Model

In particular, we have four values of measured
intensities
Ix qx fx Nxe m1 e s1 m2 e s2
Ix Nx e m1 e s1 m2 e s2
Iy qy fy Ny e m1 e s1 m2 e s2
Iy Ny e m1 e s1 m2 e s2

22
Bioinformatics Data modeling

Good news For each 25-bp probe, the fluorescent
signal increases linearly with the amount of
complementary DNA in the sample (up to some limit
where it saturates).
Bad news The linear scaling and offset differ
for each 25-bp probe. Scaling varies by factors
of more than 10x.
Noise Due to PCR cross hybridization and
measurement noise.

23
Scaling Offset differ

Scaling varies across probes
Each 25-bp sequence has different thermodynamic
properties.
Scaling varies across samples
The scanning laser for different samples may have
different levels.
The starting DNA concentrations may differ PCR
may amplify differently.
Offset varies across probes
Different levels of Cross Hybridization with the
rest of the Genome.
Offset varies across samples
Different sample genomes may differ slightly
(sample degradation impurities, etc.)

24
Linear Model Noise
25
Noise minimization
26
Final Data Model
27
MLE using gradients
28
Data Outliers

Our data model fails for few data points (bad
probes)
Soln (1) Improve the model
Soln (2) Discard the outliers
Soln (3) Alternate model for the outliers
Weight the data approprately.

29
Outlier Model
30
Problem with MLE No unique maxima
31
Scaling of MLE estimate
32
Segmentation to reduce noise

The true copy number (Allele AB) is normally 2
and does not vary across the genome, except at a
few locations (breakpoints).
Segmentation can be used to estimate the location
of breakpoints and then we can average all
estimated copy number values between each pair of
breakpoints to reduce noise.

33
Allelic Frequencies Cancer Normal
34
Allelic Frequencies Cancer Normal
35
Segmentation Break-Point Detection
36
Algorithmic Approaches

Local Approach
Change-point Detection
(QSum, KS-Test, Permutation Test)
Global Approach
HMM models
Wavelet Decomposition
Bayesian Empirical Bayes Approach
Generative Models
(One- or Multi-level Hierarchical)
Maximum A Posteriori

37
HMM
Model with a very high degree of freedom, but not
enough data points. Small Sample statistics a
Overfitting, Convergence to local maxima, etc.
38
HMM, finally
Model with a very high degree of freedom, but not
enough data points. Small Sample statistics a
Overfitting, Convergence to local maxima, etc.
3
1
2
39
HMM, last time

Advantages
Small Number of parameters. Can be optimized by
MAP estimator. (EM has difficulties).
Easy to model deviation from Markvian properties
(e.g., polymorphisms, power-law, Polyas urn like
process, local properties of chromosomes, etc.)

We will simply model the number of break-points
by a Poisson process, and lengths of the
aberrational segments by an exponential
process. Two parameter model pb pe
¹ 2
1-pe
pe
2
pb
1-pb
40
Generative Model
Breakpoints, Poisson, pb Segmental Length,
Exponential, pe Copy number, Empirical
Distribution Noise, Gaussian, m, s
Amplification, c4
Amplification, c3
Deletion, c0
Deletion, c1
41
A reasonable choice of priors yields good
segmentation.
42
A reasonable choice of priors yields good
segmentation.
43
A MAP (Maximum A Posteriori) Estimators

Priors
Deletion Amplification
Data
Priors Noise
Goal Find the most plausible hypothesis of
regional changes and their associated copy
numbers
Generalizes HMMThe prior depends on two
parameters pe and pb.

pe is the probability of a particular probe being
normal.
pb is the average number of intervals per unit
length.

44
Likelihood Function

The likelihood function for first n probes
L(h i1, m1, , ik, mk i)
Exp(-pb n) (pb n)k
(2 p s2)(-n/2)Õi1n Exp-(vi - mj)2/2s2
pe(global)(1-pe)(local)
Where ik n and i belongs to the jth interval.
Maximum A Posteriori algorithm (implemented as a
Dynamic Programming Solution) optimizes L to get
the best segmentation
L(h i1, m1, , ik, mk i)

45
Dynamic Programming Algorithm

Generalizes Viterbi and Extends.
Uses the optimal parameters for the generative
model
Adds a new interval to the end
h i1, m1, , ik, mk i h ik1, mk1 i h i1,
m1, , ik, mk, ik1, mk1 i
Incremental computation of the likelihood
function
Log L(h i1, m1, , ik, mk, ik1, mk1 i)
Log L(h i1, m1, , ik, mki)
new-res./2s2 Log(pbn) (ik1 ik) Log (2ps2)
(ik1 ik) Iglobal Log pe Ilocal Log(1
pe)

46
Prior Selection F criterion

For each break we have a T2 statistic and the
appropriate tail probability (p value) calculated
from the distribution of the statistic. In this
case, this is an F distribution.
The best (pe,pb) is the one that leads to the
maximum min p-value.

47
Segmentation Analysis
48
Comparison of chromosome 13 tumor using 4
different segmentation algorithm
vMAP
DNAcopy
CGH Explorer v.2.43
GLAD
13q13.1
13q31.3
49
Comparative Analysis BAC Array
50
Comparative Analysis Nimblegen
51
Comparative Analysis Affy 10K
52
Simulated Data

Array CGH simulations and an ROC analysis
Using the same scheme as Lai et al.
Weil R. Lai, Mark D. Johnson, Raju Kucherlapati,
and Peter J. Park (2005), Comparative analysis
of algorithms for identifying amplifications and
deletions in array CGH data, Bioinformatics,
21(19) 3763-3770.
Segmented by Vmap and DNAcopy
Vmap algorithm was tested at 11 segmentation
Pvalues of 0.1, 5 10-2, 10-2, 10-3, 10-4, ,
10-10.
DNAcopy algorithm was tested at 9 segmentation
alpha values of .9, .5, .1, 10-2, 10-3, 10-4, ,
10-7.
Analysis by Alex Pearlman et al. (2006)

53
VMAP
54
DNACopy
55
(No Transcript)
56
Prostate Tumor Gains and Losses Genome view of
19K BAC CGH
Log ratio
57
Segmentation of Multi-BAC Events On Chromosome 13
Proximal breakpoints were identical for T1 and
T3. Distal breakpoints overlapped for T1, T2,
and T3.
Tumor1 Tumor2 Tumor3
58
Further Improvement

We employed a hierarchical Bayesian model in
which global false discovery rates can be
calculated using the different levels of the
model.
Noise processes are also estimated using the
appropriate global parameters.

59
Specific Features of the Model

We build a model in which, given the region
segmentations,
we assume that the copy numbers Ij region
j, (1 j k)
in that regions are mutually independent
Gaussian Xi,j N(qj, sj2), (1 i nj)
random variables with mean qj and variance sj2.
We further assume that each copy region mean
parameter qj is in one of a small number of
states 2 1,,S with respective probabilities,
p1, , pS of being in state s. qj is in state s
(with probability ps) if it has a Gaussian
distribution with state mean qs and state
variance ts2 .
States serve to characterize regions. The state
means and variances are the hyperparameters of
the model.

60
ImplementationDynamic Programming

Given the hyperparameters, we segment regions
using a dynamic programming approach. This
consists in constructing probe regions as
follows
After the (j-1)st region has been constructed
A) we choose the next two contiguous regions to
the right of those already constructed by
optimizing the corresponding log likelihood,
subject to the condition that the p-value of the
t-statistic distinguishing between these two
(aforementioned) regions is above a given
threshold.
B) Having chosen these (aforementioned) regions,
the probe regions already constructed, contiguous
to them, may also need to be altered.

61
Segmentation (ROMA,chr3)
62
SMASH

Single Molecule Approaches to Sequencing by
Hybridization
Extensions to Optical Mapping

63
SMASH

Genomic DNA is carefully extracted from small
number of cells of an organism (e.g., human) in
normal or diseased states. (Fig 1 shows a cancer
cell to be studied for its oncogeneomic
characterization.)

64
SMASH

LNA probes of length 6 8 nucleotides are
hybridized to dsDNA (double-stranded genomic DNA)
in a test tube (Fig 2) and the modified DNA is
stretched on a 1 x 1 chip that has microfluidic
channels manufactured on its surface. These
surfaces have been chemically treated to create a
positive charge.

DNA samples are prepared for analysis with LNA
probes and restriction enzymes.
65
SMASH

Since DNA is slightly negatively charged, it
adheres to the surface as it flows along these
channels and stretches out. Individual molecules
range in size from 0.3 3 million base pairs in
length.
Next, bright emitters are attached to the probes
on the surface and the molecules are imaged (Fig
3).

66
SMASH

A restriction enzyme1 is added to break the DNA
at specific sites. Since DNA molecules are under
slight tension, the cut fragments of DNA relax
like entropic springs, leaving small visible gaps
corresponding to the positions of the restriction
site (Fig 4).
1. A restriction enzyme is a highly specific
molecular scissor that recognizes short
nucleotide sequences and cuts the DNA at only
those recognition sites.

67
SMASH

The DNA is then stained with a fluorogen (Fig 5)
and reimaged. The two images are combined to
create a composite image suggesting the locations
of a specific short word (e.g., probes) within
the context of a pattern of restriction sites.

68
SMASH

The intensity of the light emitted by the dye at
one frequency provides a measure of the length of
the DNA fragments.
The intensity of the light emitted by the
bright-emitters on probes provides an intensity
profile for locations of the probes.
Images of each DNA molecule are then converted
into ideograms, where the restriction sites are
represented by a tall rectangle and probe sites
by small circles (Fig 6).

69
SMASH

The steps above are repeated for all possible
probe compositions (modulo reverse
complementarity).
Sutta software then uses the data from all such
individual ideograms to create an assembly of the
haplotypic ordered restriction maps with
approximate probe locations superimposed on the
map.

70
SMASH

Local clusters of overlapping words are combined
by Suttas PSBH (positional sequencing by
hybridization) algorithm to overlay the inferred
haplotypic sequence on top of the restriction map
(Fig 7).

71
Gapped Probes

Mixing solid bases with wild-card bases
E.g., xxxxxx (10-4-mers) or xxxxxx
(12-6-mers)
An wild-card base
Universal In terms of its ability to form base
pairs with the other natural DNA/RNA bases.
Applications in primers and in probes for
hybridization
Examples
The naturally occurring base hypoxanthine, as its
ribo- or 2'-deoxyribonucleoside
2'-deoxyisoinosine
7-deaza-2'-deoxyinosine
2-aza-2'-deoxyinosine

72
Simulation Results

Probe Map Assumptions
For single DNA molecules
Probe location Standard Deviation 240 bases
Data coverage per probe map 50x
Probe hybridization rate 30, and
false positive rate of 10 probes per megabase,
uniformly distributed.
Analytically estimation of the average error rate
in the probe consensus map
Probe location SD 60 bases
False Positive rate lt 2.4
False Negative rate lt 2.0.

73
Simulation Results
UNGAPPED
GAPPED
74
Simulation Results

Simulation based on non-random sequences from the
human genome 96 blocks of 1 Kb (from chromosome
1) concatenated together along with its in silico
restriction map.
Error summary for the gapped probe pattern
xxx xxx
Error count excluding repeats or near repeats
0.32bp / 10Kb
There is no error due to incorrect
rearrangements.
There is no loss of information at haplotypic
level.
Assembly failed in 2 of 96 blocks of 1kb 2.1
failure rate (out of memory).

75
GENomic conTIG

Gentig uses a purely Bayesian Approach.
It models all the error processes in the prior.
FAST It initially starts with a conservative but
fast pairwise overlap configuration, computed
efficiently using Geometric Hashing.
ACCURATE It iteratively combines pairs of maps
or map contigs, while optimizing the likelihood
score subject to a constraint imposed by a
false-positive constraint.
It has special heuristics to handle non-local
errors.

76
HAPTIG HAPlotypic conTIG Candida Albicans
FAST ACCURATE BAYESIAN ALGORITHM

The left end of chromsome-1 of the common fungus
Candida Albicans (being sequenced by Stanford).
You can clearly see 3 polymorphisms
(A) Fragment 2 is of size 41.19kb (top) vs
38.73kb (bottom).
(B) The 3rd fragment of size 7.76kb is missing
from the top haplotype.
(C)The large fragment in the middle is of size
61.78kb vs 59.66kb.

77
Lambda DNA with probes
10 mm
78
A
Fig. A Four AFM images of lambda DNA with PNA
probes hybridized to the distal recognition site,
located 6,900 bp or 2.28 microns from the end
(green arrow). Non-specifically bound probes
indicated by the red arrows. Z-scale is /- 1.5
nm.
79
E. coli
Figure 3. Two optical images of E coli K12
genomic DNA after restriction digestion with
6-cutter restriction enzyme Xho 1 and
hybridization with an 8-mer PNA probe. Bound
probes are indicated by blue arrows and
non-specifically bound probes by the red arrows.
Scale bar shown is 10 micron.
80
Discussions

81
Answer to Cancer