Title: Computational Systems Biology of Cancer:
1(II)
Computational Systems Biology of Cancer
2Bud Mishra
- Professor of Computer Science, Mathematics and
Cell Biology -
- Courant Institute, NYU School of Medicine, Tata
Institute of Fundamental Research, and Mt. Sinai
School of Medicine
3The New Synthesis
Genome Evolution
Selection
perturbed pathways
genetic instability
4Is the Genomic View of Cancer Necessarily
Accurate ?
- If I said yes, that would then suggest that that
might be the only place where it might be done
which would not be accurate, necessarily
accurate. It might also not be inaccurate, but
I'm disinclined to mislead anyone. - US Secretary of Defense, Mr. Donald Rumsfeld,
Once again quoted completely out of context.
5Cancer Initiation and Progression
Genomics (Mutations, Translocations,
Amplifications, Deletions) Epigenomics (Hyper
Hypo-Methylation) Transcriptomics (Alternate
Splicing, mRNA) Proteomics (Synthesis,
Post-Translational Modification,
Degradation) Signaling
Cancer Initiation and Progression
Proliferation, Motility, Immortality, Metastasis,
Signaling
6Mishras Mystical 3Ms
- Rapid and accurate solutions
- Bioinformatic, statistical, systems, and
computational approaches. - Approaches that are scalable, agnostic to
technologies, and widely applicable - Promises, challenges and obstacles
Measure
Mine
Model
7Measure
- What we can quantify and what we cannot
8Microarray Analysis of Cancer Genome
- Representations are reproducible samplings of DNA
populations in which the resulting DNA has a new
format and reduced complexity. - We array probes derived from low complexity
representations of the normal genome - We measure differences in gene copy number
between samples ratiometrically - Since representations have a lower nucleotide
complexity than total genomic DNA, we obtain a
stronger specific hybridization signal relative
to non-specific and noise
9Minimizing Cross Hybridization(Complexity
Reduction)
10Copy Number Fluctuation
A1
B1
C1
A2
B2
C2
A3
B3
C3
11Critical Innovations
- Data Normalization and Background Correction for
Affy-Chips - 10K, 100K, 500K (Affy) Generalized RMA
- Multi-Experiment-Based Probe-Characterization
(Kalman EM) - A novel genome segmenter algorithm
- Empirical Bayes Approach Maximum A Posteriori
(MAP) - Generative Model (Hierarchical, Heteroskedastic)
- Dynamic Programming Solution
- Cubic-Time Linear-time Approximation using
Beam-Search Heuristic - Single Molecule Technologies
- Optical and Nanotechnologies
- Sequencing SMASH
- Epigenomics
- Transcriptomics
12Background Correction Normalization
13Oligo Arrays SNP genotyping
- Given 500K human SNPs to be measured, select 10
25-mers that over lap each SNP location for
Allele A. - Select another 10 25-mers corresponding to SNP
Allele B. - Problem Cross Hybridization
14Using SNP arrays to detect Genomic Aberrations
- Each SNP probeset measures absense/presence of
one of two Alleles. - If a region of DNA is deleted by cancer, one or
both alleles will be missing! - If a region of DNA is duplicated/amplified by
cancer, one or both alleles will be amplified. - Problem Oligo arrays are noisy.
1590 humans, 1 SNP (A0.48)
Allele B
Allele A
1690 humans, 1 SNP (A0.24)
Allele B
Allele A
1790 humans, 1 SNP (A0.96)
Allele B
Allele A
18Background Correction Normalization
- Consider a genomic location L and two similar
nucleotide sequences sL,x and sL,y starting at
that location in the two copies of a diploid
genomes - E.g., they may differ in one SNP.
- Let qx and qy be their respective copy numbers in
the whole genome and all copies are selected in
the reduced complexity representation. The gene
chip contains four probes px 2 sL,x py 2 sL,y
px, py 2 G. - After PCR amplification, we have some Kx qx
amount of DNA that is complementary to the probe
px, etc.K' (¼ Kx) amount of DNA that is
additionally approximately complementary to the
probe px.
19Normalize using a Generalized RMA
- I U - mn
- a sn2 - fN(0,1)(a/b)/FN(0,1)(a/b)
- (1 b Bsn/FN(0,1)(a/b)-1
- bsn/Bsn )
- (1 FN(0,1)(a/b)/(b Bsn)-1,
- Where a U-mn -a sn2 b sn, and
- bsn å Ii,j U mn fN(0,1)(Ii,j U mn )
- Bsn å fN(0,1)(Ii,j U mn )
20Background Correction Normalization
- If the probe has an affinity fx, then the
measured intensity is can be expressed as - Kx qx K fx noise
- qx K/Kx fx noise
- With Expm1 e s1, a multiplicative logNormal
noise, - m2 e s2 an additive Gaussian noise,
- and fx Kx fx an amplified affinity.
- A more general model
- Ix qx K/Kx fx em1e s1 m2 e s2
21Mathematical Model
- In particular, we have four values of measured
intensities - Ix qx fx Nxe m1 e s1 m2 e s2
- Ix Nx e m1 e s1 m2 e s2
- Iy qy fy Ny e m1 e s1 m2 e s2
- Iy Ny e m1 e s1 m2 e s2
22Bioinformatics Data modeling
- Good news For each 25-bp probe, the fluorescent
signal increases linearly with the amount of
complementary DNA in the sample (up to some limit
where it saturates). - Bad news The linear scaling and offset differ
for each 25-bp probe. Scaling varies by factors
of more than 10x. - Noise Due to PCR cross hybridization and
measurement noise.
23Scaling Offset differ
- Scaling varies across probes
- Each 25-bp sequence has different thermodynamic
properties. - Scaling varies across samples
- The scanning laser for different samples may have
different levels. - The starting DNA concentrations may differ PCR
may amplify differently. - Offset varies across probes
- Different levels of Cross Hybridization with the
rest of the Genome. - Offset varies across samples
- Different sample genomes may differ slightly
(sample degradation impurities, etc.)
24Linear Model Noise
25Noise minimization
26Final Data Model
27MLE using gradients
28Data Outliers
- Our data model fails for few data points (bad
probes) - Soln (1) Improve the model
- Soln (2) Discard the outliers
- Soln (3) Alternate model for the outliers
Weight the data approprately.
29Outlier Model
30Problem with MLE No unique maxima
31Scaling of MLE estimate
32Segmentation to reduce noise
- The true copy number (Allele AB) is normally 2
and does not vary across the genome, except at a
few locations (breakpoints). - Segmentation can be used to estimate the location
of breakpoints and then we can average all
estimated copy number values between each pair of
breakpoints to reduce noise.
33Allelic Frequencies Cancer Normal
34Allelic Frequencies Cancer Normal
35Segmentation Break-Point Detection
36Algorithmic Approaches
- Local Approach
- Change-point Detection
- (QSum, KS-Test, Permutation Test)
- Global Approach
- HMM models
- Wavelet Decomposition
- Bayesian Empirical Bayes Approach
- Generative Models
- (One- or Multi-level Hierarchical)
- Maximum A Posteriori
37HMM
Model with a very high degree of freedom, but not
enough data points. Small Sample statistics a
Overfitting, Convergence to local maxima, etc.
38HMM, finally
Model with a very high degree of freedom, but not
enough data points. Small Sample statistics a
Overfitting, Convergence to local maxima, etc.
3
1
2
39HMM, last time
- Advantages
- Small Number of parameters. Can be optimized by
MAP estimator. (EM has difficulties). - Easy to model deviation from Markvian properties
(e.g., polymorphisms, power-law, Polyas urn like
process, local properties of chromosomes, etc.)
We will simply model the number of break-points
by a Poisson process, and lengths of the
aberrational segments by an exponential
process. Two parameter model pb pe
¹ 2
1-pe
pe
2
pb
1-pb
40Generative Model
Breakpoints, Poisson, pb Segmental Length,
Exponential, pe Copy number, Empirical
Distribution Noise, Gaussian, m, s
Amplification, c4
Amplification, c3
Deletion, c0
Deletion, c1
41A reasonable choice of priors yields good
segmentation.
42A reasonable choice of priors yields good
segmentation.
43A MAP (Maximum A Posteriori) Estimators
- Priors
- Deletion Amplification
- Data
- Priors Noise
- Goal Find the most plausible hypothesis of
regional changes and their associated copy
numbers - Generalizes HMMThe prior depends on two
parameters pe and pb.
- pe is the probability of a particular probe being
normal. - pb is the average number of intervals per unit
length.
44Likelihood Function
- The likelihood function for first n probes
- L(h i1, m1, , ik, mk i)
- Exp(-pb n) (pb n)k
- (2 p s2)(-n/2)Õi1n Exp-(vi - mj)2/2s2
- pe(global)(1-pe)(local)
- Where ik n and i belongs to the jth interval.
- Maximum A Posteriori algorithm (implemented as a
Dynamic Programming Solution) optimizes L to get
the best segmentation - L(h i1, m1, , ik, mk i)
45Dynamic Programming Algorithm
- Generalizes Viterbi and Extends.
- Uses the optimal parameters for the generative
model - Adds a new interval to the end
- h i1, m1, , ik, mk i h ik1, mk1 i h i1,
m1, , ik, mk, ik1, mk1 i - Incremental computation of the likelihood
function - Log L(h i1, m1, , ik, mk, ik1, mk1 i)
- Log L(h i1, m1, , ik, mki)
- new-res./2s2 Log(pbn) (ik1 ik) Log (2ps2)
- (ik1 ik) Iglobal Log pe Ilocal Log(1
pe)
46Prior Selection F criterion
- For each break we have a T2 statistic and the
appropriate tail probability (p value) calculated
from the distribution of the statistic. In this
case, this is an F distribution. - The best (pe,pb) is the one that leads to the
maximum min p-value. -
47Segmentation Analysis
48 Comparison of chromosome 13 tumor using 4
different segmentation algorithm
vMAP
DNAcopy
CGH Explorer v.2.43
GLAD
13q13.1
13q31.3
49Comparative Analysis BAC Array
50Comparative Analysis Nimblegen
51Comparative Analysis Affy 10K
52Simulated Data
- Array CGH simulations and an ROC analysis
- Using the same scheme as Lai et al.
- Weil R. Lai, Mark D. Johnson, Raju Kucherlapati,
and Peter J. Park (2005), Comparative analysis
of algorithms for identifying amplifications and
deletions in array CGH data, Bioinformatics,
21(19) 3763-3770. - Segmented by Vmap and DNAcopy
- Vmap algorithm was tested at 11 segmentation
Pvalues of 0.1, 5 10-2, 10-2, 10-3, 10-4, ,
10-10. - DNAcopy algorithm was tested at 9 segmentation
alpha values of .9, .5, .1, 10-2, 10-3, 10-4, ,
10-7. - Analysis by Alex Pearlman et al. (2006)
53VMAP
54DNACopy
55(No Transcript)
56 Prostate Tumor Gains and Losses Genome view of
19K BAC CGH
Log ratio
57Segmentation of Multi-BAC Events On Chromosome 13
Proximal breakpoints were identical for T1 and
T3. Distal breakpoints overlapped for T1, T2,
and T3.
Tumor1 Tumor2 Tumor3
58Further Improvement
- We employed a hierarchical Bayesian model in
which global false discovery rates can be
calculated using the different levels of the
model. - Noise processes are also estimated using the
appropriate global parameters.
59Specific Features of the Model
- We build a model in which, given the region
segmentations, - we assume that the copy numbers Ij region
j, (1 j k) - in that regions are mutually independent
- Gaussian Xi,j N(qj, sj2), (1 i nj)
- random variables with mean qj and variance sj2.
- We further assume that each copy region mean
parameter qj is in one of a small number of
states 2 1,,S with respective probabilities,
p1, , pS of being in state s. qj is in state s
(with probability ps) if it has a Gaussian
distribution with state mean qs and state
variance ts2 . - States serve to characterize regions. The state
means and variances are the hyperparameters of
the model.
60ImplementationDynamic Programming
- Given the hyperparameters, we segment regions
using a dynamic programming approach. This
consists in constructing probe regions as
follows - After the (j-1)st region has been constructed
- A) we choose the next two contiguous regions to
the right of those already constructed by
optimizing the corresponding log likelihood,
subject to the condition that the p-value of the
t-statistic distinguishing between these two
(aforementioned) regions is above a given
threshold. - B) Having chosen these (aforementioned) regions,
the probe regions already constructed, contiguous
to them, may also need to be altered.
61Segmentation (ROMA,chr3)
62SMASH
- Single Molecule Approaches to Sequencing by
Hybridization - Extensions to Optical Mapping
63SMASH
- Genomic DNA is carefully extracted from small
number of cells of an organism (e.g., human) in
normal or diseased states. (Fig 1 shows a cancer
cell to be studied for its oncogeneomic
characterization.)
64SMASH
- LNA probes of length 6 8 nucleotides are
hybridized to dsDNA (double-stranded genomic DNA)
in a test tube (Fig 2) and the modified DNA is
stretched on a 1 x 1 chip that has microfluidic
channels manufactured on its surface. These
surfaces have been chemically treated to create a
positive charge.
DNA samples are prepared for analysis with LNA
probes and restriction enzymes.
65SMASH
- Since DNA is slightly negatively charged, it
adheres to the surface as it flows along these
channels and stretches out. Individual molecules
range in size from 0.3 3 million base pairs in
length. - Next, bright emitters are attached to the probes
on the surface and the molecules are imaged (Fig
3).
66SMASH
- A restriction enzyme1 is added to break the DNA
at specific sites. Since DNA molecules are under
slight tension, the cut fragments of DNA relax
like entropic springs, leaving small visible gaps
corresponding to the positions of the restriction
site (Fig 4). - 1. A restriction enzyme is a highly specific
molecular scissor that recognizes short
nucleotide sequences and cuts the DNA at only
those recognition sites.
67SMASH
- The DNA is then stained with a fluorogen (Fig 5)
and reimaged. The two images are combined to
create a composite image suggesting the locations
of a specific short word (e.g., probes) within
the context of a pattern of restriction sites.
68SMASH
- The intensity of the light emitted by the dye at
one frequency provides a measure of the length of
the DNA fragments. - The intensity of the light emitted by the
bright-emitters on probes provides an intensity
profile for locations of the probes. - Images of each DNA molecule are then converted
into ideograms, where the restriction sites are
represented by a tall rectangle and probe sites
by small circles (Fig 6).
69SMASH
- The steps above are repeated for all possible
probe compositions (modulo reverse
complementarity). - Sutta software then uses the data from all such
individual ideograms to create an assembly of the
haplotypic ordered restriction maps with
approximate probe locations superimposed on the
map.
70SMASH
- Local clusters of overlapping words are combined
by Suttas PSBH (positional sequencing by
hybridization) algorithm to overlay the inferred
haplotypic sequence on top of the restriction map
(Fig 7).
71Gapped Probes
- Mixing solid bases with wild-card bases
- E.g., xxxxxx (10-4-mers) or xxxxxx
(12-6-mers) - An wild-card base
- Universal In terms of its ability to form base
pairs with the other natural DNA/RNA bases. - Applications in primers and in probes for
hybridization - Examples
- The naturally occurring base hypoxanthine, as its
ribo- or 2'-deoxyribonucleoside - 2'-deoxyisoinosine
- 7-deaza-2'-deoxyinosine
- 2-aza-2'-deoxyinosine
72Simulation Results
- Probe Map Assumptions
- For single DNA molecules
- Probe location Standard Deviation 240 bases
- Data coverage per probe map 50x
- Probe hybridization rate 30, and
- false positive rate of 10 probes per megabase,
uniformly distributed. - Analytically estimation of the average error rate
in the probe consensus map - Probe location SD 60 bases
- False Positive rate lt 2.4
- False Negative rate lt 2.0.
73Simulation Results
UNGAPPED
GAPPED
74Simulation Results
- Simulation based on non-random sequences from the
human genome 96 blocks of 1 Kb (from chromosome
1) concatenated together along with its in silico
restriction map. - Error summary for the gapped probe pattern
- xxx xxx
- Error count excluding repeats or near repeats
- 0.32bp / 10Kb
- There is no error due to incorrect
rearrangements. - There is no loss of information at haplotypic
level. - Assembly failed in 2 of 96 blocks of 1kb 2.1
failure rate (out of memory).
75GENomic conTIG
- Gentig uses a purely Bayesian Approach.
- It models all the error processes in the prior.
- FAST It initially starts with a conservative but
fast pairwise overlap configuration, computed
efficiently using Geometric Hashing. - ACCURATE It iteratively combines pairs of maps
or map contigs, while optimizing the likelihood
score subject to a constraint imposed by a
false-positive constraint. - It has special heuristics to handle non-local
errors.
76HAPTIG HAPlotypic conTIG Candida Albicans
FAST ACCURATE BAYESIAN ALGORITHM
- The left end of chromsome-1 of the common fungus
Candida Albicans (being sequenced by Stanford). - You can clearly see 3 polymorphisms
- (A) Fragment 2 is of size 41.19kb (top) vs
38.73kb (bottom). - (B) The 3rd fragment of size 7.76kb is missing
from the top haplotype. - (C)The large fragment in the middle is of size
61.78kb vs 59.66kb.
77Lambda DNA with probes
10 mm
78A
Fig. A Four AFM images of lambda DNA with PNA
probes hybridized to the distal recognition site,
located 6,900 bp or 2.28 microns from the end
(green arrow). Non-specifically bound probes
indicated by the red arrows. Z-scale is /- 1.5
nm.
79E. coli
Figure 3. Two optical images of E coli K12
genomic DNA after restriction digestion with
6-cutter restriction enzyme Xho 1 and
hybridization with an 8-mer PNA probe. Bound
probes are indicated by blue arrows and
non-specifically bound probes by the red arrows.
Scale bar shown is 10 micron.
80Discussions
81Answer to Cancer
- If I know the answer I'll tell you the answer,
and if I don't, I'll just respond, cleverly. - US Secretary of Defense, Mr. Donald Rumsfeld.
82To be continued