Computational Systems Biology of Cancer: - PowerPoint PPT Presentation

About This Presentation
Title:

Computational Systems Biology of Cancer:

Description:

Computational Systems Biology of Cancer: ... – PowerPoint PPT presentation

Number of Views:242
Avg rating:3.0/5.0
Slides: 82
Provided by: Bud138
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Computational Systems Biology of Cancer:


1
(II)
Computational Systems Biology of Cancer
2
Bud Mishra
  • Professor of Computer Science, Mathematics and
    Cell Biology
  • Courant Institute, NYU School of Medicine, Tata
    Institute of Fundamental Research, and Mt. Sinai
    School of Medicine

3
The New Synthesis
Genome Evolution
Selection
perturbed pathways
genetic instability
4
Is the Genomic View of Cancer Necessarily
Accurate ?
  • If I said yes, that would then suggest that that
    might be the only place where it might be done
    which would not be accurate, necessarily
    accurate. It might also not be inaccurate, but
    I'm disinclined to mislead anyone.
  • US Secretary of Defense, Mr. Donald Rumsfeld,
    Once again quoted completely out of context.

5
Cancer Initiation and Progression
Genomics (Mutations, Translocations,
Amplifications, Deletions) Epigenomics (Hyper
Hypo-Methylation) Transcriptomics (Alternate
Splicing, mRNA) Proteomics (Synthesis,
Post-Translational Modification,
Degradation) Signaling
Cancer Initiation and Progression
Proliferation, Motility, Immortality, Metastasis,
Signaling
6
Mishras Mystical 3Ms
  • Rapid and accurate solutions
  • Bioinformatic, statistical, systems, and
    computational approaches.
  • Approaches that are scalable, agnostic to
    technologies, and widely applicable
  • Promises, challenges and obstacles

Measure
Mine
Model
7
Measure
  • What we can quantify and what we cannot

8
Microarray Analysis of Cancer Genome
  • Representations are reproducible samplings of DNA
    populations in which the resulting DNA has a new
    format and reduced complexity.
  • We array probes derived from low complexity
    representations of the normal genome
  • We measure differences in gene copy number
    between samples ratiometrically
  • Since representations have a lower nucleotide
    complexity than total genomic DNA, we obtain a
    stronger specific hybridization signal relative
    to non-specific and noise

9
Minimizing Cross Hybridization(Complexity
Reduction)
10
Copy Number Fluctuation
A1
B1
C1
A2
B2
C2
A3
B3
C3
11
Critical Innovations
  • Data Normalization and Background Correction for
    Affy-Chips
  • 10K, 100K, 500K (Affy) Generalized RMA
  • Multi-Experiment-Based Probe-Characterization
    (Kalman EM)
  • A novel genome segmenter algorithm
  • Empirical Bayes Approach Maximum A Posteriori
    (MAP)
  • Generative Model (Hierarchical, Heteroskedastic)
  • Dynamic Programming Solution
  • Cubic-Time Linear-time Approximation using
    Beam-Search Heuristic
  • Single Molecule Technologies
  • Optical and Nanotechnologies
  • Sequencing SMASH
  • Epigenomics
  • Transcriptomics

12
Background Correction Normalization
13
Oligo Arrays SNP genotyping
  • Given 500K human SNPs to be measured, select 10
    25-mers that over lap each SNP location for
    Allele A.
  • Select another 10 25-mers corresponding to SNP
    Allele B.
  • Problem Cross Hybridization

14
Using SNP arrays to detect Genomic Aberrations
  • Each SNP probeset measures absense/presence of
    one of two Alleles.
  • If a region of DNA is deleted by cancer, one or
    both alleles will be missing!
  • If a region of DNA is duplicated/amplified by
    cancer, one or both alleles will be amplified.
  • Problem Oligo arrays are noisy.

15
90 humans, 1 SNP (A0.48)
Allele B
Allele A
16
90 humans, 1 SNP (A0.24)
Allele B
Allele A
17
90 humans, 1 SNP (A0.96)
Allele B
Allele A
18
Background Correction Normalization
  • Consider a genomic location L and two similar
    nucleotide sequences sL,x and sL,y starting at
    that location in the two copies of a diploid
    genomes
  • E.g., they may differ in one SNP.
  • Let qx and qy be their respective copy numbers in
    the whole genome and all copies are selected in
    the reduced complexity representation. The gene
    chip contains four probes px 2 sL,x py 2 sL,y
    px, py 2 G.
  • After PCR amplification, we have some Kx qx
    amount of DNA that is complementary to the probe
    px, etc.K' (¼ Kx) amount of DNA that is
    additionally approximately complementary to the
    probe px.

19
Normalize using a Generalized RMA
  • I U - mn
  • a sn2 - fN(0,1)(a/b)/FN(0,1)(a/b)
  • (1 b Bsn/FN(0,1)(a/b)-1
  • bsn/Bsn )
  • (1 FN(0,1)(a/b)/(b Bsn)-1,
  • Where a U-mn -a sn2 b sn, and
  • bsn å Ii,j U mn fN(0,1)(Ii,j U mn )
  • Bsn å fN(0,1)(Ii,j U mn )

20
Background Correction Normalization
  • If the probe has an affinity fx, then the
    measured intensity is can be expressed as
  • Kx qx K fx noise
  • qx K/Kx fx noise
  • With Expm1 e s1, a multiplicative logNormal
    noise,
  • m2 e s2 an additive Gaussian noise,
  • and fx Kx fx an amplified affinity.
  • A more general model
  • Ix qx K/Kx fx em1e s1 m2 e s2

21
Mathematical Model
  • In particular, we have four values of measured
    intensities
  • Ix qx fx Nxe m1 e s1 m2 e s2
  • Ix Nx e m1 e s1 m2 e s2
  • Iy qy fy Ny e m1 e s1 m2 e s2
  • Iy Ny e m1 e s1 m2 e s2

22
Bioinformatics Data modeling
  • Good news For each 25-bp probe, the fluorescent
    signal increases linearly with the amount of
    complementary DNA in the sample (up to some limit
    where it saturates).
  • Bad news The linear scaling and offset differ
    for each 25-bp probe. Scaling varies by factors
    of more than 10x.
  • Noise Due to PCR cross hybridization and
    measurement noise.

23
Scaling Offset differ
  • Scaling varies across probes
  • Each 25-bp sequence has different thermodynamic
    properties.
  • Scaling varies across samples
  • The scanning laser for different samples may have
    different levels.
  • The starting DNA concentrations may differ PCR
    may amplify differently.
  • Offset varies across probes
  • Different levels of Cross Hybridization with the
    rest of the Genome.
  • Offset varies across samples
  • Different sample genomes may differ slightly
    (sample degradation impurities, etc.)

24
Linear Model Noise
25
Noise minimization
26
Final Data Model
27
MLE using gradients
28
Data Outliers
  • Our data model fails for few data points (bad
    probes)
  • Soln (1) Improve the model
  • Soln (2) Discard the outliers
  • Soln (3) Alternate model for the outliers
    Weight the data approprately.

29
Outlier Model
30
Problem with MLE No unique maxima
31
Scaling of MLE estimate
32
Segmentation to reduce noise
  • The true copy number (Allele AB) is normally 2
    and does not vary across the genome, except at a
    few locations (breakpoints).
  • Segmentation can be used to estimate the location
    of breakpoints and then we can average all
    estimated copy number values between each pair of
    breakpoints to reduce noise.

33
Allelic Frequencies Cancer Normal
34
Allelic Frequencies Cancer Normal
35
Segmentation Break-Point Detection
36
Algorithmic Approaches
  • Local Approach
  • Change-point Detection
  • (QSum, KS-Test, Permutation Test)
  • Global Approach
  • HMM models
  • Wavelet Decomposition
  • Bayesian Empirical Bayes Approach
  • Generative Models
  • (One- or Multi-level Hierarchical)
  • Maximum A Posteriori

37
HMM
Model with a very high degree of freedom, but not
enough data points. Small Sample statistics a
Overfitting, Convergence to local maxima, etc.
38
HMM, finally
Model with a very high degree of freedom, but not
enough data points. Small Sample statistics a
Overfitting, Convergence to local maxima, etc.
3
1
2
39
HMM, last time
  • Advantages
  • Small Number of parameters. Can be optimized by
    MAP estimator. (EM has difficulties).
  • Easy to model deviation from Markvian properties
    (e.g., polymorphisms, power-law, Polyas urn like
    process, local properties of chromosomes, etc.)

We will simply model the number of break-points
by a Poisson process, and lengths of the
aberrational segments by an exponential
process. Two parameter model pb pe
¹ 2
1-pe
pe
2
pb
1-pb
40
Generative Model
Breakpoints, Poisson, pb Segmental Length,
Exponential, pe Copy number, Empirical
Distribution Noise, Gaussian, m, s
Amplification, c4
Amplification, c3
Deletion, c0
Deletion, c1
41
A reasonable choice of priors yields good
segmentation.
42
A reasonable choice of priors yields good
segmentation.
43
A MAP (Maximum A Posteriori) Estimators
  • Priors
  • Deletion Amplification
  • Data
  • Priors Noise
  • Goal Find the most plausible hypothesis of
    regional changes and their associated copy
    numbers
  • Generalizes HMMThe prior depends on two
    parameters pe and pb.
  • pe is the probability of a particular probe being
    normal.
  • pb is the average number of intervals per unit
    length.

44
Likelihood Function
  • The likelihood function for first n probes
  • L(h i1, m1, , ik, mk i)
  • Exp(-pb n) (pb n)k
  • (2 p s2)(-n/2)Õi1n Exp-(vi - mj)2/2s2
  • pe(global)(1-pe)(local)
  • Where ik n and i belongs to the jth interval.
  • Maximum A Posteriori algorithm (implemented as a
    Dynamic Programming Solution) optimizes L to get
    the best segmentation
  • L(h i1, m1, , ik, mk i)

45
Dynamic Programming Algorithm
  • Generalizes Viterbi and Extends.
  • Uses the optimal parameters for the generative
    model
  • Adds a new interval to the end
  • h i1, m1, , ik, mk i h ik1, mk1 i h i1,
    m1, , ik, mk, ik1, mk1 i
  • Incremental computation of the likelihood
    function
  • Log L(h i1, m1, , ik, mk, ik1, mk1 i)
  • Log L(h i1, m1, , ik, mki)
  • new-res./2s2 Log(pbn) (ik1 ik) Log (2ps2)
  • (ik1 ik) Iglobal Log pe Ilocal Log(1
    pe)

46
Prior Selection F criterion
  • For each break we have a T2 statistic and the
    appropriate tail probability (p value) calculated
    from the distribution of the statistic. In this
    case, this is an F distribution.
  • The best (pe,pb) is the one that leads to the
    maximum min p-value.

47
Segmentation Analysis
48
Comparison of chromosome 13 tumor using 4
different segmentation algorithm
vMAP
DNAcopy
CGH Explorer v.2.43
GLAD
13q13.1
13q31.3
49
Comparative Analysis BAC Array
50
Comparative Analysis Nimblegen
51
Comparative Analysis Affy 10K
52
Simulated Data
  • Array CGH simulations and an ROC analysis
  • Using the same scheme as Lai et al.
  • Weil R. Lai, Mark D. Johnson, Raju Kucherlapati,
    and Peter J. Park (2005), Comparative analysis
    of algorithms for identifying amplifications and
    deletions in array CGH data, Bioinformatics,
    21(19) 3763-3770.
  • Segmented by Vmap and DNAcopy
  • Vmap algorithm was tested at 11 segmentation
    Pvalues of 0.1, 5 10-2, 10-2, 10-3, 10-4, ,
    10-10.
  • DNAcopy algorithm was tested at 9 segmentation
    alpha values of .9, .5, .1, 10-2, 10-3, 10-4, ,
    10-7.
  • Analysis by Alex Pearlman et al. (2006)

53
VMAP
54
DNACopy
55
(No Transcript)
56
Prostate Tumor Gains and Losses Genome view of
19K BAC CGH
Log ratio
57
Segmentation of Multi-BAC Events On Chromosome 13
Proximal breakpoints were identical for T1 and
T3. Distal breakpoints overlapped for T1, T2,
and T3.
Tumor1 Tumor2 Tumor3
58
Further Improvement
  • We employed a hierarchical Bayesian model in
    which global false discovery rates can be
    calculated using the different levels of the
    model.
  • Noise processes are also estimated using the
    appropriate global parameters.

59
Specific Features of the Model
  • We build a model in which, given the region
    segmentations,
  • we assume that the copy numbers Ij region
    j, (1 j k)
  • in that regions are mutually independent
  • Gaussian Xi,j N(qj, sj2), (1 i nj)
  • random variables with mean qj and variance sj2.
  • We further assume that each copy region mean
    parameter qj is in one of a small number of
    states 2 1,,S with respective probabilities,
    p1, , pS of being in state s. qj is in state s
    (with probability ps) if it has a Gaussian
    distribution with state mean qs and state
    variance ts2 .
  • States serve to characterize regions. The state
    means and variances are the hyperparameters of
    the model.

60
ImplementationDynamic Programming
  • Given the hyperparameters, we segment regions
    using a dynamic programming approach. This
    consists in constructing probe regions as
    follows
  • After the (j-1)st region has been constructed
  • A) we choose the next two contiguous regions to
    the right of those already constructed by
    optimizing the corresponding log likelihood,
    subject to the condition that the p-value of the
    t-statistic distinguishing between these two
    (aforementioned) regions is above a given
    threshold.
  • B) Having chosen these (aforementioned) regions,
    the probe regions already constructed, contiguous
    to them, may also need to be altered.

61
Segmentation (ROMA,chr3)
62
SMASH
  • Single Molecule Approaches to Sequencing by
    Hybridization
  • Extensions to Optical Mapping

63
SMASH
  • Genomic DNA is carefully extracted from small
    number of cells of an organism (e.g., human) in
    normal or diseased states. (Fig 1 shows a cancer
    cell to be studied for its oncogeneomic
    characterization.)

64
SMASH
  • LNA probes of length 6 8 nucleotides are
    hybridized to dsDNA (double-stranded genomic DNA)
    in a test tube (Fig 2) and the modified DNA is
    stretched on a 1 x 1 chip that has microfluidic
    channels manufactured on its surface. These
    surfaces have been chemically treated to create a
    positive charge.

DNA samples are prepared for analysis with LNA
probes and restriction enzymes.
65
SMASH
  • Since DNA is slightly negatively charged, it
    adheres to the surface as it flows along these
    channels and stretches out. Individual molecules
    range in size from 0.3 3 million base pairs in
    length.
  • Next, bright emitters are attached to the probes
    on the surface and the molecules are imaged (Fig
    3).

66
SMASH
  • A restriction enzyme1 is added to break the DNA
    at specific sites. Since DNA molecules are under
    slight tension, the cut fragments of DNA relax
    like entropic springs, leaving small visible gaps
    corresponding to the positions of the restriction
    site (Fig 4).
  • 1. A restriction enzyme is a highly specific
    molecular scissor that recognizes short
    nucleotide sequences and cuts the DNA at only
    those recognition sites.

67
SMASH
  • The DNA is then stained with a fluorogen (Fig 5)
    and reimaged. The two images are combined to
    create a composite image suggesting the locations
    of a specific short word (e.g., probes) within
    the context of a pattern of restriction sites.

68
SMASH
  • The intensity of the light emitted by the dye at
    one frequency provides a measure of the length of
    the DNA fragments.
  • The intensity of the light emitted by the
    bright-emitters on probes provides an intensity
    profile for locations of the probes.
  • Images of each DNA molecule are then converted
    into ideograms, where the restriction sites are
    represented by a tall rectangle and probe sites
    by small circles (Fig 6).

69
SMASH
  • The steps above are repeated for all possible
    probe compositions (modulo reverse
    complementarity).
  • Sutta software then uses the data from all such
    individual ideograms to create an assembly of the
    haplotypic ordered restriction maps with
    approximate probe locations superimposed on the
    map.

70
SMASH
  • Local clusters of overlapping words are combined
    by Suttas PSBH (positional sequencing by
    hybridization) algorithm to overlay the inferred
    haplotypic sequence on top of the restriction map
    (Fig 7).

71
Gapped Probes
  • Mixing solid bases with wild-card bases
  • E.g., xxxxxx (10-4-mers) or xxxxxx
    (12-6-mers)
  • An wild-card base
  • Universal In terms of its ability to form base
    pairs with the other natural DNA/RNA bases.
  • Applications in primers and in probes for
    hybridization
  • Examples
  • The naturally occurring base hypoxanthine, as its
    ribo- or 2'-deoxyribonucleoside
  • 2'-deoxyisoinosine
  • 7-deaza-2'-deoxyinosine
  • 2-aza-2'-deoxyinosine

72
Simulation Results
  • Probe Map Assumptions
  • For single DNA molecules
  • Probe location Standard Deviation 240 bases
  • Data coverage per probe map 50x
  • Probe hybridization rate 30, and
  • false positive rate of 10 probes per megabase,
    uniformly distributed.
  • Analytically estimation of the average error rate
    in the probe consensus map
  • Probe location SD 60 bases
  • False Positive rate lt 2.4
  • False Negative rate lt 2.0.

73
Simulation Results
UNGAPPED
GAPPED
74
Simulation Results
  • Simulation based on non-random sequences from the
    human genome 96 blocks of 1 Kb (from chromosome
    1) concatenated together along with its in silico
    restriction map.
  • Error summary for the gapped probe pattern
  • xxx xxx
  • Error count excluding repeats or near repeats
  • 0.32bp / 10Kb
  • There is no error due to incorrect
    rearrangements.
  • There is no loss of information at haplotypic
    level.
  • Assembly failed in 2 of 96 blocks of 1kb 2.1
    failure rate (out of memory).

75
GENomic conTIG
  • Gentig uses a purely Bayesian Approach.
  • It models all the error processes in the prior.
  • FAST It initially starts with a conservative but
    fast pairwise overlap configuration, computed
    efficiently using Geometric Hashing.
  • ACCURATE It iteratively combines pairs of maps
    or map contigs, while optimizing the likelihood
    score subject to a constraint imposed by a
    false-positive constraint.
  • It has special heuristics to handle non-local
    errors.

76
HAPTIG HAPlotypic conTIG Candida Albicans
FAST ACCURATE BAYESIAN ALGORITHM
  • The left end of chromsome-1 of the common fungus
    Candida Albicans (being sequenced by Stanford).
  • You can clearly see 3 polymorphisms
  • (A) Fragment 2 is of size 41.19kb (top) vs
    38.73kb (bottom).
  • (B) The 3rd fragment of size 7.76kb is missing
    from the top haplotype.
  • (C)The large fragment in the middle is of size
    61.78kb vs 59.66kb.

77
Lambda DNA with probes
10 mm
78
A
Fig. A Four AFM images of lambda DNA with PNA
probes hybridized to the distal recognition site,
located 6,900 bp or 2.28 microns from the end
(green arrow). Non-specifically bound probes
indicated by the red arrows. Z-scale is /- 1.5
nm.
79
E. coli
Figure 3. Two optical images of E coli K12
genomic DNA after restriction digestion with
6-cutter restriction enzyme Xho 1 and
hybridization with an 8-mer PNA probe. Bound
probes are indicated by blue arrows and
non-specifically bound probes by the red arrows.
Scale bar shown is 10 micron.
80
Discussions
  • QA

81
Answer to Cancer
  • If I know the answer I'll tell you the answer,
    and if I don't, I'll just respond, cleverly.
  • US Secretary of Defense, Mr. Donald Rumsfeld.

82
To be continued
  • Break
Write a Comment
User Comments (0)
About PowerShow.com