Summary of Some Ideas

About This Presentation

Title:

Summary of Some Ideas

Description:

Summary of Some Ideas – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 149

Provided by: csN6

Learn more at: https://cs.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Summary of Some Ideas

1
regeneron Seminar 6.4.2004

2
BAC to the Future(mapping, cgh, genetics
beyond)

Bud Mishra
Courant Inst. ² Cold Spring Harbor Lab. ² Tata
Inst of Fund. Res. ² Mt. Sinai School of Medicine

3
Cancer
4
A Challenge

At present, description of a recently diagnosed
tumor in terms of its underlying genetic lesions
remains a distant prospect. Nonetheless, we look
ahead 10 or 20 years to the time when the
diagnosis of all somatically acquired lesions
present in a tumor cell genome will become a
routine procedure.
Douglas Hanahan and Robert Weinberg
Cell, Vol. 100, 57-70, 7 Jan 2000

5
Amplifications Deletions
6
Goals

Spontaneous Somatic Mutations
Common Amplifications and Homozygous Deletions in
the Genome of Human Tumor Cells
Spontaneous Mutations in the Parental Germline
Sporadic Hereditary Diseases
Autism, Juvenile Schizopherenia, Childhood
Neoplasms
Based on a Collection of LCR Probes,
Representative of the Genome
Detailed Chromosomal Positions of the Probes are
assumed unknown and may need to be created ab
initio.

7
Biotechnology
Where we explore various tools of the trade
8
Tools of the TradeSCISSORS

Type II Restriction Enzyme
Biochemicals capable of cutting the
double-stranded DNA by breaking two -O-P-O
bridges on each backbone
Restriction Site
Corresponds to specific short sequences EcoRI
GAATTC
Naturally occurring protein in bacteriaDefends
the bacterium from invading viral DNABacterium
produces another enzyme that methylates the
restriction sites of its own DNA

9
Tools of the TradeGLUE

DNA Ligase
Cellular Enzyme Joins two strands of DNA
molecules by repairing phosphodiester bonds
T4 DNA Ligase (E. coli infected with
bacteriophage T4)
Hybridization
Hydrogen bonding between two complementary single
stranded DNA fragments, or an RNA fragment and a
complementary single stranded DNA fragment
results in a double stranded DNA or a DNA-RNA
fragment

10
Tools of the TradeCOPIER

DNA Amplification
Main Ingredients Insert (the DNA segment to be
amplified), Vector (a cloning vector that
combines with an insert to create a replicon),
Host Organism (usually bacteria).

11
Tools of the TradeCOPIER

PCR (Polymerase Chain Reaction)
Main Ingredients Primers, Catalysts, Templates,
and the dNTPs.

12
Karyotyping CGH
Where we examine existing methods to characterize
the Cancer Genome.
13
Karyotyping
14
Karyotyping
15
Karyotypic Analysis
Not enough chromosomes Turners Syndrome Too
many chromosomes Downs Syndrome Mixed up
pieces (Translocations) Philadelphia
Chromosome Missing pieces or Deletions
Cri-du-chat Syndrome Other anomilies Fragile X
Symdrome
16
Ploidy Analysis

Compare DNA content of unknown cell population to
DNA content of reference cell population
If amount of DNA differs from the reference the
unknown sample may be aneuploid (or haploid,
triploid, tetraploid, etc.)

17
CGHComparative Genomic Hybridization.

Equal amounts of biotin-labeled tumor DNA and
digoxigenin-labeled normal reference DNA are
hybridized to normal metaphase chromosomes
The tumor DNA is visualized with fluorescein and
the normal DNA with rhodamine
The signal intensities of the different
fluorochromes are quantitated along the single
chromosomes
The over-and underrepresented DNA segments are
quantified by computation of tumor/normal ratio
images and average ratio profiles

Amplification
Deletion
18
CGH Comparative Genomic Hybridization.
19
CGH Comparative Genomic Hybridization.
20
RDA Representational Differential Analysis

PCR based Set-Differencing
Two Sets S1 and S2
If x 2 S1 then both xw and xc have primers.
If x 2 S2 then neither xw and xc have primers.
If x 2 S1 n S2 then x undergoes an exponential
growth.
If x 2 S1 Å S2 then x undergoes a linear growth.
If x 2 S2 n S1 then x has no growth.

21
SAGESerial Analysis of Gene Expression

Three principles underlie the SAGE methodology
A short sequence tag (10-14bp) contains
sufficient information to uniquely identify a
transcript provided that that the tag is obtained
from a unique position within each transcript.
Sequence tags can be linked together to from long
serial molecules that can be cloned and
sequenced.
Quantitation of the number of times a particular
tag is observed provides the expression level of
the corresponding transcript.

22
Microarray Based Methods

RNA expression microarray analysis
Analysis of DNA copy number changes using CGH to
microarrayed BACs
Analysis of DNA copy number changes using
microarrayed cDNAs and ESTs

23
Analysis of copy number changes
Where we develop a novel method to find copy
number fluctuations ROMA arrayCGH
24
Microarray Analysis

Representations are reproducible samplings of DNA
populations in which the resulting DNA has a new
format and reduced complexity.
We array probes derived from low complexity
representations of the normal genome
We measure differences in gene copy number
between samples ratiometrically
Since representations have a lower nucleotide
complexity than total genomic DNA, we obtain a
stronger specific hybridization signal relative
to non-specific and noise

25
Tumor vs. Normal

Copy number can be measured by computing the fold
changes
Yellow Copy number unchanged
Red Amplification (More tumor material than
normal)
Green Deletion (Less tumor material than normal)

26
Sir Ernest Rutherford
For Mikes sake, Soddy, dont call it
transmutation. Theyll have our heads off as
alchemists. Rutherford, winner of 1908 Nobel
prize for chemistry for cataloging alpha and beta
particles

All science is either physics or stamp
collecting.

27
Low Complexity Representation

Superior Hybridization Kinetics and
Signal-to-Noise Ratio.
Reproducible, Reliable Consistent.
Can be prepared in large amounts from microscopic
amounts of material.
Parallel representations preserve gene ratios
between samples treated in parallel.

28
MAP (Maximum A Posteriori) Estimation Algorithm
Where we develop a novel algorithm to segment
regions of similar copy number alterations
29
How Representations are Made..
30
BglII Representation (3)
31
Copy Number Fluctuation
32
HMM
33
HMM, finally
Model with a very high degree of freedom, but not
enough data points. Small Sample statistics a
Overfitting, Convergence to local maxima, etc.
3
1
2
34
HMM, last time

Advantages
Small Number of parameters. Can be optimized by
MAP estimator. (EM has difficulties).
Easy to model deviation from Markvian properties
(e.g., polymorphisms, power-law, Polyas urn like
process, local properties of chromosomes, etc.)

We will simply model the number of break-points
by a Poisson process, and lengths of the
aberrational segments by an exponential
process. Two parameter model pb pe
35
A MAP (Maximum A Posteriori) Estimators

The prior depends on two parameters pe and pb.
pe is the probability of a particular probe being
normal.
pb is the average number of intervals per unit
length.

Generalizes HMM

Priors
Deletion Amplification
Data
Priors Noise
Goal Find the most plausible hypothesis of
regional changes and their associated copy numbers

36
Likelihood Function

The µ values of non-global probes are unknown.
We estimate these µ values using the sample mean
for that interval.
Our Bayesian solution maximizes L to yield the
optimal segmentation

37
A dynamic programming algorithm.

Generalizes VITERBI
Extension
Adds a new interval to the end.
Likelihood function can be incrementally computed

38
A reasonable choice of priors yields good
segmentation.
39
A reasonable choice of priors yields good
segmentation.
40
Sir Ernest Rutherford

If your experiment needs statistics, you ought
to have done a better experiment.

41
Prior Selection F criterion

For each break we have a T2 statistic and the
appropriate tail probability (p value) calculated
from the distribution of the statistic. In this
case, this is an F distribution.
The best (pe,pb) is the one that leads to the
maximum min p-value.

42
Thought Experiments, Algorithms Simulations
Where we think about how to assign chromosomal
locations to probes using array hybridization
43
Locations of the Probes
44
Locations of the Probes
45
Mapping Representational Probes

Statistics of inter-probe pair-wise distance
measurements
Estimating distances by hybridization with pools
of clones from a library
Simulation results

46
Sir Ernest Rutherford

We haven't the money, so we've got to think."

47
Measuring distances

A one dimensional Buffons needle problem.
Take two points on a line, and drop unit-length
needles of some color.
The probability that the two points will have
different colors monotonically increases with the
distance between these two points
as distance increases from 0 to 1
attains a fixed value for all distances konger
than 1.
One can generalize by considering
More than two pointsP points.
Dropping a small set of bichromatic needles

p
p
p
Distance ¼ 3/6 0.5
48
The Experiments
cX coverage subsample
cX coverage subsample

Probes are points
BACs are needles
Hybridization on an array simulates dropping the
bichromatic needles

M
High Coverage BAC Library
cX coverage subsample
cX coverage subsample
49
A Mathematical Problem

A set of P points x1, x2, , xP µ 0,G with
pdf f(x) 1/G i.i.d. for all x 2 0,G
Distance di,j d(xi xj), measured between
two arbitrary points xi and xj x.
Given O(P2) distances infer positions.

50
Distance vs Observed
51
Matrix-to-Line

Given a P P positive symmetric real-valued
matrix D of measured distances.
The entry di,j f(d x).
Choose an embedding of the points
x1, x2, , xP ½ 0,G,
which maximizes a likelihood function
Õ1 i, j P f(xi xj di,j)

52
Bayes Formula
53
Minimizing a Quadratic Cost Function
54
A Physical Model
P2
P3
P2
P1
P4
d1,2
d2,3
d2,4
P1
P3
d1,3
d3,4
d1,4
Mass-less Balls connected with springs of
different stiffness
P4
55
Algorithm
Join

Consider measured distances of length L q L
Examine these distances in increasing order.
q 2 (0,1) to be determined by the Chernoff
bounds
Initially, every probe is a singleton contig.
Two operations Join and Adjust either combines
smaller contigs or improve an existing contig.

56
Algorithm
Adjust

Join and adjust locally minimizes the
log-likelihood cost function
Local minimum of a weighted sum-of-square error
function

57
Algorithmic Complexity
58
The Experiments

Outcomes for probe pi
Pi hybridizes to zero BACs.
Outcome B (blank)
Pi hybridizes to at least one red BACs and zero
green BACs.
Outcome R (red)
Pi hybridizes to zero red BACs and at least one
green BACs.
Outcome G (green)
Pi hybridizes to at least one red BACs and at
least one green BACs.
Outcome Y (yellow)
We call these events iB, iR, iG and iY
respectively.

59
Hamming Distance

The full experiment consists of M random samples.
The output is a color string for each probe.
sj h sj,k ik1M with sj,k 2 B, R, G, Y
associated with probe pj
Hi,j places where si and sj differ
Ci,j places where si and sj are the same but
not blank
Hi,j places where si and sj are blank

60
Notations

Nf Clones per experiment
M Experiments
L Length of a clone,
G Length of a genome
a Nf/G PrA clone starts at a site
c NfL/G a L coverage per experiment
a aG aR a/2 c/2L

61
Computing the Probabilities

Probability of Events
C (iG Æ jG) Ç (iR Æ jR) Ç (iY Æ jY)
T (i Æ j)
H (C Ç T)

62
Computing the Probabilities
63
Computing the Probabilities

Pr(C x 5 L)
1 2 exp(-a L) 2 exp(-a (Lx))2 exp(-2
a (Lx))
Pr(T x 5 L) exp(-2 a (Lx))
Pr(H x 5 L) 1-1 2 exp(-a L) 2 exp(-a
(Lx))2

64
Computing the Probabilities
65
Final Estimator
66
Chernoff Bound

False Positives (d lt q L) Æ (x gt L)
False Negatives (x lt q L) Æ (d gt L)

67
Computing the Chernoff Bounds
68
Yeast Mapping
69
Steps in Mapping
70
Data from One Experiment
71
Expectation Maximization
72
Map
73
Local Distances
74
Sequence Validation
75
Sequence Validation
76
Sequence Validation
77
Sir Ernest Rutherford

I have become more and more impressed by the
power of the scientific method of extending our
knowledge of nature.
Experiment, directed by the imagination of either
an individual, or still better of a group of
individuals of varied mental outlook is able to
achieve results which far transcend the
imagination alone of the greatest natural
philosopher.

78
Sir Ernest Rutherford

Experiment without imagination, or imagination
without recourse to experiment, can accomplish
little. But for effective progress, a happy blend
of these powers is necessary

Students
Fang Chen
Jiawu Feng
Ofer Gill
Matthias Heymann
Iuliana Ionita
Venkatesh P. Mysore
Marina Spivak
Bing Sun
Yi (Joey) Zhou
Visitors
Marco Isopi
Carla Piazza
Alberto Policriti
Naomi Silver
Chris Wiggins
Franz Winkler

Principal Investigator
Bud Mishra
Researchers
Marco Antoniotti
Paolo Barbano
Vera Cherepinsky
Raoul-Sam Daruwala
Gilad Lerman
Joe McQuown
Toto Paxia
Archisman Rudra
Nadia Ugel
Alumni
Will Casey
Marc Rejali

80
The End

http//www.cs.nyu.edu/mishra
http//bioinformatics.cat.nyu.edu
Valis, Gene Grammar, NYU MAD, Cell Simulation,

81
Other Ongoing Projects

SINGLE MOLECULE MAPPING
Single Molecule Genomics Optical Mapping,
Optical Sequencing RFLP Haplotyping
(In collaboration with Wisc Funded by NCI)
ARRAY MAPPING
(In collaboration with MIT Funded by NSF ITR)
ARRAY CGH
Microarray-based Genome Mapping--
(In collaboration with NYU Med School CSHL ---
funded by NCI/NIH)
EXPRESSION DATA ANALYSIS
(In collaboration with NYU Biology Med School
funded by NSF MHHI)

82
SINGLE MOLECULE OPTICALMAPPING
83
Error Sources

Sizing Error
(Bernoulli labeling, absorption cross-section,
PSF)
Partial Digestion
False Optical Sites
Orientation
Spurious molecules, Optical chimerism, Calibration

Image of restriction enzyme digestedYAC clone
YAC clone 6H3, derived from human chromosome 11,
digested with the restriction endonuclease Eag I
and Mlu I, stained with a fluorochrome and imaged
by fluorescence microscopy.
84
Optical MappingInterplay between Biology and
Computation
85
Y

From a genes point of view, reshuffling is a
great restorative
The Y, in its solitary state disapproves of such
laxity. Apart from small parts near each tip
which line up with a shared section of the X, it
stands aloof from the great DNA swap. Its genes,
such as they are, remain in purdah as the
generations succeed. As a result, each Y is a
genetic republic, insulated from the outside
world. Like most closed societies it becomes both
selfish and wasteful. Every lineage evolves an
identity of its own which, quite often, collapses
under the weight of its own inborn weaknesses.
Celibacy has ruined mans chromosome.
Steve Jones, Y The descent of Men, 2002.

86
Mapping the DAZ locus on Y Chromosome
87
GCP is NP-Complete

Transformation from Hamiltonian Path Problem
restricted to cubic graphs.

Choose p 3/4 k M
88
NPCompleteness

G has a Hamiltonian path
v1, v2, vM
Then, the admissible placement is
D1, D2, DM
with at most two intervals Ij Ij1
overlapping with k cuts in common.
Conversely, any admissible
placement with a goodness gtk induces a
permutation p on the indices of the vertices of
G.
v(p(1)), v(p(2), , v(p(M))Hamiltonian

v1
v2
v3
D1
D2
D3
Consensus Map
89
Experiment Design

Relation among the error parameters
3b n p /4 5 k 5 n p4 q/2
) p (3 b/2 q)1/3
Parameter choice for shotgun-mapping. Make the
partial digestion probability rather high (close
to 1) or the relative sizing error as low for
instance by using a rare cutter.

90
Contour Plot as a Function of Sizing Error
(x-axis) and Digestion Rate (y-axis)

The calculation is for the human genome, G 3,300
mb.
The average molecule length 5 mb, with an
overlap of 1 mb
The average restriction fragment length 25 kb
For a sizing error of 3 kb, the required
digestion rate is 94
If the sizing error is reduced to 2 kb, the
required digestion rate drops to 88
If the sizing error is reduced to 1 kb, the
required digestion rate drops to 80
(See Mathematica Demo)

91
Gentigs Successes

E. coli
P. falciparum D. radiodurans Y. Pestis
Rhodobacter sphaeroides Shigella flexneri
Salmonella enterica
Aspergillus fumigatus
The automated Gentig system is routinely used
to map microbe genomes quickly effortlessly
by scientists with no quantitative or
computational training.

Shotgun Optical Mapping of Genomes
92
VALISvast active living intelligence
system
93
(No Transcript)
94
Key Feature of Valis

State of the art of rapid prototyping in
bioinformatics, functional genomics and systems
biology.
Multilanguage Scripting
Data storage
Graphical User interfaces

95
Visual Genome
96
Multi-Scripting

A Valis script can be written in any supported
language
JScript, VBScript, Python, PERL, Lisp, R and
SETL.
All the scripts see the same Valis class
hierarchy.
For example, once a user learns that a Valis
Sequence Object has a method called Input that
will read the sequence from a file, the user can
subsequently use this same primitive from all the
different languages.

97
Advantages

We can take the best from each language
Graph algorithms in SETL
Sockets in Python
Regular Expressions in Perl
AI in Lisp
Statistics in R
..

98
Data Storage

Based on Extended B-Trees
At the lowest level there is an Heap of pages
Must correctly keep track of the reference counts
of each record/object to implement value semantics

99
B Tree Indexes

Leaf pages contain data entries, and are chained
(prev next)
Non-leaf pages contain index entries and direct
searches

Non-leaf Pages
Leaf Pages
100
Hardware

Although Valis is designed to be used in
workstations, we can run the computation
intensive processes on Beowulf computing
servers.This cluster has
16 compute nodes connected via a Gigabit Ethernet
and a low-latency, high speed network from
Dolphinics.
Cluster nodes are dual 2.4GHz Intel Xeon
processors with 4GB of memory.
Itis arranged in a 3D torus topology allowing
each node to communicate directly with its three
nearest neighbors.
The disk storage capacity for hosting our
databases is about one Terabyte.
The operating system is Linux and we use high
performance MPI libraries from Scali.

101
Visualization

Once the processing is completed, it is very
important to be able to quickly visualize the
results.
For this reason Valis provides numerous
visualization tools that allow a user to quickly
display
sequences, maps,
microarray data,
tables,
graphs and annotations.
These widget can be customized from the scripts.

102
Valis Demo

Bioinformatics 1.1, 1.2, 1.3
Systems Biology Simple Pathway 2.1, 2.2, 2.3
Systems Biology Apoptosis 2.4
Simpathica
XS-System
BioWave
BioSim
NYUMAD

103
SIMPATHICASystems Biology
How much of reasoning about biology can be
automated?
104
Why do we need a tool?
We claim that, by drawing upon mathematical
approaches developed in the context of dynamical
systems, kinetic analysis, computational theory
and logic, it is possible to create powerful
simulation, analysis and reasoning tools for
working biologists to be used in deciphering
existing data, devising new experiments and
ultimately, understanding functional properties
of genomes, proteomes, cells, organs and
organisms.
Simulate Biologists! Not Biology!!
105
Reasoning and Experimentation
106
Simpathica is a modular system
Canonical Form

Characteristics
Predefined Modular Structure
Automated Translation from Graphical to
Mathematical Model
Scalability

107
Glycolysis
Glycogen
P_i
Glucose-1-P
Glucose
Phosphorylase a
Phosphoglucomutase
Glucokinase
Glucose-6-P
Phosphoglucose isomerase
Fructose-6-P
Phosphofructokinase
108
Reaction Scheme for Wnt Signaling.
The reaction steps of the Wnt pathway are
numbered 1 to 19. Protein complexes are denoted
by the names of their components. Phosphorylated
components are marked by an asterisk.
Single-headed solid arrows characterize
irreversible reactions. Double-headed arrows
denote binding equilibria. Blue arrows mark
reactions that have only been taken into account
when studying the effect of high axin
concentrations.
109
Broken arrows represent activation of Dsh by the
Wnt ligand (step 1), Dsh-mediated initiation of
the release of GSK3b from the destruction complex
(step 3), and APC-mediated degradation of axin
(step 15). The broken arrows indicate that the
components mediate but do not participate
stoichiometrically in the reaction scheme. The
irreversible reactions 2, 4, 5, 911, and 13 are
unimolecular, and reactions 6, 7, 8, 16, and 17
are reversible binding steps.
110
Steady State Concentration
111
\beta-catenin degradation
112
Wnt Demo

Systems Biology Wnt Pathway 3.1, 3.2
SimpathicaA

113
The Cell Cycle
G1
start
cell division
Cdk
Cdk
Cdk
Cyclin
S
M (anaphase)
APC
APC
finish
G2
M (metaphase)
114
Cyclin B/Cdk and Cdh1/APC

dCycB/dt
k1 (k2 k2Cdh1)CycB
dCdh1/dt
(k3 k3 A) (1-Cdh1)/ (J31 Cdh1)
k4 m CycBCdh1/ (J4 Cdh1)

A pair of nonlinear ODE (ordinary differential
equations) describing the biochemical reactions
at the center.

115
Simulation of Yeast Cell Cycle
116
Simulation of Yeast Cell Cycle
117
Simulation of Yeast Cell Cycle
118
The Natural Language Interface
119
Story generation

Temporal Logic formulae can be rendered in
English.
Temporal Logic formulae can be generated
automatically (with care).
Each formula can be tested against a set of
datasets differences can then be noted.

120
Cell Cycle Story Generation Results (HTML
rendering)
Report on "Test Experiment Tyson WT, 1 Mutant, 2
Mutants.".RESULTSThe results refer to the
following datasets The first dataset is named
"Ian's Experiment/Tyson Yeast Dataset WT". The
second dataset is named "Ian's Experiment/Tyson
Yeast Dataset Mut1". The third dataset is named
"Ian's Experiment/Tyson Yeast Dataset mut2".

CDH1 less than or equal to 1.0071783 will
always hold until CDH1 activates CYCB, is true
in the first dataset, is true in the second
dataset, and is false in the third dataset.
CDH1 represses CYCB implies CYCB is greater than
or equal to 0.65, is false in the first
dataset, is true in the second dataset, and is
true in the third dataset.
eventually, CDH1 is less than or equal to CYCB,
is false in the first dataset, is true in the
second dataset, and is true in the third
dataset.

121
GenomicsLarge Segmental Duplications
122
Recent Segmental Duplications
Human

3.5 5 of the human genome is found to contain
segmental duplications, with length gt 5 or 1kb,
identity gt 90.
August, 2001 assembly,
Bailey, et al. 2002.
April, 2003 assembly,
Cheung, et al. 2003.
These duplications are estimated to have emerged
about 40Mya under neutral assumption.
The duplications are mostly interspersed
(non-tandem), and happen both inter- and
intra-chromosomally.

From Bailey, et al. 2002
123
Recent Segmental Duplications
Mouse

1.2 of the mouse genome is found to contain
segmental duplications, with length gt 5kb,
identity gt 90.
February, 2003 mouse assembly,
Cheung, et al. 2003.
These duplications are estimated to have emerged
about 25Mya under neutral assumption.
The duplications happen both inter- and
intra-chromosomally.

From Cheung, et al. 2003
124
Statistical AnalysisDuplication Flanking
Sequences

What are the molecular mechanisms that caused the
recent segmental duplications in the human and
mouse genomes?
Thermodynamic instability in the DNA sequences
Recombination between homologous repeat elements
Other unknown mechanisms.

125
Thermodynamics
126
Hypotheses
Scenario 1 Repeat-Mediated Homologous
recombination
Scenario 2 Preferential Repeat Insertion after
Duplication
Scenario 3 Artificial Boundary Effect from
Duplication Mapping
Duplicated segment
Duplicated segment
Duplicated segment
Overrepresentation of repeats in the flanking
regions
127
The Model
128
The Mathematical Model
h1 proportion of duplications by repeat
recombination h1 proportion of
duplications by recombination of the specific
repeat h1- - proportion of duplications
by recombination of other repeats h0 proportion
of duplications by other repeat-unrelated
mechanism h0 proportion of h0 with
common specific repeat in the flanking regions
h0- proportion of h0 with no common
specific repeat in the flanking regions
h0- - proportion of h0 with no specific repeat
in the flanking regions
a mutation rate in duplicated sequences ß
insertion rate of the specific repeat ?
mutation rate in the specific repeat d
divergence level of duplications e divergence
interval of duplications.
129
Model Validation
Alu
L1
f - -
f - -
f -
f -
f
f
Diversity
Diversity

The model parameters (aAlu, ßAlu, ?Alu, aL1, ßL1,
?L1) are estimated from the reported mutation and
insertion rates in the literature.
The relative strengths of the alternative
hypotheses can be estimated by model fitting to
the real data.
h1Alu 0.76 h1Alu 0.3 h1L1 0.76 h1 L1
0.35.

130
ChIP-Chip Analysis
131
Details (math)

Idea in a nutshell (assume symmetric data)
Throw out genes which deviate significantly at
various scales
Stop at exhausted scales (cubes)
Threshold in stopping cubes by estimating
Cs(yx)
Average over shifted grids

132
Procedure of Algorithm

Recall Fixed dyadic grid along L
Compute FQ (also fQ, ßQ,s Q) in top-down alg
Stop at an interval if either

133
Normalization.

Recall setting (simplified) DataN2 matrix of
log. EVs
Problem systematic variation Different EVs
are recorded for same amount of mRNA.
Normalization Removal of variation to allow
balanced comparisons

134
Normalization (continues)

Related stat. terminology conditional mean
estimation previous terminology
Related math problem Construct a graph (or
chord-arc curve) a strip around it
Approach to solve math problem Combine ideas of
multiscale curve/graph constructions (Jones,
David and Semmes, L) with the multistrip
construction before

135
ChIP-Chip Experiments I
136
ChIP-Chip Experiments II
137
ChIP-Chip Experiments III
138
CARTWheelRedescription
139
What is redescription?

Shift of vocabulary from one language (descriptor
family) to another to describe the same entity
Descriptor is any meaningful way of defining a
subset within a universal set of entities
Set theoretic operations used on basic
descriptors to define derived descriptors
Evaluated on the basis of Jaccards coefficient
(A ltgt B) (A ? B) / (A U B)

140
Why Redescribe?

Allows feature construction
Can handle any kind of data in terms of
descriptors no data specific mining required
Can find commonalities and differences between
various descriptors/descriptor families at the
same time
Can look for stories using a series of inexact
redescriptions

141
CARTwheels algorithm for redescription
142
CARTwheels algorithm for redescription
143
Implementation details

Simplified version of CARTwheels algorithm used
to speed up the process
Algorithm implemented in C on UNIX
Visualization implemented in Java on UNIX
Interacts with Postgres based database to extract
data/descriptors

144
Implementation details descriptors used

Experimental (microarray) data for yeast from
Gasch et al. Descriptors constructed of the form
gt, lt
9 different stress used from Gasch et al. data
GO category assignments for genes (biological
process, cellular component, molecular function)