Title: Design and Optimization of Universal DNA Arrays
1 Design and Optimization of Universal DNA Arrays
- Ion Mandoiu
- Computer Science Engineering Department
- University of Connecticut
- http//www.engr.uconn.edu/ion/
2DNA Arrays
- Exploit Watson-Crick complementarity to
simultaneously perform a large number of
substring tests - Used in a variety of high-throughput genomic
analyses - Transcription (gene expression) analysis
- Single Nucleotide Polymorphism (SNP) genotyping
- Alternative splicing, ChIP-on-chip, tiling
arrays, genomic-based species identification,
point-of-service diagnosis, - Common array formats involve direct hybridization
between labeled DNA/RNA sample and DNA probes
attached to a glass slide
3SNP Genotyping
- Genome variation 0.1 of the DNA different from
one individual to another - 80 of the variation is represented by Single
Nucleotide Polymorphisms (SNPs) - 2 possible nucleotides (alleles) for each SNP
- SNP genotyping determining the alleles present
at the SNP sites - Highest throughput for SNP genotyping is achieved
by high-density DNA microarrays based on direct
hybridization
4SNP genotyping via direct hybridization
Labeled sample
- SNP1 with alleles T/G
- SNP2 with alleles A/G
Array with 2 probes/SNP
Hybridization
5Universal DNA Arrays
- Programable arrays
- Array consists of application independent
oligonucleotides - Analysis carried by a sequence of reactions
involving application specific primers - Flexible AND cost effective
- Universal array architectures tag arrays, APEX
arrays, SBE/SBH arrays,
6Overview
- Tag Array Design
- - Tag Set Design
- - Tag Assignment Algorithms
- SBE/SBH Assays
- - Decoding and Multiplexing Algorithms
- Conclusions
7SNP Genotyping with Tag Arrays
Tag
Primer
G
A
G
2. Solution phase hybridization
- Mix reporter probes with unlabeled genomic DNA
C
antitag
4. Solid phase hybridization
3. Single-Base Extension (SBE)
8Tag Array Advantages
- Cost effective
- Same array used in many analyses ? can be mass
produced - Easy to customize
- Only need to synthesize new set of reporter
probes - Reliable
- Solution phase hybridization better understood
than hybridization on solid support
9Tag Set Design Problem
t1
t1
t2
t2
t1
t2
t1
- (H1) Tags hybridize strongly to complementary
antitags - (H2) No tag hybridizes to a non-complementary
antitag - (H3) Tags do not cross-hybridize to each other
Tag Set Design Problem Find a maximum
cardinality set of tags satisfying (H1)-(H3)
10Hybridization Models
- Melting temperature Tm temperature at which 50
of duplexes are in hybridized state - 2-4 rule
- Tm 2 (As and Ts) 4 (Cs and Gs)
- More accurate models exist, e.g., the
near-neighbor model
11Hybridization Models (contd.)
- Hamming distance model, e.g., Marathe et al. 01
- Models rigid DNA strands
- LCS/edit distance model, e.g., Torney et al. 03
- Models infinitely elastic DNA strands
- c-token model Ben-Dor et al. 00
- Duplex formation requires formation of
nucleation complex between perfectly
complementary substrings - Nucleation complex must have weight ? c, where
wt(A)wt(T)1, wt(C)wt(G)2 (2-4 rule)
12c-h Code Problem
- c-token left-minimal DNA string of weight ? c,
i.e., - w(x) ? c
- w(x) lt c for every proper suffix x of x
- A set of tags is a c-h code if
- (C1) Every tag has weight ? h
- (C2) Every c-token is used at most once
c-h Code Problem Ben-Dor et al.00 Given c and
h, find maximum cardinality c-h code
13Algorithms for c-h Code Problem
- Ben-Dor et al.00 approximation algorithm based
on DeBruijn sequences - Alphabetic tree search algorithm
- Enumerate candidate tags in lexicographic order,
save tags whose c-tokens are not used by
previously selected tags - Easily modified to handle various combinations
of constraints - MT 05, 06 Optimum c-h codes can be computed in
practical time for small values of c by using
integer programming - Practical runtime using Garg-Koneman
approximation and LP-rounding
14Token Content of a Tag
- c4
- CCAGATT
- CC
- CCA
- CAG
- AGA
- GAT
- GATT
Tag ? sequence of c-tokens End pos 2
3 4 5 6 7
c-token CC?CCA?CAG?AGA?GAT?GATT
15Layered c-token graph for length-l tags
l
l-1
c/2
(c/2)1
c1
t
s
cN
16Integer Program Formulation MPT05
- Maximum integer flow problem w/ set capacity
constraints - O(hN) constraints variables, where N c-tokens
17Number of c-tokens
- WA or T, SC or G
- Gn strings of weight n
- ? G1 2 G2 6 Gn 2Gn-2 2Gn-1
18Number of c-tokens
19Packing LP Formulation
20Garg-Konemann Algorithm
- x ? 0 y ? ? // yi are variables of the dual LP
- Find min weight s-t path p, where weight(v) yi
for every v?Vi - While weight(p) lt 1 do
- M ? maxi p ? Vi
- xp ? xp 1/M
- For every i, yi ? yi( 1 ? p ? Vi/M )
- Find min weight s-t path p, where weight(v)
yi for v?Vi - 4. For every p, xp ? xp / (1 - log1??)
GK98 The algorithm computes a factor (1- ?)2
approximation to the optimal LP solution with
(N/?) log1?N shortest path computations
21LP Based Tag Set Design
- Run Garg-Konemann and store the minimum weight
paths in a list - Traversing the list in reverse order, pick tags
corresponding to paths if they are feasible and
do not share c-tokens with already selected tags - Mark used c-tokens and run the alphabetic tree
search algorithm to select additional tags
22Periodic Tags MT05
- Key observation c-token uniqueness constraint in
c-h code formulation is too strong - A c-token should not appear in two different
tags, but can be repeated in a tag - A tag t is called periodic if it is the prefix of
(?)? for some period ? - Periodic strings make best use of c-tokens
23c-token factor graph, c4 (incomplete)
CC AAG
AAC
AAAA AAAT
24Vertex-disjoint Cycle Packing Problem
- Given directed graph G, find maximum number of
vertex disjoint directed cycles in G - MT 05 APX-hard even for regular directed graphs
with in-degree and out-degree 2 - h-c/21 approximation factor for tag set design
problem - Salavatipour and Verstraete 05
- Quasi-NP-hard to approximate within ?(log1-? n)
- O(n1/2) approximation algorithm
25Cycle Packing Algorithm
- Construct c-token factor graph G
- T?
- For all cycles C defining periodic tags, in
increasing order of cycle length, - Add to T the tag defined by C
- Remove C from G
- Perform an alphabetic tree search and add to T
tags consisting of unused c-tokens - Return T
- Gives an increase of over 40 in the number of
tags compared to previous methods
26Experimental Results
27Antitag-to-Antitag Hybridization
- Additional constraint antitags do not
cross-hybridize, including self - Formalization in c-token hybridization model
- (C3) No two tags contain complementary substrings
of weight ? c - Cycle packing and tree search extend easily
28Results w/ Extended Constraints
29More Hybridization Constraints
t1
t1
t2
- Enforced during tag assignment by
- - Leaving some tags unassigned and distributing
primers across multiple arrays Ben-Dor et al.
03 - - Exploiting availability of multiple primer
candidates MPT05
30Assignable Primers
- If primer p hybridizes to the complement of tag
t, at most one of the assignments (p,t), (p,t)
and (p,t) can be made
- Set P of primers is assignable to a set T of tags
if the condition above is satisfied for every
p,p and t,t
31Characterization of Assignable Sets
- conflict graph
- G(T ? P,E), where (t,p) ? E if t and p
hybridize - X number of primers adjacent to a degree 1 tag
- Y number of degree 0 tags
X1
Y2
- Ben-Dor 04 Set P is assignable to T iff
- XY ? P
32Finding Assignable Primer Sets
Multiplexing Problem given primer set P and tag
set T, find partition of P into minimum number of
assignable sets
Maximum Assignable Primer Set Problem given
primer set P and tag set T, find a maximum size
assignable subset of P
- Both problems are NP-hard Ben-Dor 04
33Integration with Primer Selection
- In practice, several primer candidates with
equivalent functionality - In SNP genotyping, can pick primer from either
forward and reverse strand - In gene expression/identification applications,
many primers have desired length, Tm, etc.
34Pooled Array Multiplexing Problem
Pooled Multiplexing Problem Given set of primer
pools P and tag set T, find a primer from each
pool and a partition of selected primers into
minimum number of assignable sets
35XY Characterization Fails for Pools
36Pooled Multiplexing Algorithms
- Primer-Del greedy deletion for pools similar to
Ben-Dor et al 04
- Repeatedly delete primer of maximum potential
until XY ? pools, where - Potential of tag t is 2-deg(t)
- Potential of primer p is sum of potentials of
conflicting tags - Subtract ½ if primer adjacent to a tag of degree 1
37Pooled Multiplexing Algorithms
- Primer-Del greedy deletion for pools similar to
Ben-Dor et al 04 - Primer-Del same but never delete last primer
from pool unless no other choice - Min-Pot select primer with min potential from
each pool, then run Primer-Del - Min-Deg select primer with min degree, then run
Primer-Del - Iterative ILP iteratively find a maximum
assignable pool set using integer linear program
38Integer Linear Program for MAPS
- where zpt 1 iff primer p is assigned to tag t
39Results GenFlex Tags, c8
40Results GenFlex Tags, c7
41Results 213 MPT05 Tags, c7
42Herpes B Gene Expression Assay
GenFlex Tags
Periodic Tags
43Overview
- Tag Array Design
- - Tag Set Design
- - Tag Assignment Algorithms
- SBE/SBH Assays
- - Decoding and Multiplexing Algorithms
- Conclusions
44SBE/SBH Assay MP 06
Primers
T
T
A
A
T
T
TTGCA
T
CCATT
A
GATAA
T
hybridization to k-mer array (SBH)
single-base extension (SBE)
45Some notations
- P set of primers, X set of probes
- Ep ? A,C,T,G the set of possible extensions for
primer p - The spectrum of primer p, SpecX(p), is the set of
probes hybridizing with p - The extended spectrum of primer p with extension
set Ep,
46Decodable primer sets
- Four parallel single-color SBE/SBH experiments ?
one type of extension in each SBE experiment - P is weakly decodable with respect to extension e
if for every primer p - One SBE/SBH experiment with 4 colors (4
extensions) - P is weakly decodable if for every primer p and
every extension e ? Ep
47Strongly r-decodable primer sets
- Hybridization involving labeled nucleotide is
less predictable - ?Informative probes should not rely on it
- Signal from one SNP may obscure signal from
another when read at the same probe due to
differences in DNA amplification efficiency - ?Informative probes cannot be shared between SNPs
- P is strongly r-decodable if for every primer p
- where r redundancy parameter
48MPPP
- A set of primer pools P P1,,Pn is strongly
r-decodable iff there is a primer pi in each pool
Pi such that p1,,pn is strongly r-decodable.
- Minimum Pool Partitioning Problem (MPPP)
- Given
- primer pools set P and extensions sets Ep, for
every primer p - probe set X
- redundancy r
- Find
- partition of P into the min number of strongly
r-decodable subsets
49MDPSP
- Maximum r-Decodable Pool Subset Problem (MDPSP)
- Given
- primer pools set P and extensions sets Ep, for
every primer p - probe set X
- redundancy r
- Find
- strongly r-decodable subset of P of maximum size
50Min-Greedy Algorithm for Maximum Induced Matching
in General Graphs
- Pick a vertex u of min degree
- Pick a vertex v of min degree from among us
neighbors - Add edge (u,v) to the matching
- Delete all neighbors of u and v from the graph
- Repeat the above steps until the graph becomes
empty - Duckworth 05 d-1 approximation factor for
d-regular graphs
51Min-Greedy Algorithms for MDPSP
- Bipartite hybridization graph G
- Primers in left side, probes in right side
- Two types of edges
- N(p)SpecX(p)
- N-(p)SpecX(p,Ep) \ SpecX(p)
- Two algorithm variants
- MinPrimerGreedy pick primer first
- MinProbeGreedy pick probe first
- Delete primer/probe if N degree drops below r/1
52Experimental results for k-mers (Ep4, primer
length20)
53MDPSP Size vs Primer Length
k10
54Experimental results for c-tokens (Ep4, primer
length20)
55MDPSP Size vs Primer Length
c13
56Overview
- Tag Array Design
- - Tag Set Design
- - Tag Assignment Algorithms
- SBE/SBH Assays
- - Decoding and Multiplexing Algorithms
- Conclusions
57Conclusions and Ongoing Work
- Combinatorial algorithms yield significant
increases in multiplexing rates of universal
arrays - New SBE/SBH architecture particularly promising
based on preliminary simulation results - Ongoing work
- Extend methods to more accurate hybridization
models, e.g., use NN melting temperature models - More complex (e.g., temperature dependent) DNA
tag set non-interaction requirements for DNA
self/mediated assembly - Probabilistic decoding in presence of
hybridization errors - Application to novel domains, e.g., DNA barcoding
58Acknowledgments
- Claudia Prajescu and Dragos Trinca
- Funding from NSF (Awards 0546457 and 0543365) and
UCONN Research Foundation