Title: Algorithms for Biochip Design and Optimization
1Algorithms for Biochip Design and Optimization
- Ion Mandoiu
- Computer Science Engineering Department
- University of Connecticut
2Overview
- Physical design of DNA arrays
- DNA tag set design
- Digital microfluidic biochip testing
- Conclusions
3Driver Biochip Applications
- Driver applications
- Gene expression (transcription analysis)
- SNP genotyping
- CNP analysis
- Genomic-based microorganism identification
- Point-of-care diagnosis
- healthcare, forensics, environmental monitoring,
- As focus shifts from basic research to clinical
applications, there are increasingly stringent
design requirements on sensitivity, specificity,
cost - Assay design and optimization become critical
4Single Nucleotide Polymorphisms
- Human Genome ? 3 ? 109 base pairs
- Main form of variation between individual
genomes single nucleotide polymorphisms (SNPs) - Total SNPs ? 1 ? 107
- Difference b/w any two individuals ? 3 ? 106
SNPs (? 0.1 of entire genome)
ataggtccCtatttcgcgcCgtatacacgggTctata
ataggtccGtatttcgcgcAgtatacacgggActata
ataggtccCtatttcgcgcCgtatacacgggTctata
5Watson-Crick Complementarity
- Four nucleotide types A,C,T,G
- As paired with Ts (2 hydrogen bonds)
- Cs paired with Gs (3 hydrogen bonds)
6SNP genotyping via direct hybridization
Labeled sample
- SNP1 with alleles T/G
- SNP2 with alleles A/G
Array with 2 probes/SNP
Hybridization
7In-Place Probe Synthesis
8In-Place Probe Synthesis
9In-Place Probe Synthesis
10Simplified DNA Array Flow
Probe Selection
Design
Physical Design Probe Placement Embedding
Mask Manufacturing
Manufacturing
Array Manufacturing
Hybridization Experiment
End User
Analysis of Hybridization Intensities
Gene expression levels, SNP genotypes,
11Unwanted Illumination Effect
- Unintended illumination during manufacturing ?
synthesis of erroneous probes - Effect gets worse with technology scaling
12Border Length Minimization Objective
- Effects of unintended illumination ? border
length
13Synchronous Synthesis
- Periodic deposition sequence, e.g., (ACTG)k
- Each probe grown by one nucleotide in each period
? border conflicts b/w adjacent probes 2 x
Hamming distance
142D Placement Problem
- Find minimum cost mapping of the Hamming graph
onto the grid graph - Special case of the Quadratic Assignment Problem
Edge cost 2 x Hamming distance
152D Placement Sliding-Window Matching
- Proposed by Doll et al. 94 in VLSI context
- Slide window over entire chip
- Repeat fixed of iterations (? O(N) time for
fixed window size), or until improvement drops
below certain threshold
162D Placement Epitaxial Growth
- Proposed by PreasL88, ShahookarM91 in VLSI
context - Simulates crystal growth
- Efficient row implementation
- Use lexicographical sorting for initial ordering
of probes - Fill cells row-by-row
- Bound number of candidate probes considered when
filling each cell - Constant of lookahead rows ? O(N3/2) runtime, N
probes
172D Placement Recursive Partitioning
- Very effective in VLSI placement
AlpertK95,Caldwell et al.00 - 4-way partition using linear time clustering
- Repeat until Row-Epitaxial can be applied
18Asynchronous Synthesis
G
T
C
A
G
T
C
Deposition Sequence
A
G
T
Probes
C
A
A
A
G
G
T
T
T
G
C
G
A
A
19Optimal Single-Probe Re-Embedding
- Efficient solution by dynamic programming
20In-Place Re-Embedding Algorithms
- 2D placement fixed, allow only probe embeddings
to change - Greedy optimally re-embed probe with largest
gain - Chessboard alternate re-embedding of black/white
cells - Sequential re-embed probes row-by-row
Chip size Greedy Greedy Chessboard Chessboard Sequential Sequential
Chip size LB CPU LB CPU LB CPU
100 125.7 40 120.5 54 119.9 64
500 127.1 943 121.4 1423 120.9 1535
21Integration with Probe Selection
Pool Size Pool Row-Epitaxial Pool Row-Epitaxial
Pool Size Improv CPU sec.
1 - 217
2 4.3 1040
4 8.2 1796
8 11.8 3645
16 15.2 7515
Probe Selection
Probe Pools
Physical Design Placement Embedding
Chip size 100x100
22Overview
- Physical design of DNA arrays
- DNA tag set design
- Digital microfluidic biochip testing
- Conclusions
23Universal Tag Arrays
- Brenner 97, Morris et al. 98
- Array consisting of application independent tags
- Two-part reporter probes aplication specific
primers ligated to antitags - Detection carried by a sequence of reactions
separately involving the primer and the antitag
part of reporter probes
24Universal Tag Array Advantages
- Cost effective
- Same tag array used for different analyses
- ? can be mass-produced
- Only need to synthesize new set of reporter
probes - More reliable!
- Solution phase hybridization better understood
than hybridization on solid support
25SNP Genotyping with Tag Arrays
Tag
Primer
G
A
G
2. Solution phase hybridization
- Mix reporter probes with unlabeled genomic DNA
C
antitag
4. Solid phase hybridization
3. Single-Base Extension (SBE)
26Tag Set Design Problem
t1
t1
t2
t2
t1
t2
t1
- (H1) Tags hybridize strongly to complementary
antitags - (H2) No tag hybridizes to a non-complementary
antitag
Tag Set Design Problem Find a maximum
cardinality set of tags satisfying (H1)-(H2)
27Hybridization Models
- Melting temperature Tm temperature at which 50
of duplexes are in hybridized state - 2-4 rule
- Tm 2 (As and Ts) 4 (Cs and Gs)
- More accurate models exist, e.g., the
near-neighbor model
28Hybridization Models (contd.)
- Hamming distance model, e.g., Marathe et al. 01
- Models rigid DNA strands
- LCS/edit distance model, e.g., Torney et al. 03
- Models infinitely elastic DNA strands
- c-token model Ben-Dor et al. 00
- Duplex formation requires formation of
nucleation complex between perfectly
complementary substrings - Nucleation complex must have weight ? c, where
wt(A)wt(T)1, wt(C)wt(G)2 (2-4 rule)
29c-h Code Problem
- c-token left-minimal DNA string of weight ? c,
i.e., - w(x) ? c
- w(x) lt c for every proper suffix x of x
- A set of tags is a c-h code if
- (C1) Every tag has weight ? h
- (C2) Every c-token is used at most once
c-h Code Problem Ben-Dor et al.00 Given c and
h, find maximum cardinality c-h code
30Algorithms for c-h Code Problem
- Ben-Dor et al.00 approximation algorithm based
on DeBruijn sequences - Alphabetic tree search algorithm
- Enumerate candidate tags in lexicographic order,
save tags whose c-tokens are not used by
previously selected tags - Easily modified to handle various combinations
of constraints - MT 05, 06 Optimum c-h codes can be computed in
practical time for small values of c by using
integer programming - Practical runtime using Garg-Koneman
approximation and LP-rounding
31Token Content of a Tag
- c4
- CCAGATT
- CC
- CCA
- CAG
- AGA
- GAT
- GATT
Tag ? sequence of c-tokens End pos 2
3 4 5 6 7
c-token CC?CCA?CAG?AGA?GAT?GATT
32Layered c-token graph for length-l tags
l
l-1
c/2
(c/2)1
c1
t
s
cN
33Integer Program Formulation MPT05
- Maximum integer flow problem w/ set capacity
constraints - O(hN) constraints variables, where N c-tokens
34Packing LP Formulation
35Garg-Konemann Algorithm
- x ? 0 y ? ? // yi are variables of the dual LP
- Find min weight s-t path p, where weight(v) yi
for every v?Vi - While weight(p) lt 1 do
- M ? maxi p ? Vi
- xp ? xp 1/M
- For every i, yi ? yi( 1 ? p ? Vi/M )
- Find min weight s-t path p, where weight(v)
yi for v?Vi - 4. For every p, xp ? xp / (1 - log1??)
GK98 The algorithm computes a factor (1- ?)2
approximation to the optimal LP solution with
(N/?) log1?N shortest path computations
36LP Based Tag Set Design
- Run Garg-Konemann and store the minimum weight
paths in a list - Traversing the list in reverse order, pick tags
corresponding to paths if they are feasible and
do not share c-tokens with already selected tags - Mark used c-tokens and run the alphabetic tree
search algorithm to select additional tags
37Periodic Tags MT05
- Key observation c-token uniqueness constraint in
c-h code formulation is too strong - A c-token should not appear in two different
tags, but can be repeated in a tag - A tag t is called periodic if it is the prefix of
(?)? for some period ? - Periodic strings make best use of c-tokens
38c-token factor graph, c4 (incomplete)
CC AAG
AAC
AAAA AAAT
39Vertex-disjoint Cycle Packing Problem
- Given directed graph G, find maximum number of
vertex disjoint directed cycles in G - MT 05 APX-hard even for regular directed graphs
with in-degree and out-degree 2 - h-c/21 approximation factor for tag set design
problem - Salavatipour and Verstraete 05
- Quasi-NP-hard to approximate within ?(log1-? n)
- O(n1/2) approximation algorithm
40Cycle Packing Algorithm
- Construct c-token factor graph G
- T?
- For all cycles C defining periodic tags, in
increasing order of cycle length, - Add to T the tag defined by C
- Remove C from G
- Perform an alphabetic tree search and add to T
tags consisting of unused c-tokens - Return T
41Experimental Results
42More Hybridization Constraints
t1
t1
t2
- Enforced during tag assignment by
- - Leaving some tags unassigned and distributing
primers across multiple arrays Ben-Dor et al.
03 - - Exploiting availability of multiple primer
candidates MPT05
43Herpes B Gene Expression Assay
GenFlex Tags
Tm pools Pool size 500 tags 500 tags 1000 tags 1000 tags 2000 tags 2000 tags
Tm pools Pool size arrays Util. arrays Util. arrays Util.
60 1446 1 4 82.26 3 65.35 2 57.05
60 1446 5 4 88.26 3 70.95 2 63.55
67 1560 1 4 86.33 3 69.70 2 61.15
67 1560 5 4 91.86 3 76.00 2 67.20
70 1522 1 4 88.46 3 73.65 2 65.40
70 1522 5 4 92.26 2 91.10 2 70.30
Periodic Tags
Tm pools Pool size 500 tags 500 tags 1000 tags 1000 tags 2000 tags 2000 tags
Tm pools Pool size arrays Util. arrays Util. arrays Util.
60 1446 1 4 94.06 2 97.20 1 72.30
60 1446 5 4 96.13 2 100.00 1 72.30
67 1560 1 4 96.53 2 98.70 1 78.00
67 1560 5 4 98.00 2 99.90 1 78.00
70 1522 1 4 96.73 2 98.90 1 76.10
70 1522 5 4 97.80 2 99.80 1 76.10
44Overview
- Physical design of DNA arrays
- DNA tag set design
- Digital microfluidic biochip testing
- Conclusions
45Digital Microfluidic Biochips
Srinivasan et al. 04
- Electrodes typically arranged in rectangular
grid - Droplets moved by applying voltage to adjacent
cell - Can be used for analyses of DNA, proteins,
metabolites
46Design Challenges
- Testing
- High electrode failure rate, but can re-configure
around - Performed both after manufacturing and concurrent
with chip operation - Main objective is minimization of completion time
- Module placement
- Assay operations (mixing, amplification, etc.)
can be mapped to overlapping areas of the chip if
performed at different times - Droplet routing
- When multiple droplets are routed simultaneously
must prevent accidental droplet merging or
interference
Merging
Interference
47Concurrent Testing Problem
- GIVEN
- Input/Output cells
- Position of obstacles (cells in use by ongoing
reactions) - FIND
- Trajectories for test droplets such that
- Every non-blocked cell is visited by at least one
test droplet - Droplet trajectories meet non-merging and
non-interference constraints - Completion time is minimized
Defect model test droplet gets stuck at
defective electrode Su et al. 04 ILP-based
solution for single test droplet case heuristic
for multiple input-output pairs with single test
droplet/pair
48ILP Formulation for Unconstrained Number of
Droplets
- Each cell (i,j) visited at least once
- Droplet conservation
- No droplet merging
- No droplet interference
- Minimize completion time
49Special Case
- NxN Chip
- I/O cells in Opposite Corners
- No Obstacles
- ? Single droplet solution needs N2 cycles
50Lower Bound
- Claim Completion time is at least 4N 4 cycles
Proof In each cycle, each of the k droplets
place 1 dollar in current cell
? 3k(k-1)/2 dollars paid waiting to depart
? 1 dollar in each cell
? k dollars in each diagonal
? 3k(k-1)/2 dollars paid waiting for last droplet
51Stripe Algorithm with N/3 Droplets
52Stripe Algorithm with Obstacles of width Q
- Divide array into vertical stripes of width Q1
- Use one droplet per stripe
- All droplets visit cells in assigned stripes in
parallel - In case of interference droplet on left stripe
waits for droplet in right stripe
53Results for 120x120 Chip, 2x2 Obstacles
Obstacle Area Average completion time (cycles) Average completion time (cycles) Average completion time (cycles) Average completion time (cycles) Average completion time (cycles) k40 vs. k1 speed-up
k1 k12 k20 k30 k40 k40 vs. k1 speed-up
0 14400 1412 944 710 593 24x
1 14256 1420 953.4 715.2 598.8 24x
5 13680 1473 982.8 725 596.2 23x
10 12960 1490 1010.8 734.8 592.6 22x
15 12240 1501 1025.8 730.8 588.2 21x
20 11520 1501 1046.8 738.4 580.8 20x
25 10800 1501 1071 736.6 570 19x
20x decrease in completion time by using
multiple droplets
54Overview
- Physical design of DNA arrays
- DNA tag set design
- Digital microfluidic biochip testing
- Conclusions
55Conclusions
- Biochip design is a fertile area of applications
- Combinatorial optimization techniques can yield
significant improvements in assay
quality/throughput - Very dynamic area, driver applications and
underlying technologies change rapidly
56Acknowledgments
- Physical design of DNA arrays A.B. Kahng, P.
Pevzner, S. Reda, X. Xu, A. Zelikovsky - Tag set design D. Trinca
- Testing of digital microfluidic biochips R.
Garfinkel, B. Pasaniuc, A. Zelikovsky - Financial support UCONN Research Foundation, NSF
awards 0546457 and 0543365
57Questions?