Design and Optimization of Universal DNA Arrays - PowerPoint PPT Presentation

About This Presentation

Title:

Design and Optimization of Universal DNA Arrays

Description:

c-h Code Problem [Ben-Dor et al.00] Given c and h, find maximum cardinality c-h code ... Primer-Del = same but never delete last primer from pool unless no ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 49

Provided by: IonMa8

Learn more at: https://dna.engr.uconn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Design and Optimization of Universal DNA Arrays

1
Design and Optimization of Universal DNA Arrays

Ion Mandoiu
Computer Science Engineering Department
University of Connecticut
http//www.engr.uconn.edu/ion/

2
DNA Arrays

Exploit Watson-Crick complementarity to
simultaneously perform a large number of
substring tests
Used in a variety of high-throughput genomic
analyses
Transcription (gene expression) analysis
Single Nucleotide Polymorphism (SNP) genotyping
Alternative splicing, ChIP-on-chip, tiling
arrays, genomic-based species identification,
point-of-service diagnosis,
Common array formats involve direct hybridization
between labeled DNA/RNA sample and DNA probes
attached to a glass slide

3
SNP Genotyping

Genome variation 0.1 of the DNA different from
one individual to another
80 of the variation is represented by Single
Nucleotide Polymorphisms (SNPs)
2 possible nucleotides (alleles) for each SNP
SNP genotyping determining the alleles present
at the SNP sites
Highest throughput for SNP genotyping is achieved
by high-density DNA microarrays based on direct
hybridization

4
SNP genotyping via direct hybridization
Labeled sample

SNP1 with alleles T/G
SNP2 with alleles A/G

Array with 2 probes/SNP
Hybridization
5
Universal DNA Arrays

Programable arrays
Array consists of application independent
oligonucleotides
Analysis carried by a sequence of reactions
involving application specific primers
Flexible AND cost effective
Universal array architectures tag arrays, APEX
arrays, SBE/SBH arrays,

6
Overview

Tag Array Design
- Tag Set Design
- Tag Assignment Algorithms
SBE/SBH Assays
- Decoding and Multiplexing Algorithms
Conclusions

7
SNP Genotyping with Tag Arrays
Tag
Primer
G

A
G
2. Solution phase hybridization

Mix reporter probes with unlabeled genomic DNA

C
antitag
4. Solid phase hybridization
3. Single-Base Extension (SBE)
8
Tag Array Advantages

Cost effective
Same array used in many analyses ? can be mass
produced
Easy to customize
Only need to synthesize new set of reporter
probes
Reliable
Solution phase hybridization better understood
than hybridization on solid support

9
Tag Set Design Problem
t1
t1
t2
t2
t1
t2
t1

(H1) Tags hybridize strongly to complementary
antitags
(H2) No tag hybridizes to a non-complementary
antitag
(H3) Tags do not cross-hybridize to each other

Tag Set Design Problem Find a maximum
cardinality set of tags satisfying (H1)-(H3)
10
Hybridization Models

Melting temperature Tm temperature at which 50
of duplexes are in hybridized state
2-4 rule
Tm 2 (As and Ts) 4 (Cs and Gs)
More accurate models exist, e.g., the
near-neighbor model

11
Hybridization Models (contd.)

Hamming distance model, e.g., Marathe et al. 01
Models rigid DNA strands
LCS/edit distance model, e.g., Torney et al. 03
Models infinitely elastic DNA strands
c-token model Ben-Dor et al. 00
Duplex formation requires formation of
nucleation complex between perfectly
complementary substrings
Nucleation complex must have weight ? c, where
wt(A)wt(T)1, wt(C)wt(G)2 (2-4 rule)

12
c-h Code Problem

c-token left-minimal DNA string of weight ? c,
i.e.,
w(x) ? c
w(x) lt c for every proper suffix x of x
A set of tags is a c-h code if
(C1) Every tag has weight ? h
(C2) Every c-token is used at most once

c-h Code Problem Ben-Dor et al.00 Given c and
h, find maximum cardinality c-h code
13
Algorithms for c-h Code Problem

Ben-Dor et al.00 approximation algorithm based
on DeBruijn sequences
Alphabetic tree search algorithm
Enumerate candidate tags in lexicographic order,
save tags whose c-tokens are not used by
previously selected tags
Easily modified to handle various combinations
of constraints
MT 05, 06 Optimum c-h codes can be computed in
practical time for small values of c by using
integer programming
Practical runtime using Garg-Koneman
approximation and LP-rounding

14
Token Content of a Tag

c4
CCAGATT
CC
CCA
CAG
AGA
GAT
GATT

Tag ? sequence of c-tokens End pos 2
3 4 5 6 7
c-token CC?CCA?CAG?AGA?GAT?GATT
15
Layered c-token graph for length-l tags
l
l-1
c/2
(c/2)1

c1
t
s
cN
16
Integer Program Formulation MPT05

Maximum integer flow problem w/ set capacity
constraints
O(hN) constraints variables, where N c-tokens

17
Number of c-tokens

WA or T, SC or G
Gn strings of weight n
? G1 2 G2 6 Gn 2Gn-2 2Gn-1

18
Number of c-tokens
19
Packing LP Formulation
20
Garg-Konemann Algorithm

x ? 0 y ? ? // yi are variables of the dual LP
Find min weight s-t path p, where weight(v) yi
for every v?Vi
While weight(p) lt 1 do
M ? maxi p ? Vi
xp ? xp 1/M
For every i, yi ? yi( 1 ? p ? Vi/M )
Find min weight s-t path p, where weight(v)
yi for v?Vi
4. For every p, xp ? xp / (1 - log1??)

GK98 The algorithm computes a factor (1- ?)2
approximation to the optimal LP solution with
(N/?) log1?N shortest path computations
21
LP Based Tag Set Design

Run Garg-Konemann and store the minimum weight
paths in a list
Traversing the list in reverse order, pick tags
corresponding to paths if they are feasible and
do not share c-tokens with already selected tags
Mark used c-tokens and run the alphabetic tree
search algorithm to select additional tags

22
Periodic Tags MT05

Key observation c-token uniqueness constraint in
c-h code formulation is too strong
A c-token should not appear in two different
tags, but can be repeated in a tag
A tag t is called periodic if it is the prefix of
(?)? for some period ?
Periodic strings make best use of c-tokens

23
c-token factor graph, c4 (incomplete)
CC AAG
AAC
AAAA AAAT
24
Vertex-disjoint Cycle Packing Problem

Given directed graph G, find maximum number of
vertex disjoint directed cycles in G
MT 05 APX-hard even for regular directed graphs
with in-degree and out-degree 2
h-c/21 approximation factor for tag set design
problem
Salavatipour and Verstraete 05
Quasi-NP-hard to approximate within ?(log1-? n)
O(n1/2) approximation algorithm

25
Cycle Packing Algorithm

Construct c-token factor graph G
T?
For all cycles C defining periodic tags, in
increasing order of cycle length,
Add to T the tag defined by C
Remove C from G
Perform an alphabetic tree search and add to T
tags consisting of unused c-tokens
Return T

Gives an increase of over 40 in the number of
tags compared to previous methods

26
Experimental Results
27
Antitag-to-Antitag Hybridization

Additional constraint antitags do not
cross-hybridize, including self
Formalization in c-token hybridization model
(C3) No two tags contain complementary substrings
of weight ? c
Cycle packing and tree search extend easily

28
Results w/ Extended Constraints
29
More Hybridization Constraints
t1
t1
t2

Enforced during tag assignment by
- Leaving some tags unassigned and distributing
primers across multiple arrays Ben-Dor et al.
03
- Exploiting availability of multiple primer
candidates MPT05

30
Assignable Primers

If primer p hybridizes to the complement of tag
t, at most one of the assignments (p,t), (p,t)
and (p,t) can be made

Set P of primers is assignable to a set T of tags
if the condition above is satisfied for every
p,p and t,t

31
Characterization of Assignable Sets

conflict graph
G(T ? P,E), where (t,p) ? E if t and p
hybridize
X number of primers adjacent to a degree 1 tag
Y number of degree 0 tags

X1
Y2

Ben-Dor 04 Set P is assignable to T iff
XY ? P

32
Finding Assignable Primer Sets
Multiplexing Problem given primer set P and tag
set T, find partition of P into minimum number of
assignable sets
Maximum Assignable Primer Set Problem given
primer set P and tag set T, find a maximum size
assignable subset of P

Both problems are NP-hard Ben-Dor 04

33
Integration with Primer Selection

In practice, several primer candidates with
equivalent functionality
In SNP genotyping, can pick primer from either
forward and reverse strand
In gene expression/identification applications,
many primers have desired length, Tm, etc.

34
Pooled Array Multiplexing Problem
Pooled Multiplexing Problem Given set of primer
pools P and tag set T, find a primer from each
pool and a partition of selected primers into
minimum number of assignable sets
35
XY Characterization Fails for Pools
36
Pooled Multiplexing Algorithms

Primer-Del greedy deletion for pools similar to
Ben-Dor et al 04

Repeatedly delete primer of maximum potential
until XY ? pools, where
Potential of tag t is 2-deg(t)
Potential of primer p is sum of potentials of
conflicting tags
Subtract ½ if primer adjacent to a tag of degree 1

37
Pooled Multiplexing Algorithms

Primer-Del greedy deletion for pools similar to
Ben-Dor et al 04
Primer-Del same but never delete last primer
from pool unless no other choice
Min-Pot select primer with min potential from
each pool, then run Primer-Del
Min-Deg select primer with min degree, then run
Primer-Del
Iterative ILP iteratively find a maximum
assignable pool set using integer linear program

38
Integer Linear Program for MAPS

where zpt 1 iff primer p is assigned to tag t

39
Results GenFlex Tags, c8
40
Results GenFlex Tags, c7
41
Results 213 MPT05 Tags, c7
42
Herpes B Gene Expression Assay
GenFlex Tags
Periodic Tags
43
Overview

Tag Array Design
- Tag Set Design
- Tag Assignment Algorithms
SBE/SBH Assays
- Decoding and Multiplexing Algorithms
Conclusions

44
SBE/SBH Assay MP 06
Primers
T
T
A
A
T
T
TTGCA
T
CCATT
A
GATAA
T
hybridization to k-mer array (SBH)
single-base extension (SBE)
45
Some notations

P set of primers, X set of probes
Ep ? A,C,T,G the set of possible extensions for
primer p
The spectrum of primer p, SpecX(p), is the set of
probes hybridizing with p
The extended spectrum of primer p with extension
set Ep,

46
Decodable primer sets

Four parallel single-color SBE/SBH experiments ?
one type of extension in each SBE experiment
P is weakly decodable with respect to extension e
if for every primer p
One SBE/SBH experiment with 4 colors (4
extensions)
P is weakly decodable if for every primer p and
every extension e ? Ep

47
Strongly r-decodable primer sets

Hybridization involving labeled nucleotide is
less predictable
?Informative probes should not rely on it
Signal from one SNP may obscure signal from
another when read at the same probe due to
differences in DNA amplification efficiency
?Informative probes cannot be shared between SNPs
P is strongly r-decodable if for every primer p
where r redundancy parameter

48
MPPP

A set of primer pools P P1,,Pn is strongly
r-decodable iff there is a primer pi in each pool
Pi such that p1,,pn is strongly r-decodable.

Minimum Pool Partitioning Problem (MPPP)
Given
primer pools set P and extensions sets Ep, for
every primer p
probe set X
redundancy r
Find
partition of P into the min number of strongly
r-decodable subsets

49
MDPSP

Maximum r-Decodable Pool Subset Problem (MDPSP)
Given
primer pools set P and extensions sets Ep, for
every primer p
probe set X
redundancy r
Find
strongly r-decodable subset of P of maximum size

50
Min-Greedy Algorithm for Maximum Induced Matching
in General Graphs

Pick a vertex u of min degree
Pick a vertex v of min degree from among us
neighbors
Add edge (u,v) to the matching
Delete all neighbors of u and v from the graph
Repeat the above steps until the graph becomes
empty
Duckworth 05 d-1 approximation factor for
d-regular graphs

51
Min-Greedy Algorithms for MDPSP

Bipartite hybridization graph G
Primers in left side, probes in right side
Two types of edges
N(p)SpecX(p)
N-(p)SpecX(p,Ep) \ SpecX(p)
Two algorithm variants
MinPrimerGreedy pick primer first
MinProbeGreedy pick probe first
Delete primer/probe if N degree drops below r/1

52
Experimental results for k-mers (Ep4, primer
length20)
53
MDPSP Size vs Primer Length
k10
54
Experimental results for c-tokens (Ep4, primer
length20)
55
MDPSP Size vs Primer Length
c13
56
Overview

Tag Array Design
- Tag Set Design
- Tag Assignment Algorithms
SBE/SBH Assays
- Decoding and Multiplexing Algorithms
Conclusions

57
Conclusions and Ongoing Work

Combinatorial algorithms yield significant
increases in multiplexing rates of universal
arrays
New SBE/SBH architecture particularly promising
based on preliminary simulation results
Ongoing work
Extend methods to more accurate hybridization
models, e.g., use NN melting temperature models
More complex (e.g., temperature dependent) DNA
tag set non-interaction requirements for DNA
self/mediated assembly
Probabilistic decoding in presence of
hybridization errors
Application to novel domains, e.g., DNA barcoding

58
Acknowledgments