Design and Optimization of Universal DNA Arrays - PowerPoint PPT Presentation

About This Presentation
Title:

Design and Optimization of Universal DNA Arrays

Description:

c-h Code Problem [Ben-Dor et al.00] Given c and h, find maximum cardinality c-h code ... Primer-Del = same but never delete last primer from pool unless no ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 49
Provided by: IonMa8
Category:

less

Transcript and Presenter's Notes

Title: Design and Optimization of Universal DNA Arrays


1
Design and Optimization of Universal DNA Arrays
  • Ion Mandoiu
  • Computer Science Engineering Department
  • University of Connecticut
  • http//www.engr.uconn.edu/ion/

2
DNA Arrays
  • Exploit Watson-Crick complementarity to
    simultaneously perform a large number of
    substring tests
  • Used in a variety of high-throughput genomic
    analyses
  • Transcription (gene expression) analysis
  • Single Nucleotide Polymorphism (SNP) genotyping
  • Alternative splicing, ChIP-on-chip, tiling
    arrays, genomic-based species identification,
    point-of-service diagnosis,
  • Common array formats involve direct hybridization
    between labeled DNA/RNA sample and DNA probes
    attached to a glass slide

3
SNP Genotyping
  • Genome variation 0.1 of the DNA different from
    one individual to another
  • 80 of the variation is represented by Single
    Nucleotide Polymorphisms (SNPs)
  • 2 possible nucleotides (alleles) for each SNP
  • SNP genotyping determining the alleles present
    at the SNP sites
  • Highest throughput for SNP genotyping is achieved
    by high-density DNA microarrays based on direct
    hybridization

4
SNP genotyping via direct hybridization
Labeled sample
  • SNP1 with alleles T/G
  • SNP2 with alleles A/G

Array with 2 probes/SNP
Hybridization
5
Universal DNA Arrays
  • Programable arrays
  • Array consists of application independent
    oligonucleotides
  • Analysis carried by a sequence of reactions
    involving application specific primers
  • Flexible AND cost effective
  • Universal array architectures tag arrays, APEX
    arrays, SBE/SBH arrays,

6
Overview
  • Tag Array Design
  • - Tag Set Design
  • - Tag Assignment Algorithms
  • SBE/SBH Assays
  • - Decoding and Multiplexing Algorithms
  • Conclusions

7
SNP Genotyping with Tag Arrays
Tag
Primer
G

A
G
2. Solution phase hybridization
  • Mix reporter probes with unlabeled genomic DNA

C
antitag
4. Solid phase hybridization
3. Single-Base Extension (SBE)
8
Tag Array Advantages
  • Cost effective
  • Same array used in many analyses ? can be mass
    produced
  • Easy to customize
  • Only need to synthesize new set of reporter
    probes
  • Reliable
  • Solution phase hybridization better understood
    than hybridization on solid support

9
Tag Set Design Problem
t1
t1
t2
t2
t1
t2
t1
  • (H1) Tags hybridize strongly to complementary
    antitags
  • (H2) No tag hybridizes to a non-complementary
    antitag
  • (H3) Tags do not cross-hybridize to each other

Tag Set Design Problem Find a maximum
cardinality set of tags satisfying (H1)-(H3)
10
Hybridization Models
  • Melting temperature Tm temperature at which 50
    of duplexes are in hybridized state
  • 2-4 rule
  • Tm 2 (As and Ts) 4 (Cs and Gs)
  • More accurate models exist, e.g., the
    near-neighbor model

11
Hybridization Models (contd.)
  • Hamming distance model, e.g., Marathe et al. 01
  • Models rigid DNA strands
  • LCS/edit distance model, e.g., Torney et al. 03
  • Models infinitely elastic DNA strands
  • c-token model Ben-Dor et al. 00
  • Duplex formation requires formation of
    nucleation complex between perfectly
    complementary substrings
  • Nucleation complex must have weight ? c, where
    wt(A)wt(T)1, wt(C)wt(G)2 (2-4 rule)

12
c-h Code Problem
  • c-token left-minimal DNA string of weight ? c,
    i.e.,
  • w(x) ? c
  • w(x) lt c for every proper suffix x of x
  • A set of tags is a c-h code if
  • (C1) Every tag has weight ? h
  • (C2) Every c-token is used at most once

c-h Code Problem Ben-Dor et al.00 Given c and
h, find maximum cardinality c-h code
13
Algorithms for c-h Code Problem
  • Ben-Dor et al.00 approximation algorithm based
    on DeBruijn sequences
  • Alphabetic tree search algorithm
  • Enumerate candidate tags in lexicographic order,
    save tags whose c-tokens are not used by
    previously selected tags
  • Easily modified to handle various combinations
    of constraints
  • MT 05, 06 Optimum c-h codes can be computed in
    practical time for small values of c by using
    integer programming
  • Practical runtime using Garg-Koneman
    approximation and LP-rounding

14
Token Content of a Tag
  • c4
  • CCAGATT
  • CC
  • CCA
  • CAG
  • AGA
  • GAT
  • GATT

Tag ? sequence of c-tokens End pos 2
3 4 5 6 7
c-token CC?CCA?CAG?AGA?GAT?GATT
15
Layered c-token graph for length-l tags
l
l-1
c/2
(c/2)1

c1
t
s
cN
16
Integer Program Formulation MPT05
  • Maximum integer flow problem w/ set capacity
    constraints
  • O(hN) constraints variables, where N c-tokens

17
Number of c-tokens
  • WA or T, SC or G
  • Gn strings of weight n
  • ? G1 2 G2 6 Gn 2Gn-2 2Gn-1

18
Number of c-tokens
19
Packing LP Formulation
20
Garg-Konemann Algorithm
  • x ? 0 y ? ? // yi are variables of the dual LP
  • Find min weight s-t path p, where weight(v) yi
    for every v?Vi
  • While weight(p) lt 1 do
  • M ? maxi p ? Vi
  • xp ? xp 1/M
  • For every i, yi ? yi( 1 ? p ? Vi/M )
  • Find min weight s-t path p, where weight(v)
    yi for v?Vi
  • 4. For every p, xp ? xp / (1 - log1??)

GK98 The algorithm computes a factor (1- ?)2
approximation to the optimal LP solution with
(N/?) log1?N shortest path computations
21
LP Based Tag Set Design
  • Run Garg-Konemann and store the minimum weight
    paths in a list
  • Traversing the list in reverse order, pick tags
    corresponding to paths if they are feasible and
    do not share c-tokens with already selected tags
  • Mark used c-tokens and run the alphabetic tree
    search algorithm to select additional tags

22
Periodic Tags MT05
  • Key observation c-token uniqueness constraint in
    c-h code formulation is too strong
  • A c-token should not appear in two different
    tags, but can be repeated in a tag
  • A tag t is called periodic if it is the prefix of
    (?)? for some period ?
  • Periodic strings make best use of c-tokens

23
c-token factor graph, c4 (incomplete)
CC AAG
AAC
AAAA AAAT
24
Vertex-disjoint Cycle Packing Problem
  • Given directed graph G, find maximum number of
    vertex disjoint directed cycles in G
  • MT 05 APX-hard even for regular directed graphs
    with in-degree and out-degree 2
  • h-c/21 approximation factor for tag set design
    problem
  • Salavatipour and Verstraete 05
  • Quasi-NP-hard to approximate within ?(log1-? n)
  • O(n1/2) approximation algorithm

25
Cycle Packing Algorithm
  • Construct c-token factor graph G
  • T?
  • For all cycles C defining periodic tags, in
    increasing order of cycle length,
  • Add to T the tag defined by C
  • Remove C from G
  • Perform an alphabetic tree search and add to T
    tags consisting of unused c-tokens
  • Return T
  • Gives an increase of over 40 in the number of
    tags compared to previous methods

26
Experimental Results
27
Antitag-to-Antitag Hybridization
  • Additional constraint antitags do not
    cross-hybridize, including self
  • Formalization in c-token hybridization model
  • (C3) No two tags contain complementary substrings
    of weight ? c
  • Cycle packing and tree search extend easily

28
Results w/ Extended Constraints
29
More Hybridization Constraints
t1
t1
t2
  • Enforced during tag assignment by
  • - Leaving some tags unassigned and distributing
    primers across multiple arrays Ben-Dor et al.
    03
  • - Exploiting availability of multiple primer
    candidates MPT05

30
Assignable Primers
  • If primer p hybridizes to the complement of tag
    t, at most one of the assignments (p,t), (p,t)
    and (p,t) can be made
  • Set P of primers is assignable to a set T of tags
    if the condition above is satisfied for every
    p,p and t,t

31
Characterization of Assignable Sets
  • conflict graph
  • G(T ? P,E), where (t,p) ? E if t and p
    hybridize
  • X number of primers adjacent to a degree 1 tag
  • Y number of degree 0 tags

X1
Y2
  • Ben-Dor 04 Set P is assignable to T iff
  • XY ? P

32
Finding Assignable Primer Sets
Multiplexing Problem given primer set P and tag
set T, find partition of P into minimum number of
assignable sets
Maximum Assignable Primer Set Problem given
primer set P and tag set T, find a maximum size
assignable subset of P
  • Both problems are NP-hard Ben-Dor 04

33
Integration with Primer Selection
  • In practice, several primer candidates with
    equivalent functionality
  • In SNP genotyping, can pick primer from either
    forward and reverse strand
  • In gene expression/identification applications,
    many primers have desired length, Tm, etc.

34
Pooled Array Multiplexing Problem
Pooled Multiplexing Problem Given set of primer
pools P and tag set T, find a primer from each
pool and a partition of selected primers into
minimum number of assignable sets
35
XY Characterization Fails for Pools
36
Pooled Multiplexing Algorithms
  • Primer-Del greedy deletion for pools similar to
    Ben-Dor et al 04
  • Repeatedly delete primer of maximum potential
    until XY ? pools, where
  • Potential of tag t is 2-deg(t)
  • Potential of primer p is sum of potentials of
    conflicting tags
  • Subtract ½ if primer adjacent to a tag of degree 1

37
Pooled Multiplexing Algorithms
  • Primer-Del greedy deletion for pools similar to
    Ben-Dor et al 04
  • Primer-Del same but never delete last primer
    from pool unless no other choice
  • Min-Pot select primer with min potential from
    each pool, then run Primer-Del
  • Min-Deg select primer with min degree, then run
    Primer-Del
  • Iterative ILP iteratively find a maximum
    assignable pool set using integer linear program

38
Integer Linear Program for MAPS
  • where zpt 1 iff primer p is assigned to tag t

39
Results GenFlex Tags, c8
40
Results GenFlex Tags, c7
41
Results 213 MPT05 Tags, c7
42
Herpes B Gene Expression Assay
GenFlex Tags
Periodic Tags
43
Overview
  • Tag Array Design
  • - Tag Set Design
  • - Tag Assignment Algorithms
  • SBE/SBH Assays
  • - Decoding and Multiplexing Algorithms
  • Conclusions

44
SBE/SBH Assay MP 06
Primers
T
T
A
A
T
T
TTGCA
T
CCATT
A
GATAA
T
hybridization to k-mer array (SBH)
single-base extension (SBE)
45
Some notations
  • P set of primers, X set of probes
  • Ep ? A,C,T,G the set of possible extensions for
    primer p
  • The spectrum of primer p, SpecX(p), is the set of
    probes hybridizing with p
  • The extended spectrum of primer p with extension
    set Ep,

46
Decodable primer sets
  • Four parallel single-color SBE/SBH experiments ?
    one type of extension in each SBE experiment
  • P is weakly decodable with respect to extension e
    if for every primer p
  • One SBE/SBH experiment with 4 colors (4
    extensions)
  • P is weakly decodable if for every primer p and
    every extension e ? Ep

47
Strongly r-decodable primer sets
  • Hybridization involving labeled nucleotide is
    less predictable
  • ?Informative probes should not rely on it
  • Signal from one SNP may obscure signal from
    another when read at the same probe due to
    differences in DNA amplification efficiency
  • ?Informative probes cannot be shared between SNPs
  • P is strongly r-decodable if for every primer p
  • where r redundancy parameter

48
MPPP
  • A set of primer pools P P1,,Pn is strongly
    r-decodable iff there is a primer pi in each pool
    Pi such that p1,,pn is strongly r-decodable.
  • Minimum Pool Partitioning Problem (MPPP)
  • Given
  • primer pools set P and extensions sets Ep, for
    every primer p
  • probe set X
  • redundancy r
  • Find
  • partition of P into the min number of strongly
    r-decodable subsets

49
MDPSP
  • Maximum r-Decodable Pool Subset Problem (MDPSP)
  • Given
  • primer pools set P and extensions sets Ep, for
    every primer p
  • probe set X
  • redundancy r
  • Find
  • strongly r-decodable subset of P of maximum size

50
Min-Greedy Algorithm for Maximum Induced Matching
in General Graphs
  • Pick a vertex u of min degree
  • Pick a vertex v of min degree from among us
    neighbors
  • Add edge (u,v) to the matching
  • Delete all neighbors of u and v from the graph
  • Repeat the above steps until the graph becomes
    empty
  • Duckworth 05 d-1 approximation factor for
    d-regular graphs

51
Min-Greedy Algorithms for MDPSP
  • Bipartite hybridization graph G
  • Primers in left side, probes in right side
  • Two types of edges
  • N(p)SpecX(p)
  • N-(p)SpecX(p,Ep) \ SpecX(p)
  • Two algorithm variants
  • MinPrimerGreedy pick primer first
  • MinProbeGreedy pick probe first
  • Delete primer/probe if N degree drops below r/1

52
Experimental results for k-mers (Ep4, primer
length20)
53
MDPSP Size vs Primer Length
k10
54
Experimental results for c-tokens (Ep4, primer
length20)
55
MDPSP Size vs Primer Length
c13
56
Overview
  • Tag Array Design
  • - Tag Set Design
  • - Tag Assignment Algorithms
  • SBE/SBH Assays
  • - Decoding and Multiplexing Algorithms
  • Conclusions

57
Conclusions and Ongoing Work
  • Combinatorial algorithms yield significant
    increases in multiplexing rates of universal
    arrays
  • New SBE/SBH architecture particularly promising
    based on preliminary simulation results
  • Ongoing work
  • Extend methods to more accurate hybridization
    models, e.g., use NN melting temperature models
  • More complex (e.g., temperature dependent) DNA
    tag set non-interaction requirements for DNA
    self/mediated assembly
  • Probabilistic decoding in presence of
    hybridization errors
  • Application to novel domains, e.g., DNA barcoding

58
Acknowledgments
  • Claudia Prajescu and Dragos Trinca
  • Funding from NSF (Awards 0546457 and 0543365) and
    UCONN Research Foundation
Write a Comment
User Comments (0)
About PowerShow.com