High Performance Direct Pairwise Comparison of Genomic Sequences - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

High Performance Direct Pairwise Comparison of Genomic Sequences

Description:

High Performance Direct Pairwise Comparison of Genomic Sequences ... if(i % freq) { // Pipeline stall! out = VecStore(sum); Save(out) ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 21
Provided by: chrism54
Category:

less

Transcript and Presenter's Notes

Title: High Performance Direct Pairwise Comparison of Genomic Sequences


1
High Performance Direct Pairwise Comparison of
Genomic Sequences
  • Christopher Mueller, Mehmet Dalkilic, Andrew
    Lumsdaine
  • HiCOMB
  • April 4, 2005
  • Denver, Colorado

2
Introduction
  • Goals
  • Generate data for large format visualization
  • Exploit parallel features present in commodity
    hardware
  • SIMD/vector processors
  • SMP/multiple processors per machine
  • Clusters
  • Genome Comparison
  • Dot plot is the only complete method for
    comparing genomes
  • Often ruled out due to quadratic running time
  • Size of data has an upper bound and modern
    hardware is approaching the point where this
    bound is (almost) within reach
  • Target Data
  • DNA sequences, one direction (5 to 3)
  • Target Platform
  • Apple dual processor G5, Altivec vector processor

3
Related Work
  • BLAST
  • Apple and Genentech (AGBLAST), 5x speedup using
    Altivec
  • Smith-Waterman
  • Rognes and Seeberg, 6x speedup using MMX
  • HMMER
  • Erik Lindahl, 30 improvement using Altivec
  • Hardware Solutions
  • Various commercial FPGA solutions exist for
    different algorithms (e.g., TimeLogics DeCypher
    platform offers BLAST, HMM, SW)

4
SIMD Overview
  • Single Instruction, Multiple Data
  • Perform the same operation on many data items at
    once
  • Vector registers can be divided according to the
    data type
  • The Altivec registers in the G5 are 128 bits
    wide.
  • Vector programming using gcc on Apple G5s is one
    step removed from assembly programming
  • Functions are thin wrappers around assembly calls
  • The optimizer does not cover vector operations
  • Memory loads and stores are handled by the
    programmer and must be properly byte aligned

Image from http//developer.apple.com/hardware/ve
5
The Dot Plot
qseq
NAÏVE_DOTPLOT(qseq, sseq, win, strig) // qseq
- column sequence // sseq - row sequence //
win - number of elements to compare //
for each point // strig - number of matches
required // for a point for each q
in qseq for each s in sseq
score 0 for each (q, s) in
(qseqqqwin,
ssswin) if q s
score 1 end if q end for each
(q,s) if score gt strig
AddDot(q, s) end if score end for
each s end for each q
sseq
win 3 strig 2
Dotplot comparing the human and fly mitochondrial
genomes (generated by DOTTER)
6
The Standard Algorithm
STD_DOTPLOT(qScores, s, win, strig) dotvec
zeros(len(q)) for each char c in s dotvec
shift(dotvec, 1) dotvec qScoresc
if index(c) gt win delchar sindex(c) -
win dotvec - shift(qScoresdelchar,
win) for each dot in dotvec
gt strig display(dot) end for each
dot end for i end DOTPLOT
7
Data Parallel Dot Plot
VECTOR_DOTPLOT(qScores, s, win, strig) //
Group diagonals by the upper and lower //
triangular sections of the martix for each
vector diagonal D runningScore vector(0)
for each char c in s score
VecLoad(qScoresc) runningScore
VecAdd(score, r_score) if index(c) gt win
delChar sindex(c) - win
delscore VecLoad(qScoresdelChar)
runningScore VecSub(score, delscore) if
VecAnyElementGte(runningScore, strig)
scores VectorUnpack(runningScore) for
each score in scores gt strig
Output(row(c), col(score), score) end for
each score end if VecGte() end for
each c end for each D end VECTOR_DOTPLOT
8
Coarse Grained Parallelism
  • Block Level Parallelism
  • Block the matrix into columns
  • Overlap by the number of characters in the window
  • Single Machine
  • Run one thread per processor
  • Create one memory mapped file per processor
  • Cluster
  • Run one instance per machine and one thread per
    processor.
  • Store results locally (e.g. /tmp)

9
Model-driven Implementation
Goal Break the algorithm into basic operations
that can be modeled independently to understand
the performance issues at each step.
Data Streams (data read speed)
Vector Operations (instruction throughput)
Sparse Matrix Format
Data output
(data write speed)
10
Data Stream Models
Data Stream Performance (Mops)
// Base case // S-sequence is one stream
pointer s // Q-sequence is four
streams uchar qScore4 // Option 1 Four
Pointers // Keep pointers to the current //
position in the score vectors qScore0 qScore
1 qScore2 qScore3 score
qScores // Option 2 Index // Index the
score vectors with // a counter i score
qScoresi
  • Single stream pointer is similar to indexing, but
    a little slower
  • For the four score streams, indexed 1/4 of the
    time, maintaining the pointers costs more than
    lookup
  • Pointer vs. Index numbers varied based on the
    compiler version

11
Vector Performance Models
// Model Variables uchar data randseq(),
out16 long i 0, l len(data) vector uchar
sum 0, value // VecAdd for(i 0 i lt l -
16 i) value VecLoad(datai) sum
VecAdd(value, sum) // StoreAll for(i 0 i lt
l - 16 i) value VecLoad(datai) sum
VecAdd(value, sum) out VecStore(sum)
Save(out) // StoreFreq int freq l
alpha for(i 0 i lt l - 16 i) value
VecLoad(datai) sum VecAdd(value, sum)
if(i freq) // Pipeline stall! out
VecStore(sum) Save(out)
Vector Model Performance (Mops)
  • Attempts to model infrequent write operations
    were unsuccessful
  • Storing all dots yields high performance, but
    this is not practical for large comparisons
  • StoreFreq provides a lower bound on performance

12
Pipeline Management
// Sequence of Vector Operations // score
VecLoad(qScoresc) score1 vec_ld(0, ptemp)
// unalgined score2 vec_ld(16, ptemp) //
loads vperm vec_lvsl(0, ptemp) score
vec_perm(score1, score2, vperm) runningScore
vec_add(score, r_score) // delscore
VecLoad(qScoresdelChar) score1 vec_ld(0,
ptemp) score2 vec_ld(16, ptemp) vperm
vec_lvsl(0, ptemp) delscore
vec_perm(score1, score2, vperm) runningScore
vec_sub(score, delscore) if(vec_any_ge(runningSc
ore, strig)) scores vec_st(runningScore)
// Main processor for(i 0 i lt 16 i)
if(hiti gt ustrig ) dm.AddDot(y, x i,
hiti)
Cycle-accurate Plots of the Instructions in Flight
Each line shows each cycle for one instruction.
Instructions are offset (x-axis) based on
starting time. Time flows from top to bottom
(y-axis). The left plot shows a series of
add/delete steps with no dots generated. The
bottom plot shows the pipeline being interrupted
when a dot is generated.
13
Sparse Matrix Format
Sparse Matrix Format Performance (Mops)
// Option 1 // stdvector CSR-eqse Sparse
Matrix struct Dot int col int
value struct Row int num vectorltDotgt
cols typedef vectorltRowgt DotMatrixVec //
Option 2 // Memory Mapped Coordinate-wise //
Sparse Matrix struct RowDot int row int
col int value RowDot out
(RowDot)mmap()
6.78x
3.85x
1.0x
  • Both approaches required some maintenance to
    avoid exhausting main memory
  • mmap avoids a second pass through the data during
    the save step

14
Data Location
Data Location Performance (Mops)
  • Large, shared data is often located on network
    drives
  • This adds a network hop for all disk I/O
  • Even for infrequent I/O, this can significantly
    affect performance

1.98x
1.35x
1.0x
1.0x
  • The stdvector sparse matrix had a slight
    benefit.
  • The mmap sparse matrix improved significantly
    with local data storage.

15
Traditional Manual Optimizations
  • Prefetch
  • G5 hardware prefetch is very good
  • Attempts to optimize had negative impact
  • Blocking
  • Slight negative impact due to burps in the stream
  • Unrolling
  • Complicated code very quickly
  • No measurable improvement

16
System Details
  • Apple Dual 2.0 GHz G5, 3.5 GB RAM
  • 100 Mbit network to file server
  • OS X 10.3.5 (Darwin Kernel Version 7.5.0)
  • g 3.3 (build 1620)
  • -O3
  • -fast (different compiler, aggressive
    optimizations)
  • -altivec (limited optimizations)
  • Upgrade from 1614 to 1620 improved DOTTERs
    performance by 30
  • Libraries
  • Boostthread
  • Data (from GenBank)
  • Mitochondrial genomes
  • E. Coli, Listeria bacterial genomes

17
Results
Final Results (Mops)
13.0x
  • Single Machine
  • Mitochondrial (20 kbp)
  • DOTTER vs. Data-parallel
  • Bacterial (4.5 Mbp)
  • Data-parallel only
  • Cluster
  • (16 dual processor 2.3 GHz G5s)
  • Bacterial Comparison
  • 92 min, 8 sec (1 node)
  • 5 min, 42 sec (16 nodes)

7.0x
1.0x
Scalability
Scalability (time/nodes)
18
Visualization
  • Results rendered to PDF
  • Target Displays
  • 2x4, 6400x2400 tiled display wall
  • IBM T221, 3840x2400, 204 dpi display
  • Magnifying glass required
  • High resolution formats
  • 600 dpi laser printer
  • 1200 dpi ink jet printer
  • High resolution, no interactivity

19
Conclusions
  • Modern commodity hardware is close to providing
    the performance necessary for large direct
    genomic comparisons.
  • 5,000,000 base pair sequences are realistic
    (bacteria)
  • 50,000,000 base pair sequences are possible
    (small human chromosomes)
  • It is important to take a careful, experimental
    approach to implementation and to test all
    assumptions.

20
Acknowledgements
  • Jeremiah Willcock helped develop the initial
    prototype
  • Eric Wernert, Craig Jacobs, and Charlie Moad from
    the UITS Advanced Visualization Lab at Indiana
    University provided visualization support
  • This work was supported by a grant from the Lilly
    Endowment
  • References
  • Apple Developers Connection, Velocity Engine
    and Xcode, from, Apple Developer Connection,
    Cupertino, CA, 2004. http//developer.apple.com/ha
    rdware/ve http//developer.apple.com/tools/xcode
  • A. J. Gibbs and G. A. McIntyre, The diagram, a
    method for comparing sequences. Its use with
    amino acid and nucleotide sequences, Eur J
    Biochem, 16 (1970), pp. 1-11.
  • E. L. L. Sonnhammer and R. Durbin, A Dot-Matrix
    Program with Dynamic Threshold Control Suited for
    Genomic DNA and Protein-Sequence Analysis,
    Gene-Combis, 167 (1995), pp. 1-10.
Write a Comment
User Comments (0)
About PowerShow.com