Title: High Performance Direct Pairwise Comparison of Genomic Sequences
1High Performance Direct Pairwise Comparison of
Genomic Sequences
- Christopher Mueller, Mehmet Dalkilic, Andrew
Lumsdaine - HiCOMB
- April 4, 2005
- Denver, Colorado
2Introduction
- Goals
- Generate data for large format visualization
- Exploit parallel features present in commodity
hardware - SIMD/vector processors
- SMP/multiple processors per machine
- Clusters
- Genome Comparison
- Dot plot is the only complete method for
comparing genomes - Often ruled out due to quadratic running time
- Size of data has an upper bound and modern
hardware is approaching the point where this
bound is (almost) within reach - Target Data
- DNA sequences, one direction (5 to 3)
- Target Platform
- Apple dual processor G5, Altivec vector processor
3Related Work
- BLAST
- Apple and Genentech (AGBLAST), 5x speedup using
Altivec - Smith-Waterman
- Rognes and Seeberg, 6x speedup using MMX
- HMMER
- Erik Lindahl, 30 improvement using Altivec
- Hardware Solutions
- Various commercial FPGA solutions exist for
different algorithms (e.g., TimeLogics DeCypher
platform offers BLAST, HMM, SW)
4SIMD Overview
- Single Instruction, Multiple Data
- Perform the same operation on many data items at
once - Vector registers can be divided according to the
data type - The Altivec registers in the G5 are 128 bits
wide. - Vector programming using gcc on Apple G5s is one
step removed from assembly programming - Functions are thin wrappers around assembly calls
- The optimizer does not cover vector operations
- Memory loads and stores are handled by the
programmer and must be properly byte aligned
Image from http//developer.apple.com/hardware/ve
5The Dot Plot
qseq
NAÏVE_DOTPLOT(qseq, sseq, win, strig) // qseq
- column sequence // sseq - row sequence //
win - number of elements to compare //
for each point // strig - number of matches
required // for a point for each q
in qseq for each s in sseq
score 0 for each (q, s) in
(qseqqqwin,
ssswin) if q s
score 1 end if q end for each
(q,s) if score gt strig
AddDot(q, s) end if score end for
each s end for each q
sseq
win 3 strig 2
Dotplot comparing the human and fly mitochondrial
genomes (generated by DOTTER)
6The Standard Algorithm
STD_DOTPLOT(qScores, s, win, strig) dotvec
zeros(len(q)) for each char c in s dotvec
shift(dotvec, 1) dotvec qScoresc
if index(c) gt win delchar sindex(c) -
win dotvec - shift(qScoresdelchar,
win) for each dot in dotvec
gt strig display(dot) end for each
dot end for i end DOTPLOT
7Data Parallel Dot Plot
VECTOR_DOTPLOT(qScores, s, win, strig) //
Group diagonals by the upper and lower //
triangular sections of the martix for each
vector diagonal D runningScore vector(0)
for each char c in s score
VecLoad(qScoresc) runningScore
VecAdd(score, r_score) if index(c) gt win
delChar sindex(c) - win
delscore VecLoad(qScoresdelChar)
runningScore VecSub(score, delscore) if
VecAnyElementGte(runningScore, strig)
scores VectorUnpack(runningScore) for
each score in scores gt strig
Output(row(c), col(score), score) end for
each score end if VecGte() end for
each c end for each D end VECTOR_DOTPLOT
8Coarse Grained Parallelism
- Block Level Parallelism
- Block the matrix into columns
- Overlap by the number of characters in the window
- Single Machine
- Run one thread per processor
- Create one memory mapped file per processor
- Cluster
- Run one instance per machine and one thread per
processor. - Store results locally (e.g. /tmp)
9Model-driven Implementation
Goal Break the algorithm into basic operations
that can be modeled independently to understand
the performance issues at each step.
Data Streams (data read speed)
Vector Operations (instruction throughput)
Sparse Matrix Format
Data output
(data write speed)
10Data Stream Models
Data Stream Performance (Mops)
// Base case // S-sequence is one stream
pointer s // Q-sequence is four
streams uchar qScore4 // Option 1 Four
Pointers // Keep pointers to the current //
position in the score vectors qScore0 qScore
1 qScore2 qScore3 score
qScores // Option 2 Index // Index the
score vectors with // a counter i score
qScoresi
- Single stream pointer is similar to indexing, but
a little slower - For the four score streams, indexed 1/4 of the
time, maintaining the pointers costs more than
lookup - Pointer vs. Index numbers varied based on the
compiler version
11Vector Performance Models
// Model Variables uchar data randseq(),
out16 long i 0, l len(data) vector uchar
sum 0, value // VecAdd for(i 0 i lt l -
16 i) value VecLoad(datai) sum
VecAdd(value, sum) // StoreAll for(i 0 i lt
l - 16 i) value VecLoad(datai) sum
VecAdd(value, sum) out VecStore(sum)
Save(out) // StoreFreq int freq l
alpha for(i 0 i lt l - 16 i) value
VecLoad(datai) sum VecAdd(value, sum)
if(i freq) // Pipeline stall! out
VecStore(sum) Save(out)
Vector Model Performance (Mops)
- Attempts to model infrequent write operations
were unsuccessful - Storing all dots yields high performance, but
this is not practical for large comparisons - StoreFreq provides a lower bound on performance
12Pipeline Management
// Sequence of Vector Operations // score
VecLoad(qScoresc) score1 vec_ld(0, ptemp)
// unalgined score2 vec_ld(16, ptemp) //
loads vperm vec_lvsl(0, ptemp) score
vec_perm(score1, score2, vperm) runningScore
vec_add(score, r_score) // delscore
VecLoad(qScoresdelChar) score1 vec_ld(0,
ptemp) score2 vec_ld(16, ptemp) vperm
vec_lvsl(0, ptemp) delscore
vec_perm(score1, score2, vperm) runningScore
vec_sub(score, delscore) if(vec_any_ge(runningSc
ore, strig)) scores vec_st(runningScore)
// Main processor for(i 0 i lt 16 i)
if(hiti gt ustrig ) dm.AddDot(y, x i,
hiti)
Cycle-accurate Plots of the Instructions in Flight
Each line shows each cycle for one instruction.
Instructions are offset (x-axis) based on
starting time. Time flows from top to bottom
(y-axis). The left plot shows a series of
add/delete steps with no dots generated. The
bottom plot shows the pipeline being interrupted
when a dot is generated.
13Sparse Matrix Format
Sparse Matrix Format Performance (Mops)
// Option 1 // stdvector CSR-eqse Sparse
Matrix struct Dot int col int
value struct Row int num vectorltDotgt
cols typedef vectorltRowgt DotMatrixVec //
Option 2 // Memory Mapped Coordinate-wise //
Sparse Matrix struct RowDot int row int
col int value RowDot out
(RowDot)mmap()
6.78x
3.85x
1.0x
- Both approaches required some maintenance to
avoid exhausting main memory - mmap avoids a second pass through the data during
the save step
14Data Location
Data Location Performance (Mops)
- Large, shared data is often located on network
drives - This adds a network hop for all disk I/O
- Even for infrequent I/O, this can significantly
affect performance
1.98x
1.35x
1.0x
1.0x
- The stdvector sparse matrix had a slight
benefit. - The mmap sparse matrix improved significantly
with local data storage.
15Traditional Manual Optimizations
- Prefetch
- G5 hardware prefetch is very good
- Attempts to optimize had negative impact
- Blocking
- Slight negative impact due to burps in the stream
- Unrolling
- Complicated code very quickly
- No measurable improvement
16System Details
- Apple Dual 2.0 GHz G5, 3.5 GB RAM
- 100 Mbit network to file server
- OS X 10.3.5 (Darwin Kernel Version 7.5.0)
- g 3.3 (build 1620)
- -O3
- -fast (different compiler, aggressive
optimizations) - -altivec (limited optimizations)
- Upgrade from 1614 to 1620 improved DOTTERs
performance by 30 - Libraries
- Boostthread
- Data (from GenBank)
- Mitochondrial genomes
- E. Coli, Listeria bacterial genomes
17Results
Final Results (Mops)
13.0x
- Single Machine
- Mitochondrial (20 kbp)
- DOTTER vs. Data-parallel
- Bacterial (4.5 Mbp)
- Data-parallel only
- Cluster
- (16 dual processor 2.3 GHz G5s)
- Bacterial Comparison
- 92 min, 8 sec (1 node)
- 5 min, 42 sec (16 nodes)
7.0x
1.0x
Scalability
Scalability (time/nodes)
18Visualization
- Results rendered to PDF
- Target Displays
- 2x4, 6400x2400 tiled display wall
- IBM T221, 3840x2400, 204 dpi display
- Magnifying glass required
- High resolution formats
- 600 dpi laser printer
- 1200 dpi ink jet printer
- High resolution, no interactivity
19Conclusions
- Modern commodity hardware is close to providing
the performance necessary for large direct
genomic comparisons. - 5,000,000 base pair sequences are realistic
(bacteria) - 50,000,000 base pair sequences are possible
(small human chromosomes) - It is important to take a careful, experimental
approach to implementation and to test all
assumptions.
20Acknowledgements
- Jeremiah Willcock helped develop the initial
prototype - Eric Wernert, Craig Jacobs, and Charlie Moad from
the UITS Advanced Visualization Lab at Indiana
University provided visualization support - This work was supported by a grant from the Lilly
Endowment - References
- Apple Developers Connection, Velocity Engine
and Xcode, from, Apple Developer Connection,
Cupertino, CA, 2004. http//developer.apple.com/ha
rdware/ve http//developer.apple.com/tools/xcode -
- A. J. Gibbs and G. A. McIntyre, The diagram, a
method for comparing sequences. Its use with
amino acid and nucleotide sequences, Eur J
Biochem, 16 (1970), pp. 1-11. -
- E. L. L. Sonnhammer and R. Durbin, A Dot-Matrix
Program with Dynamic Threshold Control Suited for
Genomic DNA and Protein-Sequence Analysis,
Gene-Combis, 167 (1995), pp. 1-10.