Title: Massively Parallel Solutions for Molecular Sequence Analysis
1Massively Parallel Solutions for Molecular
Sequence Analysis
Prabhakar R. Gudla CMSC 838T Presentation
2Outline
- Motivation
- Smith-Waterman Algorithm
- Parallelization
- High Performance Computing
- Hybrid Architecture
- Fuzion 150
- Performance Evaluation
- Conclusions and Comments
3Motivation
Discovered sequences are analyzed by comparison
with databases
Complexity is proportional to the product of
query size times database size
? Analysis too slow on sequential computers
4Sequence Alignment
- Two possible approaches
- Heuristics, e.g. BLAST, FASTA, but the more
efficient the heuristics, the worse the quality
of the results - Parallel Processing, get high-quality results in
reasonable time - BLAST, FASTA, Smith-Waterman (S-W)
Smith- Waterman
FASTA
BLAST
5Outline
- Motivation
- Smith-Waterman Algorithm
- Parallelization
- High Performance Computing
- Hybrid Architecture
- Fuzion 150
- Performance Evaluation
- Conclusion and Comments
6Parallelization of S-W
0
- matrix cells along a single diagonal are computed
in parallel - comparison is performed in l1l2?1 steps on l1 PEs
7Parallel Architectures
- Embedded Massively Parallel Accelerators
- Other accelerators Decypher, Biocellerator,
GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2,
BioScan
8Outline
- Motivation
- Smith-Waterman Algorithm
- Parallelization
- High Performance Computing
- Hybrid Architecture
- Fuzion 150
- Performance Evaluation
- Conclusion and Comments
9Previous Applications
- Volume Visualization Schmidt 00
- Automatic Visual Quality Control (Automobile
Industry) - Computer Tomography Schmidt, Schimmler, and
Schröder 98 - Video Compression Schmidt and Schimmler 99
- Range of Transforms (Fourier, Wavelet, Hough,
Radon) Schmidt, Schimmler and Schröder 99 - Image Processing Schimmler and Lang 96, Lenders
and Schröder 90, Jiang Edirisinghe, and Schröder
97
10Hybrid Architecture
- combines SIMD and MIMD paradigm within a parallel
architecture ? Hybrid Computer
11Architecture of Systola 1024
- Instruction Systolic Array
- 32 ? 32 mesh of processing elements
- wavefront instruction execution
12Mapping onto Systola 1024
a query sequence (equal to 1024)
b subject sequence
c1c0 X
- Efficient routing on the ISA Row Ringshift and
Broadcast
- Subject sequences can be pipelined with only step
delay ? k steps for subject sequence of length k
13Fuzion 150 Architecture
Linear SIMD Array 1536 PEs each with 2 Kbytes
DRAM
SIMD Controller
Instruction Fetch
Local Memory
Host
AGP
Rambus
FUZION Bus
1,2 or 4 Channels (6.4 GB/s)
32-bit EPU (ARC)
Video I/O
Display
- 0.25-?m, single-chip, SIMD architecture
- 1536 PEs _at_ 200 MHz ? 300 GOPS
- 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth
- multithreading (control units interact via
semaphores) - developed by Clearspeed Technology (UK) for
graphics, networking processing
14Fuzion 150 Architecture
Local Memory
Block 5
Fuzion Bus
PE (5,0)
PE (5,1)
PE (5,255)
Block 1
PE (1,0)
PE (1,1)
PE (1,255)
Block 0
PE (0,0)
PE (0,1)
PE (0,255)
15Mapping onto the Fuzion 150
Block 5
a query sequence (equal to 1536)
Block 1
b subject sequence
Block 0
bk.b1b0
c1c0 X
- No fast global communication ? 2-step local
communication - Subject sequence can be pipelined with only step
delay
16Contents
- Motivation
- Smith-Waterman Algorithm
- Parallelization
- High Performance Computing
- Hybrid Architecture
- Fuzion 150
- Performance Evaluation
- Conclusion and Comments
17Performance Evaluation
- Scan times in seconds for TrEMBL 14 (351834
Protein Sequences) for various query sequence
lengths
- Parallel implementation scales linearly with
sequence length - Computing time dominates data transfer time
- Fuzion 150 is ?25 times faster than a single
Systola 1024 difference in CMOS technology
(0.25? vs 1.0?)
18Performance Evaluation
- Time comparisons for a 10 Mbase search on
different parallel architectures with different
query length
- 4?faster than 16K-PE MasPar
- 6?faster than Kestrel
- 5?faster than SAMBA (special-purpose 3-board
architecture)
19Performance Evaluation
USparc Sun Ultrasparc 140 MHz B-SYS 470-PE
ISA Alpha DEC Alpha 433 MHz 1K MP2 1K-PE
MasPar Paragon 32-node Paragon Decy-1 1-board
Decypher-II Merc1 1-board Mercury Bcll-1
Biocellerator Samba 2-board Samba 16-MP2
16K-PE MasPar FDF-3 5-Board Paracell
FDF Kestrel 1-board Kestrel Decy-15 15-board
Decypher-II (single purpose) (FPGA)
Source Dahle et. al, PDPTA, 1243-1249, 1999
20Outline
- Motivation
- Smith-Waterman Algorithm
- Parallelization
- High Performance Computing
- Hybrid Architecture
- Fuzion 150
- Performance Evaluation
- Conclusions and Comments
21Conclusions
- Demonstrated how fine-grained and hybrid parallel
architectures can be applied efficiently for
Comparative Genomics - Significant runtime savings for full genome
comparisons and database searching - Same systems can be used for accelerating other
bioinformatics applications, e.g. Hidden Markov
Models
22Comments
- ?? With hardware support, is S-W as fast as
BLAST?
Comparative search speeds on 600 MHz 21264A Alpha
machine (comparable MCUPS as Hybrid System and
Fuzion 150)
Search Tools (against Swiss-Prot DB) Sequence Under Test Sequence Under Test Sequence Under Test
Search Tools (against Swiss-Prot DB) ELVIS (5) Metr (276) Arp_arath (536)
Search Tools (against Swiss-Prot DB) Time taken for the search (seconds) Time taken for the search (seconds) Time taken for the search (seconds)
FASTA 3.3 4.3 20.0 25.0
BLAST 2.2 1.0 4.0 10.0
SSearch (SW) 6.0 240.0 565.0
HWare Accl. 3.2 16.8 29.7
Source Shane Sturrock, SCS, 2(1), April 2002
23Comments
- ? Is it feasible to use S-W as the default ?
- Currently offered as a default option at EBI
(European Bioinformatics Institute), handles 15K
queries per month w/ full implementation of S-W - Depends on the objectives of the search
- ? Just how much more accurate is S-W ?
- 5-10 more sensitive towards divergent matches
than BLAST (Shpaer et. al., Genomics 38, 179-191,
1996) - BLAST will retrieve most biologically significant
similarities, but will miss a few and will
include some chance similarities
24Comparison of S-W VS BLAST
- Source Shpaer et.al., Genomics 38(2),
pp.179-191, 1996 - ? Is there a real difference in the results ?
- YES
25Comparison of S-W, FASTA, and BLAST
Note The numbers in the table show for how many
protein SF the method in the column performed
better than the one in the row
26Acknowledgements
- Dr. Bertil Schmidt
- Dr. Chau-Wen Tseng
27QA
28Extra Slides
29Full Genome Comparison
- related Organisms, but Tuberculosis causes a
disease ? find common and different parts - 16?106 pairwise sequence comparisons
30Smith-Waterman Algorithm
- Optimal local alignment of two sequences
- Performs an exhaustive search for the optimal
local alignment - Complexity O(n?m) for sequence lengths n and m
- Based on the 'dynamic programming' (DP) algorithm
- Fill the DP matrix using a substitution
(mutation) matrix - Find the maximal value (score) in the matrix
- Trace back from the score until a 0 value is
reached
31Smith-Waterman Algorithm
- Aligning S1 and S2 of length l1 and l2 using
recurrences
- Calculate three possible ways to extend the
alignment - by one aminoacid (AA) in each sequence
- by one AA in the first sequence and align it with
a gap in the second - by one AA in the second sequence and align it
with a gap in the first
32Smith-Waterman Algorithm
Align S1ATCTCGTATGATG S2GTCTATCAC
0
0
0
0
0
0
2
1
0
0
2
1
0
2
2
3
4
?1, ?1
5
7
9
8
10
33Principles of the ISA
.....
.....
34Principles of the ISA
Communication- Register
35Interface Processors
Interface Processors North
. . . .
Interface Processors West
ISA
. . . .
36Instruction Systolic Array
- wavefront instruction execution ? fast
accumulation operations (e.g. row sum, broadcast,
ringshift)
37Advantage of ISAs Performing Aggregate Functions
C CWEST
C C CWEST
C CWEST CCEAST
38Data Transfer
- In Systola 1024,
- input of new character (bj) into the lower
western IP, and - when l1 gt 2048, the input of previously computed
H, E, and F cells and output of H, E, and F cells - For Fuzion 150, during the 16 new H-cells in each
PE, one new character is input via Fuzion bus
39Instruction Counts
- Instruction Count (IC) to update 2 and 16 H-cells
in Systola 1024 and Fuzion 150, respectively
Operations in each PE per iteration step Systola Fuzion
Get H(i 1, j), F(i 1), bj, maxi-1 from neighbor 20 22
Compute t max0, H(i 1, j 1) Sbt(ai, bj) 20 576
Compute F(i, j) maxH(i 1, j ?, F(i 1, j) ? 8 336
Compute E(i, j) maxH(i, j 1 ?, E(i, j 1) ? 8 448
Compute F(i, j) maxt, H(i, j, F(i, j) 8 368
Compute maxi maxH(i, j), maxi-1 4 184
Sum 68 1934
40Maximum Characters/PE
- The memory per PE on Systola is 32 (16-bit)
registers - 2 characters per PE is the maximal possible
- (2 chars x 20 AAs substitution row x 8-bit per
substitution value 20 registers) - The memory per PE on Fuzion is 2Kb
- maximum chars per PE is 16
- restricted due to indirect addressing per PE
41Indirect Address
- An addressing mode found in many processors'
instruction sets where the instruction contains
the address of a memory location which contains
the address of the operand (the "effective
address") or specifies a register which contains
the effective address
42Myrinet - Overview
- Myrinet is a cost-effective, high-performance,
packet-communication and switching technology
that is widely used to interconnect clusters of
workstations, PCs, servers, or single-board
computers - Conventional networks (e.g., ethernet) can be
used to build clusters, but do not provide the
performance/features required for HPC or
high-availability clustering
43Myrinet - Characteristics
- Full-duplex 22 Gigabit/second data rate links,
switch ports, and interface ports - Flow control, error control, and "heartbeat"
continuity monitoring on every link - Low-latency, cut-through, crossbar switches, with
monitoring for high-availability applications - Switch networks that can scale to tens of
thousands of hosts, and that can also provide
alternative communication paths between hosts - Host interfaces that execute a control program to
interact directly with host processes ("OS
bypass") for low-latency communication, and
directly with the network to send, receive, and
buffer packets
44lq ? processors Hybrid
- Query sequence M, Number of processors
- in ISA N2, assuming M k x N
- k ? N Each k x N subarray computes the alignment
of the same query sequence with different subject
sequences - k N
- k/N 2 load 2 chars per PE
- k/N gt 2 split query sequence into k/2N passes
and load 2N2 chars in each pass
45lq ? processors Fuzion 150
- Length of query sequence M, Number
- of processors 1536
- k x M 1536 k alignments of same query sequence
w/ different subject sequences carried out in
parallel - k x 1536 M
- Split into k passes requires I/O of
intermediate results in each step - Data transfers can be minimized by assigning k/M
chars per PE currently 16 chars per PE is the
limit
46Concept of true and false hits
- The following cases were distinguished
- true positives, alignments between proteins of
similar structure that fall above a given
threshold (defined by the sequence alignment
method) - false positives, alignments between proteins of
dissimilar structure that fall above a given
threshold of the sequence alignment - true negatives, alignments between proteins of
dissimilar structure that that fall below a given
threshold - false negatives, alignments between proteins of
similar structure that fall below a given
threshold
47Guidelines
- When to use S-W ?
- if you are looking for a protein distantly
related to your query sequence (e.g., you have a
known protein sequence and you want to find
possible distant homologues) - if you are looking for the protein encoded in
your low-quality DNA query sequence (e.g., you
have a badly sequenced cDNA clone) - if you are looking for a DNA sequence
corresponding to your protein query sequence
(e.g., you want to identify potential homologues
of your protein in the EST databases) - When to use BLAST ?
- if you are looking for close matches and you
don't mind missing lower homology sequences - if you want a quick answer
48Performance Evaluation of SAMBA
Query sequence length 10 30 100 300 1000 3000 10000
Query sequence length Time in seconds Time in seconds Time in seconds Time in seconds Time in seconds Time in seconds Time in seconds
Samba 25 25 26 30 40 77 210
DEC-Alpha 150 Mhz Speed up 57 2.3 120 4.8 350 13.5 1041 34.7 3468 86.7 11510 150 38450 183
SUN-Sparc 5 110 MHz Speed up 95 3.8 239 9.5 746 28.6 2215 7.4 7300 183 24269 315 80300 382
DEC 5000/250 40 MHz Speed up 182 7.3 548 22 1407 54 4054 135 12920 323 41169 534 131193 625
Source Jamet and Laveneir, CABIOS, 12(7),
609-615, 1997 ? The longer the query length, the
better the speed-up
49Performance Evaluation of Kestrel
USparc Sun Ultrasparc 140 MHz B-SYS 470-PE
ISA Alpha DEC Alpha 433 MHz 1K MP2 1K-PE
MasPar Paragon 32-node Paragon Decy-1 1-board
Decypher-II Merc1 1-board Mercury Bcll-1
Biocellerator Samba 2-board Samba 16-MP2
16K-PE MasPar FDF-3 5-Board Paracell
FDF Kestrel 1-board Kestrel Decy-15 15-board
Decypher-II (single purpose) (FPGA)
Source Dahle et. al, PDPTA, 1243-1249, 1999
50Performance Evaluation of Splash-2
Hardware Specifics MCUPS
Splash-2 Unidir 16 boards 43,000
Splash-2 Bidir 16 boards 34,000
Splash-2 Unidir 1 board 3,000
Splash-2 Bidir 1 board 2,100
Splash-1 Bidir 746 PEs 370
SPARC 10/30 GX gcc O2 1.2
VAX 6620 VMS CC 1.0
SPARC-1 gcc O2 0.87
486DX-50 PC DOS gcc O2 0.67
Source Hoang, IEEE-CMM, 185-191, 1993