Massively Parallel Solutions for Molecular Sequence Analysis - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Massively Parallel Solutions for Molecular Sequence Analysis

Description:

Complexity is proportional to the product of query size times database size ... Paragon: 32-node Paragon. Decy-1: 1-board Decypher-II* Merc1: 1-board Mercury ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 51
Provided by: sce53
Category:

less

Transcript and Presenter's Notes

Title: Massively Parallel Solutions for Molecular Sequence Analysis


1
Massively Parallel Solutions for Molecular
Sequence Analysis
Prabhakar R. Gudla CMSC 838T Presentation
2
Outline
  • Motivation
  • Smith-Waterman Algorithm
  • Parallelization
  • High Performance Computing
  • Hybrid Architecture
  • Fuzion 150
  • Performance Evaluation
  • Conclusions and Comments

3
Motivation
Discovered sequences are analyzed by comparison
with databases
Complexity is proportional to the product of
query size times database size
? Analysis too slow on sequential computers
4
Sequence Alignment
  • Two possible approaches
  • Heuristics, e.g. BLAST, FASTA, but the more
    efficient the heuristics, the worse the quality
    of the results
  • Parallel Processing, get high-quality results in
    reasonable time
  • BLAST, FASTA, Smith-Waterman (S-W)

Smith- Waterman
FASTA
BLAST
5
Outline
  • Motivation
  • Smith-Waterman Algorithm
  • Parallelization
  • High Performance Computing
  • Hybrid Architecture
  • Fuzion 150
  • Performance Evaluation
  • Conclusion and Comments

6
Parallelization of S-W
0
  • matrix cells along a single diagonal are computed
    in parallel
  • comparison is performed in l1l2?1 steps on l1 PEs

7
Parallel Architectures
  • Embedded Massively Parallel Accelerators
  • Other accelerators Decypher, Biocellerator,
    GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2,
    BioScan

8
Outline
  • Motivation
  • Smith-Waterman Algorithm
  • Parallelization
  • High Performance Computing
  • Hybrid Architecture
  • Fuzion 150
  • Performance Evaluation
  • Conclusion and Comments

9
Previous Applications
  • Volume Visualization Schmidt 00
  • Automatic Visual Quality Control (Automobile
    Industry)
  • Computer Tomography Schmidt, Schimmler, and
    Schröder 98
  • Video Compression Schmidt and Schimmler 99
  • Range of Transforms (Fourier, Wavelet, Hough,
    Radon) Schmidt, Schimmler and Schröder 99
  • Image Processing Schimmler and Lang 96, Lenders
    and Schröder 90, Jiang Edirisinghe, and Schröder
    97

10
Hybrid Architecture
  • combines SIMD and MIMD paradigm within a parallel
    architecture ? Hybrid Computer

11
Architecture of Systola 1024
  • Instruction Systolic Array
  • 32 ? 32 mesh of processing elements
  • wavefront instruction execution

12
Mapping onto Systola 1024
a query sequence (equal to 1024)
b subject sequence
c1c0 X
  • Efficient routing on the ISA Row Ringshift and
    Broadcast
  • Subject sequences can be pipelined with only step
    delay ? k steps for subject sequence of length k

13
Fuzion 150 Architecture
Linear SIMD Array 1536 PEs each with 2 Kbytes
DRAM
SIMD Controller
Instruction Fetch
Local Memory
Host
AGP
Rambus
FUZION Bus
1,2 or 4 Channels (6.4 GB/s)
32-bit EPU (ARC)
Video I/O
Display
  • 0.25-?m, single-chip, SIMD architecture
  • 1536 PEs _at_ 200 MHz ? 300 GOPS
  • 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth
  • multithreading (control units interact via
    semaphores)
  • developed by Clearspeed Technology (UK) for
    graphics, networking processing

14
Fuzion 150 Architecture
Local Memory
Block 5
Fuzion Bus
PE (5,0)
PE (5,1)
PE (5,255)
Block 1
PE (1,0)
PE (1,1)
PE (1,255)
Block 0
PE (0,0)
PE (0,1)
PE (0,255)
15
Mapping onto the Fuzion 150
Block 5
a query sequence (equal to 1536)
Block 1
b subject sequence
Block 0
bk.b1b0
c1c0 X
  • No fast global communication ? 2-step local
    communication
  • Subject sequence can be pipelined with only step
    delay

16
Contents
  • Motivation
  • Smith-Waterman Algorithm
  • Parallelization
  • High Performance Computing
  • Hybrid Architecture
  • Fuzion 150
  • Performance Evaluation
  • Conclusion and Comments

17
Performance Evaluation
  • Scan times in seconds for TrEMBL 14 (351834
    Protein Sequences) for various query sequence
    lengths
  • Parallel implementation scales linearly with
    sequence length
  • Computing time dominates data transfer time
  • Fuzion 150 is ?25 times faster than a single
    Systola 1024 difference in CMOS technology
    (0.25? vs 1.0?)

18
Performance Evaluation
  • Time comparisons for a 10 Mbase search on
    different parallel architectures with different
    query length
  • 4?faster than 16K-PE MasPar
  • 6?faster than Kestrel
  • 5?faster than SAMBA (special-purpose 3-board
    architecture)

19
Performance Evaluation
USparc Sun Ultrasparc 140 MHz B-SYS 470-PE
ISA Alpha DEC Alpha 433 MHz 1K MP2 1K-PE
MasPar Paragon 32-node Paragon Decy-1 1-board
Decypher-II Merc1 1-board Mercury Bcll-1
Biocellerator Samba 2-board Samba 16-MP2
16K-PE MasPar FDF-3 5-Board Paracell
FDF Kestrel 1-board Kestrel Decy-15 15-board
Decypher-II (single purpose) (FPGA)
Source Dahle et. al, PDPTA, 1243-1249, 1999
20
Outline
  • Motivation
  • Smith-Waterman Algorithm
  • Parallelization
  • High Performance Computing
  • Hybrid Architecture
  • Fuzion 150
  • Performance Evaluation
  • Conclusions and Comments

21
Conclusions
  • Demonstrated how fine-grained and hybrid parallel
    architectures can be applied efficiently for
    Comparative Genomics
  • Significant runtime savings for full genome
    comparisons and database searching
  • Same systems can be used for accelerating other
    bioinformatics applications, e.g. Hidden Markov
    Models

22
Comments
  • ?? With hardware support, is S-W as fast as
    BLAST?

Comparative search speeds on 600 MHz 21264A Alpha
machine (comparable MCUPS as Hybrid System and
Fuzion 150)
Search Tools (against Swiss-Prot DB) Sequence Under Test Sequence Under Test Sequence Under Test
Search Tools (against Swiss-Prot DB) ELVIS (5) Metr (276) Arp_arath (536)
Search Tools (against Swiss-Prot DB) Time taken for the search (seconds) Time taken for the search (seconds) Time taken for the search (seconds)
FASTA 3.3 4.3 20.0 25.0
BLAST 2.2 1.0 4.0 10.0
SSearch (SW) 6.0 240.0 565.0
HWare Accl. 3.2 16.8 29.7
Source Shane Sturrock, SCS, 2(1), April 2002
23
Comments
  • ? Is it feasible to use S-W as the default ?
  • Currently offered as a default option at EBI
    (European Bioinformatics Institute), handles 15K
    queries per month w/ full implementation of S-W
  • Depends on the objectives of the search
  • ? Just how much more accurate is S-W ?
  • 5-10 more sensitive towards divergent matches
    than BLAST (Shpaer et. al., Genomics 38, 179-191,
    1996)
  • BLAST will retrieve most biologically significant
    similarities, but will miss a few and will
    include some chance similarities

24
Comparison of S-W VS BLAST
  • Source Shpaer et.al., Genomics 38(2),
    pp.179-191, 1996
  • ? Is there a real difference in the results ?
  • YES

25
Comparison of S-W, FASTA, and BLAST
Note The numbers in the table show for how many
protein SF the method in the column performed
better than the one in the row
26
Acknowledgements
  • Dr. Bertil Schmidt
  • Dr. Chau-Wen Tseng

27
QA
28
Extra Slides
29
Full Genome Comparison
  • related Organisms, but Tuberculosis causes a
    disease ? find common and different parts
  • 16?106 pairwise sequence comparisons

30
Smith-Waterman Algorithm
  • Optimal local alignment of two sequences
  • Performs an exhaustive search for the optimal
    local alignment
  • Complexity O(n?m) for sequence lengths n and m
  • Based on the 'dynamic programming' (DP) algorithm
  • Fill the DP matrix using a substitution
    (mutation) matrix
  • Find the maximal value (score) in the matrix
  • Trace back from the score until a 0 value is
    reached

31
Smith-Waterman Algorithm
  • Aligning S1 and S2 of length l1 and l2 using
    recurrences
  • Calculate three possible ways to extend the
    alignment
  • by one aminoacid (AA) in each sequence
  • by one AA in the first sequence and align it with
    a gap in the second
  • by one AA in the second sequence and align it
    with a gap in the first

32
Smith-Waterman Algorithm
Align S1ATCTCGTATGATG S2GTCTATCAC
0
0
0
0
0
0
2
1
0
0
2
1
0
2
2
3
4
?1, ?1
5
7
9
8
10
33
Principles of the ISA
.....
.....
34
Principles of the ISA
Communication- Register
35
Interface Processors
Interface Processors North
. . . .
Interface Processors West
ISA
. . . .
36
Instruction Systolic Array
  • wavefront instruction execution ? fast
    accumulation operations (e.g. row sum, broadcast,
    ringshift)

37
Advantage of ISAs Performing Aggregate Functions
  • Row Broadcast

C CWEST
  • Row Sum

C C CWEST
  • Row Ringshift

C CWEST CCEAST
38
Data Transfer
  • In Systola 1024,
  • input of new character (bj) into the lower
    western IP, and
  • when l1 gt 2048, the input of previously computed
    H, E, and F cells and output of H, E, and F cells
  • For Fuzion 150, during the 16 new H-cells in each
    PE, one new character is input via Fuzion bus

39
Instruction Counts
  • Instruction Count (IC) to update 2 and 16 H-cells
    in Systola 1024 and Fuzion 150, respectively

Operations in each PE per iteration step Systola Fuzion
Get H(i 1, j), F(i 1), bj, maxi-1 from neighbor 20 22
Compute t max0, H(i 1, j 1) Sbt(ai, bj) 20 576
Compute F(i, j) maxH(i 1, j ?, F(i 1, j) ? 8 336
Compute E(i, j) maxH(i, j 1 ?, E(i, j 1) ? 8 448
Compute F(i, j) maxt, H(i, j, F(i, j) 8 368
Compute maxi maxH(i, j), maxi-1 4 184
Sum 68 1934
40
Maximum Characters/PE
  • The memory per PE on Systola is 32 (16-bit)
    registers
  • 2 characters per PE is the maximal possible
  • (2 chars x 20 AAs substitution row x 8-bit per
    substitution value 20 registers)
  • The memory per PE on Fuzion is 2Kb
  • maximum chars per PE is 16
  • restricted due to indirect addressing per PE

41
Indirect Address
  • An addressing mode found in many processors'
    instruction sets where the instruction contains
    the address of a memory location which contains
    the address of the operand (the "effective
    address") or specifies a register which contains
    the effective address

42
Myrinet - Overview
  • Myrinet is a cost-effective, high-performance,
    packet-communication and switching technology
    that is widely used to interconnect clusters of
    workstations, PCs, servers, or single-board
    computers
  • Conventional networks (e.g., ethernet) can be
    used to build clusters, but do not provide the
    performance/features required for HPC or
    high-availability clustering

43
Myrinet - Characteristics
  • Full-duplex 22 Gigabit/second data rate links,
    switch ports, and interface ports
  • Flow control, error control, and "heartbeat"
    continuity monitoring on every link
  • Low-latency, cut-through, crossbar switches, with
    monitoring for high-availability applications
  • Switch networks that can scale to tens of
    thousands of hosts, and that can also provide
    alternative communication paths between hosts
  • Host interfaces that execute a control program to
    interact directly with host processes ("OS
    bypass") for low-latency communication, and
    directly with the network to send, receive, and
    buffer packets

44
lq ? processors Hybrid
  • Query sequence M, Number of processors
  • in ISA N2, assuming M k x N
  • k ? N Each k x N subarray computes the alignment
    of the same query sequence with different subject
    sequences
  • k N
  • k/N 2 load 2 chars per PE
  • k/N gt 2 split query sequence into k/2N passes
    and load 2N2 chars in each pass

45
lq ? processors Fuzion 150
  • Length of query sequence M, Number
  • of processors 1536
  • k x M 1536 k alignments of same query sequence
    w/ different subject sequences carried out in
    parallel
  • k x 1536 M
  • Split into k passes requires I/O of
    intermediate results in each step
  • Data transfers can be minimized by assigning k/M
    chars per PE currently 16 chars per PE is the
    limit

46
Concept of true and false hits
  • The following cases were distinguished
  • true positives, alignments between proteins of
    similar structure that fall above a given
    threshold (defined by the sequence alignment
    method)
  • false positives, alignments between proteins of
    dissimilar structure that fall above a given
    threshold of the sequence alignment
  • true negatives, alignments between proteins of
    dissimilar structure that that fall below a given
    threshold
  • false negatives, alignments between proteins of
    similar structure that fall below a given
    threshold

47
Guidelines
  • When to use S-W ?
  • if you are looking for a protein distantly
    related to your query sequence (e.g., you have a
    known protein sequence and you want to find
    possible distant homologues)
  • if you are looking for the protein encoded in
    your low-quality DNA query sequence (e.g., you
    have a badly sequenced cDNA clone)
  • if you are looking for a DNA sequence
    corresponding to your protein query sequence
    (e.g., you want to identify potential homologues
    of your protein in the EST databases)
  • When to use BLAST ?
  • if you are looking for close matches and you
    don't mind missing lower homology sequences
  • if you want a quick answer

48
Performance Evaluation of SAMBA
Query sequence length 10 30 100 300 1000 3000 10000
Query sequence length Time in seconds Time in seconds Time in seconds Time in seconds Time in seconds Time in seconds Time in seconds
Samba 25 25 26 30 40 77 210
DEC-Alpha 150 Mhz Speed up 57 2.3 120 4.8 350 13.5 1041 34.7 3468 86.7 11510 150 38450 183
SUN-Sparc 5 110 MHz Speed up 95 3.8 239 9.5 746 28.6 2215 7.4 7300 183 24269 315 80300 382
DEC 5000/250 40 MHz Speed up 182 7.3 548 22 1407 54 4054 135 12920 323 41169 534 131193 625
Source Jamet and Laveneir, CABIOS, 12(7),
609-615, 1997 ? The longer the query length, the
better the speed-up
49
Performance Evaluation of Kestrel
USparc Sun Ultrasparc 140 MHz B-SYS 470-PE
ISA Alpha DEC Alpha 433 MHz 1K MP2 1K-PE
MasPar Paragon 32-node Paragon Decy-1 1-board
Decypher-II Merc1 1-board Mercury Bcll-1
Biocellerator Samba 2-board Samba 16-MP2
16K-PE MasPar FDF-3 5-Board Paracell
FDF Kestrel 1-board Kestrel Decy-15 15-board
Decypher-II (single purpose) (FPGA)
Source Dahle et. al, PDPTA, 1243-1249, 1999
50
Performance Evaluation of Splash-2
Hardware Specifics MCUPS
Splash-2 Unidir 16 boards 43,000
Splash-2 Bidir 16 boards 34,000
Splash-2 Unidir 1 board 3,000
Splash-2 Bidir 1 board 2,100
Splash-1 Bidir 746 PEs 370
SPARC 10/30 GX gcc O2 1.2
VAX 6620 VMS CC 1.0
SPARC-1 gcc O2 0.87
486DX-50 PC DOS gcc O2 0.67
Source Hoang, IEEE-CMM, 185-191, 1993
Write a Comment
User Comments (0)
About PowerShow.com