Parallel Computation in Biological Sequence Analysis: ParAlign - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Parallel Computation in Biological Sequence Analysis: ParAlign

Description:

Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 56
Provided by: Greg4163
Category:

less

Transcript and Presenter's Notes

Title: Parallel Computation in Biological Sequence Analysis: ParAlign


1
Parallel Computation in Biological Sequence
Analysis ParAlign TurboBLAST
Larissa Smelkov
2
Biological Sequence Alignment
  • Local

Global
To identify conserved regions and differences To see whether 2 strings have a common substring
Needleman-Wunsch Smith-Waterman
Comparing two genes with same function (human vs. mouse) Comparing two proteins with similar function Searching for local similarities in large sequences (newly sequenced genomes) Looking for motifs in 2 proteins
Goal
Algorithm
Application
3
Protein Responsible for Iron Transport
  • Human
  • MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGK
    STLLKMLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMT
    VRELVAIGRYPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGG
    ERQRAWIAMLVAQDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVI
    AVLHDINMAARYCDYLVALRGGEMIAQGTPAEIMRGETLEMIYGIPMGIL
    PHPAGAAPVSFVY

Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNL
RDLTQQERISLTCVQKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYK
LKPIAAEIYEHTEGSTTSYYAVAVVKKGTEFTVNDLQGKTSCHTGLGRSA
GWNIPIGTLLHWGAIEWEGIESGSVEQAVAKFFSASCVPGATIEQKLCRQ
CKGDPKTKCARNAPYSGYSGAFHCLKDGKGDVAFVKHTTVNENAPDLNDE
YELLCLDGSRQPVDNYKTCNWARVAAHAVVARDDNKVEDIWSFLSKAQSD
FGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIMLKRVPSLMSQLYLGFEY
YSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRWSVVSNGDVECTVV
DETKDCIIKIMKGEADAV
4
Protein Responsible for Iron Transport
  • Human
  • MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGK
    STLLKMLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMT
    VRELVAIGRYPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGG
    ERQRAWIAMLVAQDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVI
    AVLHDINMAARYCDYLVALRGGEMIAQGTPAEIMRGETLEMIYGIPMGIL
    PHPAGAAPVSFVY

Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNL
RDLTQQERISLTCVQKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYK
LKPIAAEIYEHTEGSTTSYYAVAVVKKGTEFTVNDLQGKTSCHTGLGRSA
GWNIPIGTLLHWGAIEWEGIESGSVEQAVAKFFSASCVPGATIEQKLCRQ
CKGDPKTKCARNAPYSGYSGAFHCLKDGKGDVAFVKHTTVNENAPDLNDE
YELLCLDGSRQPVDNYKTCNWARVAAHAVVARDDNKVEDIWSFLSKAQSD
FGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIMLKRVPSLMSQLYLGFEY
YSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRWSVVSNGDVECTVV
DETKDCIIKIMKGEADAV
5
Similar Substrings
  • DSLSGGERQRAWIAMLVAQDSRC

  • DQLSGSPRQNRIQWIAVLKAEKSKC

6
Talk Outline
  • Problem Description
  • Smith-Waterman Algorithm
  • BLAST
  • ParAlign
  • TurboBLAST
  • Comparison

7
Problems of Comparison of 2 Sequences
  • Evolution Factor
  • Additions
  • Deletions
  • Substitutions
  • Human Factor
  • Typos
  • Duplicates

8
Solution
  • Smith-Waterman Algorithm (S-W)
  • Score Matrix
  • Gap Penalty

9
Score Matrix BLOSUM45
10
Pairwise Alignment Example
ELEPHANT PANTHER
11
S-W Dynamic Programming Matrix
12
S-W Formula
  • Ti-1, j-1 score(si, tj)
  • Ti, j max

Ti-1, j g
Ti, j-1 g
0
Ti-1, j-1
Ti-1, j
g gap penalty g 8 (in our example)
?
Ti, j-1
13
S-W Dynamic Programming Matrix
14
S-W Dynamic Programming Matrix
15
S-W Dynamic Programming Matrix
16
S-W Dynamic Programming Matrix
17
S-W Result Alignment
ELEPHANT P ANTHER
18
S-W Summary
  • Uses
  • Score matrix
  • Gap penalties
  • Complexity
  • O(mn)
  • Sensitivity
  • High

19
Growth of GenBank
33 mln sequences as of Feb. 14, 2004
http//www.ncbi.nlm.nih.gov/Genbank/genbankstats.h
tml
20
BLAST Basic Local Alignment Search Tool
21
BLAST Steps
  • Divide both sequences into words of length w
  • default w 3
  • Calculate score for each pair
  • Extend high scored pairs to increase score

22
BLAST Divide Sequences
23
BLAST Calculate Score
24
BLAST Sort Pairs on Score
25
BLAST Extension
26
BLAST Summary
  • Uses
  • Score matrix
  • Gap penalties
  • Heuristics to reduce computations
  • Complexity
  • O(m) with O(n) processors
  • Sensitivity
  • Low

27
Sensitivity
AXBXCXDXE ABCDE
  • Task Align 2 sequences
  • Smith-Waterman
  • BLAST

AXBXCXDXE A B C D E
Ø (no similar substrings)
28
S-W vs. BLAST
Speed
BLAST
S-W
Sensitivity
29
S-W and BLAST
  • Using them now
  • Too costly
  • Inefficient
  • Time-consuming
  • Solution
  • More heuristics
  • More parallelism

30
ParAlign
31
ParAlign Steps
  • Find ungapped alignments
  • Calculate approximate alignment scores
  • Choose high-scored sequences
  • Apply S-W

32
ParAlign Microparallelism
  • Divide wide registers into smaller units
  • Perform the same operation on different data
    sources
  • Modern microprocessors have this technology built
    in

33
ParAlign Calculate Scores in Parallel
34
ParAlign Estimate of Gaps
35
ParAlign Apply S-W in Parallel
36
ParAlign Summary
  • Uses
  • SIMD technology (single instruction multiple
    data)
  • S-W Algorithm
  • Heuristics to reduce computations
  • Requirement for machine
  • Modern microprocessor
  • Speed
  • Fast
  • Sensitivity
  • Medium

37
TurboBLAST
38
TurboBLAST Steps
  • Divide the job
  • Parts of query against partition of database
  • Apply BLAST
  • Merge results

39
TurboBLAST Implementation
  • A three-tier system
  • Components
  • Client
  • Master
  • Workers

40
TurboBLAST Schema
Master
Client
  • Sets up tasks
  • Manages execution
  • Coordinates Workers
  • Provides VSM

job
tasks
  • Divides job into tasks
  • Writes results to file

results
Turbo Hub
task
request task
results
Workers
  • Divide task
  • Schedule subtasks
  • Solve subtasks
  • Merge results

It does it not by pushing the work out, but
rather by simply posting information about what
work needs to be done and letting the machines
grab work from the remote locations.
41
TurboBLAST Client
  • Takes a BLAST job and divides it into a number of
    initial BLAST tasks.
  • Submits these tasks to the Master
  • Retrieves the results, and writes them to file.

42
TurboBLAST Master
  • Accepts tasks from Clients and sets them up to
    for processing by the Workers
  • Includes TurboHub (the server portion of a
    parallel execution system)
  • Includes File Provider (Java application that
    manages the databases)

43
TurboBLAST Worker
  • Workers are processors
  • Run a Java application and perform the BLAST
    computations
  • Merge the result
  • Are responsible for scheduling

44
TurboHub
  • TurboHub is execution engine for parallel and
    distributed Java applications
  • Scalable high performance
  • Wide range of computing environments
  • Manages the flow of data through the workflows
  • Schedules the components
  • Transforms data between components
  • Balances load
  • Handles errors

45
TurboBLAST TurboHub
  • Manages task execution
  • Coordinates the Workers
  • Provides a virtual shared memories
  • Supports dynamic changes in the set of Workers
  • Supports fault tolerance

46
TurboBLAST File Provider
  • Maintains a copy of each database
  • Delivers all or part of each database to Workers
    as they require them

47
TurboBLAST Advantages
  • Size of each task is optimal
  • processing is efficient on the processor that
    computes the task
  • Large set of tasks
  • no waste of time for processors
  • No algorithm change
  • Support for all flavors of BLAST
  • Ease to update
  • Applicable for different environments (PC,
    Macintosh )

48
TurboBLAST Experiment
  • Input data
  • 500 proteins
  • 200 400 amino acids in each
  • Database
  • 1,681,522,266 sequences
  • Hardware
  • IBM Linux cluster
  • 8 dual-processor workstations
  • 2 Pentium III processors, 996 Mhz each
  • 2 Gbyte memory
  • 100 Mbit Ethernet

49
TurboBLAST Results of Experiment
50
TurboBLAST Results of Experiment
51
TurboBLAST Summary
  • Divide and Conquer
  • Use many copies of BLAST in parallel
  • Uses BLAST Algorithm
  • Requirement for each machine
  • Java VM
  • Local BLAST executable
  • Speed
  • Very fast
  • Sensitivity
  • Low

52
Comparison of Algorithms/Products
Turbo BLAST
Speed
ParAlign
BLAST
S-W
Sensitivity
53
References
  • R.D. Bjornson, A.H. Sherman, S.B. Weston, N.
    Willard, J. Wing
  • TurboBLAST A Parallel Implementation of BLAST
    Built on the TurboHub
  • Intl. Parallel and Distributed Processing
    Symposium (IPDPS), 2002.
  • Rognes T.
  • ParAlign a parallel sequence alignment
    algorithm for rapid and sensitive database
    searches
  • Oxford University Press, 2001

54
Dont ask any Questions, please
55
PS
  • Web site there you can donate your computer time
    to participate in search of methods to cure
    cancer
  • http//www.the-optimists.org.uk
Write a Comment
User Comments (0)
About PowerShow.com