Title: Parallel Computation in Biological Sequence Analysis
1Parallel Computation in Biological Sequence
Analysis
- Xue Wu
- CMSC 838 Presentation
2Motivation
- Scanning and analyzing biological sequences are
common and repeated tasks in molecular biology - Homologous sequence searching
- Based on pairwise alignment
- Task is to find similarities between a particular
query sequence and all the sequences of a
biosequence databank - Multiple sequence alignment
- Simultaneous alignment of three or more
nucleotide or amino acid sequences - Problems with sequential solution
- With the exponential growth of the biosequence
banks, homologous sequence searching becomes time
consuming - The automatic generation of an accurate multiple
alignment is computationally expensive - Parallel solution can reduce computation time and
provide more accurate result
3Talk Overview
- Overview of talk
- Motivation
- Techniques and Evaluations
- Similarity Sequence Searching
- Multiple Sequence Alignment
- Observations
4Techniques similarity sequence searching
- Two main parallel methods to search sequence
database - Fine grain approach for SIMD parallel computer
- Parallelize the comparison algorithm itself
- All processors cooperate to determine the
similarity score - Coarser grain approach for MIMD parallel computer
- Parallelize the database searching
- Each processor performs a selected number of
comparison - method used in the paper
- Parallelize Similarity Searching coarser grain
approach - Workload balancing is the key point for better
parallelism - Partition database, combine results from
sequential search for each database requires
equal-sized pieces of database for load balance - Percentage of Load ImBalance (PLIB) as metric for
load imbalance
5Techniques similarity sequence searching
- Splitting up database
- Unsorted portion method first load balancing
technique - Partition the database into a number of portions
- Portion_size database_size / processors_number
- If sequence assignment causes sum of sequence
lengths in portion P exceed ideal size by more
than X percent, reassign the sequence to portion
P1 - Low communication overhead, but possibly high
PLIB - Sorted portion method Master-worker method
- Sequences are sorted in decreasing length order
to minimize PLIB - The master processor distributes the sequences to
the worker processors dynamically - Low PLIB, but high communication overhead
6Techniques similarity sequence searching
- Proposed bucket method
- Statically apply sorted portion method
- Algorithm
- Sequences in the database are sorted in
decreasing length order - Starting from the longest-length sequence, place
the sequences in N buckets. For each sequence, - Find the sum of the sequences length in each
bucket - Find the bucket with the smallest sum value
- Place the sequence in the bucket
- In the case of a tie, the smallest numbered
bucket is selected - Each of the N processors performs sequence search
in its own bucket - If only N/n processors are used, each processor
searches n bucket
7Techniques similarity sequence searching
- Evaluation and comparison
- Comparison of Bucket and Portion method
- Comparison of Bucket and Master-worker method
- Algorithms are implemented on the Intel iPSC/860
- Preprocessing is performed on SPARC station 2
- Data source is GenBank (release 86.0)
- Preprocessing overhead is added for Bucket method
8Techniques similarity sequence searching
- Evaluation and comparison continued
- Conclusions
- In all tested cases, proposed Bucket method has
- Lower PLIB than Portion method
- Higher speedup than master-worker method
- Bucket method has obvious advantage when
- Sequences length is relatively small
- Processing with large number of processors
9Techniques multiple sequence alignment
- Sequential Berger-Munson algorithm
10Techniques multiple sequence alignment
- Sequential Berger-Munson algorithm
- Applied randomized techniques with optimization
to iteratively improve the multiple sequence
alignment - Description
- Randomly partition the input sequences into two
groups - Align two groups of sequences instead of
individual sequence with alignment score
calculated by - If the new alignment score is higher than the
previous one, the alignment is accepted and the
gaps are inserted into the sequences accordingly.
- The modified or unmodified alignment is used as
the input for the next iteration. The process is
stopped after q consecutive iterations of
rejection.
11Techniques multiple sequence alignment
- Parallel Berger-Munson algorithm with speculative
computation - Consecutive sequence of rejected iterations are
not dependent on each other and can be done in
parallel
12Techniques multiple sequence alignment
- Evaluation
- Method Improve the alignment generated by
experts and other program (CLUSTALV) - Data Source
- Three different groups of immunoglobulin
sequences from Kabat Database (Beta Release 5.0) - The average sequence lengths of three groups are
similar - The number of sequences are different, which in
each group is as twice as the previous group
13Techniques multiple sequence alignment
- Evaluation continued
- Alignment score comparison
- Apparently, Berger-Munson method provides more
accurate alignments with sacrifice of computation
time, which is not ignorable
14Techniques multiple sequence alignment
- Parallel Algorithm Speedup factor
- Conclusion
- The original iterative method is a good tool for
improving alignment results - With the parallel speculative computation
technique, it can - Increase the alignment score
- Reduce the computation time
- Can achieve higher speedup factors when
- Processing large_sized sequence group
- Processing sequences with high alignment score
- Cannot be compared with the previous algorithm by
Ishikawa et al.
15Observations
- Similarity Sequence Searching
- With the increasing size of biosequence database
and growth of computation power, coarse grain
parallelism for sequence searching is more simple
and effective - Time required for processing any given sequence
depends not only on the length of sequence, but
also on - The composition of the sequence
- CPU power and CPU availability
- So dynamic load balancing is still necessary.
- To minimize communication and scheduling
overhead, - Distributing sequences by fixed/variable size
block - Applying buffering strategy to reduce data
starvation and shadow scheduling latency - Multiple Sequence Alignment
- With the increasing of computation power,
parallelizing single multiple sequence Alignment
is not necessary. However, using parallelism to
increase the alignment accuracy is still
attractive. - Using computation time to exchange for alignment
accuracy
16Thank you!