Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari

Description:

Cluster fragments of cDNA. Related to fragment assembly' ... Depth of its root in GST threshhold. CMSC 838T Presentation. 21. On-demand Pair Generation ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 37

Provided by: csg3

Learn more at: https://www.cs.gsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari

1
Parallel EST ClusteringbyKalyanaraman, Aluru,
and Kothari

Nargess Memarsadeghi
CMSC 838 Presentation

2
Talk Overview

Overview of talk
Motivation
Background
Techniques
Evaluation
Related work
Observations

3
Motivation EST Clustering

Problem EST Clustering
Cluster fragments of cDNA
Related to fragment assembly problem
Detecting overlapping fragments
Overlaps can be computed
Pairwise alignment algorithm
Dynamic programming
Alternative
Approximate overlap detection algorithms
Dynamic programming

4
Motivation

Common Tools
Takes too long
Days for 100,000 ESTs
Runs out of memory
This paper
PaCE
Parallel Clustering of ESTs
Efficient parallel EST Clustering
Space efficient algorithm
Reduce total work
Reduce run-time

5
Background EST Clustering Tools

Three traditional software
Originally designed for fragment assembly
TIGR Assembler
Phrap
CAP3
One parallel software
UICLUSTER assumes ESTs from 3 end

6
EST Clustering Tools

Basic approach
Find pairs of similar sequences
Align similar pairs
Dynamic programing
Quality of EST clustering
Phrap Fastest
avoids dynamic programming
Relies on approximation, lower quality
CAP Least of erroneous clusters

7
EST Clustering Tools Performance

With 50,000 maize ESTs
Using PC with dual Pentium 450MHZ , 512 RAM
TIGR ran out of memory
Phrap 40 min
CAP gt 24 hours
With 100,000 maize ESTs
all ran out of memory
CAP would require 4 days

8
Goal

Space efficient algorithm
Space requirement linear in the size of the input
data set
Reduce total work
Without sacrificing quality of clustering
Reduce run-time and facilitate the clustering of
large data sets
Through parallel processing
Scale memory with of processors

9
Approach

Expense
Pairwise alignment (time memory)
Promising pairs
Common string s w
Cost if common sl gt w , then repeats l-w1
times

10
Approach (Cont ..)

Approach
Use trie structure
Identify promising pairs
Merge clusters with strong overlaps
Avoid storing/testing all similar pairs
Parallel EST Clustering Software
Generalized Suffix Tree (GST)
Multiple processors
Maintain and updates EST Clusters
Others generate batches of promising pairs,
perform alignment

11
Approach (Cont )
12
Tries

Index for each char
N leaves
Height N

13
Suffix Tries (Cont ..)

TRIM suffix trie

14
Suffix Tries (Cont ..)

Indicies
Storage O(n), constant is high though
Common string
Longest common substring

15
Suffix Tries (Cont ..)
a
b
5
b

a
a
b

b

4

3
2
1
Given a pattern P ab we traverse the tree
according to the pattern.
16
Parallel Generation of GST

GST Generalized Suffix Tree
Compacted trie
Longest common prefix found in constant time
Used for on-demand pair generation
Sequential O(nl)
Parallel O(nl/p)

17
Parallel Generation of GST (Cont )

Previous implementations
CRCW/CREW PRAM model
Work-optimal
Involves alphabetical ordering of characters
Unrealistic assumptions
synchronous operation of processors
infinite network bandwidth
no memory contention
Not practically efficient

18
Parallel Generation of GST (Cont )

Papers approach
ESTs equally distributed among processors
Each processor
Partitions suffixes of ESTs into buckets
Distribute buckets to the processors
All suffixes in a bucket allocated to the same
processor
Total of suffixes allocated to a processor O
( )

19
Parallel Generation of GST (Cont )

Each buckets processor
Compute compacted trie of all its suffixes
Cannot use sequential construction
Suffixes of a string
not in the same bucket
Each bucket
Subtree in the GST
Nodes
Depth first search traversal of the trie
Pointer to the right most child

20
On-demand Pair Generation

A pair should be generated if
Share substring of length treshhold
Maximal
Leaves in a common node
Share a substring of length depth of node
Parallel algorithm
Each processor works with its trie if
Depth of its root in GST lt threshhold

21
On-demand Pair Generation

To process
Sort internal nodes
Decreasing order of depth
Lists of a node
Generated after process
Removed after parent is processed
Limits space O(nl)
Run time pairs generated cost of sorting
Rejected pairs increase run-time by a factor of 2
Eliminating duplicates reduce run-time

22
Parallel Clustering

Master-Slave paradigm
Master processor
Maintains and updates clusters
Using union-find data structure
Receives messages from slave processors
A batch of next promising pairs generated by
slave
Results of the pairwise alignment
Determines which ones to explore
Determines if merging should occur
Slave processors
Generate pairs on demand
Perform pairwise alignments of pairs dispatched
by the master processor

23
Parallel Clustering (Cont)
Organization of Parallel Clustering Software

Batch of promising pairs generated results of
pairwise alignment
Batchsize or fewer of pairs results of
pairwise alignemnt on each pair

Slave P
Master P
Slave P
slave P
24
Parallel Clustering (Cont..)

To start
Slave P starts with 3 batchsize pairs
Sends the 3rd batch to Master P
Starts alignment on 1st batch
Sends results on 1st a newly generated batch
While waiting to receive results from Master P,
aligns 2nd batch
Processor always has the next batch to work
between
Submitting the results of previous batch
Receiving another set of pairs

25
Parallel Clustering (Cont..)

Improve and control quality
Parameters
Match and mismatch scores
Gap penalties
Post processing
Detection of alternating splicing
Consulting protein databases
Organism specific

26
Experimental environment

Used C and MPI
Tested
Quality of software
Arabidopsis thaliana (due to availability of its
genome)
Run-time behavior
50,000 Maize ESTs with 32-processor IBM SP
of processors
Data size
( of Promising pairs) vs data size
Batchsize vs ( processors)
of Clusters
Master processors time

27
Quality Assessment

To asses quality
A data set and its correct clustering
ESTs from plant Arabidopsis thaliana
Splice program
Align ESTs to the genome
Discard ESTs that
Dont align
Aligned in multiple spots

28
Quality Assessment (Cont )

False negative
A pair in correct clustering is not paired in the
output
5
False positive
A pair not in correct clustering appears in
results
Negligible (lt 0.04)
Due to conservative nature of algorithm

29
Quality Assessment
Cluster results Number of singleton clusters Number of non-singleton clusters
Benchmark 10,803 18,727
CAP3 17,930 17,556
PaCE 14,802 19,536
Distribution of the number singleton and
non-singleton clusters for benchmark set of
168,200 Arabidopsis ESTs.
30
Quality Assessment (Cont..)
31
Run-time Assessment

Experiment with 50,000 maize ESTs
32-processor IBM SP-2
16 minutes

32
Run-time Assessment (Cont )
p Preprocessing Clustering Total
4 273 102 375
8 119 50 169
16 61 26 87
32 38 15 53
64 29 10 39
Run-time (in seconds) spent in various
components of PaCE for 20,000 ESTs. p, number of
processors.
33
Run-time Assessment (Cont ..)

Run-time as a function of batchsize
Small batchsize
Increase in communication overhead
Large batchsize
Slaves less responsive to the need of generating
pairs
Slave does not use latest clustering results
Optimal batchsize
Determined by experiment
Master processors time
Fixed batchsize, increase in of processors
Gradual increase in Master Ps time
With 32 processors, increase lt 1
Using 1 Master Processor in not bottleneck

34
Results

Space Linear in size of the input data set
Reduced total work without sacrificing quality
Reduced run-time
Parallel processors
Eliminating pairs
Faciliate clustering
Scale memory with Processors

35
Observations

PaCE Approaches EST clustering problem directly
Better than
CAP3
Phrap
TIGR Assembler
Compare time/quality
TIGICL (TIGR Indices Clustering Tool)
Support for PVM
MegaBlast
STACK
Large data sets
Lots of Processors
Can improve clustering time?
Clustering algorithm

36
References

http//www.cs.berkeley.edu/kubitron/courses/cs258
-S02/lectures/eval10-logp.pdf
Apostolico, C. Iliopoulos, G. M. Landau, B.
Schieber, and U. Vishkin. Parallel construction
of a suffix tree with applications. Algorithmica,
3347365, 1988.

Write a Comment

User Comments (0)