Title: Parallel Computational Biochemistry
1Parallel Computational Biochemistry
2Proteins, DNA, etc.
DNA encodes the information necessary to
produce proteins
Proteins are the main molecular building blocks
of life (for example, structural proteins,
enzymes)
3Proteins, DNA, etc.
- Proteins are formed from a chain of molecules
called amino acids
4Proteins, DNA, etc.
- The DNA sequence encodes the amino acid sequence
that constitutes the protein
5Proteins, DNA, etc.
- There are twenty amino acids found in proteins,
denoted by A, C, D, E, F, G, H, I, ...
6Multiple Sequence Alignment
7Databases of Biological Sequences
NCBI 14,976,310 sequences 15,849,921,438
nucleotides
gtBGAL_SULSO BETA-GALACTOSIDASE Sulfolobus
solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKW
VHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWS
RIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIF
KDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEF
ARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELS
RRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMA
ENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRT
EKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRY
HLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLA
DNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEH
LNSVPPVKPLRH
Swiss-Prot 104,559 sequences
38,460,707 residues
PDB 17,175 structures
8Sequence comparison
- Compare one sequence (target) to many sequences
(database search) - Compare more than two sequences simultaneously
9Applications
- Phylogenetic analysis
- Identification of conserved motifs and domains
- Structure prediction
10(No Transcript)
11Phylogenetic Analysis
12Structure Prediction
gt RICIN GLYCOSIDASE MYSFPNSFRFGWSQAGFQSEMGTPGSEDPN
TDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIAR
LNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALN
HYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLST
RTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGY
LSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDM
EAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTR
TVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLT
KYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGY
LHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAI
TDEIEHLNSVPPVKPLRH
Protein sequences
Protein structures
Genomic sequences
13Our Contributions
- Parallel min vertex cover for improved sequence
alignments - (to appear in Journal of Computer and System
Sciences) - Parallel Clustal W (ICCSA 2003)
- In progress Clustal XP portal at
http//cgm.dehne.net
14Clustal W
15Progressive Alignment
1. Do pairwise alignment of all sequences
and calculate distance matrix
Scerevisiae 1 Celegans 2
0.640 Drosophia 3 0.634 0.327 Human
4 0.630 0.408 0.420 Mouse 5 0.619
0.405 0.469 0.289
2. Create a guide tree based on this
pairwise distance matrix
3. Align progressively following guide tree.
start by aligning most closely related pairs of
sequences at each step align two sequences or
one to an existing subalignment
16Parallel Clustal
- Parallel pairwise (PW) alignment matrix
- Parallel guide tree calculation
- Parallel progressive alignment
Scerevisiae 1 Celegans 2
0.640 Drosophia 3 0.634 0.327 Human
4 0.630 0.408 0.420 Mouse 5 0.619
0.405 0.469 0.289
17Relative Speedup
18Clustal XP vs. SGI
- SGI data taken from Performance Optimization of
Clustal W Parallel Clustal W, HT Clustal, and
MULTICLUSTAL - By Dmitri Mikhailov, Haruna Cofer, and Roberto
Gomperts
19Parallel Clustal - Improvements
- Optimization of input parameters
- scoring matrices, gap penalties - requires many
repetitive Clustal W calculations with various
input parameters. - Minimum Vertex Cover
- use minimum vertex cover to remove erroneous
sequences, and identify clusters of highly
similar sequences.
20Minimum Vertex Cover
- TASK remove smallest number of gene sequences
that eliminates all conflicts - NP-complete
- Conflict Graph
- vertex sequence
- edge conflict (e.g. alignment with very poor
score)
21FPT Algorithms
- Phase 1 Kernelization
- Reduce problem to size f(k)
- Phase 2 Bounded Tree Search
- Exhausive tree search exponential in f(k)
22Kernelization
- Buss's Algorithm for k-vertex cover
- Let G(V,E) and let S be the subset of vertices
with degree k or more. - Remove S and all incident edges
- G-gtG k -gt k'k-S.
- IF G' has more than k x k' edges
- THEN no k-vertex cover exists
- ELSE start bounded tree search on G'
23Bounded Tree Search
24Case 1 simple path of length 3
remove selected vertices from G' k' - 2
25Case 2 3-cycle
remove selected vertices from G' k' - 2
26Case 3 simple path of length 2
remove v1, v2 from G' k' - 1
27Case 4 simple path of length 1
remove v, v1 from G' k' - 1
28Sequential Tree Search
- Depth first search
- backtrack when k'0 and G'ltgt0 ("dead end" ))
- stop when solution found (G', k'gt0 )
29Parallel Tree Search
- Basic Idea
- Build top log p levels of the search tree (T ')
- every proc. starts depth-first search at one leaf
of T ' - randomize depth-first search by selecting random
child
30Analysis Balls-in-bins
sequential depth-first search path total
lengthL, solutions m
expected sequential time (rand. distr.) L/(m1)
parallel search path
expected parallel time (rand. distr.) p
L/(p(m1)) expected speedup p / (1
(m1)/L) if m ltlt L then expected speedup p
31Simulation Experiment
L 1,000,000
32Implementation
- test platform
- 32 node HPCVL Beowulf cluster
- each node dual 1.4 GHz Intel Xeon, 512 MB RAM,
60 GB disk - gcc and LAM/MPI on LINUX Redhat 7.2
- code-s Sequential k-vertex cover
- code-p Parallel k-vertex cover
33Test Data
- Protein sequences
- Same protein from several hundred species
- Each protein sequence a few hundred amino acid
residues in length - Obtained from the National Center for
Biotechnology Information (http//www.ncbi.nlm.nih
.gov/)
34Test Data
- Somatostatin
- neuropeptide involved in the regulation of many
functions in different organ systems - Clustal Threshold 10, V 559, E 33652, k
273, k' 255
35Test Data
- WW
- small protein domain that binds proline rich
sequences in other proteins and is involved in
cellular signaling - Clustal Threshold 10, V 425, E 40182, k
322, k' 318
36Test Data
- Kinase
- large family of enzymes involved in cellular
regulation - Clustal Threshold 16, V 647, E 113122,
k 497, k' 397
37Test Data
- SH2 (src-homology domain 2)
- involved in targeting proteins to specific sites
in cells by binding to phosphor-tyrosine - Clustal Threshold 10, V 730, E 95463, k
461, k' 397
38Test Data
- Thrombin
- protease involved in the blood coagulation
cascade and promotes blood clotting by converting
fibrinogen to fibrin - Clustal Threshold 15, V 646, E 62731, k
413, k' 413
39Test Data
- PHD (pleckstrin homology domain)
- involved in cellular signaling
- Clustal Threshold 10, V 670, E 147054,
k 603, k' 603
40Test Data
- Random Graph
- V 220, E 2155, k 122, k' 122
- Grid Graph
- V 289, E 544, k 145, k' 145
41Test Data
42Sequential Times
Kinase, SH2, Thombin n/a
43Code-p on Virtual Proc.
44Parallel Times
45Speedup Somatostatin
46Speedup WW
47Speedup Rand. Graph
48Speedup Grid Graph
49Clustal XP
X Extended P Parallel
in progress
50Clustal XP
http//cgm.dehne.net
51(No Transcript)