CyberBridges Protein Pattern Discovery

About This Presentation

Title:

CyberBridges Protein Pattern Discovery

Description:

... 0.32 Angstroms. RMSD: 0.66 ... Square distance) less than 1.0 Angstrom indicates a good match ... RMSD: 0.46 angstroms. 4 positions RMSD: 0.35 angstroms. 14 ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 15

Provided by: IBMU467

Category:

more less

Transcript and Presenter's Notes

Title: CyberBridges Protein Pattern Discovery

1
CyberBridges Protein Pattern Discovery

Tom Milledge
Giri Narasimhan
Bioinformatics Research Group (BioRG)
School of Computing and Information Sciences, FIU

2
Protein Pattern Discovery Introduction

Goals
Implement unsupervised pattern discovery tools
for protein structure data by using the geometric
hashing technique
Create database of protein structure patterns
Create multiple 3-D structural alignments
Identify functional regions in proteins.

3
Molecular Biology Primer
Proteins Hemoglobin, Immunoglobin, Keratin,
Melanin, Insulin, etc.
RNA
Protein
4
Where does protein structure information come
from?
PDB (protein data bank) a repository of 3-D
protein structures
5
Representing substructures as triangles
Largest common substructure (many linked
triangles) in query and target proteins
One triangle (3 atoms)

Length1 Length2 Length3 ID1 ID2 ID3
9.5 7.05 7.01
217 231 238

6
Basic steps for triangle-based geometric hashing

Preprocessing phase
Extract triangle information from target (model)
proteins and store them in a hash table
Searching phase
For any given query protein, find the matching
triangles in the hash table
Extension phase
Find the largest matching substructures

7
Preprocessing phase Create the hash table
Read PDB data
7.06
9.49
7.01
Hash key
Extract triangles
035035047
Generate a hash key (based on the three lengths
and bin-size parameters) and enter record into a
hash table
8
Search phase finding the matches
1. Decompose Query Protein

The initial search entails matching the query
triangles with the database of (target)
triangles. The results are added to a new hash
table containing all the target matches. The
results table includes the query atom IDs for the
substructure building phase.
The Hash Table is split across cluster nodes by
protein, with protein attribute information
stored in a separate table. This data is
accessed via the atom id foreign keys stored in
the hash table record.
At the begin of the search, the query protein is
decomposed into triangles with the attribute
information stored in a separate table. The
query protein data is then copied to all nodes.
2. Initial Search
9
Extension phase building the substructures
Every vertex of the tree is a triangle
A list of triangle hits
Build an adjacency structure
Use graph searching algorithm, find larger
substructures
Measure structural similarity (RMSD) between
every substructure in query protein with every
substructure in model protein
Output common substructure pairs
RMSD root mean square distance
10
Case study Dehydrogenase superfamily
1B3R Hydrolase (Rat)
1CJC Reductase (Cow)
1CF2 Dehydrogenase (Bacteria)
11
Dehydrogenases Shared structural element
Reoccurring substructure
12
Dehydrogenases building the common substructure
Other overlapping triangle matches are extended
from initial triangle to find largest common
substructure
Triangle from query protein (green) matches
triangle from target protein (pink)
RMSD (Root Mean Square distance) less than 1.0
Angstrom indicates a good match
RMSD is measured at each extension step to ensure
validity of the larger match
RMSD 0.32 Angstroms
RMSD 0.66 Angstroms
13
Results Zinc finger protein family
DNA-binding substructure
Zinc-binding substructure
10 positions RMSD 0.46 angstroms
4 positions RMSD 0.35 angstroms
14
Conclusions and Future Work

Geometric hashing of proteins shows promise as an
important technique with a very good fit to many
parallel architectures. Areas of future work
include
Molecular Docking Identify potential drugs that
are least likely to cause side-effects.
Function prediction Create a database of
conserved substructures that indicate a specific
protein function.
Structure prediction Use sequence patterns with
a structural templates to predict structure of
new sequences.