Title: CyberBridges Protein Pattern Discovery
1CyberBridges Protein Pattern Discovery
- Tom Milledge
- Giri Narasimhan
- Bioinformatics Research Group (BioRG)
- School of Computing and Information Sciences, FIU
2Protein Pattern Discovery Introduction
- Goals
- Implement unsupervised pattern discovery tools
for protein structure data by using the geometric
hashing technique - Create database of protein structure patterns
- Create multiple 3-D structural alignments
- Identify functional regions in proteins.
3Molecular Biology Primer
Proteins Hemoglobin, Immunoglobin, Keratin,
Melanin, Insulin, etc.
RNA
Protein
4Where does protein structure information come
from?
PDB (protein data bank) a repository of 3-D
protein structures
5Representing substructures as triangles
Largest common substructure (many linked
triangles) in query and target proteins
One triangle (3 atoms)
- Length1 Length2 Length3 ID1 ID2 ID3
- 9.5 7.05 7.01
217 231 238
6Basic steps for triangle-based geometric hashing
- Preprocessing phase
- Extract triangle information from target (model)
proteins and store them in a hash table - Searching phase
- For any given query protein, find the matching
triangles in the hash table - Extension phase
- Find the largest matching substructures
7Preprocessing phase Create the hash table
Read PDB data
7.06
9.49
7.01
Hash key
Extract triangles
035035047
Generate a hash key (based on the three lengths
and bin-size parameters) and enter record into a
hash table
8Search phase finding the matches
1. Decompose Query Protein
The initial search entails matching the query
triangles with the database of (target)
triangles. The results are added to a new hash
table containing all the target matches. The
results table includes the query atom IDs for the
substructure building phase.
The Hash Table is split across cluster nodes by
protein, with protein attribute information
stored in a separate table. This data is
accessed via the atom id foreign keys stored in
the hash table record.
At the begin of the search, the query protein is
decomposed into triangles with the attribute
information stored in a separate table. The
query protein data is then copied to all nodes.
2. Initial Search
9Extension phase building the substructures
Every vertex of the tree is a triangle
A list of triangle hits
Build an adjacency structure
Use graph searching algorithm, find larger
substructures
Measure structural similarity (RMSD) between
every substructure in query protein with every
substructure in model protein
Output common substructure pairs
RMSD root mean square distance
10Case study Dehydrogenase superfamily
1B3R Hydrolase (Rat)
1CJC Reductase (Cow)
1CF2 Dehydrogenase (Bacteria)
11Dehydrogenases Shared structural element
Reoccurring substructure
12Dehydrogenases building the common substructure
Other overlapping triangle matches are extended
from initial triangle to find largest common
substructure
Triangle from query protein (green) matches
triangle from target protein (pink)
RMSD (Root Mean Square distance) less than 1.0
Angstrom indicates a good match
RMSD is measured at each extension step to ensure
validity of the larger match
RMSD 0.32 Angstroms
RMSD 0.66 Angstroms
13Results Zinc finger protein family
DNA-binding substructure
Zinc-binding substructure
10 positions RMSD 0.46 angstroms
4 positions RMSD 0.35 angstroms
14Conclusions and Future Work
- Geometric hashing of proteins shows promise as an
important technique with a very good fit to many
parallel architectures. Areas of future work
include - Molecular Docking Identify potential drugs that
are least likely to cause side-effects. - Function prediction Create a database of
conserved substructures that indicate a specific
protein function. - Structure prediction Use sequence patterns with
a structural templates to predict structure of
new sequences.