Title: Department of Computer Science,
1 Deepak Bandyopadhyay
A Geometric Framework for Robust Nearest Neighbor
Analysis of Protein Structure and Function
- Department of Computer Science,
- University of North Carolina at Chapel Hill
2Outline
Use geometric proximity (Voronoi / Delaunay)
to analyze protein structure and get insight into
their function
Use geometric proximity (Voronoi / Delaunay)
to analyze protein structure and get insight into
their function
Geometric proximity structures have problems
with imprecise points. But we can fix this!
Lets modify existing neighbor analyses of
protein structure to make them robust, and design
new ones!
Motivation
Briefly SNAPP packing differences secondary
structure hinges Detail structural fingerprints
for function inference
Methods
Applications
3Nearest Neighbors
4Geometric structures on point sets
Delaunay triangulation / tessellation (DT)
- Input Points
- Output Neighbors
5Delaunay tessellation of proteins
quadruplets
- Represent each amino acid by a point
- Ca, side-chain centroid, Cb,...
- Delaunay tetrahedra ?? nearest neighbor
quadruplets
6Delaunay Tessellation Applications
SNAPP, four-body statistical potential for
hydrophobic core stability Carter et al, 2001
Decoy discrimination Krishnamoorthy and
Tropsha, 2003 Scoring Ligand-receptor binding
affinity Zhang et al, 2004
Mining frequent substructures in protein families
Huan et al., 2004, 2005 Structure-Based
Function Inference Bandyopadhyay et al, 2005
7Outline
Geometric proximity structures have problems
with imprecise points. But we can fix this!
Motivation
Methods
Applications
8Effect of Imprecision on Delaunay
- If point coordinates are imprecise...
- What happens to the Delaunay neighbors?
- Think of 4 nearly co-circular points in 2D.
Delaunay edges may flip neighbors change.
9Which applications are affected by instability
of Delaunay ?
Frequent Subgraphs Qualitative,
Discretized Worse affected
Voronoi volumes Quantitative, Continuous Less
affected
- When people use Delaunay in analysis of protein
structure, they assume it is robust to
perturbations!
10Method 1 Almost-Delaunay (AD) tetrahedra
- A 4-tuple of points is in AD(e), if, by
perturbing all points in the set by at most e,
its circumscribing sphere can become empty. - The minimum perturbation required, e, is the AD
threshold.
Vertex can move within sphere of radius e
Green Delaunay, in AD(0) Red is in AD(e)
11AD tetrahedra for protein 2ACY, 98 residues,
Cas(colored by threshold DT not shown, for
clarity)
AD tetrahedra my overlap they do not tile space
12Computing AD thresholds Bandyopadhyay and
Snoeyink, 2004
- Find the spherical shell of minimum width, using
a result from computational metrology
Garcia-Lopez et al, 1998
- Given a set of points P, a simplex t is AD(e),
iff its points are contained within 2 concentric
spheres s.t. - difference in radii is 2e, minimum over all
such concentric spheres - inner sphere contains no points of P
2D Example
Code to compute AD edges, triangles, tetrahedra
for 3D points, in C/CGAL (with MATLAB
interface and utilities) is available
fromhttp//www.cs.unc.edu/debug/software
13Method 2 Delaunay Probability
- AD(e) captures worst-case deviation in
coordinates - Uncertainty in actual coordinates ? probabilistic
model - Assume each point has Gaussian p.d.f
- Prob(sphere empty of pi) 1-?(p.d.f of pi inside
sphere) - Probability that tetrahedron abcd is Delaunay
- integrate over all possible spheres defined by
a,b,c,d prob(sphere) ?p ?a,b,c,dprob(sphere
empty of p) - AD algorithm makes Delaunay Probability
computation feasible - Delaunay Probability significant only for
tetrahedra with low e!
1
2
14Summary of contributions
- Algorithmic
- Theory and Algorithm for the general framework
- Fast and robust implementation for 3D points
- Application domain
- Nearest neighbor analysis with imprecision
- Applications explored
- Scoring protein packing with a statistical 4-body
potential (SNAPP) - Quantifying packing differences between proteins
and other structures - Assigning secondary structure from Cas
- Analyzing conformational changes and finding
hinge residues - Finding local packing motifs specific to protein
families, applied to structure classification,
and functional inference for structural genomics
15Outline
Lets modify existing neighbor analyses of
protein structure to make them robust, and design
new ones!
Briefly SNAPP packing differences secondary
structure hingesDetail structural fingerprints
for function inference
Motivation
Methods
Applications
16Application 1 SNAPP
- Simplicial Neighborhood Analysis of Protein
Packing - Carter et al, JMB99
- Residues represented by side-chain centroids
- Protein structure represented as an aggregate of
space filling, irregular tetrahedra - Unique and objective recognition of nearest
neighbor residues in sets of four (Quadruplets)
17Likelihood Scores for 8724 Compositions
Tropsha A, Singh R, Vaisman I, Zheng W. Pac
Symp Biocomput. 614-23 (1996) Dunbrack, R. Culled
PDB http//www.fccc.edu/research/labs/dunbrack/cu
lledpdb.html
18Likelihood Mapped to hydrophobic core
19Applications
- Applications
- Decoy Discrimination Krishnamoorthy and
Tropsha, 2003 - Weighting scheme based on tetrahedron sequence
topology - Conformation change on ligand binding Sherman
et al, 2003 - Study of folding simulations Krishnamoorthy and
Tropsha, 2003 - Ligand-receptor binding affinity Zhang,
Golbraikh and Tropsha, 2004 - Contribution of almost-Delaunay
- How stable is the SNAPP score computed using
Delaunay? - Compute variants of it using AD and Delaunay
Probability
20Results scoring decoys
1
2
3
1. 4state_reduced 2. lattice_ssfit 3. semfold
- SNAPP with Delaunay probabilities distinguishes
decoys from native state as well as (even better
than?) Delaunay-based SNAPP. - Hence, the original Delaunay-based score is
stable
21Results scoring CASP5 predictions
- SNAPP with Delaunay probabilities discriminates
native structures from predictions as well as
Delaunay-based SNAPP (usually even better). - Hence, the original Delaunay-based score is
stable
Z-score (Rank)
22 Application 2 Packing Differences
- How does DT change as points are perturbed, for
different point sets?
sidechain centroids
(2cro 4state_reduced)
23 Stability of the DT in Proteins
Right Number of Delaunay and AD(0.3) tetrahedra
for a sample of predictions to CASP5. Notice
that the native structures, colored green, have
fewer AD tetrahedra for the same number of
Delaunay tetrahedra.
Left Average of AD tetrahedra at low e (lt 0.5
Å ) grows faster for random points than
proteins, as seen in this cumulative
histogram. This suggests that the DT is stable
for small perturbations in proteins
Delaunay
24Application 3 Secondary structure from Ca
- AD threshold histogram of a-helixhas unique
signature that enableshelix assignment from Cas
25AD secondary structure assignment
- strong a-helix signal, weaker b-sheet and b-turn
signals - Better accuracy than previous work Wako and
Yamato, 1998 - More tolerant to structural and H-bond
imperfections than DSSP - 1bg5, irregular helix on right
- Applications
- consensus assignment
- structure prediction
1bg5
AD
1bg5
DSSP
Above Visual comparison of a-helix, b-sheet
and b-turn assignments in 1BG5 showing an
irregular a-helix detected by AD and not DSSP.
26Application 4 Conformational Change and Hinges
- Analysis of conformational change and detection
of hinges from a few unaligned conformations
using AD tetrahedra
27Neighbor Changes on Motion
- Motion major rearrangements at a few key
residues, the hinges - Model as neighbor changes, rather than large
dihedral angle changes - DT contains no conformational change signal AD
tetrahedra do - In neighborhood of hinge region, neighbor
relationships change drastically (quantify by
changes in AD tetrahedra thresholds) - Ovotransferrin, threshold color 0, 0.01-0.1,
0.1-0.5, 0.5-1,1-2 - Hinge residues from hinge tetrahedra
1TFA SC apo (open) form
1IEJ SC holo (closed) form
28Comparison with literature
Ovotransferrin hinges from 3 conformations
TrpRS 8 chains, preTS, MD sim
sidechain centroids
- ? hinge region
- ? isolated hinge residue
- Labeled residues are known from literature
29Application 5 Family-Specific Fingerprints
- Find residue packing patterns specific to protein
families, using graph representations with DT/AD
edges. - Use for family classification and functional
annotation
30Graph Representation
Proteins
Small
Molecules
Peptide edge Proximity edge
Node label Amino acid type, chemical properties,
Edge label Sequence adjacency or structure
proximity, determined by distance
31Graph Database Mining
- Input database of labeled undirected graphs
threshold 0 lt ? ? 1
- Output All (connected) frequent subgraphs from
the graph database. - Performance is critical
- Number of patterns can grow exponentially for
large and dense graphs - Subgraph isomorphism (NP-complete)
32Subgraph mining algorithms developed in our group
- Frequent Subgraph Mining ICDM03
- Canonical Adjacency Matrix (CAM) tree
- Induced Subgraph Mining RECOMB04
- Induced subgraphs geometrically more rigid,
superimposable - Miss many useful motifs embedded in a dense
graph. - Maximal frequent subgraph mining SIGKDD04
- Mines only maximal frequent subgraph (no
supergraph freqnt) - Uses a spanning tree comparison algorithm
- CliqueHashing and CliqueHashing ISMB05 demo
- Finding frequent cliques in linear time
33Three Graph Representations
CD
E(DT) ? E(AD) ? E(CD)
34Family Specific Fingerprints
- Frequent occur in gt80 of family proteins
- Family-specific occur in lt5 of background
proteins
TRP141
GLY196
CYS42
HIS57
G1
G2
GLY197
CYS42
ALA55
CYS58
Subgraph G1 Not sequence conserved. Useful for
the annotation of structural orphans.
Subgraph G2 Sequence conserved motif
C-x(12)-A-x-H-C Useful for the annotation of both
structural orphans and sequences.
Human Kallikrein 6 (1LO6) Serine Protease family
35Largest Serine Protease Fingerprint
1LO6
Blue His57-Asp102-Ser195 catalytic triad Grey
others
36Cyclin Dependent Protein Kinases (structure of
PDB1B6C)
- 6 residue motif is highlighted in Red
- ASP(333) is part of the active site
- Conserved in 18 out of 29 PK proteins.
37Applications of Family-Specific Fingerprints
- Functional family inference for Structural
Genomics - Functional family inference for predicted
structures - Functional neighbors and remote structural
similarity - Deriving sequence patterns from fingerprints
38Motivation
- Hypothetical proteins from Structural Genomics
- structure known, function unknown
- Function has to be inferred from structure
- Overall fold similarity to structure
with - Local structure similarity known function
- Overall fold similarity not necessary,sometimes
misleading - Existing local structure methods
- Search for known functional sites
- Derive templates by clique detection
39Related work function inference from local
structure
- Detecting similarity to known functional sites
- SiteEngine Shulman-Peleg et al, 2003
- SURFACE Ferre et al, 2004
- eF-site Kinoshita and Nakamura, 2004
- PINTS-weekly Stark, Shkumatov and Russell 2004
- Detecting functional sites derived from protein
families - FoldMiner Shapiro and Brutlag, 2003
- Phunctioner Pazos and Sternberg, 2004
- DRESPAT Wangikar et al, 2003
- Common structural cliques Milik et al, 2003
geom.hashing
surfacepatches
40Method for functional inference
- Pick families from SCOP, EC or other
classifications - Model protein structures by labeled graphs, with
almost-Delaunay edges defining proximity - Enumerate all frequent subgraphs within the
family using a subgraph mining algorithm - Pick frequent subgraphs infrequent in background
as family-specific fingerprints - Search for fingerprints in structure to be
annotated - use an index of graph similarity to speed up
Ullmans alg. - Assign significance of family membership based on
the fingerprints found.
41Fast Graph Search Using Local Neighborhood Index
Hard Case Search for 11 subgraphs in the
6500-protein background dataset (hydrophobic,
average 60 occurrences per protein)
4
1
ASP 102
SER 214
5
ALA 196
2
3
ALA 55
HIS 57
Intractable w/o index
42Function Inference Using Fingerprints
- Given query structure q
- Given fingerprints X1 Xm for prospective
family Fi - Say Xq1 Xqn q, is q in Fi?
- Simple approximation based on fingerprints
- P-value based on number of BG proteins with more
fingerprints - Accurate Bayesian formula applied to family and
background probabilities of Xq1 Xqn
43Advantages of our method
- Sequence similarity not sensitive enough
- Global fold similarity misleading
- Functional site similarity
- Different functional families sometimes share
functional sites - Exact matching may not be robust
(distortion/mutation) - Clique methods sacrifice generality of patterns
- Subgraph fingerprints
- Family-specific, few false positives by
definition - Multiple fingerprints consensus
- Confidence of family membership
44Cross-validation of fingerprints
- Four-fold CV
- Splitting family members into training and test
sets - Mining fingerprints from training sets
- Report fraction of training set FPs found in test
set Eukaryotic Serine Proteases, 59 members,
824 Triosephosphate Isomerase, 12 members,
6016 Metallodependent Hydrolases, 17 members,
4912 . - Report false positives and false negatives
large, homo-geneous families
small, diverse families
45Discriminating the TIM barrels
- Validation of method all TIM barrel families are
structurally very similar.
..
..
..
..
..
..
..
..
46Annotations missed by SCOP 1.65
New serine protease annotations, based on the
number of fingerprints found out of 79 Serine
Protease fingerprints 1op0A (73/79)
1os8A(73/79) 1p57B (73/79) 1s83 (73/79)
1ssx (46/79) 1md8 (45/79). New
Trioseposphate Isomerase (TIM), 1r2r, 1885/1920
fingerprints. Verified in PDB file headers,
literature All the above except 1op0 have been
classified in SCOP 1.67, Feb 2005
47Structural Genomics Function inference I
Metallo-dependent hydrolase 8-stranded ba (TIM)
barrel fold 17 members, 49 FP
unknown function 7-stranded barrel fold 30 FP
found
48Residues hit by fingerprints
Figures made in VMD
Metallo-dependent hydrolase 8-stranded ba (TIM)
barrel fold 17 members, 49 FP
unknown function 7-stranded barrel fold 30 FP
found
Acidic
Basic
Polar
Hphobic
49Function inference for predictions
- Check predicted structures against family of
template - SNAPP Fischer et al 2004, SPREK Taylor,
Jonassen 2004 not family-specific - well-packed predictions with wrong fold may
score high - Fingerprints infer the correct functional family,
even if the template chosen is incorrect. - E.g. CASP5 target T0147, PDB 1m65
- rare (ba)8 fold, putative metallo-dependent
hydrolase (MDH) - 107 predictions ranked 1
- 50 predictions had 50 or more of 49 MDH FP
- 51 other families had 4 preds with 50 FP
50Functional Neighbors
- Finding families that share some fingerprints
- Search for family fingerprints in the background
- Cluster hits for significant enrichment in SCOP,
GO hierarchy - Eg. Find local similarity between remotely
related SCOP families
1lvl
1lvl
1kew
SCOP NAD(P) binding Rossman fold
SCOP FAD/NAD linked reductase
The DALI Z-score of the two structures is 4.5,
which suggests that they are dissimilar at the
fold level. The pair-wise sequence identity is
16 and there is no local sequence similarity at
the region of the motif.
51Sequence Patterns from Tertiary Packing
- Frequent DT quadruplets or subgraph motifs that
are conserved in sequence order, mapped back to
sequence ? Sparse Sequence Signatures - Evaluate precision/recall by querying SwissProt
- Overlap with / comparable to PROSITE patterns
- Joint work with Ruchir Shah
Sequence Motif aa1, aa2, aa3, aa4, d12, d23,
d34 D, S,
G, P, 2, 3, 7
52Future Work
- Biological validation of function inference
- Future applications in bioinformatics
- Hierarchical family fingerprints to infer
function for novel folds, with no putative family
information - Tool for template verification in homology
modeling/fold recognition - Augment domain classifications (SCOP) with
motif-based functions - Augment structure neighbor searches (VAST) with
functional neighbors - Robust neighbor relation to accelerate MD, QM
simulations - Improve docking (graph matching, MD)
- Local similarity search
- Other geometric computations (Voronoi
volumes/domains, alpha-shapes,)
53Thanks to
- Thesis advisor Dr. Jack Snoeyink (UNC CS)
- Collaborators in this work
- Dr. Alexander Tropsha (UNC Pharmacy)
- Jun (Luke) Huan, Dr. Wei Wang, Dr. Jan Prins (UNC
CS) - Ruchir Shah (UNC Biomolecular Informatics)
- Dr. Bala Krishnamoorthy (Washington State U.
Pullman, Math) - Dr. Charlie Carter (UNC Biochemistry)
- Mother nature, for her wonderful imprecision and
complexity, that is an endless source of problems
54References
- References to my publications
- Bandyopadhyay, D. and J. Snoeyink (2004).
Almost-Delaunay simplices Nearest neighbor
relations for imprecise points. ACM-SIAM
Symposium On Discrete Algorithms (SODA04).
http//www.cs.unc.edu/debug/papers/AlmDel - Bandyopadhyay, D. and J. Snoeyink (2004).
Almost-Delaunay simplices Robust nearest
neighbor relations for imprecise points in CGAL.
Second CGAL User Workshop, 2004. Software
http//www.cs.unc.edu/debug/software - Jun Huan, Wei Wang, Deepak Bandyopadhyay, Jack
Snoeyink, Jan Prins, Alexander Tropsha (2004).
Finding Protein Family-specific residue packing
patterns in Protein Structure Graphs. RECOMB
2004. Invited to Journal of Computational
Biology, 2005, in press. - Bandyopadhyay, Deepak, Alexander Tropsha and Jack
Snoeyink. A Robust Score for Protein Packing
using Almost-Delaunay Tetrahedra. 2005, in
submission. - Bandyopadhyay, Deepak, Jun Huan, Jinze Liu, Jan
Prins, Jack Snoeyink, Wei Wang, and Alexander
Tropsha. Protein Functional Family Identification
by Fast Subgraph Isomorphism Using
Structure-Based Fingerprints Mined from SCOP and
EC families. 2005, in submission. Poster
presented at Triangle Biophysics Symposium, 2004. - Bandyopadhyay, Deepak, Jack Snoeyink, Alexander
Tropsha and Charlie Carter. Analysis of Protein
Conformational Change Using Almost-Delaunay
Tetrahedra. Manuscript in preparation. Poster
presented at Pacific Symposium on Biocomputing
(PSB), Jan. 2005, Big Island of Hawaii. - Bandyopadhyay, Deepak, Alexander Tropsha and Jack
Snoeyink. Analyzing Protein Structure using
Almost-Delaunay Tetrahedra. UNC-CS Technical
Report TR03-043, 2003. Poster presented at
RECOMB 2004, March 2004, San Diego, CA.
55References
- Computational geometry methods applied to protein
structure analysis - Gerstein, M., J. Tsai, and M. Levitt (1995). The
volume of atoms on the protein surface
Calculated from simulation, using Voronoi
polyhedra. Journal of Molecular Biology 249(5),
955966. - Tsai, J., R. Taylor, C. Chothia, and M. Gerstein
(1999). The packing density in proteins Standard
radii and volumes. Journal of Molecular Biology
290(1), 253266. - Angelov, B., J. Sadoc, R. Jullien, A. Soyer, J.
Mornon, and J. Chomilier (2002). Nonatomic
solvent-driven Voronoi tessellation of proteins
an open tool to analyze protein folds. Proteins
49(4), 446456. - J. Pontius, J. Richelle and S.J. Wodak (1996).
Deviations from Standard Atomic Volumes as a
Quality Measure for Protein Crystal Structures.
Journal of Molecular Biology 264(1), 121-136. - H. Edelsbrunner and P. Koehl. The
weighted-volume derivative of a space-filling
diagram. PNAS, Mar 2003 100 2203 - 2208. - Liang, J. and K. A. Dill (2001). Are proteins
well-packed? Biophys. J. 81(2), 751766. - J. Liang, H. Edelsbrunner, P. Fu, P. Sudhakar,
and S. Subramaniam. Analytical shape computing
of macromolecules II identification and
computation of inaccessible cavities inside
proteins. Proteins, 331829, 1998. - H.L. Cheng. Algorithms for Smooth and Deformable
Surfaces in 3D. Ph.D. Dissertation, University of
Illinois at Urbana-Champaign, 2002. - Y.-E. Ban, H. Edelsbrunner and J. Rudolph.
Interface surfaces for protein-protein complexes.
Proc. RECOMB 2004. - Wernisch, L., M. Hunting, and S. Wodak (1999).
Identification of structural domains in proteins
by a graph heuristic. Proteins 35(3), 338352. - Wako, H. and T. Yamato (1998). Novel method to
detect a motif of local structures in different
protein conformations. Protein Engineering 11,
981990.
56References
- SNAPP
- C. W. Carter, B. C. LeFebvre, S. Cammer, A.
Tropsha, and M. H. Edgell (2001). Four-body
potentials reveal protein-specific correlations
to stability changes caused by hydrophobic core
mutations. Journal of Molecular Biology,
311(4)625638. - B. Krishnamoorthy and A. Tropsha (2003).
Development of a four-body statistical
pseudo-potential to discriminate native from
non-native protein conformations. Bioinformatics,
19(12). - Tropsha, A., Carter, C., Cammer, S. Vaisman, I.
(2003). Simplicial neighborhood analysis of
protein packing (SNAPP) a computational
geometry approach to studying proteins. Meth.
Enzymol.,374, 509544 - Hinges
- Krebs WG, Alexandrov V, Wilson CA, Echols N, Yu
H, Gerstein M. (2002). Normal mode analysis of
macromolecular motions in a database framework
developing mode concentration as a useful
classifying statistic. Proteins. 2002 Sep
148(4)682-95. - Jacobs DJ, Rader AJ, Kuhn LA, Thorpe MF (2001).
Protein Flexibilty Predictions using Graph Theory
Proteins 44, 150 - 165. - M.F. Thorpe, Ming Lei, A.J. Rader, Donald J.
Jacobs, and Leslie A. Kuhn (2001). Protein
Flexibility and Dynamics using Constraint Theory.
J. Molecular Graphics and Modelling 19, 60-69. - Secondary structure
- Kabsch, W. and C. Sander (1983). Dictionary of
protein secondary structure pattern recognition
of hydrogen-bonded and geometrical features.
Biopolymers 22(12), 25772637. - Family-specific motifs
- Cammer, S. A. and A. Tropsha (2000).
Identification of sequence specific tertiary
packing motifs in protein structures using
Delaunay tessellation. Lecture Notes in
Computational Science and Engineering. Springer
Verlag, New York. - J. Huan, W. Wang, and J. Prins (2003). Efficient
Mining of Frequent Subgraphs in the Presence of
Isomorphism. International Conference on Data
Mining 03. - Jun (Luke) Huan, Wei Wang, Anglinia Washington,
Jan Prins, Ruchir Shah, Alexander Tropsha (2004).
Accurate Classification of Protein Structural
Families Based on Coherent Subgraph Mining. PSB
2004. - Huan, J., Wang, W., Prins, J. Yang, J. (2004b).
SPIN Mining maximal frequent subgraphs from
graph databases. SIGKDD 2004
57Canonical Adjacency Matrix
- The Canonical Adjacency Matrix (CAM) of a graph G
is the maximal adjacency matrix for G under a
total ordering defined on adjacency matrices.
58CAM Tree
b
d
c
a
b
b
a
c
y
b
x
b
y
a
a
b
y
b
y
c
y
0
d
y
0
59Chemical Datasets
- Predictive Toxicology Evaluation Competition
- Dataset 337 compounds
- Two class labels positive (180) and negative
(157) - Each chemical graph contains 27 nodes and 27
edges on average - NIH DTP Anti-Viral Screen Test
- Chemicals are classified to be Confirmed Active
(CA), Confirmed Moderate Active (CM) and
Confirmed Inactive (CI) in NIH DTP Anti-Viral
Screen Test . - Dataset contains 423 CA and 1083 CM compounds
- Each chemical graph contains 25 nodes and 27
edges on average
60Performance (Chemical Datasets)
PTE
DTP CA/CM
FFSM and gSpan are the current available most
efficient frequent subgraph mining algorithms