Title: Structural Characterization of Proteins using Residues Environments
1Structural Characterization of Proteins using
Residues Environments
- Sean D. Mooney, Mike Hsin-Ping Liang, Rob
DeConde, and Russ B. Altman - Stanford University, Indiana University
- PROTEINS Structure, Function, and Bioinformatics
- 2005
2Outline
- Introduction
- Materials and Methods
- Results
- Discussions
- Conclusions
3Introduction
- A primary challenge for structural genomics is
the automated functional characterization of
protein structure. - Current structural methods for identifying
function rely on one of the following - Phylogenetic tree derived from sequence
similarity (evolutionary trace) - Hand curated molecular fingerprints (template
based) - Fold recognition and alignment methods (structure
comparison) - Sequence-based methods for functional
characterization rely on identifying conserved
residues within protein structures
4Introduction (cont)
- It is important to develop sequence-independent
methods for identifying function to complement
sequence-based methods when they are limited by - Lack of sequence similarity
- Small datasets
- Methods for identifying key functional residues,
or molecular fingerprints, can classify function
5Introduction (cont)
- Sequence-independent structure-based methods for
function assignment are challenging for several
reasons - Aligning local structure is a difficult
computational task - Estimating the statistical significance of the
results is challenging - Scanning through the entire protein data bank
(PDB) can be computational demanding - Structural similarity and functional similarity
are not always well correlated
6Goal
- Develop a method for unsupervised mining of
structural datasets and automatically identifying
local regions within protein structures that are
statistically associated with a given annotation - Define the most structurally significant residues
environments for given classification, based on
the structural environments represented in that
database
7Methods
- S-BLEST (Structure-Based Local Environment Search
Tool) - Based on the FEATURE representation of a local
environment - Rapidly search databases of vectors of local
structure properties - S-BLEST method relies on k nearest neighbor (KNN)
searches using a Manhattan distance metric - Significance score (z-score)
-
8Local Environment
- Local structure environment
- A set of concentric shells extending outward from
a positions of residues Cß atom - Glycine residues
- Cß atom position were estimated by determining
the average of position of a Cß from serine
protease 1DSU - It is the simplest of the 20 standard amino
acids its side chain is a hydrogen atom - Each shell contains 66 properties
- of atoms associated with a given residue type
- of positively and negatively charge ions
- The van der Waals volume of the shell
- The solvent accessibility
- Four radial boundaries
- 1.875, 3.75, 5.625, 7.5 Ã…
- Vector dimension 264
glycine1
1DSU
1 http//en.wikipedia.org/wiki/Glycine
9Materials
- Datasets
- ASTRAL 40 non-redundant structure database
(ASTRAL 1.65) - X-ray crystal structures only
- Steps for data cleaning
- Remove all hetero-atoms (PDB HETATM)
- Normalization
-
10Identification of Residues Environments
Associated With a Structural of Functional
Annotation
- The performance of each residues can be
determined by creating a receiver operator
characteristics (ROC) plot of the ranking - TP a protein structure that belongs to the same
SCOP family as the query protein with a z-score
of greater magnitude than the threshold - FP a protein structure that doesnt belong to
the same SCOP family but has a z-score of greater
magnitude than the threshold - The AUC of a residue in a query structure of
known function indicates how well the reside
environment classifies the SCOP family of the
structure and can range from 0.0 to 1.0.
11Congruence Approach for Combining S-BLEST Searches
- Congruence approaches (Shotgun) are a useful way
to combine several searches to increase
statistical significant. - If input is a query with multiple residues (query
chain) - Each residue in the query chain --gt most similar
residues in each dataset chain (z-score
threshold) - If there were n resides in the query chain, there
would be n residues (possibly redundant) in the
dataset chain that are identified as most similar
to each of the n residues in the query each with
a z-score.
12Identification of Structurally Similar Residue
Environments in ASTRAL 40 v1.65
- ASTRAL 40 v1.65 encoded 4129 crystallographically
determined structures
distance histogram distribution (2TRXA)
13Identification of the Residue Environments
Associated With a Structural Class
1DI9A
AUC
Yellow ATP binding site, Gray peptide binding
channel, Red Phosphorylated(???) Residue
14ROC of the Ranked Chains Outputted from the
Congruence Approach
Of the 27 members in our dataset, the first 25
chains ranked were true positives, whereas the
method failed to recognize 1KOA and 1FMK as
structurally similar (AUC is 0.935).
15Congruence Approach to Characterize Protein
Structures
- Goal - to show that S-BLEST finds structurally
similar environments with potential implications
for fold, family, and function. - Select 100 random SCOP families in ASTRAL 40
- X-ray crystallographic structures only
- Z-score threshold for each protein is -5.5
16ASTRAL 40 v1.65 (100 random members)
PPV Positive Predictive Value
z-score
17Analysis of Uncharacterized PDB Structures
- 86 structures
- 86 of these structures had no significant hits
when searched against the PDB using BLAST with
e-value cutoff of 1e-4 - How to obtain these 86 structures search PDB
for the phrase unknown function
18Hit Results from the 86 Structures with Unknown
Function (1/5)
True Positive
1VGYA
1LFWA
SCOP C.56.5.4
ARG97 ARG115, HIS68 HIS87, ASP70 ASP89,
GLY98 GLY112, GLU136 GLU154 z-score -6.36,
e-value 3x10-4
19Hit Results from the 86 Structures with Unknown
Function (2/5)
1VGYA
AUC
20Hit Results from the 86 Structures with Unknown
Function (3/5)
Of the five true positives in our dataset, three
were the top hits, the fourth was in position
five, and the fifth was ranked 65th overall (AUC
is 0.995)
21Hit Results from the 86 Structures with Unknown
Function (4/5)
Questionable Significant
1B3UA
1OYZA
Clearly structurally related, and the best
residues matches occur between SSEs, and are
often observed bridging the structural
elements z-score -5.21
22Hit Results from the 86 Structures with Unknown
Function (5/5)
Possible Unknown Hit
1LJOA
1B34A
Proteins share the same fold, but their
functional relationship is not known z-score
-5.64 e-value 1e-7
23Discussions
- This method is intended to identify statistically
significant environments in protein structures
and will be complementary to both sequence-based
methods such as BLAST or HMMs and fold
recognition methods - Analysis of a random member from each of 100
random families (SCOP) - S-BLEST (threshold of -5.1, z-score) finds 28
SCOP family members that BLAST (threshold of
1e-5, E-value) does not find - BLAST finds 89 family members that S-BLEST does
not find - Local structural variability between the proteins
- However, of 66 false-family positives, all but 13
of which share the superfamily of the query - for each BLAST hit, the degree of structural
conservation of each residue environment can be
easily determined using S-BLEST. - Enzyme
- many residues that were annotated as being
important for enzyme chemistry are not the ones
that are most useful for recognizing structural
similarities. - The method sometimes does not select the critical
residues (such as the catalytic triad) likely
because the environments around those residues
are structurally variable between members.
24Conclusions
- We developed S-BLEST to meet a need for rapidly
identifying similar structures to a query protein
using local structural environment. - S-BLEST identifies constellations of structurally
similar residues between the query protein and
the full database of known protein structures. - We found that many of the structural environments
in SCOP have statistically significant local
environment neighbors. - S-BLEST was able to associated 20 proteins with
at least one local structure neighbor and
identify the amino acid environment that are most
similar between those neighbors.