Title: Extracting and Exploiting Structural Patterns in Proteins, especially Relating to Function
1Extracting and Exploiting Structural Patterns in
Proteins, especially Relating to Function
- Janet Thornton
- James Watson, Roman Laskowski - EBI
- Adel Golovin, Kim Henrick - EBI MSD
- David Leader, James Milner-White Glasgow
- Andrzej Joachimiak, Aled Edwards MCSG
- (Mid-West Centre for Structural Genomics)
2Outline
- Structural Motifs
- PDBsum
- MSDmotif
- Functional Motifs
- Catalytic Site Atlas
- DNA Binding Motifs
- Automated templates
- Reverse Templates
- From Structure to Function? - ProFunc
3Structural Motifs
- Structural motifs are commonly occurring small
sections of proteins that are distinguished by - Sequence Gly-X-Gly
- Conformation ?,? angles
- Secondary structure - helix, bab unit
- Function catalytic triad, calcium binding site
4Examples of Structural Motifs
AlphaBeta Motif
Beta Turn
Schellmann Loop
Beta Bulge (classic)
Nest
Beta Bulge Loop
5Structural Motifs
- They may be continuous along the chain (e.g.
GXG) or discontinuous (e.g. catalytic triad) - Historically motifs were identified and analysed
in an effort to understand the relationship
between protein sequence and structure, to
improve prediction methods. They are also used to
assign function (Prosite). - Many motifs can now be recognised automatically
from coordinates, using programmes such as DSSP
and Promotif - PDB files can be annotated with these structural
motifs e.g. in PDBsum
6 http//www.ebi.ac.uk/thornton-s
rv/databases/pdbsum/
Roman Laskowski
7Example page
8Protein detail
9MSD motifhttp//www.ebi.ac.uk/msd-srv/msdmotif
- Adel Golovin
- Currently alpha test
- Full Release probably Oct 2005
PDB 1gci
10MSD motif
- Small 3D motifs from J.Milner-White search/view
- Secondary structure patterns (HTH) search/view
- ?,?,? based search/view
- Ligands and their environment search/view
- Catalytic sites search/view
- Blast sequence search/view
- Prosite compliant patterns search/view
- 3D multiple alignment
11MSDmotif options
12Small motifs
Alpha-Beta Motif
Nest
ST staple
11 motifs in total (Prof James Milner-White) http
//doolittle.ibls.gla.ac.uk9006/david/ProteinMotif
DB.html
13Motifs In MSDmotif (1)
AlphaBeta Motif
Beta Turn
Schellmann Loop
Beta Bulge (classic)
Nest
Beta Bulge Loop
14Motifs In MSDmotif (2)
Asx Motif
ST Motif
Asx Turn
ST Turn
ST Staple
15Statistics provided by MSDmotifSTmotif
a)
b)
c)
- Amino acid occurrence at each position
- Correlation between side chain charge and residue
position - Motif parameter variation
16Hit List after clicking
17Small motifs 3D alignmentfrom different
families
ST-staple
18MSDmotif options
19Secondary structure patterns
Where N binds sugar Man or Nag
20?,?,? search
PDB1gci
Ideal for short loops search
21Example of a search using MSDmotif
PDB1gci Subtilases family
PDB1f5p Globins family
Phi/Psi Search using MSDmotif
Other Subtilases
Calcium binding site
22Sequence search
ZN binding pattern CXXCXXXFXXXXXLXXHXXXH
233D alignment
24MSD motif
- Available in alpha version
- http//www.ebi.ac.uk/msd-srv/msdmotif
- Will be published later this year
- Incremental weekly update
- 20 G disk space on Oracle DB, linear dependency
- 0.8 M per PDB
- Web application server with J2EE servlet engine
- NCBI Blast
25Outline
- Structural Motifs
- PDBsum
- MSDmotif
- Functional Motifs
- Catalytic Site Atlas
- DNA Binding Motifs
- Automated templates
- Reverse Templates
- From Structure to Function? - ProFunc
26Catalytic Site Atlas
- Taken from primary literature
- ?-lactamase Class A
- EC 3.5.2.6
- PDB 1btl
- Reaction ?-lactam H2O ? ?-amino acid
- Active site residues S70, K73, S130, E166
- Plausible mechanism
27 - The Catalytic Site Atlas a resource of catalytic
sites and residues identified in enzymes using
structural data. - Craig T. Porter, Gail J. Bartlett, and Janet M.
Thornton - Nucl. Acids. Res. 2004 32 D129-D133.
- http//www.ebi.ac.uk/thornton-srv/databases/CSA
28- Annotates catalytic residues in the PDB
- Based on a dataset of 514 enzyme families
- Representative catalytic site for each family
- Homologues assigned by Psi-BLAST
- Limited substitution allowed.
- Homologues updated monthly.
- Literature references
- Data also available via MSDsite
- http//www.ebi.ac.uk/thornton-srv/databases/CSA
- http//www.ebi.ac.uk/msd-srv/msdsite
293-D templates
- Use 3D templates to describe the active site of
the enzyme - analogous to 1-D sequence motifs such as PROSITE,
but in 3-D - Sequence position independent
- Captures essence of functional site in protein
30Pepsin
31Aspartic Proteinase - Active Site residues -
DTGx2
Eukaryotic Fungal Aspartic Proteinases
all-atom DTG-DTG Template
32Aspartic Proteases Active Site Template
Asp CO2
Gly C?
A template of 8 atoms is sufficient to
identify all Aspartic Proteinases
Asp O?
Gly C?
Thr/Ser O?
Thr O?
33Aspartic Protease Template Search against all PDB
green true redfalse
34TEmplate Search and Superposition TESS
Wallace et al., 1997
- defines a functional site as a sequence-independen
t set of atoms in 3-D space - search a new structure for a functional site
- search a database of structures for similar
clusters
e.g. serine proteinase, catalytic triad
35Serine Proteinase templates
- A trypsin-based template of 7 atoms was able to
identify almost all serine proteinases in PDB-
including subtilisin - It also identified active sites of several other
functionally distinct enzyme families - serine
carboxypeptidase, acetylcholine esterase lipase
dehalogenase - The catalytic triad has evolved independently
many times
36Active site convergence
Trypsin
Subtilisin
37(No Transcript)
383D Templates to Characterise Functional Sites
Template searches
39Database of enzyme active site templates
189 templates
Carbamoylsarcosine amidohhydrase
Ser-His-Asp catalytic triad
Dihydrofolate reductase
40DNA
Protein
41DNA-binding Motifs
- Helix-Turn-Helix (HTH)
- Standard HTH
- Winged helix
- Beta Sheet
- Zinc-finger
42Prediction of DNA Binding Function using
Structural Motifs
- Predicting function from structure
- Structural motifs
- Helix-Turn-Helix (HTH)
- Bind in major groove
- Carboxyl terminal helix - DNA recognition
- 1/3 DNA-binding protein families (16/54)
- Brennan and Mathews 1989 Brennan, 1991
43HTH Motif Proteins
Catabolic activator protein (1ber)
Lambda repressor/operator complex (1lmb)
44HTH Motif Templates
3D template library (E.g. 1berA16-36)
45Predicting DNA binding function
- Scanning template library against 3D structures
- One template T (length n) scanned against protein
P of length m, RMSD calculated optimal
superposition at each m-n1 possible positions in
P - Calculate lowest RMSD for optimal superposition
46Ideal RMSD distribution
47RMSD Distributions with HTH templates
1.2Å
RMSD
831/23,506 3.5 false positives 2/142 1.4
false negatives
48HTH Motif Extended Templates
- Extend templates by adding 2 residues to start
and end - 1berA16-36
- 1berA14-38
49RMSD Distributions with extended HTH templates
1.2Å
110/23,506 0.5 false positives 2/144 1.4
false negatives
50Comparison of RMSD Distributions
51HTH Accessible Surface Area
ASA threshold 990Å2 reduced false positives from
110 to 80 False positive rate of 0.3
(80/23506)
52Summary
- Structural template library of 144 HTH motifs
- Minimum RMSD for optimal superpositions on whole
protein structures based on C? coordinates - Thresholds of 1.2Å RMSD and 990Å2 ASA
- Hit rate of 98.6 false positive rate of 0.3
- Recognition across sequence families and fold
families
53Template databases
- HAND CURATED
- Enzyme active sites (PROCAT) 189 templates
- Currently being extended
- Metal-binding sites 600 templates
- AUTOMATED
- Ligand-binding sites 10,000 templates
- DNA-binding sites 800 templates
54Automatically generated templates
a. For each Het Group in the PDB extract a
non-homologous data set of proteins binding that
Het Group
b. Identify residues interacting with ligand (via
H-bonds or non-bonded contacts)
c. Templates generated from overlapping local
groups of 3-residue clusters
d. Gives over 10,000 ligand-binding templates
55Automatically generated templates
a. Extract a non-homologous data set of
DNA/RNA-binding proteins from the PDB
b. Identify residues interacting with DNA/RNA
(via H-bonds or non-bonded contacts)
c. Templates generated from overlapping local
groups of 3-residue clusters
d. Gives over 800 DNA/RNA-binding templates
56Problems with automated template methods
- WITH A LARGE NUMBER OF TEMPLATES
-
- Too many hits (usually tens, and often hundreds)
- Use of rmsd rarely discriminates true from false
positives - Local distortion in structure may give a large
rmsd - Top hit rarely the correct hit even in
obvious cases
57An example
58Enzyme active site templates
Hits for 1hsk
102. E.C.1.1.1.158 2.19Å UDP-N-acetylmuramat
e dehydrogenase
59Comparison of template environments
Similar residues in neighbourhood
Template structure 1mbb
Target structure 1hsk
60Comparison of template environments
61Comparison of template environments
62Environment similarity score
Score equivalent grid-points using Dayhoff matrix
and taking voids into account
Total similarity score obtained from sum of all
grid-point scores
63Results for 1hsk
Hit E.C number Rmsd Score Enzyme
1. E.C.1.1.1.158 2.08 209.1
UDP-N-acetylmuramate dehydrogenase 2.
E.C.3.2.1.14 2.13 146.0 Chitinase A
chitodextrinase
1,4-beta-poly-N-acetylglucos
aminidase
coly-beta-glucosaminidase 3.
E.C.3.2.1.17 1.92 142.4 Turkey
lysozyme 4. E.C.3.2.1.17 1.89 138.7
Hen lysozyme 5. E.C.3.5.1.26 1.47 132.3
Aspartylglucosylaminidase 6. E.C.3.2.1.3
1.54 131.1 Glucan 1,4-alpha-glucosidase
64Residue conservation
65Residue conservation and cleft proximity
66Reverse templates
67Comparison of template environments
Identical residues in neighbourhood
Template structure 1mbb
Target structure 1hsk
68Reverse templates
- Typically get 20-40 templates from a single
structure
- Search each template vs PDB (or representative
subset)
- Non-homologous dataset of 2,500 protein chains
- Focused search (eg top DALI hits)
- Locate known PDB entries with closest local
similarity
- Program called the Protein SiteSeer
- Times for search vs 2,500 set
- JESS 30 minutes
- SiteSeer 3 hours
69biological multimeric state
evolutionary relationships
INTERACTIONS
MULTIMERS
FOLD
Structure to Function
Structure to Function
SURFACE
MUTANTS SNPs
3D STRUCTURE
ELECTROSTATICS
LIGANDS
CLUSTERS
ligand functional sites
enzyme active sites
catalytic clusters, mechanisms motifs
70Protein Function
- Protein function has many definitions
- Biochemical Function - The biochemical role of
the protein e.g. serine protease - Biological Function - The role of the protein in
the cell/organism e.g.digestion, blood clotting,
fertilisation - The 3D structure usually only provides
information about biochemical function
71250 structures solved to date by MCSG
Some examples
40 are hypothetical proteins
72From Gene To Biochemical Function
- Gene ? Protein ? 3D Structure ? Function
-
- Identifying sequence or structural similarity
- (i.e. identifying an evolutionary relationship)
- is the most powerful route to function
- Identification
73From Gene To Biochemical Function
- Gene ? Protein ? 3D Structure ? Function
- Given a protein structure
- Where is the functional site?
- Which ligands bind to the protein?
74Predicting function from 3D Structure
conservation
Residue conservation
- Conservation
- Valdar Thornton
- Lichtarge et al.
- Aloy et al.
- Glaser et al.
- Etc...
75Predicting function from 3D structure binding
sites
Binding sites
- Binding site comparison
- Geometrical hashing
- eF-site (Nakamura et al.)
- PINTS (Russell)
- Pseudospheres (Klebe)
- pvSOAR (Binkowski et al.)
- etc
76Predicting function from 3D Structure templates
3D templates
77Predicting Binding Site
Binding-site analysis cutA
78Identifying Binding Site Function Using Motifs
- 3D enzyme active site structural motifs (Craig
Porter) - Catalytic Site Atlas - Identification
of catalytic residues (Gail Bartlett, Alex
Gutteridge) - Metal binding sites (Malcolm
MacArthur) - Binding site features (Gareth
Stockwell) - Automatically generated templates
of ligand-binding and - DNA binding motifs (Sue
Jones, Hugh Shanahan) - Reverse templates
(Roman Laskowski) JESS fast template search
algorithm (Jonathan Barker)
79An example
Structure Rossmann fold, hence many
structural homologues
80PROCAT template search
One very strong hit
81ProFunc function from 3D structure
Roman Laskowski
82http//www.ebi.ac.uk/thornton-srv/databases/ProFun
c/
83Goal Function Prediction from Structure
Roman Laskowski James Watson
84Goal Function Prediction from Structure
85MCSG Dataflow
86Functional Annotation
All MCSG structures are automatically run through
ProFunc. The results are examined manually to
try to estimate the most likely function. The
most recent (Nov 2004) dataset contains 193
unique structures
Some assignment possible 102 (53)
Function remains unknown 23 (12)
Prior function known 68 (35)
87Acknowledgements
- James Watson, Roman Laskowski - EBI
- Adel Golovin, Kim Henrick - EBI MSD
- David Leader, James Milner-White Glasgow
- Andrzej Joachimiak, Aled Edwards MCSG
- (Mid-West Centre for Structural Genomics)
http//www.ebi.ac.uk/thornton-srv/databases/pdbsum
/
http//www.ebi.ac.uk/msd-srv/msdmotif
http//www.ebi.ac.uk/thornton-srv/databases/ProFun
c/