Title: The Structure Lectures
1The Structure Lectures
- Boris Steipe
- boris.steipe_at_utoronto.ca
http//biochemistry.utoronto.ca/steipe - Departments of Biochemistry and Molecular and
Medical Genetics - Program in Proteomics and Bioinformatics
- University of Toronto
2Lecture 9.0Use of Protein Structure
- Boris Steipe
- boris.steipe_at_utoronto.ca
http//biochemistry.utoronto.ca/steipe - Departments of Biochemistry and Molecular and
Medical Genetics - Program in Proteomics and Bioinformatics
- University of Toronto
- ( Some slides have been adapted from material by
Chris Hogue, Toronto, prepared for CBW in 2002)
3Concepts
- "Sequence" and "structure" are abstractions of
biopolymers. - Structure can be determined experimentally.
- Structure abstractions can be stored, retrieved
and visualized. - Knowledge of structure allows mechanistic
explanations. - Structure is not arbitrary, but comes in units -
motifs, helices, strands, domains and complexes. - Domains are folding units, functional units and
units of inheritance.
4Concept 1
- "Sequence" and "structure" are abstractions of
biopolymers.
5Physical Amino Acids andAmino Acid Abstractions
Formula C9H9NO2 Smiles String
CH(NHR)(C(O)R) CH2-c1(cHcHc
(cHcH1)OH) Name Tyrosine 3-Letter
Tyr 1-Letter Y
ATOM 1091 N TYR 145 -35.676 -13.136
50.622 1.00 10.36 ATOM 1092 CA TYR 145
-36.931 -13.763 51.019 1.00 10.63 ATOM 1093
C TYR 145 -37.676 -12.879 52.016 1.00
11.16 ATOM 1094 O TYR 145 -37.061
-12.316 52.926 1.00 13.91 ATOM 1095 CB TYR
145 -36.660 -15.140 51.638 1.00 9.52 ATOM
1096 CG TYR 145 -37.845 -15.737 52.361
1.00 6.36 ATOM 1097 CD1 TYR 145
-38.144 -15.357 53.663 1.00 3.30 ATOM 1098
CD2 TYR 145 -38.691 -16.652 51.727 1.00
6.14 ATOM 1099 CE1 TYR 145 -39.248
-15.856 54.311 1.00 5.57 ATOM 1100 CE2 TYR
145 -39.804 -17.165 52.376 1.00 4.89 ATOM
1101 CZ TYR 145 -40.076 -16.757 53.670
1.00 4.35 ATOM 1102 OH TYR 145
-41.170 -17.231 54.345 1.00 4.44
http//www.daylight.com/dayhtml/doc/theory/theor
y.smiles.html
6The Concept of Abstract Amino Acids Allows Highly
Compressed Information
Nucleophile
H-bond Donor
Bulky
Phospho-Acceptor
Hydrophobic
H-Bond Acceptor
Y
Aromatic
2 side chain rotational freedom
7The Concept of Abstract Amino Acid Similarity is
Lossy
Nucleophile (CDESTY)
H-bond Donor (CHKNQRSTWY)
Bulky (FILQRYW)
Phospho-Acceptor (STY)
Hydrophobic (FAMILYVW)
H-Bond Acceptor (DEHNQSTY)
Y
Aromatic (FWH)
2 side chain rotational freedom (CDFHSW)
8Structure Contextualizes Sequence
V V I Y T T G
(Tyr262 in 1ERQ.pdb)
9Structural Abstraction
- To store structures we need
- - coordinate
- - topology, and
- - chemical type
- information.
y
e
Sulphur
d
Carbon
x
Oxygen
z
Nitrogen
g
b
a
Met
10Concept 2
- Structure can be determined experimentally.
11Experimental sources of structure
- Crystallization required
- Diffraction ? data collection
- The phase problem MAD, heavy metal isomorphic
derivatives ... - ... or "Molecular replacement" give phase
approximations - Model building in electron density maps
- Refinement
12Experimental sources of structure
Crystallization is limiting. Diffraction is not
imaging! Refinement is required.
X-ray NMR
Model
Data
http//www-structure.llnl.gov/Xray/101index.html
13Experimental sources of structure
- High concentration required( 1mM)
- Assignment of peaks ...
- ... determination of crosspeaks ? distance
constraints - Calculation of models from distance constraints
- Refinement
14Experimental sources of structure
X-ray NMR
1DRO.PDB
Ensemble of structures that are compatible with
experimental distance constraints
Consensus model
Concentration/Solubility Assignment and
NOEs Refinement
15Assessing structure quality
- Metrics
- Resolution, R-factor and R-free
- Bond length and angle deviations
- Coordinate error can be estimated
- from diffraction data
http//www.sci.sdsu.edu/TFrey/Bio750/Bio750X-Ray.h
tml
Programs Whatcheck and Procheck calculate quality
metrics http//swift.cmbi.kun.nl/WIWWWI//fullche
ck.html http//www.biochem.ucl.ac.uk/roman/proche
ck/procheck.html (also NMR)
Rules of thumb for "good structures" Resolution
2Å, R-factor 20, mean coordinate error 0.2 Å,
RMSD bond-lengts 0.02Å
16Concept 3
- Structure abstractions can be stored, retrieved
and visualized.
17ThePDB
The PDB is the primary repository of protein
structure data.
http//www.rcsb.org/pdb
18Whats in a Structure File?
- Population experiments
- X-ray, 1 structure
- NMR - sometimes many structures
- Incomplete - not all atoms are there
- Hydrogens, parts of the protein in motion
- Crystallographic space
- correct, but not always relevant
19The PDB format
- Flat file, column oriented
- Human readable
- Human editable
- Huge legacy problems
Flat File A datafile without indexing structure
or hierarchy. In contrast, to relational
database, or data grammar.
20Header
HEADER IMMUNOGLOBULIN
01-MAR-93 2IMM 2IMM 2 COMPND
IMMUNOGLOBULIN VL DOMAIN (VARIABLE DOMAIN OF
KAPPA LIGHT 2IMM 3 COMPND 2 CHAIN) OF
MCPC603
2IMM 4 SOURCE HUMAN (HOMO SAPIENS)
RECOMBINANT SYNTHETIC M603 GENE 2IMM
5 AUTHOR B.STEIPE,R.HUBER
2IMM 6 REVDAT 1
15-JUL-93 2IMM 0
2IMM 7 REMARK 1
2IMM
8 REMARK 1 REFERENCE 1
2IMM 9 REMARK 1 AUTH
B.STEIPE,A.PLUCKTHUN,R.HUBER
2IMM 10 REMARK 1 TITL REFINED CRYSTAL
STRUCTURE OF A RECOMBINANT 2IMM
11 REMARK 1 TITL 2 IMMUNOGLOBULIN DOMAIN AND A
2IMM 12 REMARK 1
TITL 3 COMPLEMENTARITY-DETERMINING REGION
1-GRAFTED MUTANT 2IMM 13 REMARK 1 REF
J.MOL.BIOL. V. 225 739 1992
2IMM 14 REMARK 1 REFN ASTM JMOBAK UK
ISSN 0022-2836 070 2IMM 15
... REMARK 2
2IMM 23 REMARK 2
RESOLUTION. 2.00 ANGSTROMS.
2IMM 24 REMARK 3
2IMM
25 ...
21Seqres
... SEQRES 1 114 ASP ILE VAL MET THR GLN
SER PRO SER SER LEU SER VAL 2IMM 35 SEQRES 2
114 SER ALA GLY GLU ARG VAL THR MET SER CYS
LYS SER SER 2IMM 36 SEQRES 3 114 GLN SER
LEU LEU ASN SER GLY ASN GLN LYS ASN PHE LEU 2IMM
37 SEQRES 4 114 ALA TRP TYR GLN GLN LYS
PRO GLY GLN PRO PRO LYS LEU 2IMM 38 SEQRES 5
114 LEU ILE TYR GLY ALA SER THR ARG GLU SER
GLY VAL PRO 2IMM 39 SEQRES 6 114 ASP ARG
PHE THR GLY SER GLY SER GLY THR ASP PHE THR 2IMM
40 SEQRES 7 114 LEU THR ILE SER SER VAL
GLN ALA GLU ASP LEU ALA VAL 2IMM 41 SEQRES 8
114 TYR TYR CYS GLN ASN ASP HIS SER TYR PRO
LEU THR PHE 2IMM 42 SEQRES 9 114 GLY ALA
GLY THR LYS LEU GLU LEU LYS ARG 2IMM
43 ...
Explicit (above) and implicit sequence may differ
!
22Atom
Pitfalls Atomname is a mix of Chemical element
and bond topology. "CA.." ? ".CA." Sequence
number is actually a string - Chain and insertion
code are required to make it unique (e.g B 123A).
Atom number
Amino acid type
X
Y
Z
Occ
ATOM 119 CA ARG 18 8.386 51.105
35.847 1.00 7.30 2IMM 179
B
Sequence number
(Temperature factors)
Atom name
Record type
PDB format is strictly column oriented !
23Hetero Atoms
... HETATM 877 O HOH 1 -4.169
60.050 40.145 1.00 3.00 2IMM 937 ...
http//xray.bmc.uu.se/hicup/
24The crystallographic asymmetric units does not
necessarily contain a functional molecule
The contents of a crystal lattice unit cell can
be generated from the asymmetric unit by applying
the required symmetry operations for the
crystallographic space-group. But neither is this
trivial for the non-crystallographer, nor is it
obvious which of the symmetry replicates might
make physiological contacts.
1qpi.pdb Tet-repressor/operator complex
25... Biological Unit
PQS reasons automatically about how a monomer
might be correctly completed to a functional bio-
molecular complex (and is often correct).
http//pqs.ebi.ac.uk/
26NCBI structure group
MMDB - very well integrated but somewhat
impenetrable.
27NDB
http//ndbserver.rutgers.edu/NDB/
urx035.pdb (Hammerhead Ribozyme)
28PDBsum - and "secondary" structure databases
http//www.biochem.ucl.ac.uk/bsm/pdbsum/
29PDBsum - Information
30Others
- Macromolecular Structure Database at EBI
(Relibase, PQS ...) - http//www.ebi.ac.uk/msd/
- Macromolecular structure related resources at the
PDB - http//www.rcsb.org/pdb/links.html
- Structure links at the Southwestern Biotechnology
and Informatics Center - http//www.swbic.org/links/1.19.2.5.php
- Molecular Models from Chemistry
- http//people.ouc.bc.ca/woodcock/molecule/molecule
.html - Molecular Library
- http//www.nyu.edu/pages/mathmol/library/
- .... many, many more.
31Concept 4
- Knowledge of structure allows mechanistic
explanations.
32Structure as an integrated map - Example questions
- Which part of my structure appears to be
conserved ? - Are two functionally important residues possibly
in contact ? - Where is Asn220 relative to the active site ?
- May the mutation E123A possibly have something to
do with protein stability ? - Is Leu234 on the surface, or in the core ?
- I want to clone my protein into a yeast
two-hybrid system should I fuse the DNA binding
domain to the N- or the C- terminus ?
33Geometric relationships
- Bonds
- Angles, plain and dihedral
- Surfaces
- Chemical potential, amino acid functions
- Static and dynamic disorder
- Structural similarity
- Electrostatics
- Conservation patterns (structural and functional)
- Quarternary structure
- Posttranslational modification sites
- Unexpected homology
- ...
34Distances from coordinates
XYZ coordinates are vectors in an orthogonal
coordinate system, in Å.
All the rules of analytical geometry apply.
... ATOM 687 OH TYR 86 7.415
62.584 32.900 1.00 3.37 ... ATOM 651 O
ASP 82 9.996 62.571 32.488 1.00
5.18 ...
d (9.996-7.415)2 (62.571-62.584)2
(32.488-32.900)20.5 (2.581)2 (-0.013)2
(-0.412)20.5 6.661561 0.0000169
0.1697440.5 6.8314740.5 2.614 Å
0.2614 nm 2.614 . 10-10 m
35Dihedral angles
Single bonds Freely rotable, but constrained by
steric overlap. Small energetic barrier,
preference for staggered conformations.
i3
i
i2
Double bonds Constrained to planar geometry.
Large energetic barrier to isomerization.
f
i1
36Backbone dihedral angles Ramachandran plots
?
?
?
Rotatable bonds in the backbone are named f,y and
w.
Due to steric overlap, not all combinations of
(f,y? are allowed.
Allowed and forbidden regions of (f,y? space are
shown on the Ramachandran plot.
Observed (f,y? values reflect the theoretical
boundaries well.
37Sidechain rotamers
100 randomly chosen Phe-residues superimposed.
?3
?2
??
Ponder Richards (1987) J. Mol. Biol. 193,
775-791
http//dunbrack.fccc.edu/bbdep/
38H-bond patterns
Tyr-Thr sidechain H-bond despite canonical
geometry, correct topology may be ambiguous!
Example TYR - Side Chain Donor OH can donate a
single hydrogen (The OH-H bond is 1.00Å long and
lies in the plane of CE1, CE2, CZ and OH forming
an angle of 110 degrees with the CZ-OH bond.)
Distribution of H-bond counts in all and buried
residues, D-A distances, H-A distances and D-H-A
angles inTyr sidechains.
McDonald Thornton (1994) J. Mol. Biol. 238,
777-793
http//www.biochem.ucl.ac.uk/bsm/atlas/
39Molecular surface
Chain "A" of 1AON.PDB - GroEL/ES complex
Surface rendering of GroEL/ES complex (D.
Goodsell)
40Molecular surface
Surface provides a visual metaphore, and a useful
tool to map properties. But how can a molecular
surface be defined ? Obviously, the hard-sphere
surface is chemically not very relevant.
Van der Waals surface
41Molecular surface
Probe !
Van der Waals surface
42Molecular surface
Contact surface
Accessible surface
"Accessible"
Van der Waals surface
Reentrant surface
"Buried"
43Calculating solvent accessible surfaces
- Draw a sphere around each atom, with a radius of
(VdW solvent probe ). - Erase all overlapping sphere surfaces.
- The remaining area is the accessible surface.
C 1.75 Å N 1.55 O 1.4Å H 1.17Å
44Parameters and assumptions
Problem Analytical solution inefficient. Solutio
n Numerical solution with probe points Problem
Regular placement of n probe points Solution
Stochastic placement Problem Stochastic
placement quite irregular Solution Enforce
minimum separation Problem Efficiency Solution
Place points only once, translate as
needed Problem What is a good value for n
? Solution Try different n, evaluate standard
deviation Problem Should n be constant per atom,
or per area ? Solution dots/area - need to scale
dots with r VdW Problem Hydrogens - where to get
united atom radii ? Solution Literature
search. Problem Reference areas for relative SAA
needed Solution Model explicitely, as
tripeptides ...
u,v ? 0,1
? 2p?u
f cos-1 (2v1)
http//mathworld.wolfram.com/ SpherePointPicking.h
tml
Even a straightforward algorithm has it's hidden
parameters and assumptions. Results are
meaningful only in this context. Any comparison
is problematic.
45Mapping properties on surfaces
- Properties of atoms (B-factors)
- Ensemble properties of residues
- (hydrophobicity, conservation)
- Geometry (local curvature)
- Fields and potentials
- (isosurfaces, binding potential)
AChE (1ACL.PDB) color coded by electrostatic
potential with GRASP. (http//trantor.bioc.
columbia.edu/grasp/)
46Concept 5
- Structure is not arbitrary, but contains
recurring units.
47Basic building blocks of structure
Eg. PROMOTIF - as used in PDBSUM
But classical descriptions of structural
building blocks are as much based on idealized
concepts of geometry as on observations of
nature. An unbiased analysis may arrive at
significantly different classifications !
48Unbiased structure motifs alignment with added
value
Motif alignments ... Why are particular amino
acids conserved? What is essential in a sequence ?
A structure motif consensus sequence, compiled
from unrelated segments, averages out features of
conservation that are only due to incomplete
divergence (homology). A consensus sequence,
taken from different structural contexts,
averages out features of sequence that are due to
specific functional (binding, catalysis) or
non-local structural requirements (packing,
interaction). What remains is information about
sequence propensities of local structural
elements.
49A schematikon motif example complex loop
Motif 1icf 215 Length 7 Support 7 Unique 7 Ran
k 399
50A schematikon motif examplestrand N-cap
Motif 1whi 35 Length 4 Support 7 Unique 7 Rank
444
51Concept 6
- Domains are
- folding units, functional units, and units of
inheritance.
52Domains are ubiquitous in proteins
Large proteins are composed of compact,
semi-independent units - domains.
Reason Modularity Folding efficiency
2MCP.PDB
53Domains in proteins
Number of domains in 787 representative proteins
used as the basis for the CATH database
Jones S et al. (1998) Protein Science 7233
54Domains in proteins
Non-random relationship between domain number and
chain length in the 787 representative proteins
used as the basis for the CATH database
Jones S et al. (1998) Protein Science 7233
55Domains in proteins
Domain size in the 787 representative proteins
used as the basis for the CATH database
Jones S et al. (1998) Protein Science 7233
56There is no universal definition of "domains"
Possible definitions are based on independently
inherited (sub)sequences (sequence domain),
modular protein functions (functional domain),
folding unit or atomic contacts (structural
domain).
Domain A part of structure that can fold
irrespective of the presence of other parts of
structure
But what is measured is commonly sequence,
function, or structure - NOT FOLDING!
57Further complications
Analogous structure, Domain insertions, Circular
permutations, Domain swapping.
Domain insertion 1A2J.PDB Protein disulfide
isomerase
2TRX.PDB Thioredoxin
58Further complications
Analogous structure, Domain insertions, Circular
permutations, Domain swapping.
59Further complications
Analogous structure, Domain insertions, Circular
permutations, Domain swapping.
Domain swapping 11BG.PDB Bull seminal ribonuclease
60Domains can be elusive
The separation of a structure into domains
requires the arbitrary definition of thresholds
in a continuum of possibilities.
informed
61Why care ?
Function evolution works on sequence, but
selects function. Definition of domains in
structure can uncover functional units that may
evolve independently. Sequence searches,
alignments etc. with domains are much more
specific. Once structural domains have been
defined, sequence profiles, HMMs or other
computational procedures can be used to pick out
more members of the domain family from the
database. Domains can be defined from sequence
patterns, or from the analyis of structure.
62Automated (objective) domain definition -
Sequence (CDD)
http//www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtm
l
CDD from Smart and Pfam CDART from CDD and
Genbank
63SemiAutomated consensus domain definition -
Structure (CATH)
Dehydrolipoamide dehydrogenase 1LPFA
Jones S et al. (1998) Domain assignment for
protein structures using a consensus approach
Chracterization and analysis. Protein Science
7233-242
64SCOP CATH structural classification
The eight most frequent SCOP Superfolds
http//scop.mrc-lmb.cam.ac.uk/scop/
http//www.biochem.ucl.ac.uk/bsm/cath/
65CATH - Class
Class 2 Mainly Beta
Class 3 Mixed Alpha/Beta
Class4 Few Secondary Structures
66CATH - Architecture
Super Roll
Barrel
2-Layer Sandwich
67CATH - Topology
Serine Protease
Aconitase, domain 4
TIM Barrel
68CATH - Homology
Dihydropteroate (DHP) synthetase
FMN dependent fluorescent proteins
7-stranded glycosidases
69CATH - Entry
(Example)
70IV Open Issues
- I Integration into processes, scriptable APIs
- II Sequence based identification of domains
- III Analysing domains in context
- IV Defining modular domain functions
71Bioinformaticians apparently do not like
structure !
- Sequence
- Discrete alphabet
- Easy to manipulate
- Well developed datastructures
- Well developed libraries
- Structure
- Continuous space
- Linear algebra, complicated energy functions
- Databases and datastructures are difficult
- Paucity of libraries
Meet the challenge !