Title: Protein Structure
1Protein Structure
- Ingo Ruczinski
- Department of Biostatistics, Johns Hopkins
University
2Structural Proteins
3Membrane Proteins
4Globular Proteins
5Terminology
- Primary Structure
- Secondary Structure
- Tertiary Structure
- Quatenary Structure
- Supersecondary Structure
- Domain
- Fold
6Hierarchy of Protein Structure
7Helices
a
p
Amino acids/turn
3.6
4.4
3.0
Frequency
97
rare
3
H-bonding
i, i4
i, i5
i, i3
8a-helices
a-helices have handedness
a-helices have a dipole
9b-sheets
10b-sheets
Have a right-handed twist!
11b-sheets
Can form higher level structures!
12Super Secondary Structure Motifs
13What is a Domain?
- Richardson (1981)
-
- Within a single subunit polypeptide chain,
contiguous portions of the polypeptide chain
frequently fold into compact, local
semi-independent units called domains.
14More About Domains
- Independent folding units.
- Lots of within contacts, few outside.
- Domains create their own hydrophobic core.
- Regions usually conserved during recombination.
- Different domains of the same protein can have
different functions. - Domains of the same protein may or may not
interact.
15Two Very Small Domains
16Why Look for Domains?
Domains are the currency of protein function!
17Homology and Analogy
- Homology Similarity in characteristics
resulting from shared ancestry. - Analogy The similarity of structure between two
species that are not closely related,
attributable to convergent evolution.
Homologous structures can be divided into
orthologues (a result from changes in the same
gene between different organisms, such as
myoglobin) and paralogues (a result from gene
duplication and subsequent changes within an
organism and its descendents, such as
hemoglobin).
18(No Transcript)
19The CATH Hierarchy
20(No Transcript)
21DALIDistance Matrix Alignment
- DALI generates alignments of structural
fragments, and is able to find alignments
involving chain reversals and different
topologies. - The algorithm uses distance matrices to represent
each structure to be compared. - Application of DALI to the entire PDB produces
two classifications of structures FSSP and DDD
(3D).
Holm L, and Sander C (1993)
22DALI
23DALI
24FSSP and DDD
- The families of structurally similar proteins
(FSSP) is a database of structural alignments of
proteins in the protein data bank (PDB). It
presents the results of applying DALI to (almost)
all chains of proteins in the PDB. - The DALI domain dictionary (DDD) is a
corresponding classification of recurrent domains
automatically extracted from known proteins.
25Other Algorithms for Domain Decomposition
- The Protein Domain Parser (PDP) uses compactness
as a chief principle. - http//123d.ncifcrf.gov/pdp.html
- DomainParser is graph theory based. The
underlying principle used is that residue-residue
contacts are denser within a domain than between
domains. - http//compbio.ornl.gov/structure/domainparser/
26Oh Dear
27Parsing Sequence into Domains
- Look for internal duplication.
- Look for low complexity segments.
- Look for transmembrane segments.
28Why is That Important?
- Functional insights.
- Improved database searching.
- Fold recognition.
- Structure determination.
PRODOM http//protein.toulouse.inra.fr/prodom/cu
rrent/html/home.php PFAM
http//www.sanger.ac.uk/Softwa
re/Pfam/
29Protein Structure Prediction
30Homology Modeling
31Fold Recognition
Sequence Known folds
32Ab Initio Structure Prediction
33Homology Modeling
- Align sequence to protein sequences with known
structure. - Construct and evaluate model of 3D structure from
alignment. - Requirement Close match to template sequences
with known 3D structure (sequence similarity of
at least 25).
Note about 25 of the protein sequences in the
Swiss-Prot database have templates for at least
part of the sequence!
34Threshold for Structural Homology
Rost B, Protein Engineering 12 (1999).
35Homology Modeling Approach
- Find set of sequences related to target sequence.
- Align target sequence to template sequences (key
step). - Construct 3D model for core (backbone)
- Conserved regions ? conserved structure /
coordinates. - Structure diverges ? use sequence similarity,
secondary structure prediction, manual
prediction, etc. to fill in gaps. - Construct 3D models for loops
- Search loop conformation library, limited protein
folding. - Model location of side chains
- Search rotamer library, use molecular dynamics.
- Optimize / verify the model
- Improve likelihood / ensure legality of model.
36Homology Modeling Web Pages
- MODELLER
- http//salilab.org/modeller/modeller.html
- SWISS-MODEL
- http//www.expasy.org/swissmod/SWISS-MODEL.html
37Quality Assessment
- Goal
- Ensure predicted 3D structure is possible /
probable in practice - Based on general knowledge of protein structures
- Criteria
- Carbon backbone conformations allowed
(Ramachandran map) - Legal bond lengths, angles, dihedrals
- Peptide bonds are planar
- Side chain conformations correspond to ones in
rotamer library - Hydrogen-bonding of polar atoms if buried
- Proper environments for hydrophobic / hydrophilic
residues - No bad atom-atom contacts
- No holes inside 3D structure
- Solvent accessibility
38Quality Assessment Programs
- VERIFY3D
- http//shannon.mbi.ucla.edu/DOE/Services/Verify_3D
- PROCHECK
- http//www.biochem.ucl.ac.uk/roman/procheck/proch
eck.html - WHATIF
- http//www.cmbi.kun.nl/whatif/
39Fold Recognition
- The input sequence is threaded on different folds
from a library of known folds. - Using scoring functions, we get a score for the
compatibility between the sequence and the
structures.
Amino acids with different chemical properties
Library of known folds
40Fold Recognition
Hydrogen donor
Hydrogen acceptor
Hydrophobic
Glycin
Good score!
41Web Sites for Fold Recognition
3D-PSSM http//www.bmm.icnet.uk/3dpssm LIBRA
I http//www.ddbj.nig.ac.jp/htmls/Email/libra/LI
BRA_I.html UCLA DOE http//www.doe-mbi.ucla.edu/p
eople/frsvr/frsvr.html 123D http//www-Immb.ncif
crf.gov/nicka/123D.html PROFIT http//lore.came.s
bg.ac.at/home.html
42Ab Initio Methods
- Ab initio From the beginning.
- Assumption 1 All the information about the
structure of a protein is contained in its
sequence of amino acids. - Assumption 2 The structure that a (globular)
protein folds into is the structure with the
lowest free energy. - Finding native-like conformations require
- - A scoring function (potential).
- - A search strategy.
43Fragment Selection
44Hydrophobic Burial
45Residue Pair Interaction
46The Sequence Independent Term
47Strand Packing Helps!
Estimated f-q distribution
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53Rosetta in CASP4
54CASP
55Hubbard Plot
56Hubbard Plots
57(No Transcript)
58Functional Annotation
59(No Transcript)
60Protein Design
61Protein Design
62Protein Secondary Structure Prediction
63(No Transcript)
64Secondary Structure Assignment
- Eight states from DSSP
- H a-helix
- G 310 helix
- I p-helix
- E b-strand
- B bridge
- T b-turn
- S bend
- C coil
- CASP standard
- H (H, G, I), E (E, B), C (C, T, S).
65Secondary Structure Prediction
Given the sequence of amino acids of a protein,
what is its secondary structure?
GHWIATRGQLIREAYEDYRHFSSECPFIP
Primary structure
CEEEEECHHHHHHHHHHHCCCHHCCCCCC
Secondary structure
Notation H Helix E Strand C Coil
66Secondary Structure Prediction
Helix
Edge strand
Buried strand
By eye!
67Conformational Preferences of Amino Acids
Helical Preference.
Strand Preference.
Turn Preference.
68Conformational Preferences of Amino Acids
Extended flexible side chains.
Bulky side chains, beta-branched.
Restricted conformations, side chain main
chain interactions.
69Secondary Structure Prediction
Given the sequence of amino acids of a protein,
what is its secondary structure?
GHWIATRGQLIREAYEDYRHFSSECPFIP
Primary structure
CEEEEECHHHHHHHHHHHCCCHHCCCCCC
Secondary structure
Notation H Helix E Strand C Coil
70Measures for Prediction Accuracy
The standard measure for prediction accuracy is
(still) the Q3 measure. It is simply the
proportion (in percent) of all amino acids that
have correct matches for the three states C, E,
H. In recent years, the segment overlap measure
(SOV) has been used more extensively. It aims for
measuring how well secondary structure elements
have been predicted rather than individual
residues.
Rost et al (1994), JMB 235, pp 13-26.
71Automated Methods
The availability of large families of homologous
sequences together with advances in computing
techniques has pushed the prediction accuracy
well above 70. Most methods are available as web
servers. They include
PHD http//www.embl-heidelberg.de/predictprotein
/predictprotein.html PSI-PRED http//bioinf.cs.ucl
.ac.uk/psipred/ JPRED http//www.compbio.dundee.ac
.uk/www-jpred/
72Consensus