Title: Protein databases and structures
1Lecture 3 Protein databases and structures
http//www.science.org.au/sats2004/images/mackay2.
jpg
2Course Outline
3Protein Principles
- Proteins reflect millions of years of
evolution. - Most proteins belong to large evolutionary
families. - 3D structure is better conserved than sequence
during evolution. - Similarities between sequences or between
structures may reveal - information about shared biological functions
of a protein family.
4http//expasy.org/sprot/
5How can we determine the function of an
uncharacterized protein sequence ?
MGENDPPAVEAPFSFRSLFGLDDLKISPVAPDADAVAAQILSLLPLKFFP
IIVIGIIALILALAIGLGIHFDCSGKYRCRSSFKCIELIARCDGVSDCKD
GEDEYRCVRVGGQNAVLQVFTAASWKTMCSDDWKGHYANVACAQLGFPSY
VSSDNLRVSSLEGQFREEFVSIDHLLPDDKVTALHHSVYVREGCASGHVV
TLQCTACGHRRGYSSRIVGGNMSLLSQWPWQASLQFQGYHLCGGSVITPL
WIITAAHCVYDLYLPKSWTIQVGLVSLLDNPAPSHLVEKIVYHSKYKPKR
LGNDIALMKLAGPLTFNEMIQPVCLPNSEENFPDGKVCWTSGWGATEDGA
GDASPVLNHAAVPLISNKICNHRDVYGGIISPSMLCAGYLTGGVDSCQGD
SGGPLVCQERRLWKLVGATSFGIGCAEVNKPGVYTRVTSFLDWIHEQMER
DLKT
Find homologues Predict conserved domains Predict
structure Other
6Higher Level Structures Motifs Domains
Motif is a simple combination of a few secondary
structures, that appear in several different
proteins in nature. A collection of motifs
forms a domain. Domain is a more complex
combination of secondary structures. It has a
very specific function (contains an active site).
A protein may contain more than one domain.
7Grouping of Secondary Structures Elements -
Super-secondary Structures or Motifs
bab
aa
b-hairpin
?-barrels
http//www.expasy.org/swissmod/course/text/chapter
4.htm
8Motif Searching with PHI-BLAST (Pattern Hit
Initiated BLAST)
Searches for particular patterns in protein
queries.
Input protein query sequence and a pattern
contained in that sequence.
From BLASTP page, choose PHI-BLAST http//www.ncb
i.nlm.nih.gov/BLAST/ http//www.ebi.ac.uk/blastpgp
/
9Prosite Patterns ....
- Consensus sequences and patters are regular
expressions, - that can be used like fingerprints. E.g.
PROSITE patters
-N-P-ST-P- PS00001
N-Glycosylation
MGENDPPAVEAPFSFRSLFGLDDLKISPVAPDADAVAAQILSLLPLKFFP
IIVIGIIALILALAIGLGIHFDCSGKYRCRSSFKCIELIARCDGVSDCKD
GEDEYRCVRVGGQNAVLQVFTAASWKTMCSDDWKGHYANVACAQLGFPSY
VSSDNLRVSSLEGQFREEFVSIDHLLPDDKVTALHHSVYVREGCASGHVV
TLQCTACGHRRGYSSRIVGGNMSLLSQWPWQASLQFQGYHLCGGSVITPL
WIITAAHCVYDLYLPKSWTIQVGLVSLLDNPAPSHLVEKIVYHSKYKPKR
LGNDIALMKLAGPLTFNEMIQPVCLPNSEENFPDGKVCWTSGWGATEDGA
GDASPVLNHAAVPLISNKICNHRDVYGGIISPSMLCAGYLTGGVDSCQGD
SGGPLVCQERRLWKLVGATSFGIGCAEVNKPGVYTRVTSFLDWIHEQMER
DLKT
MGENDPPAVEAPFSFRSLFGLDDLKISPVAPDADAVAAQILSLLPLKFFP
IIVIGIIALILALAIGLGIHFDCSGKYRCRSSFKCIELIARCDGVSDCKD
GEDEYRCVRVGGQNAVLQVFTAASWKTMCSDDWKGHYANVACAQLGFPSY
VSSDNLRVSSLEGQFREEFVSIDHLLPDDKVTALHHSVYVREGCASGHVV
TLQCTACGHRRGYSSRIVGGNMSLLSQWPWQASLQFQGYHLCGGSVITPL
WIITAAHCVYDLYLPKSWTIQVGLVSLLDNPAPSHLVEKIVYHSKYKPKR
LGNDIALMKLAGPLTFNEMIQPVCLPNSEENFPDGKVCWTSGWGATEDGA
GDASPVLNHAAVPLISNKICNHRDVYGGIISPSMLCAGYLTGGVDSCQGD
SGGPLVCQERRLWKLVGATSFGIGCAEVNKPGVYTRVTSFLDWIHEQMER
DLKT
10Example Calmodulin-Binding Motif
(calcium-binding proteins)
11http//www.expasy.ch/prosite/
Prosite determines the function of
uncharacterized protein, and to which known
family of proteins it belongs. A pattern
describes a group of amino acids that constitutes
an usually short but characteristic motif within
a protein sequence.
For example The pattern AC - x - V - x(4) -
ED. is interpreted as Ala or Cys - any -
Val - any-any-any-any- any but Glu or Asp.
Note Search by full text.
12PROSITE SYNTAX
For example The pattern AC - x - V - X(4) -
ED. is interpreted as Ala or Cys - any -
Val - any-any-any-any- any but Glu or Asp.
- The standard one-letter code for amino acids.
- x' any amino acid.
- ' residues allowed at the position.
- ' residues forbidden at the position.
- ( )' repetition of a pattern element are
indicated in parenthesis. - X(n) or X(n, m) to indicate the number or
range of repetition. - -' separates each pattern element.
- ' indicated a N-terminal restriction of
the pattern. - ' indicated a C-terminal restriction of
the pattern. - .' the period ends the pattern..
13PROSITE Scan on Expasy
http//www.expasy.org/tools/scanprosite/
14http//www.ebi.ac.uk/interpro/
http//www.ebi.ac.uk/InterProScan/
15http//pfam.sanger.ac.uk/
16http//eblocks.stanford.edu/eblocks/kwsearch.html
Logos Select display format GIF PDF
Postscript
17http//weblogo.berkeley.edu//logo.cgi
18Protein Structures
Primary
Secondary
Tertiary
Quaternary
Arrangement of secondary elements in 3D space.
Amino acid sequence.
Alpha helices Beta sheets, Loops.
Packing of several polypeptide chains.
Given an amino acid sequence, we are interested
in its secondary structures, and how they are
arranged in higher structures.
19Protein Structures
- Proteins are fundamental components of all
living cells. - The critical feature of a protein is its
ability to adopt the right shape for carrying out
a particular function. - Identifying protein's shape (structure), is a
key to understanding its biological function and
its role in health and disease, in addition to
finding the right cure. -
- Amino acid chains can fold, in a variety of
ways. Only one of these folds allows a protein to
function properly.
20How do Proteins Acquire Correct Conformation ?
- The primary amino acid sequence is crucial in
determining its final - structure.
- In some cases, additional interactions may be
required before a - protein can attain its final conformation
(for example, cofactors, - one or more subunits).
- Proteins can change their shape and function
depending on the - environmental conditions in which they are
found. The primary amino - acid sequence does not change.
21A Major Challenge of Bio-informatics
The challenge Understand the relationship
between amino acid sequence and the 3D structure
of proteins Predict 3D structure from
sequence. Unfortunately, the relationship
between sequence and structure is very
complicated. Current tools perform this task
poorly. Best performance (so far) can be
achieved using sequence homology to a known 3D
structure experimentally determined (by X-ray
crystallography or NMR).
22The Structural Prediction Problem
Given a protein sequence, compute its structure.
- Possible in principle.
- Astronomical, highly under-constrained search
space. - Biophysics complex and incomplete.
- Next to impossible in practice.
23How is the 3D structure determined ?
- 1. Experimental methods (Best approach)
- X-rays crystallography - stable fold, good
quality crystals. - NMR - stable fold, not suitable for large
molecule. - 2. In-silico methods (partial solutions -
- based on similarity)
- Sequence or profile alignment - uses similar
sequences, - limited use of 3D information.
- Threading - needs 3D structure, combinatorial
complexity. - Ab-initio structure prediction - not always
successful.
http//www.idi.ntnu.no/grupper/KS-grp/microarray/s
lides/drablos/Fold_recognition/sld004.htm
24Predicting Protein Structure
Principle Look for the structure with minimum
free energy. Rule of thumb Hydrophobic a.a.
wants to stay inside (conserved) ,hydrophilic
a.a. wants to be outside (less conserved,
assuming water as the universal solvent in
cells). The main driving force for folding is
to pack hydrophobic side-chains into the interior
of the molecule, thus creating a hydrophobic
core. Factors other than free energy shape,
size, polarity, strength of interactions, etc.
25Conformation of Polypeptides
The Advent of Computational Modeling Aim
Develop procedures for predicting protein
structure, that are not so time consuming and
that are not hindered by size and solubility
constraints. Basic Theory Proteins that share a
similar sequence, generally share the same basic
structure. There is a strong conservation of
protein 3D shape across large evolutionary
distances.
26Three Main Approaches for Structural Prediction
- Comparative (Homology) Modeling.
- Requires sequence that is similar to the
sequences of - a protein(s) of known structure.
- Fold Recognition (Threading).
- Requires a structure similar to a known structure
- (with little sequence similarity).
- Both based on similarity.
- Ab-initio (based only on sequence)
- Have no similarity, based on first principals.
Example A pathway for folding a 2-domain protein.
271. Comparative (Homology) Modeling
Principle Sequence homology usually implies 3D
structural similarity.
Given a protein sequence, look for homologous
sequences with a known structure. Suppose the
structure of one or more homologous has already
been determined. Then the structure of our
original protein will be similar (High sequence
identity (gt 70), is necessary).
Remark The success of this approach depends on
the number of different structures already
determined (low success early on, improved as PDB
grows).
282. Protein Fold Recognition -
Classifying Proteins by Folds
Goal Map regions of linear sequence to known
folds in PDB.
Fold Collection of proteins that share a
similar combination of secondary structures.
In human Estimated number of proteins is
100,000. 700 folds discovered so far.
Nature has created complexity through the
combination of a small number of simple
elements - such as secondary structures.
29Fold Recognition
Fold recognition - Given a sequence and a library
of folds, thread the sequence through each fold.
Take the one with the highest score.
Note Method will fail if new protein does not
belong to any fold in the library. Experience
shows that with current library (700
folds) most new proteins do find a good
fold. Score of the threading is computed based
on known physical chemistry properties and
statistics of amino acids.
http//cmgm.stanford.edu/biochem218/16Threading.pd
f
30Fold Recognition - Threading
Thick backbone - known structure. Thin lines -
modeled structure. Some side-chains are not
positioned correctly, but some look good.
The similarity of structures is very high in
core regions (helices sheets). However, loops
vary even in pairs of homologous structures with
high of sequence similarity.
31http//www.pdg.cnb.uam.es/cursos/FVi2001DIA1/
32- Used when all else fails
- 1. No homology found to any sequence with known
structure. - 2. All known folds give poor threading scores.
- Given only the sequence, try to predict the
structure - based on physical-chemistry properties (energy,
- hydrophobicity, size, charge, etc.).
- Some ab-initio programs try to simulate the
process of - the protein folding in the cell (by molecular
dynamics).
33Ab-Initio Prediction
- A good prediction method for 2- or 3D
structures - only for small simple proteins.
- Method requires enormous computational
resources. -
Despite substantial -
improvements, success - is still very limited.
34http//isw3.aist-nara.ac.jp/IS/Bio-Info-Unit/gogro
up/study-ja.html
35PDB Content Growth
As of Tuesday Sep 04, 2007
http//www.rcsb.org/pdb/
36PDB - DataBank of Protein Structures
Example PDB code - 1E4I (glucosidase)
http//www.rcsb.org/pdb/
PDB tutorial http//www.rcsb.org/pdbstatic/tutori
als/tutorial.html
37http//www.ebi.ac.uk/thornton-srv/databases/pdbsum
/
Choose protein button -
Many Links
Sequence conservation
Active sites
.. Residues interacting with ligand
38Ligands button Cleft button
LIGPLOT of interactions involving ligand
39Viewing vs. Predicting
There are many programs for visualizing the 3D
structure of a protein. They typically offer
different viewpoints, each revealing a
different aspect of the 3D structure. All these
viewing programs (e.g. Rasmol or Cn3D) receive as
input the 3D coordinates of the structure (e.g
PDB or MMDB entries). While doing a great
visualization job, these programs have nothing
to do with predicting the structure.
40(EMBL)
http//www.ebi.ac.uk/rost/predictprotein/submit_d
ef.html
PP Help http//www.predictprotein.org/docs.php
41What is PredictProtein (PP) ?
PP is an automatic service for protein database
searches and the prediction of aspects of
protein structure. You send an amino acid
sequence and PP returns 1. Multiple
sequence alignment (i.e. database search).
2. ProSite sequence motif. 3.
Low-complexity regions. 4. ProDom domain
assignments. 5. Nuclear localization
signals. 6. Predictions of 1.
secondary structure (PHDsec). 2.
solvent accessibility (PHDacc). 3.
transmembrane helices. 4. coiled-coil
regions.
42PredictProtein (PP) - Results
PHD secondary structure prediction
PHD is a suite of programs predicting structure
(secondary structure, solvent accessibility)
from multiple sequence alignments.
PHD Profile fed neural network systems from
HeiDelberg.
PHD_sec PHD predicted secondary
structures. Hhelix, Eextended (sheet),
blankother (loop)
43PredictProtein (PP) - Results cont.
AA amino acids. Rel_sec reliability index for
PHD_sec prediction (0low to 9high) Note
Strong predictions marked by '. PHD_sec PHD
predicted secondary structure Hhelix,
Eextended (sheet), blankother (loop).
44PredictProtein (PROF predictions)
PROF sec predicted secondary structure
Hhelix, Eextended (sheet), blankother
(loop). Rel sec reliability index for PROFsec
prediction (0low to 9high)
Solvent accessibility, by PHDacc. Relative
accessibility b buried i intermediate
e exposed
45PredictProtein (PROF predictions)
pH_sec 'probability' for assigning helix
(1high, 0low). pE_sec 'probability' for
assigning strand (1high, 0low). pL_sec
'probability' for assigning neither helix, nor
strand (1high, 0low).
46- PHD Prediction of
- Secondary structure by PHDsec.
- Solvent accessibility by PHDacc.
- Helical transmembrane regions by PHDhtm.
PHD htm PHD predicted membrane helix Mhelical
transmembrane region, blanknon-membrane. PHD
thtm refined PHD prediction. PiMohtm PHD
prediction of membrane topology Mhelical
transmembrane region, iinside of membrane,
ooutside of membrane.
47http//bioinf.cs.ucl.ac.uk/psipred/psiform.html
Note use a non-commercial e-mail address.
48- Results
At the bottom of prediction, choose pdf view of
PSI-PRED results
49Choose PHDsec
http//searchlauncher.bcm.tmc.edu/seq-search/struc
-predict.html
50ConSeq Server http//conseq.bioinfo.tau.ac.il/
Identification of functionally and structurally
important residues in protein sequences.
Quick help http//conseq.bioinfo.tau.ac.il/quick_
help.htm http//conseq.bioinfo.tau.ac.il/overviewn
extversion.htm
51ConSurf Server http//consurf.tau.ac.il/
1e4i chain A
Final ResultsView ConSurf Results with
FirstGlance in Jmol
52Bioinformatics Workshop Modeling Protein
Structures Using Nest http//www.tau.ac.il/lifesc
i/bioinfo/teaching/2007-2008/Modeling_workshop.htm
53Expasy Tools http//www.expasy.ch/tools/
Post-translational modification predictions
- SignalP http//www.cbs.dtu.dk
/services/SignalP/ Output format
http//www.cbs.dtu.dk/services/SignalP-3.0/output.
php The Sulfinator predicts tyrosine sulfation
sites in protein sequences http//www.expasy.ch
/tools/sulfinator/ Output format
http//www.expasy.ch/tools/sulfinator/sulfinator-d
oc.html SulfoSite
http//sulfosite.mbc.nctu.edu.tw/ The NetNglyc
server predicts N-Glycosylation sites in human
proteins using artificial neural networks that
examine the sequence context of Asn-Xaa-Ser/Thr
sequons http//www.cbs.dtu.dk/services/NetNGlyc/
Output format http//www.cbs.dtu.dk/servic
es/NetNGlyc/output.php
54Expasy Tools http//www.expasy.ch/tools/
The NetPhos 2.0 server produces neural network
predictions for serine, threonine and tyrosine
phosphorylation sites in eukaryotic proteins
http//www.cbs.dtu.dk/services/NetPhos/ Output
format http//www.cbs.dtu.dk/services/NetPhos/
output.php The NetPhosK 1.0 server produces
neural network predictions of kinase specific
eukaryotic protein phosphoylation
sites http//www.cbs.dtu.dk/services/NetPhosK/
Output format http//www.cbs.dtu.dk/services/Ne
tPhosK/output.php and many others
55Expasy Tools http//www.expasy.ch/tools/
Topology prediction -
- PSORTb v.2.0 (Gardy et al, 2004) (v.1.0 Gardy
et al, 2003) for bacterial sequences - WoLF PSORT (Horton et al, 2006) recently
updated version of PSORT II for the prediction of
eukaryotic sequences - PSORT II (Nakai and Horton, 1997) for
eukaryotic sequences - PSORT (Nakai and Kanehisa, 1991) for plant
sequences - iPSORT (Bannai et al, 2002) for classification
of eukaryotic N-terminal sorting signals
http//www.psort.org/
http//us.expasy.org/tools/peptidecutter/
56EMBOSS Tools Tour
http//bioweb.pasteur.fr/seqanal/EMBOSS/
57STRING - Search Tool for the Retrieval of
Interacting Proteins http//string.embl.de/
58BioCarta http//www.biocarta.com/genes/index.asp
59http//www.pathblast.org/
Try examples !