Introduction to Bioinformatics: Lecture XI Computational Protein Structure Prediction

1 / 29

About This Presentation

Title:

Introduction to Bioinformatics: Lecture XI Computational Protein Structure Prediction

Description:

Multiple sequence alignment and family profiles. Secondary structure and solvent ... membrane regions based on 'hydropathy' (or hydrophobicity) profiles. ... –

Number of Views:341

Avg rating:3.0/5.0

Slides: 30

Provided by: pediatrici

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics: Lecture XI Computational Protein Structure Prediction

1
Introduction to Bioinformatics Lecture
XIComputational Protein Structure Prediction

Jarek Meller
Division of Biomedical Informatics,
Childrens Hospital Research Foundation
Department of Biomedical Engineering, UC

2
Outline of the lecture

Protein structure and complexity of
conformational search from similarity based
methods to de novo structure prediction
Multiple sequence alignment and family profiles
Secondary structure and solvent accessibility
prediction
Matching sequences with known structures
threading and fold recognition
Ab initio folding simulations

3
Polypeptide chains backbone and side-chains
N-ter
C-ter
4
Distinct chemical nature of amino acid side-chains
C-ter
PHE
N-ter
CYS
VAL
GLU
ARG
5
Hydrogen bonds and secondary structures
b-strand
a-helix
6
Tertiary structure and long range contacts
annexin
7
Quaternary structure and protein-protein
interactions annexin hexamer
8
Domains, interactions, complexes cyclin D and Cdk
Cyclin Box
9
Domains, interactions, complexes VHL
10
Protein folding problem

The protein folding problem consists of
predicting three-dimensional structure of a
protein from its amino acid sequence
Hierarchical organization of protein structures
helps to break the problem into secondary
structure, tertiary structure and protein-protein
interaction predictions
Computational approaches for protein structure
prediction similarity based and de novo methods

11
Polypeptide chains backbone and rotational
degrees of freedom
             H    O         R2

        NH3--Ca -- C -- N -- Ca -- C --O-

     \\                 R1
H    H        O
The equilibrium length of the peptide bond (C --
N) is about 2 Ang. The average Ca - Ca
distance in a polypeptide chain is about 3.8
Ang. The angle of rotation around N - Ca bond
is called j, and the angle around the Ca - C
bond is called f. These two angles define the
overall conformation of polypeptide
chains. Simplifying, there are three discrete
states (rotations) for each of these single
bonds, implying 9N possible backbone
conformations.
12
Scoring alternative conformations with empirical
force fields (folding potentials)
Ideally, each misfolded structure should have an
energy higher than the native energy, i.e.
Emisfolded - Enative gt 0
E
misfolded
native
13
Ab initio (or de novo) folding simulations

When dealing with a new fold, the similarity base
methods cannot be applied
Ab initio folding simulations consist of
conformational search with an empirical scoring
function (force field) to be maximized (or
minimized)
Computational bottleneck exponential search
space and sampling problem (global optimization!)
Fundamental problem inaccuracy of empirical
force fields
Importance of mixed protocols, such as Rosetta by
D. Baker and colleagues (more when Monte Carlo
protocols for global optimization are introduced)

14
Similarity based approaches to structure
prediction from sequence alignment to fold
recognition

High level of redundancy in biology sequence
similarity is often sufficient to use the guilt
by association rule if similar sequence then
similar structure and function
Multiple alignments and family profiles can
detect evolutionary relatedness with much lower
sequence similarity, hard to detect with pairwise
sequence alignments Psi-BLAST by S. Altschul et.
al.
For sufficiently close proteins one may
superimpose the backbones using sequence
alignment and then perform conformational search
(with the backbone fixed) to find the optimal
geometry (according to atomistic empirical force
field) of the side-chains homology modeling
(e.g. Modeller by A. Sali et. al.)
Many structures are already known (see PDB) and
one can match sequences directly with structures
to enhance structure recognition fold
recognition
For both, fold recognition and de novo
simulation, prediction of intermediate attributes
such secondary structure or solvent accessibility
helps to achieve better sensitivity and
specificity

15
Protein families and domains
The notion of protein family is derived from
evolutionary considerations members of the same
family are related, perform the same function
and are assumed to have diverged from the same
ancestor. The notion of domain is derived from
structural considerations A domain is defined
as an autonomous structural unit, or a reusable
sequence unit that may be found in multiple
protein contexts, Baterman et. al.
PFAM (7246 families as of April
2004) http//www.sanger.ac.uk/Software/Pfam/ PRO
DOM http//prodes.toulouse.inra.fr/prodom/current
/html/home.php CDD http//www.ncbi.nlm.nih.gov/S
tructure/cdd/cddsrv.cgi Check pfam00134.11,
Cyclin_N
16
Multiple alignment and PSSM
17
Multiple alignment, clustering and families

DP search gives optimal solution scaling
exponentially with the number of sequences K,
O(nK), not practical for more than 3,4 sequences.
Standard heuristics start from pairwise
alignments (e.g. PsiBLAST, Clustalw)
Hidden Markov Model approach to family profiles
(profile HMM) as an alternative with pre-fixed
parameters, trained separately for each family.
Some initial multiple alignments necessary for
training (next lecture).

18
Predicting 1D protein profiles from sequences
secondary structures and solvent accessibility
a) Multiple alignment and family profiles improve
prediction of local structural propensities. b)
Use of advanced machine learning techniques, such
as Neural Networks or Support Vector Machines
improves results as well. B. Rost and C. Sander
were first to achieve more than 70 accuracy in
three state (H, E, C) classification, applying a)
and b).
SABLE server http//sable.cchmc.org POLYVIEW
server http//polyview.cchmc.org
19
Predicting 1D protein profiles from sequences
secondary structures and solvent accessibility
20
Predicting transmembrane domains
21
Hydropathy profiles and membrane domains
prediction
Problem Design a simple algorithm for finding
putative trans- membrane regions based on
hydropathy (or hydrophobicity) profiles.
Consider an extension based on prototypes and
k-NN.
22
Predicting transmembrane domains
23
Going beyond sequence similarity threading and
fold recognition
When sequence similarity is not detectable use a
library of known structures to match your
query with target structures. As in case of de
novo folding, one needs a scoring function that
measures compatibility between sequences and
structures.
24
Why fold recognition?

Divergent (common ancestor) vs. convergent (no
ancestor) evolution
PDB virtually all proteins with 30 seq.
identity have similar structures, however most of
the similar structures share only up to 10 of
seq. identity !
www.columbia.edu/rost/Papers/1997_evolution/paper
.html (B. Rost)
www.bioinfo.mbb.yale.edu/genome/foldfunc/ (H.
Hegyi, M. Gerstein)

25
Simple contact model for protein structure
prediction
Each amino acid is represented by a point in 3D
space and two amino acids are said to be in
contact if their distance is smaller than a
cutoff distance, e.g. 7 Ang.
26
Sequence-to-structure matching with contact models

Generalized string matching problem aligning a
string of amino acids against a string of
structural sites characterized by other
residues in contact
Finding an optimal alignment with gaps using
inter-residue pairwise models
E S klt l e k l ,
is NP-hard because of the non-local character
of scores at a given structural site (identity of
the interaction partners may change depending on
location of gaps in the alignment)
R.H. Lathrop, Protein Eng. 7 (1994)

27
Hydrophobic contact model and sequence-to-structur
e alignment
-
HPHPP

Solutions to this yet another instance of the
global optimization problem
Heuristic (e.g. frozen environment approximation)
Profile or local scoring functions (folding
potentials)

28
Using sequence similarity, predicted secondary
structures and contact potentials fold
recognition protocols
In practice fold recognition methods are often
mixtures of sequence matching and threading,
e.g., with compatibility between a sequence and a
structure measured by contact potentials and
predicted secondary structures compared to the
secondary structure of a template). D.Fischer
and D. Eisenberg, Curr. Opinion in Struct. Biol.
1999, 9 208
29
Some fold recognition servers

PsiBLAST (Altschul SF et. al., Nucl. Acids Res.
25 3389)
Live Bench evaluation (http//BioInfo.PL/LiveBench
/1/)
FFAS (L. Rychlewski, L. Jaroszewski, W. Li, A.
Godzik (2000), Protein Science 9 232) seq.
profile against profile
3D-PSSM (Kelley LA, MacCallum RM, Sternberg JE,
JMB 299 499 ) 1D-3D profile combined with
secondary structures and solvation potential
GenTHREADER (Jones DT, JMB 287 797) seq.
profile combined with pairwise interactions and
solvation potential
LOOPP annotations of remote homologs
http//www.tc.cornell.edu/CBIO/loopp