Protein Fold recognition - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Protein Fold recognition

Description:

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU ... SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU. PDB New Fold Growth ... – PowerPoint PPT presentation

Number of Views:752

Avg rating:3.0/5.0

Slides: 46

Provided by: cbs6

Category:

more less

Transcript and Presenter's Notes

Title: Protein Fold recognition

1
Protein Fold recognition

Morten Nielsen,
CBS, BioCentrum,
DTU

2
Outline

Why model protein structure
Classification of protein structures
Fold, Superfamily, Family
Protein homology modeling
Template (fold) recognition
Alignment
Side chain modeling
Loop modeling

Reliability measures
id bad, P-value good
Historical overview
Blast (simple alignment)
Psi Blast (profiles)
Profile-profile alignment
Structural features
Recombinant or democratic homology modeling
Best methods

3
Why protein modeling?

Experimental effort to determine protein
structure is very large and costly
The gap between the size of the protein sequence
data and protein structure data is large and
increasing
Close to 50 of all new sequences can be homology
modeled

4
Swiss-Prot database
5
PDB New Fold Growth
Old folds
New PDB structures
New folds

The number of unique folds in nature is fairly
small (possibly a few thousands)
90 of new structures submitted to PDB in the
past three years have similar structural folds in
PDB

6
Protein classification

Number of protein sequences grow exponentially
Number of solved structures grow exponentially
Number of new folds identified very small (and
close to constant)
Protein classification can
Generate overview of structure types
Detect similarities (evolutionary relationships)
between protein sequences

7
Protein structure classification
8
Classification schemes

SCOP
Manual classification (A. Murzin)
CATH
Semi manual classification (C. Orengo)
FSSP
Automatic classification (L. Holm)

9
Levels in SCOP

Class Folds Superfamilies Families
All alpha proteins 202 342 550
All beta proteins 141 280 529
Alpha and beta proteins (a/b) 130 213 593
Alpha and beta proteins (ab) 260 386 650
Multi-domain proteins 40 40 55
Membrane and cell surface
proteins 42 82 91
Small proteins 72 104 162
Total 887 1447 2630

http//scop.berkeley.edu/count.htmlscop-1.67
10
Major classes in SCOP

Classes
All alpha proteins
Alpha and beta proteins (a/b)
Alpha and beta proteins (ab)
Multi-domain proteins
Membrane and cell surface proteins
Small proteins

11
All a Hemoglobin (1bab)
12
All b Immunoglobulin (8fab)
13
a/b Triosephosphate isomerase (1hti)
14
ab Lysozyme (1jsf)
15
Families

Proteins whose evolutionarily relationship is
readily recognizable from the sequence (25
sequence identity)
Families are further subdivided in to Proteins
Proteins are divided into Species
The same protein may be found in several species

16
Superfamilies

Proteins which are (remote) evolutionarily
related
Sequence similarity low
Share function
Share special structural features
Relationships between members of a superfamily
may not be readily recognizable from the sequence
alone

17
Folds

Proteins which have 50 of their secondary
structure elements arranged the in the same order
in the protein chain and in three dimensions are
classified as having the same fold
No evolutionary relation between proteins
confusingly also called fold classes

18
Links

PDB (protein structure database)
www.rcsb.org/pdb/
SCOP (protein classification database)
scop.berkeley.edu
CATH (protein classification database)
www.biochem.ucl.ac.uk/bsm/cath
FSSP (protein classification database)
www.ebi.ac.uk/dali/fssp/fssp.html

19
Superfamilies

Proteins which are (remote) evolutionarily
related
Sequence similarity low
Share function
Share special structural features
Relationships between members of a superfamily
may not be readily recognizable from the sequence
alone

Fold
Superfamily
Family
Proteins
20
Model accuracy. Swiss-model. 1200 models sharing
25-95 sequence identity with the submitted
sequences (www.expasy.ch/swissmod)
21
Identification of fold

If sequence similarity is high proteins share
structure (Safe zone)
If sequence similarity is low proteins may share
structure (Twilight zone)
Most proteins do not have a high sequence
homologous partner

Rajesh Nair Burkhard Rost Protein Science,
2002, 11, 2836-47
22
Identification of correct fold

ID is a poor measure
Many evolutionary related proteins share low
sequence homology
Alignment score even worse
Many sequence will score high against every thing
(hydrophobic stretches)
P-value or E-value more reliable

23
P and E values
Score 150 10 hits with higher score (E10) 10000
hits in database P10/10000 0.001

E-value
Number of expected hits in database with score
higher than match
Depends on database size
P-value
Probability that a random hit will have score
higher than match
Database size independent

P(Score)
Score
24
Protein Homology modeling

Identify fold (template) for modeling
Find the structure in the PDB database that
resembles the unknown structure the most
Can be used to predict function

Align protein sequence to template
Simple alignment methods
Sequence profiles
Threading methods
Pseudo force fields
Model side chains and loops

25
Template identification

Simple sequence based methods
Align (BLAST) sequence against sequence of
proteins with known structure (PDB database)
Sequence profile based methods
Align sequence profile (Psi-BLAST) against
sequence of proteins with known structure (PDB)
Align sequence profile against profile of
proteins with known structure (FFAS)
Sequence and structure based methods
Align profile and predicted secondary structure
against proteins with known structure (3D-PSSM)

26
Template identification

Threading methods
Align sequence against structural environment of
proteins with known structure
Use biological information
Functional annotation in databases
Active sites

27
Sequence profiles

In conventional alignment, a scoring matrix
(BLOSUM62) gives the score for matching two amino
acids
In reality not all positions in a protein are
equally likely to mutate
Some amino acids (active cites) are highly
conserved, and the score for mismatch must be
very high
Other amino acids are mutate almost for free, and
the score for mismatch is lower than the BLOSUM
score
Sequence profiles can capture this

28
Sequence profiles
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASK
ISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHW
HGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTV
TIT-----NIDQIED-VSHGFVVVNHGVSME---I
IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFR
GEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------
IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFV
PTHK-SHNAATIDGMVPEGVEGFKSRINDE----
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWAD
GPAYVTQCPI
TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVD
PMERNTAGVP
Matching any thing but G large negative score
Any thing can match
29
Sequence profiles

Align (BLAST) sequence against large sequence
database (Swiss-Prot)
Select significant alignments and make profile
(weight matrix) using techniques for sequence
weighting and pseudo counts (see lecture on
HMMs)
Use weight matrix to align against sequence
database to find new significant hits
Repeat 2 and 3 until stop criteria

30
PDB-BLAST

Procedure
Build sequence profile by iterative PSI-BLAST
search against a sequence database
Use profile to search database of proteins with
known structure (PDB)

31
Transitive BLAST

Procedure
Find homologues to query (your) sequence
Find homologues to these homologues
Etc.
Can be implemented with e.g. BLAST or PSI-BLAST
Also known as Intermediate Sequence Search (ISS)

32
ExampleSequence profiles

Alignment of protein sequences 1PLC._ and 1GYC.A
E-value 1000
Profile alignment
Align 1PLC._ against Swiss-prot
Make position specific weight matrix from
alignment
Use this matrix to align 1PLC._ against 1GYC.A
E-value

33
Sequence profiles
Score 97.1 bits (241), Expect 9e-22
Identities 13/107 (12), Positives 27/107
(25), Gaps 17/107 (15) 1PLC._ 3 ADDGSLA
FVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS
56 F G N
G 1GYC.A 26 ------VFPSPLITGKK
GDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79
1PLC._ 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--
HQGAGMVGKVTV 98 A G F
G G G V 1GYC.A 80 AFVNQCPIAS
GHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126
Rmsd3.3 Å Model red Template blue
34
Including structure

Sequence with in a protein superfamily share
remote sequence homology
, but they share high structural homology
Structure is known for template
Predict structural properties for query
Secondary structure
Surface exposure

35
Using structure

Sequencestructure profile-profile based
alignments
Template profiles
Multiple structure alignments
Sequence based profiles
Query profile
Sequence based profile
Predicted secondary structure
Position specific gap penalties derived from
secondary structure

36
Structure biased alignment (3D-PSSM)
http//www.sbg.bio.ic.ac.uk/3dpssm/
37
Threading
Alignment score from structural fitness (pair
potential) How well does K fit environment at P6?
If P8 is acidic then fine, if P8 is basic th
en poor
Deletions
7
4
6
2
5
8
10
9
3
1
.. A T N L Y K E T L ..
Insertion
38
Threading

Threading does not work
The average protein does not exist
Threading can be used in combination with
sequence profiles, local structural features to
improve alignment

39
CASP

CASP
Critical Assessment of Structure Predictions
Every second year
Sequences from about-to-be-solved-structures are
given to groups who submit their predictions
before the structure is published
Modelers make prediction
Meeting in December where correct answers are
revealed

40
CASP5 overview
41
Successful fold recognition groups at CASP5

3D-Jury (Leszek Rychlewski)
3D-CAM (Krzysztof Ginalski)
Template recombination (Paul Bates)
HMAP (Barry Honig)
PROSPECT (Ying Xu)
ATOME (Gilles Labesse)

42
Democratic homology modeling

Let the silent majority rule
The highest score hit will often be wrong
Many prediction methods will have the correct
fold among the top 10-20 hits
If many different prediction methods all have
some fold among the top hits, this fold is
probably correct

43
3D-Jury (Rychlewski)

Inspired by Ab initio modeling methods
Average of frequently obtained low energy
structures is often closer to the native
structure than the lowest energy structure
Find most abundant high scoring model in a list
of prediction from several predictors
Use output from a set of servers
Superimpose all pairs of structures
Similarity score Sij of Ca pairs within 3.5Å
(if 40else Sij0)
3D-Jury score SijSij/(N1)
Similar methods developed by A Elofsson (Pcons)
and D Fischer (3D shotgun)

44
LiveBench

The Live Bench Project is a continuous
benchmarking program. Every week sequences of
newly released PDB proteins are being submitted
to participating fold recognition servers. The
results are collected and continuous evaluated
using automated model assessment programs. A
summary of the results is produced after several
months of data collection. The servers must delay
the updating of their structural template
libraries by one week to participate

45
Meta prediction server

Web interface to a list of public protein
structure prediction servers
Submit query sequence to all selected servers in
one go
http//bioinfo.pl/meta/

46
Meta Server
47
Meta Server
48
3D Jury
49
188 targets in total Threshold for 5 false positi
ves 50 for 3D Jury
50
Links to fold recognition servers

Databases of links
http//bioinfo.pl/meta/servers.html
http//mmtsb.scripps.edu/cgi-bin/renderrelres?prot
model
Meta server
http//bioinfo.pl/meta/
3DPSSM good graphical output
http//www.sbg.bio.ic.ac.uk/servers/3dpssm/
GenTHREADER
http//bioinf.cs.ucl.ac.uk/psipred/
FUGUE2
http//www-cryst.bioc.cam.ac.uk/fugue/prfsearch.h
tml
SAM
http//www.cse.ucsc.edu/research/compbio/HMM-apps/
T99-query.html
FOLD
http//fold.doe-mbi.ucla.edu/
FFAS/PDBBLAST
http//bioinformatics.burnham-inst.org/

51
From fold to structure

Flying to the moon has not made man conquer
space
Finding the right fold does not allow you to make
accurate protein models
Can allow prediction of protein function
Alignment is still a very hard problem
Most protein interactions are determined by the
loops, and they are the least conserved parts of
a protein structure

52
Ab initio protein modeling Modeling of newfold pr
oteins

Only when every thing else fails
Challenge
Close to impossible to model Natures folding
potential
Example

53
Challenge. Folding potential

New folds are in general constructed from a set
of subunits, where each subunit is a part of a
known fold.
The subunits are small compared to the overall
fold of the protein. No objective function exists
to guide the global packing of the subunits.

Objective function
sij 120aa
dij 6Å
54
A way to solution

Glue structure piece wise from fragments.
Guide process by empirical potential (Potential
of mean force)

Fragments with correct local structure
Natures potential
Empirical potential
55
Examples (Rosetta web server) www.bioinfo.rpi.edu/
bystrc/hmmstr/server.php
Rosetta prediction
Homology modeling
56
Take home message

Identifying the correct fold is only a small step
towards successful homology modeling
Do not trust ID or alignment score to identify
the fold. Use p-values
Use sequence profiles and local protein structure
to align sequences
Do not trust one single prediction method, use
consensus methods (3D Jury)
Only if every things fail, use ab initio methods