Protein Fold recognition - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Protein Fold recognition

Description:

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU ... SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU. PDB New Fold Growth ... – PowerPoint PPT presentation

Number of Views:752
Avg rating:3.0/5.0
Slides: 46
Provided by: cbs6
Category:

less

Transcript and Presenter's Notes

Title: Protein Fold recognition


1
Protein Fold recognition
  • Morten Nielsen,
  • CBS, BioCentrum,
  • DTU

2
Outline
  • Why model protein structure
  • Classification of protein structures
  • Fold, Superfamily, Family
  • Protein homology modeling
  • Template (fold) recognition
  • Alignment
  • Side chain modeling
  • Loop modeling
  • Reliability measures
  • id bad, P-value good
  • Historical overview
  • Blast (simple alignment)
  • Psi Blast (profiles)
  • Profile-profile alignment
  • Structural features
  • Recombinant or democratic homology modeling
  • Best methods

3
Why protein modeling?
  • Experimental effort to determine protein
    structure is very large and costly
  • The gap between the size of the protein sequence
    data and protein structure data is large and
    increasing
  • Close to 50 of all new sequences can be homology
    modeled

4
Swiss-Prot database
5
PDB New Fold Growth
Old folds
New PDB structures
New folds
  • The number of unique folds in nature is fairly
    small (possibly a few thousands)
  • 90 of new structures submitted to PDB in the
    past three years have similar structural folds in
    PDB

6
Protein classification
  • Number of protein sequences grow exponentially
  • Number of solved structures grow exponentially
  • Number of new folds identified very small (and
    close to constant)
  • Protein classification can
  • Generate overview of structure types
  • Detect similarities (evolutionary relationships)
    between protein sequences

7
Protein structure classification
8
Classification schemes
  • SCOP
  • Manual classification (A. Murzin)
  • CATH
  • Semi manual classification (C. Orengo)
  • FSSP
  • Automatic classification (L. Holm)

9
Levels in SCOP
  • Class Folds Superfamilies Families
  • All alpha proteins 202 342 550
  • All beta proteins 141 280 529
  • Alpha and beta proteins (a/b) 130 213 593
  • Alpha and beta proteins (ab) 260 386 650
  • Multi-domain proteins 40 40 55
  • Membrane and cell surface
  • proteins 42 82 91
  • Small proteins 72 104 162
  • Total 887 1447 2630

http//scop.berkeley.edu/count.htmlscop-1.67
10
Major classes in SCOP
  • Classes
  • All alpha proteins
  • Alpha and beta proteins (a/b)
  • Alpha and beta proteins (ab)
  • Multi-domain proteins
  • Membrane and cell surface proteins
  • Small proteins

11
All a Hemoglobin (1bab)
12
All b Immunoglobulin (8fab)
13
a/b Triosephosphate isomerase (1hti)
14
ab Lysozyme (1jsf)
15
Families
  • Proteins whose evolutionarily relationship is
    readily recognizable from the sequence (25
    sequence identity)
  • Families are further subdivided in to Proteins
  • Proteins are divided into Species
  • The same protein may be found in several species

16
Superfamilies
  • Proteins which are (remote) evolutionarily
    related
  • Sequence similarity low
  • Share function
  • Share special structural features
  • Relationships between members of a superfamily
    may not be readily recognizable from the sequence
    alone

17
Folds
  • Proteins which have 50 of their secondary
    structure elements arranged the in the same order
    in the protein chain and in three dimensions are
    classified as having the same fold
  • No evolutionary relation between proteins
  • confusingly also called fold classes

18
Links
  • PDB (protein structure database)
  • www.rcsb.org/pdb/
  • SCOP (protein classification database)
  • scop.berkeley.edu
  • CATH (protein classification database)
  • www.biochem.ucl.ac.uk/bsm/cath
  • FSSP (protein classification database)
  • www.ebi.ac.uk/dali/fssp/fssp.html

19
Superfamilies
  • Proteins which are (remote) evolutionarily
    related
  • Sequence similarity low
  • Share function
  • Share special structural features
  • Relationships between members of a superfamily
    may not be readily recognizable from the sequence
    alone

Fold
Superfamily
Family
Proteins
20
Model accuracy. Swiss-model. 1200 models sharing
25-95 sequence identity with the submitted
sequences (www.expasy.ch/swissmod)
21
Identification of fold
  • If sequence similarity is high proteins share
    structure (Safe zone)
  • If sequence similarity is low proteins may share
    structure (Twilight zone)
  • Most proteins do not have a high sequence
    homologous partner

Rajesh Nair Burkhard Rost Protein Science,
2002, 11, 2836-47
22
Identification of correct fold
  • ID is a poor measure
  • Many evolutionary related proteins share low
    sequence homology
  • Alignment score even worse
  • Many sequence will score high against every thing
    (hydrophobic stretches)
  • P-value or E-value more reliable

23
P and E values
Score 150 10 hits with higher score (E10) 10000
hits in database P10/10000 0.001
  • E-value
  • Number of expected hits in database with score
    higher than match
  • Depends on database size
  • P-value
  • Probability that a random hit will have score
    higher than match
  • Database size independent

P(Score)
Score
24
Protein Homology modeling
  • Identify fold (template) for modeling
  • Find the structure in the PDB database that
    resembles the unknown structure the most
  • Can be used to predict function
  • Align protein sequence to template
  • Simple alignment methods
  • Sequence profiles
  • Threading methods
  • Pseudo force fields
  • Model side chains and loops

25
Template identification
  • Simple sequence based methods
  • Align (BLAST) sequence against sequence of
    proteins with known structure (PDB database)
  • Sequence profile based methods
  • Align sequence profile (Psi-BLAST) against
    sequence of proteins with known structure (PDB)
  • Align sequence profile against profile of
    proteins with known structure (FFAS)
  • Sequence and structure based methods
  • Align profile and predicted secondary structure
    against proteins with known structure (3D-PSSM)

26
Template identification
  • Threading methods
  • Align sequence against structural environment of
    proteins with known structure
  • Use biological information
  • Functional annotation in databases
  • Active sites

27
Sequence profiles
  • In conventional alignment, a scoring matrix
    (BLOSUM62) gives the score for matching two amino
    acids
  • In reality not all positions in a protein are
    equally likely to mutate
  • Some amino acids (active cites) are highly
    conserved, and the score for mismatch must be
    very high
  • Other amino acids are mutate almost for free, and
    the score for mismatch is lower than the BLOSUM
    score
  • Sequence profiles can capture this

28
Sequence profiles
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASK
ISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHW
HGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTV
TIT-----NIDQIED-VSHGFVVVNHGVSME---I
IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFR
GEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------
IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFV
PTHK-SHNAATIDGMVPEGVEGFKSRINDE----
TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWAD
GPAYVTQCPI
TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVD
PMERNTAGVP
Matching any thing but G large negative score
Any thing can match
29
Sequence profiles
  • Align (BLAST) sequence against large sequence
    database (Swiss-Prot)
  • Select significant alignments and make profile
    (weight matrix) using techniques for sequence
    weighting and pseudo counts (see lecture on
    HMMs)
  • Use weight matrix to align against sequence
    database to find new significant hits
  • Repeat 2 and 3 until stop criteria

30
PDB-BLAST
  • Procedure
  • Build sequence profile by iterative PSI-BLAST
    search against a sequence database
  • Use profile to search database of proteins with
    known structure (PDB)

31
Transitive BLAST
  • Procedure
  • Find homologues to query (your) sequence
  • Find homologues to these homologues
  • Etc.
  • Can be implemented with e.g. BLAST or PSI-BLAST
  • Also known as Intermediate Sequence Search (ISS)

32
ExampleSequence profiles
  • Alignment of protein sequences 1PLC._ and 1GYC.A
  • E-value 1000
  • Profile alignment
  • Align 1PLC._ against Swiss-prot
  • Make position specific weight matrix from
    alignment
  • Use this matrix to align 1PLC._ against 1GYC.A
  • E-value

33
Sequence profiles
Score 97.1 bits (241), Expect 9e-22
Identities 13/107 (12), Positives 27/107
(25), Gaps 17/107 (15) 1PLC._ 3 ADDGSLA
FVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS
56 F G N
G 1GYC.A 26 ------VFPSPLITGKK
GDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79
1PLC._ 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--
HQGAGMVGKVTV 98 A G F
G G G V 1GYC.A 80 AFVNQCPIAS
GHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126
Rmsd3.3 Å Model red Template blue
34
Including structure
  • Sequence with in a protein superfamily share
    remote sequence homology
  • , but they share high structural homology
  • Structure is known for template
  • Predict structural properties for query
  • Secondary structure
  • Surface exposure

35
Using structure
  • Sequencestructure profile-profile based
    alignments
  • Template profiles
  • Multiple structure alignments
  • Sequence based profiles
  • Query profile
  • Sequence based profile
  • Predicted secondary structure
  • Position specific gap penalties derived from
    secondary structure

36
Structure biased alignment (3D-PSSM)
http//www.sbg.bio.ic.ac.uk/3dpssm/
37
Threading
Alignment score from structural fitness (pair
potential) How well does K fit environment at P6?
If P8 is acidic then fine, if P8 is basic th
en poor
Deletions
7
4
6
2
5
8
10
9
3
1
.. A T N L Y K E T L ..
Insertion
38
Threading
  • Threading does not work
  • The average protein does not exist
  • Threading can be used in combination with
    sequence profiles, local structural features to
    improve alignment

39
CASP
  • CASP
  • Critical Assessment of Structure Predictions
  • Every second year
  • Sequences from about-to-be-solved-structures are
    given to groups who submit their predictions
    before the structure is published
  • Modelers make prediction
  • Meeting in December where correct answers are
    revealed

40
CASP5 overview
41
Successful fold recognition groups at CASP5
  • 3D-Jury (Leszek Rychlewski)
  • 3D-CAM (Krzysztof Ginalski)
  • Template recombination (Paul Bates)
  • HMAP (Barry Honig)
  • PROSPECT (Ying Xu)
  • ATOME (Gilles Labesse)

42
Democratic homology modeling
  • Let the silent majority rule
  • The highest score hit will often be wrong
  • Many prediction methods will have the correct
    fold among the top 10-20 hits
  • If many different prediction methods all have
    some fold among the top hits, this fold is
    probably correct

43
3D-Jury (Rychlewski)
  • Inspired by Ab initio modeling methods
  • Average of frequently obtained low energy
    structures is often closer to the native
    structure than the lowest energy structure
  • Find most abundant high scoring model in a list
    of prediction from several predictors
  • Use output from a set of servers
  • Superimpose all pairs of structures
  • Similarity score Sij of Ca pairs within 3.5Å
    (if 40else Sij0)
  • 3D-Jury score SijSij/(N1)
  • Similar methods developed by A Elofsson (Pcons)
    and D Fischer (3D shotgun)

44
LiveBench
  • The Live Bench Project is a continuous
    benchmarking program. Every week sequences of
    newly released PDB proteins are being submitted
    to participating fold recognition servers. The
    results are collected and continuous evaluated
    using automated model assessment programs. A
    summary of the results is produced after several
    months of data collection. The servers must delay
    the updating of their structural template
    libraries by one week to participate

45
Meta prediction server
  • Web interface to a list of public protein
    structure prediction servers
  • Submit query sequence to all selected servers in
    one go
  • http//bioinfo.pl/meta/

46
Meta Server
47
Meta Server
48
3D Jury
49
188 targets in total Threshold for 5 false positi
ves 50 for 3D Jury
50
Links to fold recognition servers
  • Databases of links
  • http//bioinfo.pl/meta/servers.html
  • http//mmtsb.scripps.edu/cgi-bin/renderrelres?prot
    model
  • Meta server
  • http//bioinfo.pl/meta/
  • 3DPSSM good graphical output
  • http//www.sbg.bio.ic.ac.uk/servers/3dpssm/
  • GenTHREADER
  • http//bioinf.cs.ucl.ac.uk/psipred/
  • FUGUE2
  • http//www-cryst.bioc.cam.ac.uk/fugue/prfsearch.h
    tml
  • SAM
  • http//www.cse.ucsc.edu/research/compbio/HMM-apps/
    T99-query.html
  • FOLD
  • http//fold.doe-mbi.ucla.edu/
  • FFAS/PDBBLAST
  • http//bioinformatics.burnham-inst.org/

51
From fold to structure
  • Flying to the moon has not made man conquer
    space
  • Finding the right fold does not allow you to make
    accurate protein models
  • Can allow prediction of protein function
  • Alignment is still a very hard problem
  • Most protein interactions are determined by the
    loops, and they are the least conserved parts of
    a protein structure

52
Ab initio protein modeling Modeling of newfold pr
oteins
  • Only when every thing else fails
  • Challenge
  • Close to impossible to model Natures folding
    potential
  • Example

53
Challenge. Folding potential
  • New folds are in general constructed from a set
    of subunits, where each subunit is a part of a
    known fold.
  • The subunits are small compared to the overall
    fold of the protein. No objective function exists
    to guide the global packing of the subunits.

Objective function
sij 120aa
dij 6Å
54
A way to solution
  • Glue structure piece wise from fragments.
  • Guide process by empirical potential (Potential
    of mean force)

Fragments with correct local structure
Natures potential
Empirical potential
55
Examples (Rosetta web server) www.bioinfo.rpi.edu/
bystrc/hmmstr/server.php
Rosetta prediction
Homology modeling
56
Take home message
  • Identifying the correct fold is only a small step
    towards successful homology modeling
  • Do not trust ID or alignment score to identify
    the fold. Use p-values
  • Use sequence profiles and local protein structure
    to align sequences
  • Do not trust one single prediction method, use
    consensus methods (3D Jury)
  • Only if every things fail, use ab initio methods
Write a Comment
User Comments (0)
About PowerShow.com