Title: Evolution teaches to predict protein structure and function
1Evolution teaches to predict protein structure
and function
- Burkhard Rost
- CUBIC Columbia University
- rost_at_columbia.edu
- http//www.columbia.edu/rost
- http//cubic.bioc.columbia.edu/
2Evolution teaches prediction
- Is Bioinformatics up to the data deluge?
- Sequence comparison do we know what we do?
- conservation of structure and function
- Structure prediction where are we today?
- How to learn from the evolutionary odyssey?
- secondary structure
- transmembrane proteins
- solvent accessibility
- Are 1D predictions useful?
- sub-cellular localisation
- whole genomes
- 3D structure threading
- floppy regions
3http//cubic.bioc.columbia.edu/
- Volker Eyrich
- Rajesh Nair
- Jinfeng Liu
- Dariusz Przybylski
- Yanay Ofran
- Henry Bigelow
- Kazimierz Wrzeszczynski
- Sven Mika
- Chien Peter Chen
- Burkhard Rost
- Miguel Andrade EMBL
- Sean ODonoghue LION
- Andrej Sali Marc Marti-Renom Rockefeller
- Alfonso Valencia Florencio Pazos
Madrid - Michal Linial Jerusalem
- Claus Andersen Copenhagen
- Bastian Bruning Nijmegen
- Hepan Tan Columbia
- Trevor Siggers Columbia
- http//cubic.bioc.columbia.edu/
4CUBIC http//cubic.bioc.columbia.edu
Dariusz Przybylski
Trevor Siggers
Volker Eyrich
Murat Cokol
Jinfeng Liu
Hepan Tan
5The Data Deluge
Conclusion Bioinformatics will have a hell of a
problem
6Data Deluge what do we want?
7Data Deluge numbers
50 1.200.000 500.000 2000 17.000 800 35.000
8Data Deluge what CAN we do?
9Data Deluge we CAN we do?
Not much yet
10Evolution teaches prediction
- Bioinformatics up to the data deluge? NO, but
work in progress! - Sequence comparison do we know what we do?
- conservation of structure and function
- Structure prediction where are we today?
- How to learn from the evolutionary odyssey?
- secondary structure
- transmembrane proteins
- solvent accessibility
- Are 1D predictions useful?
- sub-cellular localisation
- whole genomes
- 3D structure threading
- floppy regions
11Dynamic programming optimal alignment
12BLAST fast matching of single words
13Profile-based comparison
14Zones
15Sequence -gt Structure
- Sequence folds into unique structure S -gt T
16Sequence -gt Structure
- Sequence folds into unique structure S -gt T
- Similar sequences fold into similar structures S
S -gt T
17Sequence -gt Structure
- Sequence folds into unique structure S -gt T
- Similar sequences fold into similar structures S
S -gt T - Most sequences dont fold, at all S -gt no T
18Twilight zone false positives explode
B Rost 1999 Prot. Engin.12, 85-94
19Significant sequence identity
B Rost 1999 Prot. Engin.12, 85-94
20Evolution did it !
B Rost 1999 Prot. Engin.12, 85-94
21Similar sequence -gt similar structure?
B Rost 1999 Prot. Engin.12, 85-94
22Detecting true hits in Twilight zone
B Rost 1999 Prot. Engin.12, 85-94
23Finding similar structures in Twilight zone
B Rost 1999 Prot. Engin.12, 85-94
24Secure thresholds for BLAST
B Rost 1999 Prot. Engin.12, 85-94
25Accuracy vs. coverage
26BLAST is not enough ...
B Rost 1999 Prot. Engin.12, 85-94
27Sequence Space Hopping
B Rost 1999 Prot. Engin.12, 85-94
28Success through sequence space hopping
B Rost 1999 Prot. Engin.12, 85-94
29Zones
30Profile-based database search
B Rost 2001 Structural Bioinformaticsin press
31Profile-based database search
32Profile-based database search
33Profile-based database search
34Profile-based database search
35Profile-based database search
36Zones
37Hypothetical distribution of similar structures
38FAKE DATA
39Midnight zone real - random
AS Yang and B Honig 2000 J. Mol. Biol.301,
679-689
B Rost 1997 Folding Design2, S19-S24
40Evolution into the Midnight zone
B Rost and S O'Donoghue 1998 EMBL preprint
41Protein structures evolved at random - almost
- average lt 10
- -gt most pairs have random identity levels
- 3 - 4 anchor residues
- 4 billion years of evolution reached equilibrium
- rate of creating new structures slower than drift
towards mean - averages for convergent and divergent evolution
similar - convergent evolution may have been a major event
42Structure space
B Rost 1998 Structure6, 259-263
43Gold-mine out of reach!
Percentage of pairs
44Conservation of function
Devon Valencia 2000, Proteins, 41, pp. 98
45Conservation of EC number
46Conservation of EC number 2
47Conservation of EC number BLAST
48Conservation in detail
49Accuracy vs. coverage EC number
50Conservation of EC numbers
51Evolution teaches prediction
- Bioinformatics up to the data deluge? NO, but
work in progress! - Know what we do? Some do, 30 over 100
residues! - Structure prediction where are we today?
- How to learn from the evolutionary odyssey?
- secondary structure
- transmembrane proteins
- solvent accessibility
- Are 1D predictions useful?
- sub-cellular localisation
- whole genomes
- 3D structure threading
- floppy regions
52Notation protein structure 1D, 2D, 3D
53(No Transcript)
54(No Transcript)
55Goal of structure prediction
Epstein Anfinsen, 1961 sequence uniquely
determines structure INPUT sequence
OUTPUT
56Protein structure prediction in reality
57(No Transcript)
58Homology modelling/comparative modelling
- assumption H and U homolgous 3D structures
- strategy modelling of U based on H
59Protein structure prediction in reality
60Protein structure prediction in reality
Genome view
SWISS-PROT view
61Structure prediction for protein universe
62Improving prediction by waiting it out
1999
1995
1991
63Evolution teaches prediction
- Bioinformatics up to the data deluge? NO, but
work in progress! - Know what we do? Some do, 30 over 100
residues! - Where are we today? NO 3D prediction from
sequence! - How to learn from the evolutionary odyssey?
- secondary structure
- transmembrane proteins
- solvent accessibility
- Are 1D predictions useful?
- sub-cellular localisation
- whole genomes
- 3D structure threading
- floppy regions
64Evolution did it !
B Rost 1999 Prot. Engin.12, 85-94
65(No Transcript)
66(No Transcript)
67(No Transcript)
68Evolution teaches prediction
- Bioinformatics up to the data deluge? NO, but
work in progress! - Know what we do? Some do, 30 over 100
residues! - Where are we today? NO 3D prediction from
sequence! - Evolutionary odyssey applied
- secondary structure 15 -gt 76 10
- transmembrane proteins
- solvent accessibility
- Are 1D predictions useful?
- sub-cellular localisation
- whole genomes
- 3D structure threading
- floppy regions
69Membrane prediction
70HTM prediction waiting for database growth ...
1999
1996
1993
71Topology for membrane helical proteins
72PHDsec success on Poly-Valine
73(No Transcript)
74Refine by dynamic programming on NN energy
75PHDhtmrefinetopologyprediction
76PHDhtm on Poly-Valine
77Example IS representative
78To be or not to be (HTM)
79False positives globular proteins
80Details PHDsec Wrong alignment
- single sequences gt accuracy clearly lower
- sufficient information in multiple alignment
- many sequences
- diversity
- wrong alignment -gt wrong prediction
ID IDE WSIM IFIR ILAS JFIR JLAS LALI
NGAP LGAP LSEQ ftsh_ecoli 1.00 1.00 1 644
1 644 644 0 0 644 ftsh_haein 0.76
0.84 256 635 1 380 380 0 0
381 ftsh_bacsu 0.50 0.62 3 630 6 637
623 6 14 637 ftsh_porpu 0.48 0.59 5
604 9 623 598 5 19 628 ftsh_lacla
0.46 0.57 1 638 12 695 635 7 52
695 ftsh_odosi 0.45 0.56 2 611 5 644
609 5 32 644
81Details PHDhtm wrong for save alignment
....,....1....,....2....,.... AA
MAKNLILWLVIAVVLMSVFQSFGPSESNG OBS htm
HHHHHHHHHHHHHHHHHHHH PHD htm
Rel htm 999999999998888899999999
99999
82Details PHDhtm correct for accurate alignment
....,....1....,....2....,.... AA
MAKNLILWLVIAVVLMSVFQSFGPSESNG OBS htm
HHHHHHHHHHHHHHHHHHHH PHD htm
HHHHHHHHHHH Rel htm 888776510000000000
01357899999 PHDRhtm HHHHHHHHHHHHHHHHHH
PHDThtm iiiiTTTTTTTTTTTTTTTTTTooooooo
83Evolution teaches prediction
- Bioinformatics up to the data deluge? NO, but
work in progress! - Know what we do? Some do, 30 over 100
residues! - Where are we today? NO 3D prediction from
sequence! - Evolutionary odyssey applied
- secondary structure 15 -gt 76 10
- transmembrane proteins 10 -gt 65 topo ok
- solvent accessibility
- Are 1D predictions useful?
- sub-cellular localisation
- whole genomes
- 3D structure threading
- floppy regions
84Defining residue solvent accessibility
85(No Transcript)
86Evolution for accessibility prediction
- Detailed prediction problematic
- Significant gain by evolutionary
information in/out with gt 75 accuracy!
87PHDacc the un-g(l)ory details
- accuracy gt 75 (two states buried, exposed)
- distribution with 10
- stronger predictions more accurate
- WARNING reliability index almost factor 2 too
large for single sequences - accuracy below average for intermediate state
- VERY dependent on alignment accuracy
88Evolution teaches prediction
- Bioinformatics up to the data deluge? NO, but
work in progress! - Know what we do? Some do, 30 over 100
residues! - Where are we today? NO 3D prediction from
sequence! - Evolutionary odyssey applied
- secondary structure 15 -gt 76 10
- transmembrane proteins 10 -gt 65 topo ok
- solvent accessibility 5 -gt 75
- Are 1D predictions useful?
- sub-cellular localisation
- whole genomes
- 3D structure threading
- floppy regions
89Evolution teaches prediction
- Bioinformatics up to the data deluge? NO, but
work in progress! - Know what we do? Some do, 30 over 100
residues! - Where are we today? NO 3D prediction from
sequence! - Evolutionary odyssey applied
- secondary structure 15 -gt 76 10
- transmembrane proteins 10 -gt 65 topo ok
- solvent accessibility 5 -gt 75
- Are 1D predictions useful? Of course to experts
- sub-cellular localisation
- whole genomes
- 3D structure threading
- floppy regions
90(No Transcript)
91(No Transcript)
92(No Transcript)
93(No Transcript)
94(No Transcript)
95Shuttle into the nucleus
Cytoplasm
Nucleus
96How many NLS motifs in databases?
- ONE in PROSITEbi-partite motif
Coverage
97Experimental NLS positive charges
98Experimental NLS more complicated
99In silico mutagenisis
100Increasing accuracy and coverage
Coverage
101Increasing accuracy and coverage
Coverage
102Increasing accuracy and coverage
Coverage
103Increasing accuracy and coverage
Coverage
104Increasing accuracy and coverage
Coverage
105Nuclear protein in proteomes
106Un-annotated nuclear proteins with NLS
- ATAXIN-1 GERGHGGG
- Breast Cancer type2 (Brc2) RIKKKQR
- Fibroblast Growth factor (fgf) KKRRRRR
- Brg1 ERKRRQ
107Using NLS to bind DNA
108DNA-binding predictions in proteomes
109Rotation _at_ CUBIC.bioc.columbia.edu
- want all cell-cycle protein
- search in SWISS-PROT, PROSITE
- search literature
- build expert set of known
110Significant motifs
111Rotation _at_ CUBIC.bioc.columbia.edu
- want all cell-cycle protein
- search in SWISS-PROT, PROSITE
- search literature
- build expert set of known
- choose unique subset
112Finding unique subsets of proteins
113Similar sequence -gt similar structure?
B Rost 1999 Prot. Engin.12, 85-94
114Rotation _at_ CUBIC.bioc.columbia.edu
- want all cell-cycle protein
- search in SWISS-PROT, PROSITE
- search literature
- build expert set of known
- choose unique subset
- find motifs. sorry time run out, here!
115Retention signals in ER and Golgi
116Evolution teaches prediction
- Bioinformatics up to the data deluge? NO, but
work in progress! - Know what we do? Some do, 30 over 100
residues! - Where are we today? NO 3D prediction from
sequence! - Evolutionary odyssey applied
- secondary structure 15 -gt 76 10
- transmembrane proteins 10 -gt 65 topo ok
- solvent accessibility 5 -gt 75
- High-throughput success of predictions
- localisation accessibility useful, but
not enough! - whole genomes
- 3D structure threading
- floppy regions
117(No Transcript)
118Family size
Prokaryotes
Archeans
Aeropyrum pernix K1
Cumulative percentage of proteins
Eukaryotes
Number of proteins in family
119Structure prediction for protein universe
120Do we aim at getting one structure per fold?
- Structural proteomics hunt for new folds
?Tough task for theory! -gt Practice Shrink
complexes 14747 technicians! - Can we avoid non-globular proteins?
- Can we prioritise aspects of function?
121Similar amino acid composition
122Inventory of life membrane proteins
Eukaryotes
Prokaryotes
Archaea
123Number of membrane helices -gt complexity?
124Membraneproteinskingdomsinventeddifferenttr
icks
125The membraneLEGO
126Length of globular regions in membrane proteins
127Inventory of life coiled-coil proteins
128Coiled-coil proteins details
129Inventory of life compartments
130Proteinstructureuniverse
131Distribution of protein length
132Bottleneck 5 money ...
- Goal 500 in 5 years
- money total of 25 M in 5 years
50,000,000,000 Lire
133What will we get?
- many new structures
- the machinery for structural genomics
- some weired structures ...
134Recipe to determine targets
- Is it a known structure?
- Is it similar to a known structure?
- Is it a membrane protein?
- Does it look like a known fold?
- Does it look like a globular protein?
- Is it a big family?
- Is it short (NMR) does it contain Met (MAD)?
135Alternative recipe to determine targets
- Do we have a crystal?
- Is it a known structure?
- Is it similar to a known structure?
136Reality checkthe invaluable contribution of
bioinformatics to target selection
137Target selection
138Priority classes
- Experimental feasibility
- Biophysical properties
- length
- presence of Methionine
- Bioinformatics criteria
- similarity to known structure
- family size
- functional annotation
- Functional genomics
139Target selection machinery
140Conclusions Structural Genomics
- we get
- most major functional elements
- most structural scaffolds
- evolutionary links
- structure-based comparison
- high-throughput techniques
- we wont get
- complexes
- interaction between them
- particular structures
- when?
- 70 of the human genome by 2010 2015
- remainder HTMs?
141Evolution teaches prediction
- Bioinformatics up to the data deluge? NO, but
work in progress! - Know what we do? Some do, 30 over 100
residues! - Where are we today? NO 3D prediction from
sequence! - Evolutionary odyssey applied
- secondary structure 15 -gt 76 10
- transmembrane proteins 10 -gt 65 topo ok
- solvent accessibility 5 -gt 75
- High-throughput success of predictions
- localisation accessibility useful, but
not enough! - whole genomes kingdoms differ in some
respects! - 3D structure threading
- floppy regions
142Midnight zone STRONGLY populated
143What we are threading for
144Goals of fold recognition, threading,remote
homology modelling
- Recognising similar fold(s)
(entire proteins) - Detecting remote homologies for fragments
(part of protein) - Align target and fold
- Remote homology modelling
(prediction in 3D)
145Two paths to fold recognition
146TOPITS
147Prediction-based threading
148Example of remote sequence identity
14930 correct first, better if stronger
150Other threading methods
- TOPITS is not the best!
- CASP PredictionCenter.llnl.gov/content.html
- CAFASP www.cs.bgu.ac.il/dfischer/CAFASP2/
- EVA cubic.bioc.columbia.edu/eva/
- CUBIC linkscubic.bioc.columbia.edu/doc/links_inde
x.html
151Evolution teaches prediction
- Bioinformatics up to the data deluge? NO, but
work in progress! - Know what we do? Some do, 30 over 100
residues! - Where are we today? NO 3D prediction from
sequence! - Evolutionary odyssey applied
- secondary structure 15 -gt 76 10
- transmembrane proteins 10 -gt 65 topo ok
- solvent accessibility 5 -gt 75
- High-throughput success of predictions
- localisation accessibility useful, but
not enough! - whole genomes kingdoms differ in some
respects! - threading better than sequence alignment!
- floppy regions (NORS no regular secondary
structure)
152Long floppy regions
- less than 5 helix or strand over gt 70 residues
153Floppy loops between domains
Formate Dehydrogenase H (1aa6.pdb)
phiX174 virion (1al0F.pdb)
DNA-containing capsid of CPV (4dpv.pdb)
Isoamylase (1bf2.pdb)
154Floppy ends
pyruvateferredoxin oxidoredisoamylase (1b0pA.p
db)
Capsid protein of CPV(1b35C.pdb)
Hexon from adenovirus type 2 (1dhx.pdb)
Myeloperoxidase (1mhlA.pdb)
Aspartate aminotrans- ferase (2aat.pdb)
Prothrombin fragment 2 (2hppP.pdb)
SH3 domainof PLC-gamma (1hsq.pdb)
Hydroxylase com- ponent of MMOH (1mtyB.pdb)
155Floppy-wrap
SH3 and adjacent ligand site (1awj.pdb)
Erythrocyte catalase (7cat.pdb)
GmDNV capsid protein (1dnx.pdb)
Cellulase (1tf4A.pdb)
Phosphoglycerate mutase (3pgm.pdb)
Carboxypeptidase T (1obr.pdb)
156Weirdoes
Extracellular domain of T beta RI (1tbi.pdb)
HIVZ2 Tat protein (1tac.pdb)
Plasminogen Kringle 4 (1krn.pdb)
Gene 5 DNA binding protein (2gn5.pdb)
Recombinant Kringle 5 domain (5hpg.pdb)
Aspartate Trans- carbamoylase (9atc.pdb)
157Weirdoes are not alone !
15810 of biomass weird !
159Length distribution of floppy regions
160Weirdoes functional !
161Yeast-2-hybrid interactions
162Evolution teaches prediction
- Bioinformatics up to the data deluge? NO, but
work in progress! - Know what we do? Some do, 30 over 100
residues! - Where are we today? NO 3D prediction from
sequence! - Evolutionary odyssey applied
- secondary structure 15 -gt 76 10
- transmembrane proteins 10 -gt 65 topo ok
- solvent accessibility 5 -gt 75
- High-throughput success of predictions
- localisation accessibility useful, but
not enough! - whole genomes kingdoms differ in some
respects! - threading better than sequence alignment!
- NORS weirdoes not alone AND functional!
163Conclusions
- no prediction of 3D structure
- no prediction of function
- but quantum leap through using frozen
knowledge from evolutionand protein structures - the data deluge floods bioinformatics
- the unsolved urgent problems are legion
- but it is still time to get it donerunning
BLAST is NOT all there is the key is
intelligent use of biological knowledge ...
164Thanksgiving
- Volker Eyrich Schrödinger, New York
- Chris Sander Whitehead, Boston
- Reinhard Schneider LION, Boston
- Alfonso Valencia CNB, Madrid
- Miguel Andrade EMBL, Heidelberg
- Séan ODonoghue LION, Heidelberg
- Amos Bairoch SIB, Genève
- Michael Braxenthaler La Roche, New York
- Søren Brunak CBS, København
- Rita Casadio Univ. Bologna
- Antoine De Daruvar LION, Bordeaux
- David Eisenberg UCLA, Los Angeles
- Piero Fariselli Univ. Bologna
- Barry Honig Columbia, New York
- Tim Hubbard Sanger, Hinxton
- Michael Levitt Univ. Stanford
- Marc Marti-Renom Rockefeller, New York
- Andrej Sali Rockefeller, New York
- Michael Scharf Take 5, Heidelberg
.. in general
localisation
165Availability of methods
- email PredictProtein_at_columbia.edu
- subject HELP
- file
- WWW http//cubic.bioc.columbia.edu/predictprotein
/ - META http//cubic.bioc.columbia.edu/
predictprotein/submit_meta.html - EVA http//cubic.bioc.columbia.edu/eva
- CUBIC http//cubic.bioc.columbia.edu/
Email address options protein name SEQWENCE