Title: How close are we to having seen all folds
1How close are we to having seen all folds OR is
ab initio structure prediction about to become
irrelevant?
2How many protein groups are out there?
3Three levels of grouping - overall structural
similarity - fold in SCOP - evolutionary
clade of remote homologs - superfamily in
SCOP - the same species of proteins -
cluster of orthologs
4Published estimates vary a lot
Folds 400 10,000 Geo-mean 2,000
Families 1,000 50,000 Geo-mean 7,000
In 2000, we did our own estimate based on
structure predictions for complete genomes and
finding a clustering threshold that gives best
random approximations to the dataWolf YI,
Grishin NV, Koonin EV. 2000 Estimating the
number of protein folds and families from
complete genome data. 299(4)897-905. The
method was PARAMETRIC and it yielded1000 folds
and 5000 families
55 years later, still being interested in the
question, we repeated the studySadreyev RI,
Grishin NV. 2006 Exploring dynamics of protein
structure determination and homology-based
prediction to estimate the number of
superfamilies and folds. BMC Struct. Biol.
2066. but we used a different,
NON-PARAMETRIC, method.
6Domain decomposition and structure prediction in
the COG database
COG Tatusov et al. (2000) NAR 28 33 4873
Clusters of Orthologous Groups of proteins from
43 complete genomes
We are interested in SCOP superfamilies and
folds
Break into domains using ADDA Heger, Holm (2003)
JMB 328 749
Structure prediction using PSI-BLAST RPS-BLAST
SCOP Murzin et al. (1995) JMB 247 536
Sequence domain families 13511 families
7Yearly dynamics of superfamilies and folds
2004
Superfamilies
1995
Number of superfamilies/ folds in SCOP
2004
Folds
1995
Number of structurally characterized families
8ltmgt and ltngt extrapolation to the total number of
families
2.4
ltmgt
ltngt
2.6
13500
9Estimates of the total numbers of superfamilies
and folds in COG
4000
1700
 Â
10Estimates of the total numbers of superfamilies
and folds in COG
4000
1700
 Â
 Â
11Families initially solved by SGI comprise
unbiased set
All families (13511)
Frequency
Node degree (number of sequence connections a
family forms with others)
12Families initially solved by SGI comprise
unbiased set
All families (13511) Families solved by SGI (259)
Frequency
Node degree (number of sequence connections a
family forms with others)
13Families initially solved by SGI comprise
unbiased set
All families (13511) Families solved by SGI (259)
Families solved in 1995 (236) Families solved in
2003 (341)
Frequency
Node degree (number of sequence connections a
family forms with others)
14A reasonable estimate is 1500 folds and
4000 superfamilies .
15Linear extrapolation to the total number of
families
2040
Superfamilies
Number of superfamilies/ folds in SCOP
2040
2004
Folds
1995
2004
1995
13500
Number of structurally characterized families
16(No Transcript)
17Lessons from CASP targets
ShuoYong Shi, Lisa Kinch, Jimin Pei, Ruslan
Sadreyev, and Nick V. Grishin
http//prodata.swmed.edu/CASP8
Howard Hughes Medical Institute, Department of
Biochemistry, University of Texas Southwestern
Medical Center at Dallas
18New fold were there any?
NF new fold historic category in CASP
2008 where did the new folds go?
181 domains 2 possibly new folds 1
19New fold 1 N-domain of T0397
N-domain of T0397 3d4r chain A residues -7-82
20First server models for T0397_1
First models for T0397_1 Gaussian kernel density
estimation for GDT-TS scores of the first server
models, plotted at various bandwidths (standard
deviations). The GDT-TS scores are shown as a
spectrum along the horizontal axis each bar
represents first server model. The bars are
colored green, gray and black for top 10, bottom
25 and the rest of servers. The family of curves
with varying bandwidth is shown. Bandwidth varies
from 0.3 to 8.2 GDT-TS units with a step of
0.1, which corresponds to the color ramp from
magenta through blue to cyan. Thicker curves
red, yellow-framed brown and black, correspond to
bandwidths 1, 2 and 4 respectively.
21Most similar ferredoxin-like fold
structure and topology diagrams of ferredoxin
fold fold closest to T0397_1
22New fold 2 N-domain of T0496
N-domain of T0496 3do9 chain A, residues 4-126
23First server models for T0496_1
First models for T0496_1 Gaussian kernel density
estimation for GDT-TS scores of the first server
models, plotted at various bandwidths (standard
deviations). The GDT-TS scores are shown as a
spectrum along the horizontal axis each bar
represents first server model. The bars are
colored green, gray and black for top 10, bottom
25 and the rest of servers. The family of curves
with varying bandwidth is shown. Bandwidth varies
from 0.3 to 8.2 GDT-TS units with a step of
0.1, which corresponds to the color ramp from
magenta through blue to cyan. Thicker curves
red, yellow-framed brown and black, correspond to
bandwidths 1, 2 and 4 respectively.
24Most similar RNAse H fold
structure and topology diagrams of RNAseH fold
fold closest to T0496_1
25Know fold some predicted no better than new!
E.g.1 T0460
First models for T0460 Gaussian kernel density
estimation for GDT-TS scores of the first server
models, plotted at various bandwidths (standard
deviations). The GDT-TS scores are shown as a
spectrum along the horizontal axis each bar
represents first server model. The bars are
colored green, gray and black for top 10, bottom
25 and the rest of servers. The family of curves
with varying bandwidth is shown. Bandwidth varies
from 0.3 to 8.2 GDT-TS units with a step of
0.1, which corresponds to the color ramp from
magenta through blue to cyan. Thicker curves
red, yellow-framed brown and black, correspond to
bandwidths 1, 2 and 4 respectively.
26T0460 very difficult target
Jumping through 20 NMR models of 2k4n
Cartoon diagram of 460 2k4n model 1 residues
1-52,67-10
27T0460 is homologous to Nqo5
Cartoon diagram of NADH-quinone
oxidoreductase2fug chain 5 residues 1-106
Cartoon diagram of 460 2k4n model 1 residues
1-52,67-10
28Know fold some predicted no better than new!
E.g.2 C-domain of T0407
First models for T0407_2 Gaussian kernel density
estimation for GDT-TS scores of the first server
models, plotted at various bandwidths (standard
deviations). The GDT-TS scores are shown as a
spectrum along the horizontal axis each bar
represents first server model. The bars are
colored green, gray and black for top 10, bottom
25 and the rest of servers. The family of curves
with varying bandwidth is shown. Bandwidth varies
from 0.3 to 8.2 GDT-TS units with a step of
0.1, which corresponds to the color ramp from
magenta through blue to cyan. Thicker curves
red, yellow-framed brown and black, correspond to
bandwidths 1, 2 and 4 respectively.
29Date Mon, 2 Jun 2008 235639 -0500 (CDT) From
Nick Grishin ltgrishin_at_chop.swmed.edugt To David
Baker ltdabaker_at_u.washington.edugt Cc Ruslan
Sadreyev ltsadreyev_at_chop.swmed.edugt, Robert M
Vernon ltrvernon_at_u.washington.edugt Subject Re
C-terminus of T0407 I liked IG because of 1)
length 2) 7 strands 3) many IG are
interaction domains in enzymes. These are very
compelling reasons.
30T0407_2 has Immunoglobulin fold
Cartoon diagram of 407, C-domain 3e38 chain A
residues 277-363
Cartoon diagram of VAP-A MSP Homology Domain 3z9l
31No server predicted IG fold for T0407_2
Cartoon diagram of 407, C-domain 3e38 chain A
residues 277-363
Top GDT server model Phyre_de_novo TS1
IG-based Baker model
32Summary 1. New folds constitute less than 2
of newly solved non-redundant
structures. 2. Many known folds cannot be
predicted because templates are
impossible to find. 3. Good way of finding a
template is to do ab-initio and
compare results to known structures.
33Acknowledgement
Our group
Collaborators
Shuoyong Shi Jing Tong Ruslan Sadreyev
Lisa Kinch Jimin Pei Ming Tang Sasha
Safronova Yuan Qi Hua Cheng
Jamie Wrabl Indraneel Majumdar Erik
Nelson Yong Wang S. Sri
Krishna Bong-Hyun Kim Dorothee Staber
David Baker U. Washington Kimmen
Sjölander UC Berkeley William Noble
U. Washington
HHMI, NIH, UTSW, The Welch Foundation