Title: Why Is Sequence Comparison Useful
1Why Is Sequence Comparison Useful?
Lipman, David (NIH/NLM/NCBI)
2Almost 100 Trillion BLAST comparisons per quarter
(10/01)
3Rapid similarity searches of nucleic acid and
protein data banks.Wilbur WJ, Lipman DJ. Proc
Natl Acad Sci U S A 1983 Feb80(3)726-30 With
the development of large data banks of protein
and nucleic acid sequences, the need for
efficient methods of searching such banks for
sequences similar to a given sequence has become
evident. We present an algorithm for the global
comparison of sequences based on matching
k-tuples of sequence elements for a fixed k. The
method results in substantial reduction in the
time required to search a data bank when compared
with prior techniques of similarity analysis,
with minimal loss in sensitivity. The algorithm
has also been adapted, in a separate
implementation, to produce rigorous sequence
alignments. Currently, using the DEC KL-10
system, we can compare all sequences in the
entire Protein Data Bank of the National
Biomedical Research Foundation with a 350-residue
query sequence in less than 3 min and carry out a
similar analysis with a 500-base query sequence
against all eukaryotic sequences in the Los
Alamos Nucleic Acid Data Base in less than 2 min.
4Cancer Gene Meets Its MatchNY Times July 3,
1983a serendipitous computer search
Waterfield MD et al., Nature 1983 Jul
7304(5921)35-39 Doolittle RF et al., Science
1983 Jul 15221(4607)275-277
v-sis 6 QGDPIPEELYKMLSGHSIRSFDDLQRLLQGDSGKEDGAE
LDLNMTRSHSGGELESLARGK 65
QGDPIPEELYMLS HSIRSFDDLQRLL GD
GEDGAELDLNMTRSHSGGELESLARG PDGF 10
QGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDGAELDLNMTRSHSG
GELESLARGR 69 v-sis 66 RSLGSLSVAEPAMIAECKTRTEVF
EISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ 125
RSLGSLAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEV
QRCSGCCNNRNVQ PDGF 70 RSLGSLTIAEPAMIAECKTRTEVF
EISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ 129 v-sis
126 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCEIVAAA
RAVTRSPGTSQEQR 185 CRPTQVQLRPVQVRKIEIV
RKKPIFKKATVTLEDHLACKCE VAAAR VTRSPG SQEQR PDGF
130 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAA
RPVTRSPGGSQEQR 189 v-sis 186 AKTTQSRVTIRTVRVRRPP
KGKHRKCKHTHDKTALKETLGA 226 AKT
QRVTIRTVRVRRPPKGKHRK KHTHDKTALKETLGA PDGF 190
AKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA 230
V-sis and Platelet-Derived Growth Factor (PDGF)
5An earlier, more subtle discovery
(for Slide Animation please Click the area of
slide or Slide Show button)
Viral src gene products are related to the
catalytic chain of mammalian cAMP-dependent
protein kinase Barker WC, Dayhoff MO. PNAS 1982
May79(9)2836-2839
Query 113 YAAQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTD
FGFAKR---VKGRTWT---LC 166 Y V
LHS DLKP NLI Q DFG GR
Sbjct 125 YSLDVVNGLLFLHSQSILHLDLKPANILISEQDVCKIS
DFGCSQKLQDLRGRQASPPHIG 184 Query 167
GTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEK
IVSGKVR 223 GT APEI D
G M P P V R Sbjct 185
GTYTHQAPEILKGEIATPKADIYSFGITLWQMTTREVP-YSGEPQYVQYA
VVAYNLR 240
- Biology not Algorithms
- - compare proteins, not DNA
- must detect similar amino acids not just
identities
6How often would one find matches?
(for Slide Animation please Click the area of
slide or Slide Show button)
- How many protein families would there be?
Unexpected similarities should be extremely rare.
7Estimating number of protein families
(for Slide Animation please Click the area of
slide or Slide Show button)
8Earliest Estimates of Number of Protein Families
- 1000
- Zuckerkandl,E. (1974) Accomplissement et
perspectives de la paleogenetique chimique. In
Ecole de Roscoff 1974, p. 69. ParisCNRS. - The appearance of new structures and
functions in proteins during evolution, J. Mol.
Evol. 7, 1-57 (1975). - Dayhoff, M.O. (1974) Federation Proceedings 33,
2314. - The origin and evolution of protein
superfamilies, Fed.Proc. 35, 2132-2138 (1976).
9Margaret Dayhoff
10Atlas of Protein Sequence and Structure, Vol. 5,
Supplement 3 (1978) pg. 10
- It has been estimated that in humans there are
approximately 50,000 proteins of functional or
medical importance. A landmark of molecular
biology will occur when one member of each
superfamily has been elucidated. At the present
rate of 25 per year, this will take less than 15
years.
11Hubris, the Genome Project, and Protein Families
(for Slide Animation please Click the area of
slide or Slide Show button)
- Chothia, C. (1992). One thousand families for the
molecular biologist. Nature, 357, 543-544.
Green P, Lipman D, Hillier L, Waterson R,
States,D, and Claverie JM (1993). Ancient
Conserved Regions in New Gene Sequences and the
Protein Databases. Science, 259, 1711-1716.
ACR similarity detected between sequences from
distantly related organisms
121992 What new families do we get from the genome
projects?
(for Slide Animation please Click the area of
slide or Slide Show button)
13Cumulative growth in number of proteins number
of conserved domains (from Geer, L., Bryant, S.,
Ostell, J.)
(for Slide Animation please Click the area of
slide or Slide Show button)
6
1.210
100
6
1.010
80
5
8.010
60
Conserved Domain Families
5
6.010
Families Hit
Number of Proteins
40
5
4.010
Protein Sequences
20
5
2.010
0.0
0
1960
1965
1970
1975
1980
1985
1990
1995
2000
14Why so few families and why do they evolve
slowly?
(for Slide Animation please Click the area of
slide or Slide Show button)
- Structural View
- Thermodynamics Finkelstein, AV, Why are the
same protein folds used to perform different
functions? FEBS 325, pp. 23-28 (1993)
15Constraints Due To Biological Function May Be
More Important
(for Slide Animation please Click the area of
slide or Slide Show button)
- Compare pairs of sequences from related classes
of proteins
- All sequences should at least share structural
similarity
- Divergence times for all sequences should be
approximately the same
prokaryotes
- Sequences within a class share function but
sequences between classes have differing function
eukaryotes
Degree within-class similarity gt between-class
similarity indicates importance of constraints
due to biological function.
16Example from the Aminoacyl-tRNA synthetases
(aaRS) (from E. Koonin Y. Wolf) essential
enzymes responsible for incorporation of amino
acids into proteins
(for Slide Animation please Click the area of
slide or Slide Show button)
- Two unrelated classes of aaRS, each includes 10
aaRS related to each other
- The last universal common ancestor (LUCA) of
modern life forms already had at least 17 aaRS
- The duplication leading to aaRS of different
specificities must have occurred during a
relatively short period of early evolution
- The post-LUCA evolution of aaRS took much longer
than the early phase when the specificities were
established. However, the changes that occurred
after the aaRS were locked in their specificities
are small compared to the changes traced to the
early phase
17Orthologs (from S. Bryant)
18Paralogs (from S. Bryant)
19Example from the Aminoacyl-tRNA Synthetases
(aaRS) (from E. Koonin Y. Wolf)
Exceptions - glutamine/glutamate,asparagine/aspart
ate tryptophan/tyrosine
20How many human genes?
(for Slide Animation please Click the area of
slide or Slide Show button)
- 80,000 Antequera F Bird A, Number of CpG
islands and genes in human and mouse, PNAS 90,
11995-11999 (1993).
120,000 Liang F et al., Gene Index analysis of
the human genome estimates approximately 120,000
genes, Nat. Gen., 25, 239-240 (2000)
35,000 Ewing B Green P, Analysis of expressed
sequence tags indicates 35,000 human genes,
Nat. Gen. 25, 232-234 (2000)
28,000-34,000 Roest Crollius, H. et al.,
Estimate of human gene number Provided by
genome-wide analysis using Tetraodon nigroviridis
DNA Sequence, Nat. Gen. 25, 235-238 (2000).
41,000-45,000 Das M et al., Assessment of the
Total Number of Human Transcription Units,
Genomics 77, 71-78 (2001)
21How many human genes with ACRs? (from S.
Resenchuk, T.Tatusov, L. Wagner, A. Souverov)
(for Slide Animation please Click the area of
slide or Slide Show button)
12,245 characterized mRNAs from RefSeq
78 have ACR, i.e., hit outside vertebrates at E
lt10e-6 ( 9,496/12,245)
90 of these have corresponding GenomeScan
predictions which also have ACR (8501/9496)
20,245 GS models for entire human genome have ACR
15,573 GS models after correction for splitting
(20,245/1.3)
17,300 estimated human genes with ACRs (
15,573/.9)
22How many human genes?
(for Slide Animation please Click the area of
slide or Slide Show button)
17,303 estimated human genes with ACRs
Now use comparative genomics
17,303/.55 31,500 Total Human Genes
More complicated than that!
23Conservation, expression level, protein length,
exon number
(for Slide Animation please Click the area of
slide or Slide Show button)
23,600 revised est. human genes with ACRs
(15,573/.66)
43,000 upper bound on est. total human genes
(23,600/.55) 35,000 is more reasonable bound
with this approach
24The relationship of protein conservation and
sequence length
- Lipman DJ, Souvorov A, Koonin EV, Panchenko AR,
Tatusova TA - BMC Evol Biol. 2002 220
254279 proteins
Salmonella Set
26Archaeoglobus fulgidus
100
80
2420 proteins
60
Number
40
20
0
0
200
400
600
800
1000
Length
27conserved
nonconserved
Structural domains
28conserved
nonconserved
Structural domains
Length
29Human
300
conserved
250
nonconserved
14538 proteins
Structural domains
200
Number
150
100
50
0
0
200
400
600
800
1000
Length
30A
conserved
nonconserved
B
31Archaeoglobus fulgidus Escherichia coli Contact
density
32Acknowledgements
all my colleagues at NCBI and NIH