Why Is Sequence Comparison Useful - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Why Is Sequence Comparison Useful

Description:

Almost 100 Trillion BLAST comparisons per quarter (10/01) ... Proc Natl Acad Sci U S A 1983 Feb;80(3):726-30 ... In: Ecole de Roscoff 1974, p. 69. Paris:CNRS. ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 33
Provided by: ncbi9
Category:

less

Transcript and Presenter's Notes

Title: Why Is Sequence Comparison Useful


1
Why Is Sequence Comparison Useful?
Lipman, David (NIH/NLM/NCBI)
2
Almost 100 Trillion BLAST comparisons per quarter
(10/01)
3
Rapid similarity searches of nucleic acid and
protein data banks.Wilbur WJ, Lipman DJ. Proc
Natl Acad Sci U S A 1983 Feb80(3)726-30 With
the development of large data banks of protein
and nucleic acid sequences, the need for
efficient methods of searching such banks for
sequences similar to a given sequence has become
evident. We present an algorithm for the global
comparison of sequences based on matching
k-tuples of sequence elements for a fixed k. The
method results in substantial reduction in the
time required to search a data bank when compared
with prior techniques of similarity analysis,
with minimal loss in sensitivity. The algorithm
has also been adapted, in a separate
implementation, to produce rigorous sequence
alignments. Currently, using the DEC KL-10
system, we can compare all sequences in the
entire Protein Data Bank of the National
Biomedical Research Foundation with a 350-residue
query sequence in less than 3 min and carry out a
similar analysis with a 500-base query sequence
against all eukaryotic sequences in the Los
Alamos Nucleic Acid Data Base in less than 2 min.
4
Cancer Gene Meets Its MatchNY Times July 3,
1983a serendipitous computer search
Waterfield MD et al., Nature 1983 Jul
7304(5921)35-39 Doolittle RF et al., Science
1983 Jul 15221(4607)275-277
v-sis 6 QGDPIPEELYKMLSGHSIRSFDDLQRLLQGDSGKEDGAE
LDLNMTRSHSGGELESLARGK 65
QGDPIPEELYMLS HSIRSFDDLQRLL GD
GEDGAELDLNMTRSHSGGELESLARG PDGF 10
QGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDGAELDLNMTRSHSG
GELESLARGR 69 v-sis 66 RSLGSLSVAEPAMIAECKTRTEVF
EISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ 125
RSLGSLAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEV
QRCSGCCNNRNVQ PDGF 70 RSLGSLTIAEPAMIAECKTRTEVF
EISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQ 129 v-sis
126 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCEIVAAA
RAVTRSPGTSQEQR 185 CRPTQVQLRPVQVRKIEIV
RKKPIFKKATVTLEDHLACKCE VAAAR VTRSPG SQEQR PDGF
130 CRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAA
RPVTRSPGGSQEQR 189 v-sis 186 AKTTQSRVTIRTVRVRRPP
KGKHRKCKHTHDKTALKETLGA 226 AKT
QRVTIRTVRVRRPPKGKHRK KHTHDKTALKETLGA PDGF 190
AKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA 230
V-sis and Platelet-Derived Growth Factor (PDGF)
5
An earlier, more subtle discovery
(for Slide Animation please Click the area of
slide or Slide Show button)
Viral src gene products are related to the
catalytic chain of mammalian cAMP-dependent
protein kinase Barker WC, Dayhoff MO. PNAS 1982
May79(9)2836-2839
Query 113 YAAQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTD
FGFAKR---VKGRTWT---LC 166 Y V
LHS DLKP NLI Q DFG GR
Sbjct 125 YSLDVVNGLLFLHSQSILHLDLKPANILISEQDVCKIS
DFGCSQKLQDLRGRQASPPHIG 184 Query 167
GTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEK
IVSGKVR 223 GT APEI D
G M P P V R Sbjct 185
GTYTHQAPEILKGEIATPKADIYSFGITLWQMTTREVP-YSGEPQYVQYA
VVAYNLR 240
  • Biology not Algorithms
  • - compare proteins, not DNA
  • must detect similar amino acids not just
    identities

6
How often would one find matches?
(for Slide Animation please Click the area of
slide or Slide Show button)
  • How many protein families would there be?

Unexpected similarities should be extremely rare.
7
Estimating number of protein families
(for Slide Animation please Click the area of
slide or Slide Show button)
8
Earliest Estimates of Number of Protein Families
- 1000
  • Zuckerkandl,E. (1974) Accomplissement et
    perspectives de la paleogenetique chimique. In
    Ecole de Roscoff 1974, p. 69. ParisCNRS.
  • The appearance of new structures and
    functions in proteins during evolution, J. Mol.
    Evol. 7, 1-57 (1975).
  • Dayhoff, M.O. (1974) Federation Proceedings 33,
    2314.
  • The origin and evolution of protein
    superfamilies, Fed.Proc. 35, 2132-2138 (1976).

9
Margaret Dayhoff
10
Atlas of Protein Sequence and Structure, Vol. 5,
Supplement 3 (1978) pg. 10
  • It has been estimated that in humans there are
    approximately 50,000 proteins of functional or
    medical importance. A landmark of molecular
    biology will occur when one member of each
    superfamily has been elucidated. At the present
    rate of 25 per year, this will take less than 15
    years.

11
Hubris, the Genome Project, and Protein Families
(for Slide Animation please Click the area of
slide or Slide Show button)
  • Chothia, C. (1992). One thousand families for the
    molecular biologist. Nature, 357, 543-544.

Green P, Lipman D, Hillier L, Waterson R,
States,D, and Claverie JM (1993). Ancient
Conserved Regions in New Gene Sequences and the
Protein Databases. Science, 259, 1711-1716.
ACR similarity detected between sequences from
distantly related organisms
12
1992 What new families do we get from the genome
projects?
(for Slide Animation please Click the area of
slide or Slide Show button)
13
Cumulative growth in number of proteins number
of conserved domains (from Geer, L., Bryant, S.,
Ostell, J.)
(for Slide Animation please Click the area of
slide or Slide Show button)
6
1.210
100
6
1.010
80
5
8.010
60
Conserved Domain Families
5
6.010
Families Hit
Number of Proteins
40
5
4.010
Protein Sequences
20
5
2.010
0.0
0
1960
1965
1970
1975
1980
1985
1990
1995
2000
14
Why so few families and why do they evolve
slowly?
(for Slide Animation please Click the area of
slide or Slide Show button)
  • Structural View
  • Thermodynamics Finkelstein, AV, Why are the
    same protein folds used to perform different
    functions? FEBS 325, pp. 23-28 (1993)

15
Constraints Due To Biological Function May Be
More Important
(for Slide Animation please Click the area of
slide or Slide Show button)
  • Compare pairs of sequences from related classes
    of proteins
  • All sequences should at least share structural
    similarity
  • Divergence times for all sequences should be
    approximately the same

prokaryotes
  • Sequences within a class share function but
    sequences between classes have differing function

eukaryotes
Degree within-class similarity gt between-class
similarity indicates importance of constraints
due to biological function.
16
Example from the Aminoacyl-tRNA synthetases
(aaRS) (from E. Koonin Y. Wolf) essential
enzymes responsible for incorporation of amino
acids into proteins
(for Slide Animation please Click the area of
slide or Slide Show button)
  • Two unrelated classes of aaRS, each includes 10
    aaRS related to each other
  • The last universal common ancestor (LUCA) of
    modern life forms already had at least 17 aaRS
  • The duplication leading to aaRS of different
    specificities must have occurred during a
    relatively short period of early evolution
  • The post-LUCA evolution of aaRS took much longer
    than the early phase when the specificities were
    established. However, the changes that occurred
    after the aaRS were locked in their specificities
    are small compared to the changes traced to the
    early phase

17
Orthologs (from S. Bryant)

18
Paralogs (from S. Bryant)

19
Example from the Aminoacyl-tRNA Synthetases
(aaRS) (from E. Koonin Y. Wolf)
Exceptions - glutamine/glutamate,asparagine/aspart
ate tryptophan/tyrosine
20
How many human genes?
(for Slide Animation please Click the area of
slide or Slide Show button)
  • 80,000 Antequera F Bird A, Number of CpG
    islands and genes in human and mouse, PNAS 90,
    11995-11999 (1993).

120,000 Liang F et al., Gene Index analysis of
the human genome estimates approximately 120,000
genes, Nat. Gen., 25, 239-240 (2000)
35,000 Ewing B Green P, Analysis of expressed
sequence tags indicates 35,000 human genes,
Nat. Gen. 25, 232-234 (2000)
28,000-34,000 Roest Crollius, H. et al.,
Estimate of human gene number Provided by
genome-wide analysis using Tetraodon nigroviridis
DNA Sequence, Nat. Gen. 25, 235-238 (2000).
41,000-45,000 Das M et al., Assessment of the
Total Number of Human Transcription Units,
Genomics 77, 71-78 (2001)
21
How many human genes with ACRs? (from S.
Resenchuk, T.Tatusov, L. Wagner, A. Souverov)
(for Slide Animation please Click the area of
slide or Slide Show button)
12,245 characterized mRNAs from RefSeq
78 have ACR, i.e., hit outside vertebrates at E
lt10e-6 ( 9,496/12,245)
90 of these have corresponding GenomeScan
predictions which also have ACR (8501/9496)
20,245 GS models for entire human genome have ACR
15,573 GS models after correction for splitting
(20,245/1.3)
17,300 estimated human genes with ACRs (
15,573/.9)
22
How many human genes?
(for Slide Animation please Click the area of
slide or Slide Show button)
17,303 estimated human genes with ACRs
Now use comparative genomics
17,303/.55 31,500 Total Human Genes
More complicated than that!
23
Conservation, expression level, protein length,
exon number
(for Slide Animation please Click the area of
slide or Slide Show button)
23,600 revised est. human genes with ACRs
(15,573/.66)
43,000 upper bound on est. total human genes
(23,600/.55) 35,000 is more reasonable bound
with this approach
24
The relationship of protein conservation and
sequence length
  • Lipman DJ, Souvorov A, Koonin EV, Panchenko AR,
    Tatusova TA
  • BMC Evol Biol. 2002 220

25
4279 proteins
Salmonella Set
26
Archaeoglobus fulgidus
100
80
2420 proteins
60
Number
40
20
0
0
200
400
600
800
1000
Length
27
conserved
nonconserved
Structural domains
28
conserved
nonconserved
Structural domains
Length
29
Human
300
conserved
250
nonconserved
14538 proteins
Structural domains
200
Number
150
100
50
0
0
200
400
600
800
1000
Length
30
A
conserved
nonconserved
B
31
Archaeoglobus fulgidus Escherichia coli Contact
density
32
Acknowledgements
all my colleagues at NCBI and NIH
Write a Comment
User Comments (0)
About PowerShow.com