Title: Computational Analysis of Genome Sequences
1Computational Analysis of Genome Sequences
Steven Salzberg The Institute for Genomic
Research (TIGR) and The Johns Hopkins University
2The Genomics Revolution
- 1995 1st genome (H. influenzae, TIGR)
- 1996 1st eukaryote (S. cerevisiae)
- 2000 29 complete microbial genomes
- 22 in progress at TIGR
- 50 in progress worldwide
- 3 complete eukaryotes
- yeast, nematode, fruit fly
- 2 major projects in 2000
- Human (3.3 billion bp)
- Arabidopsis thaliana (125 million bp)
3Genomes Completed at TIGR
Organism (genome size) Reference Haemophilus
influenzae (1.83 Mb) Fleischmann et al., Science
269, 496-512 (1995). Mycoplasma genitalium (0.58
Mb) Fraser et al., Science 270, 397-403 (1995).
Methanococcus jannaschii(1.7 Mb) Bult et al.,
Science 273, 1058-73 (1996). Helicobacter
pylori(1.6 Mb) Tomb et al., Nature 388, 539-47
(1997). Archeoglobus fulgidus (2.1 Mb) Klenk et
al., Nature 390, 364-70 (1997). Borrelia
burgdorferi(1.5 Mb) Fraser et al., Nature 390,
580-6 (1997). Treponema pallidum(1.1 Mb) Fraser
et al., Science 281, 375-88 (1998). Plasmodium
falciparum chr2 (1 Mb) Gardner et al., Science
282, 1126-32 (1998). Thermotoga maritima (1.8
Mb) Nelson et al., Nature 399, 323-9 (1999).
Deinococcus radiodurans(3.3 Mb) White et al.,
Science 286, 1571-7 (1999). Arabidopsis thaliana
chr2 (19 Mb) Lin et al., Nature 402, 761-8
(1999). Neisseria meningitidis (2.3 Mb) Tettelin
et al., Science 287, 1809-15 (2000). Chlamydia
pneumoniae (1.2 Mb) Read et al., Nucleic Acids
Res 28, 1397-406 (2000). Chlamydia trachomatis
(1.0 Mb) Read et al., Nucleic Acids Res 28,
1397-406 (2000). Vibrio cholerae (4.0
Mb) Heidelberg et al., Nature, in press.
Mycobacterium tuberculosis(4.4 Mb) Fleischmann
et al., manuscript in preparation Streptococcus
pneumoniae(2.2 Mb) Tettelin et al., manuscript in
preparation Caulobacter crescentus (4.0
Mb) Nierman et al., manuscript in
preparation Chlorobium tepidum (2.1 Mb) Eisen et
al., manuscript in preparation Porphyromonas
gingivalis (2.2 Mb) Fleishmann et al., manuscript
in preparation
4Genomes in progress at TIGR
Organism (genome size) Funding source
Plasmodium falciparum chr 14 (3.4 Mb) BWF/DoD
Plasmodium falciparum chr 10,11 (4 Mb) NIAID/DoD
Trypanosoma brucei chr 2 (1 Mb) NIAID
Enterococcus faecalis (3.0 Mb) NIAID
Mycobacterium avium (4.4 Mb) NIAID Pseudomonas
putida (6.2 Mb) DOE Schewanella putrefaciens
(4.5 Mb) DOE Staphylococcus aureus (2.8
Mb) NIAID, MGRI Dehalococcoides ethenogenes
(1.5Mb) DOE Desulfovibrio vulgaris (3.2Mb) DOE
Thiobacillus ferrooxidans (2.9 Mb) DOE Chlamydia
psittaci GPIC (1.2Mb) NIAID Bacillus anthracis
(5.0Mb) ONR/DOE/NIAID Treponema denticola (3.0
Mb) NIDR C. hydrogenoformans (2.0
Mb) DOE Methylococcus capsulatus (4.6
Mb) DOE Geobacter sulfurreducens (4.0
Mb) DOE Wolbachia sp (Drosophila) (1.4
Mb) NIH Colwellia sp (1.0 Mb) DOE Mycobacterium
smegmatis (4.0Mb) NIAID Staphylococcus
epidermidis (2.5 Mb) NIAID Theileria parva
(10Mb) ILRI/TIGR
5A Microbial Genome Sequencing Project
Random sequencing
Genome Assembly
Annotation
Data Release
Publication www.tigr.org
Sample tracking
6Gene Finding
- Gene finding plays an ever-larger role in
high-speed DNA sequencing projects - Theres no time for much else!
- 1000s of genes generated each month at a
high-throughput sequencing facility - Separate gene finders are needed for every
organism - Training on organism X, finding genes on Y,
generates inferior results - Bootstrapping problem training data is hard to
find
7Open Reading Frames 6 possibilities
TCG TAC GTA GCT AGC TAG CTA AGC ATG CAT CGA TCG
ATC GAT
T CGT ACG TAG CTA GCT AGC TA A GCA TGC ATC GAT
CGA TCG AT
identical sequence
TC GTA CGT AGC TAG CTA GCT A AG CAT GCA TCG ATC
GAT CGA T
8GLIMMER A Microbial Gene Finder
- GLIMMER 2.0 released late 1999
- gt 200 site licenses worldwide
- Works on bacteria, archaea, viruses too
- Malaria (eukaryotic) version GLIMMERM
- Refs Salzberg et al., NAR, 1998, Genomics 1999
Delcher et al., NAR, 1999 - Web site and code
- http//www.tigr.org/
9Uniform Markov Models
- Use conditional probability of a sequence
position given previous k positions in the
sequence. - Fixed, kth-order model bigger k s yield better
models (as long as data is sufficient). - Probability (score) of sequence s1 s2 s3 sn is
10Uniform Markov Models
- Advantages
- Easy to train. Count frequencies of (k1)mers in
training data. - Easy to assign a score to a sequence.
- Disadvantages
- (k1)mers can be undersampled i.e., occur too
infrequently in training data. - Models sequence as fixed-length chunks, which may
not be the best model of biology.
11Interpolated Markov Models
- Use a linear combination of 8 different Markov
chains for example - c8 P (gatcagtta) c7 P (gtcagtta)
- c1 P (ga) c0 P (g)
- where c0 c1 c2 c3 c4 1
- Equivalent to interpolating the results of
multiple Markov chains - Score of a sequence is the product of
interpolated probabilities of bases in the
sequence
12IMMs vs. Fixed-Order Models
- Performance
- IMM should always do at least as well as
fixed-order. - E.g., even if kth-order model is correct, it can
be simulated by (k1)st-order - Our results support this.
- IMM result can be used as fixed-order model.
- IMM slightly harder to train and uses more memory.
13IMM Training
- Problem How to determine the weights of all the
thousands of k-mers? - Traditionally done with E-M algorithm using
cross-validation (deleted estimation). - Slow.
- Overtraining can be a problem.
14GLIMMER IMM Training
- Our approach assumes
- Longer context is always better
- Only reason not to use it is undersampling in
training data. - If sequence occurs frequently enough in training
data, use it, i.e., l 1 - Otherwise, use frequency and c2 significance to
set l. -
15How GLIMMER Works
- Three separate programs
- long-orfs automatically extract long open
reading frames that do not overlap other long
orfs. - IMM model builder. Takes any kind of sequence
data. - Gene predictor. Takes genome sequence and finds
all the genes.
16Gene Predictor
- Finds scores entire ORFs.
- Uses 7 competing models 6 reading frames plus
random model. - Score for an ORF is the probability that the
right model generated it. - 3-periodic Markov model
- High-scoring ORFs are then checked for overlaps.
17Glimmer 2.0 IMM design
Pos -1
Context
a
t
c
g
ATGCATGATCGAG
Pos -3
Pos -3
Pos -3
Pos -2
12bp
Pos -3
Pos -3
Pos -3
Pos -4
8 levels deep
18Better Overlap Resolution
19Better Overlap Resolution
20GLIMMER 2.0s Performance
Organism Genes Genes
Additional Annotated
Found Genes H. influenzae 1738 172
0 (99.0) 250 (14) M. genitalium 483 480 (99.4)
81 (17) M. jannaschii 1727 1721 (99.7) 221 (13)
H. pylori 1590 1550 (97.5) 293 (18) E.
coli 4269 4158 (97.4) 824 (19) B.
subtilis 4100 4030 (98.3) 586 (14) A.
fulgidis 2437 2404 (98.6) 274 (11) B.
burgdorferi 853 843 (99.3) 62 (7) T.
pallidum 1039 1014 (97.6) 180 (17) T.
maritima 1877 1854 (98.8) 190 (10)
21GLIMMER 2.0 on known genes
Organism Genes Known
Correct Annotated Genes
Predictions H. influenzae 1738 1501 1496 (99
.7) M. genitalium 483 478 476 (99.6) M.
jannaschii 1727 1259 1256 (99.8) H.
pylori 1590 1092 1084 (99.3) E.
coli 4269 2656 2632 (99.1) B. subtilis 4100 1249
1231 (98.6) A. fulgidis 2437 1799 1786 (99.3) B.
burgdorferi 853 601 600 (99.8) T.
pallidum 1039 755 747 (98.9) T.
maritima 1877 1504 1493 (99.3) Average (99.3)
22- Speed
- Training for 2 Megabase genome lt 1 minute
(on a Pentium-450) - Find all genes in 2Mb genome lt 1 minute
- Impact GLIMMER was used for
- B. burgdorferi (Lyme disease) , T. pallidum
(syphilis) (TIGR) - C. trachomatis (blindness,std) (Berkeley/Stanford)
- C. pneumoniae (pneumonia) (Berkeley/Stanford/UCSF)
- T. maritima, D. radiodurans, M. tuberculosis, V.
cholerae, S. pneumoniae, C. trachomatis, C.
pneumoniae, N. meningitidis (TIGR) - X. fastidiosa (Brazilian consortium)
- Plasmodium falciparum (malaria) GlimmerM
- Arabidopsis thaliana (model plant) GlimmerM
- Others viruses, simple eukaryotes, more bacteria
23Self-Similarity Scans
- Idea analyze a whole genome by counting 3-mers
in all 6 frames - Analyze small windows (2000 bp, 10000bp) using
the same statistic - Algorithm
- Build model of entire sequence
- Apply the ?2 statistic to compare windows to the
genome itself
24Haemophilus influenzae (meningitis)
?2
GC
25Thermotoga maritima (hyperthermophile)
26Vibrio cholerae (cholera)
27On the other side of CTXf prophage is a region
encoding an RTX toxin (rtxA) and its activator
(rtxC) and transporters (rtxBD). A third
transporter gene has been identified that is a
paralog of rtxB, and is transcribed in the same
direction as rtxBD. Downstream of this gene are
two genes encoding a sensor histidine kinase and
response regulator. Trinucleotide composition
analysis suggests that the RTX region was
horizontally acquired along with the sensor
histidine kinase/response regulator, suggesting
these regulators effect expression of the closely
linked RTX transcriptional units. --Heidelberg et
al., Nature, in press.
28MUMmer
- Aligns 2 complete genomes
- Maximal Unique Matches
- Suffix trees
- Very fast alignment of very long DNA sequences
- Ref Delcher et al., Nucl. Acids Res., 1999
- Software at
- http//www.tigr.org/softlab
29The Problem
- Efficiently compute alignments between long
sequences to identify biologically interesting
features. - E.g., two strains of M. tuberculosis,each
4.4MB - E.g., two versions of a genome at different
stages of closure - Compute alignment in less than 2 minutes
30Maximal Unique Sequences
Sequences in genomes A and B that Occur exactly
once in A and in B Are not contained in any
larger such sequence
31Select the longest consistent set of MUMs Occur
in the same order in A and B
32Suffix Trees
- A tree with edges labelled by strings
- Labels of child edges of a node begin with
distinct letters - Each leaf L represents a sequencethe labels on
the path to L from the root - Holds all suffixes of a set of sequences
- A suffix is a subsequence that extends to the
end of its sequence - The suffix tree for sequences A and B
- Contains less than 2(A B ) nodes.
- Can be constructed in O (A B ) time!
- Still need lots of RAM
- All the analyses here were run on a desktop PC
33- Analyze the gaps between adjacent MUMs
- Small gaps can be aligned with Smith-Waterman
algorithm - Large gaps can be aligned recursively
- Large inserts can be searched for separately.
Many will be inconsistent MUMs - Overlapping MUMs indicate variation in copy
number of small repeats
34M. tuberculosis CSU93 vs. H37Rv
A C G TA 66 164 9C 48 81 169G 164 89 44T 1
1 159 61
35M genitalium vs. M. pneumoniae
36H. pylori 26695 vs. J99
37V. cholera (forward) vs. E. coli
Origin
38V. cholera (reverse) vs. E. coli
39V. cholera (both strands) vs. E. coli a puzzle?
40V. cholera vs. itself
41S. pyogenes vs. S. pneumoniae
42S. pyogenes vs. itself
43M. leprae vs M. tuberculosis
M. tuberculosis
M. leprae
44X-alignments how?
4
3
3
4
5
2
2
5
1
1
6
6
Ori
3
4
2
5
1
6
4
3
3
4
2
5
5
2
1
6
1
6
45Chr 2 vs. Chr 4 of Arabidopsis thaliana
discovery of a 4 Mb duplication
1100 genes 430 (39) duplicated
46Acknowledgements
- GLIMMER, GLIMMERM
- Arthur Delcher, Simon Kasif, Owen White, Mihaela
Pertea - MUMmer
- Arthur Delcher, Simon Kasif, Jeremy Peterson, Rob
Fleischmann, Owen White - Analyses
- Numerous TIGR faculty and staff, including
Jonathan Eisen, Owen White, Rob Fleischmann,
Hervé Tettelin, Tim Read, Maria Ermolaeva, John
Heidelberg, Ian Paulsen, Malcolm Gardner, Claire
Fraser, Clyde Hutchison, ... - Supported by
- National Institutes of Health (NHGRI, NLM)
- National Science Foundation (CISE, BIO)