Title: Nothing in (computational) biology makes
1Comparative genomics, genome context and genome
annotation
Nothing in (computational) biology makes sense
except in the light of evolution
after Theodosius Dobzhansky (1970)
2Genome context analysis and genome annotation
Using information other than homologous
relationships between individual gene/proteins
for functional prediction (guilt by association)
Types of context analysis
- phyletic patterns
- domain fusion (Rosetta Stone proteins)
- gene order conservation
- co-expression
- .
3(No Transcript)
4(No Transcript)
5(No Transcript)
6Goals
-
- Using gene sets from complete genomes,
delineate families of orthologs and paralogs -
Clusters of Orthologous Groups (of genes) (COGs) -
- Using COGs, develop an engine
- for functional annotation of new
- genomes
- Apply COGs for analysis of phylogenetic
patterns
7COG - group of homologous proteins such that
all proteins from different species are orthologs
(all proteins from the same species in a COG are
paralogs)
8CONSTRUCTION OF COGs FOR 8 COMPLETE GENOMES
Complete set of proteins from the analyzed
genomes
Merge triangles with common edges
1
6
FULL SELF-COMPARISON (BLASTPGP, no cut-off)
Detect groups with multidomain proteins and
isolate domains
2
5
Collapse obvious paralogs
3
REPEAT STEPS 3-5
Detect all interspecies Best Hits (BeTs) between
individual proteins or groups of paralogs
4
COGs
Detect all triangles of consistent BeTs
9A TRIANGLE OF BeTs IS A MINIMAL, ELEMENTARY COG
10A RELATIVELY SIMPLE COG PRODUCED BY MERGING
ADJACENT TRIANGLES
11A COMPLEX COG WITH MULTIPLE PARALOGS
12Current status of the COGs
Prokaryotes
11 Archaea 1 unicellular eukaryote 46
bacteria 58 complete genomes
149,321 proteins
105,861 proteins in 4075 COGs (71)
Eukaryotes
4 animals 1 plant 2 fungi 1 microsporidium
8 complete genomes 142,498 proteins
74,093 proteins in 4822 COGs
(52)
13COGnitor...
14IN ACTION
15(No Transcript)
16(No Transcript)
17(No Transcript)
18The Universal COGs
19Search for genomic determinants of
hyperthermophily
20(No Transcript)
21Search for unique archaeo-eukaryotic genes
22(No Transcript)
23A complementary pattern search for unique
bacterial genes
24(No Transcript)
25Essential function but holes in the
phyletic pattern
Strict complementary pattern
26(No Transcript)
27Relaxed complementary pattern
28(No Transcript)
29Relaxed complementary pattern with extra
restrictions
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Conservation of gene order in bacterial species
of the same genus
M. genitalium vs M. pneumoniae
35Conservation of gene order in closely related
bacterial genera
C. trachomatis vs C. pneumoniae
36Lack of gene order conservation - even in
closely related bacteria of the same
Proteobacterial subdivision
P. aeruginosa vs E. coli
37Genome Alignments - Method
Protein sets from completely genomes
BLAST cross-comparison
Table of Hits
Pairwise Genome Alignment Local alignment
algorithm Lamarck (gap opening penalty, gap
extension penalty) statistics with Monte Carlo
simulations
Template-Anchored Genome Alignment
38(No Transcript)
39Genome Alignments - Statistics
Distribution of conserved gene string lengths
40Genome Alignments - Statistics
Pairwise No. No. in in alignments
strings genes Gen1 Gen2 all
homologs ecoli-hinf 138 566 13 33 ecoli-bsub 8
9 322 8 8 ecoli-mjan 10 30 1 2 probable
orthologs ecoli-hinf 105 482 11 28 ecoli-bsub
34 168 4 4 ecoli-mjan 12 33 1 2
41Genome Alignments - Statistics
Breakdown of genes in the genome
42Genome Alignments - Statistics
Fraction of the genome in conserved gene strings
- from template-anchored alignments Minimum Synec
hocystis sp. 5 Aquifex aeolicus 10 Archaeoglo
bus fulgidus 13 Escherichia coli 14 Treponema
pallidum 17 Maximum Thermotoga
maritima 23 Mycoplasma genitalium 24
43Context-Based Prediction of Protein Functions
A Novel Translation Factor (COG0536)
L21
L27
GTPase?
GTP-binding translation factor
44Context-Based Prediction of Protein Functions
A Novel Translation Factor (COG0012)
TGS domain containing GTPase?
Peptidyl-tRNA hydrolase
GTP-binding translation factor
45(No Transcript)
46(No Transcript)