Title: Molecular biology in silico
1Molecular biology in silico
- Mikhail Gelfand
- Research and Training Center Bioinformatics,
- Institute for Information Transmission Problems,
RAS - AlBio06, Moscow, July 2006
2Propaganda
red papers (experiments)blue sequence fragments
3Complete genomes
GOLD db.(III.2006)361 complete
genomesIncomplete (in the process) 952
bacteria 58 archaea 607 eukaryotes (incl.
ESTs) 46 metagenomes
4More propaganda
- Most genes will never be studied in experiment
- Even in E.coli only 20-30 new genes per year
(hundreds are still uncharacterized) - Bioinformatics molecular biology in silico
- 2 of all recent papers in biological journals
- Essential component of biological research
- Make predictions about function and regulation of
genes (many quite reliable!) - Metabolic reconstruction and prediction of
phenotype given genome - Identify really interesting cases, fill gaps in
knowledge - Universally missing genes not a single known
gene even for 10 reactions of central
metabolism. No genes for gt40 reactions overall - Conserved hypothetical genes (5-15 of any
bacterial genome) essential, but unknown
function
5Haemophilus influenzae, 1995
6Vibrio cholerae, 2000
7How?Similarity to known proteins
- Useful for many purposes (allows one to annotate
50-75 genes in a bacterial genome) - Necessary first step
- May be automated
- to some extent
- in particular, care is needed to avoid too
specific predictions - Problem propagation of annotation errors
- Boring (nothing new)
8Noradrenaline transporter in an archaeon?
SOURCE Methanococcus jannaschii. ORGANISM
Methanococcus jannaschii Archaea
Euryarchaeota Methanococcales
Methanococcaceae Methanococcus.
FEATURES Location/Qualifiers
source 1..492
/organism"Methanococcus jannaschii"
/db_xref"taxon2190" Protein
1..492 /product"sodium-dependent
noradrenaline transporter" CDS
1..492
/gene"MJ1319"
/note"similar to EGADHI0736 percent identity
38.5 identified by sequence
similarity putative"
/coded_by"U6757271..1549"
/transl_table11
- Now corrected Hypothetical sodium-dependent
transporter MJ1319.
9Similarity to hypothetical proteins somebody
elses errors
The correct annotation
10Genes with curious functional assignments
- C75604 Probable head morphogenesis protein,
Deinococcus radiodurans - O05360 Automembrane protein H, Yersinia
enterocolitica - Q8TID9 Benzodiazepine (valium) receptor TspO,
Methanosarcina acetivorans - NP_069403 DR-beta chain MHC class II,
Archaeoglobus fulgidus
11Errors in experimental papers
SwissProt DEFINITION Hypothetical 43.6 kDa
protein. ACCESSION P48012 ... KEYWORDS
Hypothetical protein. SOURCE Debaryomyces
occidentalis ORGANISM Debaryomyces
occidentalis Eukaryota Fungi
Ascomycota Saccharomycotina Saccharomycetes
Saccharomycetales Saccharomycetaceae
Debaryomyces. CAUTION Was originally (Ref.1)
thought to be 3-isopropylmalate dehydrogenase
(LEU2). PIR DEFINITION
3-isopropylmalate dehydrogenase (EC 1.1.1.85)
- yeast(Schwanniomyces occidentalis). ACCE
SSION S55845 KEYWORDS oxidoreductase.
12SwissProt entry DSDX_ECOLI
-!- CAUTION An ORF called dsdC was originally
(Ref.3) assigned to the wrong DNA strand and
thought to be a D-serine deaminase activator, it
was then resequenced by Ref.2 and still thought
to be "dsdC", but this time to function as a
D-serine permease. It is Ref.1 that showed that
dsdC is another gene and that this sequence
should be called dsdX. It should also be noted
that the C-terminal part of dsdX (from 338
onward) was also sequenced (Ref.6 and Ref.7) and
was thought to be a separate ORF (don't worry, we
also had difficulties understanding what
happened!).
13Positional clustering
- Genes that are located in immediate proximity
tend to be involved in the same metabolic pathway
or functional subsystem - mainly in prokaryotes, very weak in eukaryotes
- caused by operon structure, but not only
- horizontal transfer of loci containing several
functionally linked operons - compartmentalisation of products in the cytoplasm
- very weak evidence
- stronger if observed in may unrelated genomes
- May be measured
- e.g. the STRING database/server (P.Bork, EMBL)
- and other sources
14STRING trpB positional clusters
15Functionally dependent genes tend to cluster on
chromosomes in many different organisms
Vertical axis number of gene pairs with
association score exceeding a threshold. Control
same graph, random re-labeling of vertices
16More genomes (stronger links) gt highly
significant clustering
17Especially in linear pathways (right)
18Fusions
- If two (or more) proteins form a single
multidomain protein in some organism, they all
are likely to be tightly functionally related - Very useful for the analysis of eukaryotes
- Sometimes useful for the analysis of prokaryotes
19STRING trpB fusions
20Phyletic patterns
- Functionally linked genes tend to occur together
- Enzymes with the same function (isozymes) have
complementary phyletic profiles
21STRING trpB co-occurrence (phyletic profiles)
22Phyletic profiles in the Phe/Tyr pathway
shikimate kinase
23Archaeal shikimate-kinase
Chorismate biosynthesis pathway (E. coli)
24Arithmetics of phyletic patterns
Shikimate dehydrogenase (EC 1.1.1.25) AroE
COG0169 aompkzyqvdrlbcefghsnuj-i--
5-enolpyruvylshikimate 3-phosphate synthase (EC
2.5.1.19) AroA COG0128 aompkzyqvdrlbcefghsnuj-
i--
Chorismate synthase (EC 2.5.1.19)
AroC COG0082 aompkzyqvdrlbcefghsnuj-i--
25Distribution of association scores (monotonic
for subunits, bimodal for isozymes)
26E.g. transporters
- Transporters of end products of metabolic
pathways may substitute the entire pathway - Transporters of compounds for catabolic pathways
co-occur with pathways - Transporters for intermediates substitute
upstream parts of pathways
27Example bioY
28Other approaches to phyletic patterns
- Gene signatures of lifestyles
- e.g. thermophily DNA gyrase is the only gene
specific to all hyperthermophiles (bacterial and
archaeal) - see COGs
- Regulators and signals
29Example bioRgene black arrowcandidate
site red dot
30Comparative analysis of regulation
- Phylogenetic footprinting regulatory sites are
more conserved than non-coding regions in general
and are often seen as conserved islands in
alignments of gene upstream regions - Consistency filtering regulons (sets of
co-regulated genes) are conserved gt - true sites occur upstream of orthologous genes
- false sites are scattered at random
31Enzymes
- Identification of a gap in a pathway (universal,
taxon-specific, or in individual genomes) - Search for candidates assigned to the pathway by
co-localization and co-regulation (in many
genomes) - Prediction of general biochemical function from
(distant) similarity and functional patterns - Tentative filling of the gap
- Verification by analysis of phylogenetic
patterns - Absence in genomes without this pathway
- Complementary distribution with known enzymes for
the same function
32Transporters
- Identification of candidates assigned to the
pathway by co-localization and co-regulation (in
many genomes) - Prediction of general function by analysis of
transmembrane segments and similarity - Prediction of specificity by analysis of
phylogenetic patterns - End product if present in genomes lacking this
pathway (substituting the biosynthetic pathway
for an essential compound) - Input metabolite if absent in genomes without the
pathway (catabolic, also precursors in
biosynthetic pathways) - Entry point in the middle if substituting an
upper or side part of the pathway in some genomes
335 UTR regions of riboflavin genes from bacteria
34Conserved secondary structure of the RFN-element
Capitals invariant (absolutely conserved)
positions. Lower case letters strongly
conserved positions. Dashes and stars
obligatory and facultative base pairs Degenerate
positions R A or G Y C or U
K G or U B not A V not U.
N any nucleotide. X any
nucleotide or deletion
35RFN the mechanism of regulation
- Transcription attenuation
36Early observation an uncharacterized gene (ypaA)
with an upstream RFN element
37Phylogenetic tree of RFN-elements (regulation of
riboflavin biosynthesis)
no riboflavin biosynthesis
duplications
no riboflavin biosynthesis
38YpaA riboflavin (vitamin B2) transporter in
Gram-positive bacteria
- 5 predicted transmembrane segments gt a
transporter - Upstream RFN element (likely co-regulation with
riboflavin genes) gt transport of riboflaving or
a precursor - S. pyogenes, E. faecalis, Listeria sp. ypaA, no
riboflavin pathway gt transport of riboflavin - Prediction YpaA is riboflavin transporter
(Gelfand et al., 1999) - Validation
- YpaA transports flavines (riboflavin, FMN, FAD)
(by genetic analysis, Kreneva et al., 2000) - ypaA is regulated by riboflavin (by microarray
expression study, Lee et al., 2001) - via attenuation of transcription (and to some
extent inhibition of translaition) (Winkler et
al., 2003)
39A new family of nickel/cobalt transporters
- No experimental data
- No structural data
- Specificity predicted by comparative genomics
- and then validated in experiment
- Mutational analysis under way
40Conserved signal upstream of nrd genes
41Identification of the candidate regulator by the
analysis of phyletic patterns
- COG1327 the only COG with exactly the same
phylogenetic pattern as the signal - large scale on the level of major taxa
- small scale within major taxa
- absent in small parasites among alpha- and
gamma-proteobacteria - absent in Desulfovibrio spp. among
delta-proteobacteria - absent in Nostoc sp. among cyanobacteria
- absent in Oenococcus and Leuconostoc among
Firmicutes - present only in Treponema denticola among four
spirochetes
42COG1327 Predicted transcriptional regulator,
consists of a Zn-ribbon and ATP-cone domains
regulator of the riboflavin pathway?
43Additional evidence
- sometimes clustered with nrd genes or with
replication genes dnaB, dnaI, polA - candidate signals upstream of other
replication-related genes - dNTP salvage
- topoisomerase I, replication initiator dnaA,
chromosome partitioning, DNA helicase II - experimental confirmation in Streptomyces
(Borovok et al., 2004)
44Multiple sites (nrd genes) FNR, DnaA, NrdR
45Mode of regulation
- Repressor (overlaps with promoters)
- Co-operative binding
- most sites occur in tandem (gt 90 cases)
- the distance between the copies (centers of
palindromes) equals an integer number of DNA
turns - mainly (94) 30-33 bp, in 84 31-32 bp 3 turns
- 21 bp (2 turns) in Vibrio spp.
- 41-42 bp (4 turns) in some Firmicutes
46Combined regulatory network for iron homeostasis
genes in a-proteobacteria.
Fe
Fe
- Fe
Fe
-
FeS status
of cell
FeS
- Fe
Fe
The connecting line denote regulatory
interactions, which the thickness reflecting the
frequency of the interaction in the analyzed
genomes. The suggested negative or positive mode
of operation is shown by dead-end and arrow-end
of the line.
47 Distribution of Irr, Fur/Mur, MntR, RirA,
and IscR regulons in a-proteobacteria
?' in RirA column denotes the absence of the
rirA gene in an unfinished genomic sequence and
the presence of candidate RirA-binding sites
upstream of the iron uptake genes.
48Phylogenetic tree of the Fur family of
transcription factors in a-proteobacteria - I
Fur in g- and b- proteobacteria
Fur in e- proteobacteria
Fur in Firmicutes
in a-proteobacteria
Regulator of manganese uptake genes (sit, mntH)
in a-proteobacteria
Regulator of iron uptake and metabolism genes
a-proteobacteria
49Erythrobacter litoralis
Caulobacter crescentus
Novosphingobium aromaticivorans
Zymomonas mobilis
Sequence logos for the identified Fur-binding
sites in the D group of a-proteobacteria
Sphinopyxis alaskensis
Oceanicaulis alexandrii
Rhodospirillum rubrum
Gluconobacter oxydans
Magnetospirillum magneticum
Parvularcula bermudensis -
Identified Mur-binding sites
Bacillus subtilis
The A, B, and C groups
Sequence logos for the known Fur-binding sites
in Escherichia coli and Bacillus subtilis
Mur
a
of - proteobacteria -
Escherichia coli
50Phylogenetic tree of the Fur family of
transcription factors in a-proteobacteria - II
Fur in g- and b- proteobacteria
Fur in e- proteobacteria
Fur in Firmicutes
a-proteobacteria
Irr in a-proteo- bacteria regulator of
iron homeostasis
51Sequence logos for the identified Irr binding
sites in a-proteobacteria.
(8 species) - Irr
The A group
The B group
(4 species) - Irr
The C group (12 species) - Irr
52Phylogenetic tree of the Rrf2 family of
transcription factors in a-proteobacteria
Nitrite/NO-sensing regulator NsrR (Nitrosomonas
europeae, Escherichia coli)
Positional clustering of rrf2-like genes
with iron uptake and storage genes Fe-S cluster
synthesis operons genes involved in nitrosative
stress protection sulfate uptake/assimilation
genes thioredoxin reductase carboxymuconolactone
decarboxylase-family genes hmc cytochrome
operon
Iron repressor RirA (Rhizobium leguminosarum)
Cysteine metabolism repressor CymR (Bacillus
subtilis)
Cytochrome complex regulator Rrf2 (Desulfovibrio
vulgaris)
Iron-Sulfur cluster synthesis repressor
IscR (Escherichia coli)
proteins with the conserved C-X(6-9)-C(4-6)-C
motif within effector-responsive domain proteins
without a cysteine triad motif
53Sequence logos for the identified RirA-binding
sites in a-proteobacteria
The A group - RirA
(8 species)
(12 species)
The C group - RirA
54Distribution of the conserved members of the Fe-
and Mn-responsive regulons and the predicted
RirA, Fur/Mur, Irr, and DtxR binding sites in
a-proteobacteria
Genes Functions Iron uptake Iron storage FeS
synthesis
Iron usage Heme biosynthesis Regulatory
genes Manganese uptake
55An attempt to reconstruct the history
56Acknowledgements
- Dmitry Rodionov (comparative genomics)
- Andrei Mironov (software)
- Alexei Vitreschak (riboswitches)
- Slides
- Michael Galperin (NCBI, Bethesda)
- Andrei Osterman (Burnham Institute, San-Diego)
- Collaboration
- Thomas Eitinger (Humboldt University, Berlin)
Co/Ni transporters - Andy Johnston (University of East Anglia) Fe in
alphas - Funding
- Howard Hughes Medical Institute
- Russian Fund of Basic Research
- RAS, program Molecular and Cellular Biology
- INTAS