Title: Recognition of regulatory signals
1Recognition of regulatory signals
- Mikhail S. Gelfand
- IntegratedGenomics-Moscow
- NATO ASI School, October 2001
2Why?
- Additional annotation tool (e.g. specificity of
transporters and enzymes from large families) - Important for practice (in addition to metabolic
reconstruction) - Interesting from the evolutionary point of view
3Overview
- 0. Biological introduction
- 1. Algorithms
- Representation of signals
- Deriving the signal
- Site recognition
- 2. Comparative genomics
- Phylogenetic footprinting
- Consistency filtering
4Some biology
- Transcription (DNA ? RNA)
- Splicing (pre-mRNA ? mRNA)
- Translation (mRNA ? protein)
- Regulation of transcription in prokaryotes
- and eukaryotes
- Initiation of translation
5Transcription and translation in prokaryotes
6Initiation of transcription (bacteria)
7Translation in prokaryotes
8Translation (details)
9Splicing (eukaryotes)
10Regulation of transcriptionin prokaryotes
11Structure of DNA-binding domain. Example 1
12Structure of DNA-binding domain. Example 2
13Protein-DNA interactions
14Regulation of transcriptionin eukaryotes
15Representation of signals
- Consensus
- Pattern (consensus with degenerate positions)
- Positional weight matrix (PWM, or profile)
- Logical rules
- RNA signals
16Consensus
- codB CCCACGAAAACGATTGCTTTTT
- purE GCCACGCAACCGTTTTCCTTGC
- pyrD GTTCGGAAAACGTTTGCGTTTT
- purT CACACGCAAACGTTTTCGTTTA
- cvpA CCTACGCAAACGTTTTCTTTTT
- purC GATACGCAAACGTGTGCGTCTG
- purM GTCTCGCAAACGTTTGCTTTCC
- purH GTTGCGCAAACGTTTTCGTTAC
- purL TCTACGCAAACGGTTTCGTCGG
- consensus ACGCAAACGTTTTCGT
17Pattern
- codB CCCACGAAAACGATTGCTTTTT
- purE GCCACGCAACCGTTTTCCTTGC
- pyrD GTTCGGAAAACGTTTGCGTTTT
- purT CACACGCAAACGTTTTCGTTTA
- cvpA CCTACGCAAACGTTTTCTTTTT
- purC GATACGCAAACGTGTGCGTCTG
- purM GTCTCGCAAACGTTTGCTTTCC
- purH GTTGCGCAAACGTTTTCGTTAC
- purL TCTACGCAAACGGTTTCGTCGG
- consensus ACGCAAACGTTTTCGT
- pattern aCGmAAACGtTTkCkT
18Frequency matrix
I ?j ?b f(b,j)log f(b,j) / p(b)
Information content
19Sequence logo
20Positional weight matrix (PWM)
21- Probabilistic motivation log-likelihood (up to a
linear transformation) - More probabilistic motivation z-score (with the
suitable base of the logarithm) - Thermodynamical motivation free energy (assuming
independence of positions, up to a linear
transformation) - Pseudocounts
22Logical rules, trees etc.
23Compilation of samples
- Initial sample
- GenBank
- specialized databases
- literature (reviews)
- literature (original papers)
- Correction of GenBank errors
- Checking the literature
- removal of predicted sites
- Removal of duplicates
24Re-alignment approaches
- Initial alignment by a biological landmark
- start of transcription for promoters
- start codon for ribosome binding sites
- exon-intron boundary for splicing sites
- Deriving the signal within a sliding window
- Re-alignment
- etc. etc. until convergence
25Gene starts of Bacillus subtilis
- dnaN ACATTATCCGTTAGGAGGATAAAAATG
- gyrA GTGATACTTCAGGGAGGTTTTTTAATG
- serS TCAATAAAAAAAGGAGTGTTTCGCATG
- bofA CAAGCGAAGGAGATGAGAAGATTCATG
- csfB GCTAACTGTACGGAGGTGGAGAAGATG
- xpaC ATAGACACAGGAGTCGATTATCTCATG
- metS ACATTCTGATTAGGAGGTTTCAAGATG
- gcaD AAAAGGGATATTGGAGGCCAATAAATG
- spoVC TATGTGACTAAGGGAGGATTCGCCATG
- ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
- pabB AAAGAAAATAGAGGAATGATACAAATG
- rplJ CAAGAATCTACAGGAGGTGTAACCATG
- tufA AAAGCTCTTAAGGAGGATTTTAGAATG
- rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
- rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
- rplM AGATCATTTAGGAGGGGAAATTCAATG
26- dnaN ACATTATCCGTTAGGAGGATAAAAATG
- gyrA GTGATACTTCAGGGAGGTTTTTTAATG
- serS TCAATAAAAAAAGGAGTGTTTCGCATG
- bofA CAAGCGAAGGAGATGAGAAGATTCATG
- csfB GCTAACTGTACGGAGGTGGAGAAGATG
- xpaC ATAGACACAGGAGTCGATTATCTCATG
- metS ACATTCTGATTAGGAGGTTTCAAGATG
- gcaD AAAAGGGATATTGGAGGCCAATAAATG
- spoVC TATGTGACTAAGGGAGGATTCGCCATG
- ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
- pabB AAAGAAAATAGAGGAATGATACAAATG
- rplJ CAAGAATCTACAGGAGGTGTAACCATG
- tufA AAAGCTCTTAAGGAGGATTTTAGAATG
- rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
- rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
- rplM AGATCATTTAGGAGGGGAAATTCAATG
- cons. aaagtatataagggagggttaataATG
- num. 001000000000110110000000111
- 760666658967228106888659666
27- dnaN ACATTATCCGTTAGGAGGATAAAAATG
- gyrA GTGATACTTCAGGGAGGTTTTTTAATG
- serS TCAATAAAAAAAGGAGTGTTTCGCATG
- bofA CAAGCGAAGGAGATGAGAAGATTCATG
- csfB GCTAACTGTACGGAGGTGGAGAAGATG
- xpaC ATAGACACAGGAGTCGATTATCTCATG
- metS ACATTCTGATTAGGAGGTTTCAAGATG
- gcaD AAAAGGGATATTGGAGGCCAATAAATG
- spoVC TATGTGACTAAGGGAGGATTCGCCATG
- ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
- pabB AAAGAAAATAGAGGAATGATACAAATG
- rplJ CAAGAATCTACAGGAGGTGTAACCATG
- tufA AAAGCTCTTAAGGAGGATTTTAGAATG
- rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
- rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
- rplM AGATCATTTAGGAGGGGAAATTCAATG
- cons. tacataaaggaggtttaaaaat
- num. 0000000111111000000001
- 5755779156663678679890
28Positional information content before and after
re-alignment
29Positional nucleotide frequencies after
re-alignment (aGGAGG pattern)
30Enhancement of a weak signal
31Deriving the signal ab initio
- Discrete (pattern-driven) approaches word
counting - Continuous (profile-driven) approaches
optimization
32Word counting. Short words
- Consider all k-mers
- For each k-mer compute the number of sequences
containing this k-mer - (maybe with some mismatches)
- Select the most frequent k-mer
33- Problem Complete search is possible only for
short words - Assumption if a long word is over-represented,
its subwords also are overrepresented - Solution select a set of over-represented words
and combine them into longer words
34Word counting. Long words
- Consider some k-mers
- For each k-mer compute the number of sequences
containing this k-mer - (maybe with some mismatches)
- Select the most frequent k-mer
35- Problem what k-tuples to start with?
- 1st attempt those actually occurring in the
sample. - But the correct signal (the consensus word) may
not be among them.
36- 2nd attempt those actually occurring in the
sample and some neighborhood. - But
- again, the correct signal (the consensus word)
may not be among them - the size of the neighborhood grows exponentially
37Graph approach
- Each k-mer in each sequence corresponds to a
vertex. Two k-mers are linked by an arc, if they
differ in at most h positions (hltltk). - Thus we obtain an n-partite graph (n is the
number of sequences). - A signal corresponds to a clique (a complete
subgraph) or at least a dense subgraph with
vertices in each part.
38A simple algorithm
- Remove vertices that cannot be extended to
complete subgraphs - that is, do not have arcs to all parts of the
graph - Remove pairs that cannot be extended
- that is, do not form triangles with the third
vertex in all parts of the graph - Etc.
- (will not work as is for dense subgraphs)
39Optimization. EM algorithms
- Generate an initial set of profiles (e.g. seed
with all k-mers) - For each profile
- find the best (highest scoring) representative in
each sequence - update the profile
- Iterate until convergence
40- This algorithm converges.
- However, it cannot leave the basin of attraction.
- Thus, if the initial approximation is bad, it
will converge to nonsense. - Solution stochastic optimization.
41Simulated annealing
- Goal maximize the information content I
- I ?j ?b f(b,j)log f(b,j) / p(b)
- or any other measure of homogeneity of the sites
42- Let A be the current signal (set of candidate
sites), and let I(A) be the corresponding
information content. - Let B be a set of sites obtained by randomly
choosing a different site in one sequence, and
let I(B) be its information content. - if I(B) ? I(A), B is accepted
- if I(B) lt I(A), B is accepted with probability
- P exp (I(B) I(A)) / T
- The temperature T decreases exponentially, but
slowly the initial temperature is chosen such
that almost all changes are accepted.
43Gibbs sampler
- Again, A is a signal (set of sites), and I(A) is
its information content. - At each step a new site is selected in one
sequence with probability - P exp (I(Anew)
- For each candidate site the total time of
occupation is computed. - (Note that the signal changes all the time)
44Use of symmetry
- DNA-binding factors and their signals
- Co-operative homogeneous
- Palindromes
- Repeats
- Co-operative non-homogeneous
- Cassetes
- Others
- RNA signals
45Recognition PWM/profiles
- The simplest technique positional nucleotide
weights are - W(b,j)ln(N(b,j)0.5) 0.25?iln(N(i,j)0.5)
- Score of a candidate site b1bk is the sum of the
corresponding positional nucleotide weights - S(b1bk ) ?j1,,kW(bj,j)
46Distribution of RBS profile scores on sites
(green) and non-sites (red)
47Pattern recognition
- Linear discriminant analysis
- Logical rules
- Syntactic analysis
- Context-sensitive grammars
- Perceptron
- Neural networks
48Neural networks architecture
- 4?k input neurons (sensors), each responsible for
observing a particular nucleotide at particular
position - OR 2?k neurons (one discriminates between purines
and pyrimidines, the other, between AT and GC) - One or more layers of hidden neurons
- One output neuron
49- Each neuron is connected to all neurons of the
next layer - Each connection is ascribed a numerical weight
- A neuron
- Sums the signals at incoming connections
- Compares the total with the threshold (or
transforms it according to a fixed function) - If the threshold is passed, excites the outcoming
connections (resp. sends the modified value)
50Training
- Sites and non-sites from the training sample are
presented one by one. - The output neuron produces the prediction.
- The connection weights and thresholds are
modified if the prediction is incorrect. - Networks differ by architecture, particulars of
the signal processing, the training schedule
51Use of sequence context
- Presence of multiple co-operative sites
- ArgR (E. coli), purine regulator (Pyrococcus)
- XylRCRP CytRCRP (E. coli)
- MEFMyoD in muscle-specific promoters (mammals)
- Location relative to promoters
- repressors vs. activators
52Benchmarking
- Difficult, because
- Different algorithms are optimized for different
performance parameters - Incompatible training sets
- Difficult to construct a homogeneous and
unambiguous testing set - Unobserved sites
- Competition between closely located sites
- Activation in specific conditions
- non-specific binding (52 out of 54 candidate
HNF-1 binding sites do bind the factor)
53Promoters of E. coli
- PWM at false positive rate 1 per 2000 bp
- 25 of all promoters,
- 60 of constitutive (non-activated) promoters
- PWM perform as well as neural networks
54Eukaryotic promoters
55Ribosome binding sites
- Information content of the profile predicts the
average reliability of predictions
56CRP (E. coli)
57Comparative approach to the analysis of regulation
- Making good predictions
- with bad rules
58Regulation of transcription in prokaryotes
- Difficult
- Small sample size
- Weak signals (or we do not know what features are
relevant, maybe the DNA structure)
59CRP (E. coli)
60GenBank entry for the E. coli genome
61- Many genomes are available gt
- gt comparative approach
- Basic assumption
- Regulons (sets of co-regulated genes) are
conserved - well in some cases
- in fact, in many cases
62Corollary The consistency check
- True sutes occur upstream of orthologous genes
- False sites are scattered at random
63Orthologs
- Orthologous genes
- diverged by specitation
- retain cellular role
- Paralogous genes
- diverged by duplication
- retain biochemical function only
64Orthology (definition)
Â
Â
Â
duplication
Â
Â
- Genomes are shown as black pipes
- 1st event duplication
- 2nd event specitation
- Genes of the same color are orthologous
- Genes of different color are paralogous
Â
Â
Â
Â
Â
Â
Â
A1 B1
A2 B2
Genome 1 Genome
2 A1 and A2 are orthologs, B1 and B2 are
orthologs, all other pairs are paralogs
65Search for orthologs (fast and dirty)
66The basic procedure
Genome 1
Genome 2
Genome N
67Accounting for the operon structure
68Checklist
- Presence of orthologous transcription factors
- Really orthologous (BETs, COGs etc. are not
sufficient) - Conservation of the DNA-binding domain
- Conservation of the core pathway
69Purine regulons of E. coli and H. influenzae
70Predicted purine transporters
71Changes in the operon structure more examples
- glnK-amtB loci of methanogenic acrhaebacteria
72Tryptophan operons
73Heat chock (HrcA) regulons / CIRCE elements
74Closely related genomes Phylogenetic footprinting
- Regulatory sites are more conserved than
non-coding regions in general and are often seen
as conserved islands in alignments of gene
upstream regions.
75High conservation
76Low conservation
77Degeneration of sites
78Problems and solutions
- Unique members of regulons may be lost use of
additional genomes decreases the number of
orphan regulon members. - Closely related factors may have similar sites
careful analysis of function and analysis of
particular sites is usually sufficient to resolve
ambiguities. - Too many genomes and regulons apply preliminary
automated screening.
79Modification ubiquitous regulators
- Present in many genomes
- Only core regulon is conserved
- Mode of regulation may vary
- Signals may be slightly different
80Arginine repressor ArgR/AhrC
81ABC transporters (periplasmic components)
82Modification horizontal transfer
- Impossible to resolve the orthology
relationships a homologous regulated gene is
sufficient for corroboration - Often rgulate large loci (several adjacent
operons) - Signals are mainly conserved
83New signals
- Select a group of related genomes
- In each genome select metabolically related genes
- Add possibly co-transcribed genes
- Compare upstream regions for each genome
independently - Construct profiles
- Compare constructed profiles if similar, then
relevant
84The purine regulon of Pyrococcus spp.
- Use functional annotation and COGs to select
genes encoding enzymes from purine pathway purA,
purB, purC, purF, purD, purE, purL-I, purL-II,
purT, guaA. - Construct profiles for each genome. The quality
of profiles is weak (lt 1 bit/position). - However, the profiles are almost identical.
- There is no significant similarity of upstream
regions (outside sites). Thus the profiles are
probably correct. - Low specificity of profiles, thus gt300 candidate
genes in each genome. - Observation in upstream regions of all genes
from the initial sample the candidate sites occur
twice with 22 bp spacer. - The new rule is absolutely specific only one
additional gene in each genome.
852895752_Ef
PyrP_Bc
UraA_Ec
PyrP_Bs
996
UAPA_En
714
UraA_Hi
940
UAPC_En
997
YcdG_Ec
758
YgfO
997
746
965
YicE
YtiP_Bs
PH
981
980
997
YgfU
2239289_Bs
969
PA A
778
2635740_Bs
PF
965
999
2314333_Hp
749
979
2635741_Bs
2689889_Bb
998
YjcD_Hi
994
1000
Y326_Mj
PbuX_Bs
2689890_Bb
YjcD
YgfQ
YieG
YicO
86Sources
- G. Stormo
- J. Fickett
- W. Miller
- I. Dubchak
- Yuh et al. (1998)
- Tronche et al. (1997)
- textbooks
87Discussions and collaboration
- Farid Chetouani (Institute Pasteur)
- Eugene Koonin (NCBI)
- Yuri Kozlov (Aginomoto)
- Leonid Mirny (Harvard - MIT)
- Alexander Mironov (GosNIIGenetika)
- Vasily Lybetsky (Inst. Probl. Inform. Trans.)
- Andrey Osterman (IntegratedGenomics)
- Danila Perumov (Inst. Nucl. Phys.)
- Pavel Pevzner (UC San Diego)
- Michael Roytberg (Inst. Math. Probl. Biol.)
88Collaborators
- Andrey A. Mironov
- A. B. Rakhmaninova
- Vadim Brodyansky
- Lyudmila Danilova
- Anna Gerasimova
- Alexey Kazakov
- Ekaterina Kotelnikova
- Olga Laikova
- Pavel Novichkov
- Ekaterina Panina
- Elya Permina
- Dmitry Ravcheev
- Dmitry Rodionov
- Natalya Sadovskaya
- Alexey Vitreschak