Recognition of regulatory signals presentation

About This Presentation

Title:

Recognition of regulatory signals

Description:

Important for practice (in addition to metabolic reconstruction) ... Heat chock (HrcA) regulons / CIRCE elements. Closely related genomes: Phylogenetic footprinting ... –

Number of Views:121

Avg rating:3.0/5.0

Slides: 89

Provided by: Gelf6

Category:

more less

Transcript and Presenter's Notes

Title: Recognition of regulatory signals

1
Recognition of regulatory signals

Mikhail S. Gelfand
IntegratedGenomics-Moscow
NATO ASI School, October 2001

2
Why?

Additional annotation tool (e.g. specificity of
transporters and enzymes from large families)
Important for practice (in addition to metabolic
reconstruction)
Interesting from the evolutionary point of view

3
Overview

0. Biological introduction
1. Algorithms
Representation of signals
Deriving the signal
Site recognition
2. Comparative genomics
Phylogenetic footprinting
Consistency filtering

4
Some biology

Transcription (DNA ? RNA)
Splicing (pre-mRNA ? mRNA)
Translation (mRNA ? protein)
Regulation of transcription in prokaryotes
and eukaryotes
Initiation of translation

5
Transcription and translation in prokaryotes
6
Initiation of transcription (bacteria)
7
Translation in prokaryotes
8
Translation (details)
9
Splicing (eukaryotes)
10
Regulation of transcriptionin prokaryotes
11
Structure of DNA-binding domain. Example 1
12
Structure of DNA-binding domain. Example 2
13
Protein-DNA interactions
14
Regulation of transcriptionin eukaryotes
15
Representation of signals

Consensus
Pattern (consensus with degenerate positions)
Positional weight matrix (PWM, or profile)
Logical rules
RNA signals

16
Consensus

codB CCCACGAAAACGATTGCTTTTT
purE GCCACGCAACCGTTTTCCTTGC
pyrD GTTCGGAAAACGTTTGCGTTTT
purT CACACGCAAACGTTTTCGTTTA
cvpA CCTACGCAAACGTTTTCTTTTT
purC GATACGCAAACGTGTGCGTCTG
purM GTCTCGCAAACGTTTGCTTTCC
purH GTTGCGCAAACGTTTTCGTTAC
purL TCTACGCAAACGGTTTCGTCGG
consensus ACGCAAACGTTTTCGT

17
Pattern

codB CCCACGAAAACGATTGCTTTTT
purE GCCACGCAACCGTTTTCCTTGC
pyrD GTTCGGAAAACGTTTGCGTTTT
purT CACACGCAAACGTTTTCGTTTA
cvpA CCTACGCAAACGTTTTCTTTTT
purC GATACGCAAACGTGTGCGTCTG
purM GTCTCGCAAACGTTTGCTTTCC
purH GTTGCGCAAACGTTTTCGTTAC
purL TCTACGCAAACGGTTTCGTCGG
consensus ACGCAAACGTTTTCGT
pattern aCGmAAACGtTTkCkT

18
Frequency matrix
I ?j ?b f(b,j)log f(b,j) / p(b)
Information content
19
Sequence logo
20
Positional weight matrix (PWM)
21

Probabilistic motivation log-likelihood (up to a
linear transformation)
More probabilistic motivation z-score (with the
suitable base of the logarithm)
Thermodynamical motivation free energy (assuming
independence of positions, up to a linear
transformation)
Pseudocounts

22
Logical rules, trees etc.
23
Compilation of samples

Initial sample
GenBank
specialized databases
literature (reviews)
literature (original papers)
Correction of GenBank errors
Checking the literature
removal of predicted sites
Removal of duplicates

24
Re-alignment approaches

Initial alignment by a biological landmark
start of transcription for promoters
start codon for ribosome binding sites
exon-intron boundary for splicing sites
Deriving the signal within a sliding window
Re-alignment
etc. etc. until convergence

25
Gene starts of Bacillus subtilis

dnaN ACATTATCCGTTAGGAGGATAAAAATG
gyrA GTGATACTTCAGGGAGGTTTTTTAATG
serS TCAATAAAAAAAGGAGTGTTTCGCATG
bofA CAAGCGAAGGAGATGAGAAGATTCATG
csfB GCTAACTGTACGGAGGTGGAGAAGATG
xpaC ATAGACACAGGAGTCGATTATCTCATG
metS ACATTCTGATTAGGAGGTTTCAAGATG
gcaD AAAAGGGATATTGGAGGCCAATAAATG
spoVC TATGTGACTAAGGGAGGATTCGCCATG
ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
pabB AAAGAAAATAGAGGAATGATACAAATG
rplJ CAAGAATCTACAGGAGGTGTAACCATG
tufA AAAGCTCTTAAGGAGGATTTTAGAATG
rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
rplM AGATCATTTAGGAGGGGAAATTCAATG

dnaN ACATTATCCGTTAGGAGGATAAAAATG
gyrA GTGATACTTCAGGGAGGTTTTTTAATG
serS TCAATAAAAAAAGGAGTGTTTCGCATG
bofA CAAGCGAAGGAGATGAGAAGATTCATG
csfB GCTAACTGTACGGAGGTGGAGAAGATG
xpaC ATAGACACAGGAGTCGATTATCTCATG
metS ACATTCTGATTAGGAGGTTTCAAGATG
gcaD AAAAGGGATATTGGAGGCCAATAAATG
spoVC TATGTGACTAAGGGAGGATTCGCCATG
ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
pabB AAAGAAAATAGAGGAATGATACAAATG
rplJ CAAGAATCTACAGGAGGTGTAACCATG
tufA AAAGCTCTTAAGGAGGATTTTAGAATG
rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
rplM AGATCATTTAGGAGGGGAAATTCAATG
cons. aaagtatataagggagggttaataATG
num. 001000000000110110000000111
760666658967228106888659666

dnaN ACATTATCCGTTAGGAGGATAAAAATG
gyrA GTGATACTTCAGGGAGGTTTTTTAATG
serS TCAATAAAAAAAGGAGTGTTTCGCATG
bofA CAAGCGAAGGAGATGAGAAGATTCATG
csfB GCTAACTGTACGGAGGTGGAGAAGATG
xpaC ATAGACACAGGAGTCGATTATCTCATG
metS ACATTCTGATTAGGAGGTTTCAAGATG
gcaD AAAAGGGATATTGGAGGCCAATAAATG
spoVC TATGTGACTAAGGGAGGATTCGCCATG
ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
pabB AAAGAAAATAGAGGAATGATACAAATG
rplJ CAAGAATCTACAGGAGGTGTAACCATG
tufA AAAGCTCTTAAGGAGGATTTTAGAATG
rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
rplM AGATCATTTAGGAGGGGAAATTCAATG
cons. tacataaaggaggtttaaaaat
num. 0000000111111000000001
5755779156663678679890

28
Positional information content before and after
re-alignment
29
Positional nucleotide frequencies after
re-alignment (aGGAGG pattern)
30
Enhancement of a weak signal
31
Deriving the signal ab initio

Discrete (pattern-driven) approaches word
counting
Continuous (profile-driven) approaches
optimization

32
Word counting. Short words

Consider all k-mers
For each k-mer compute the number of sequences
containing this k-mer
(maybe with some mismatches)
Select the most frequent k-mer

Problem Complete search is possible only for
short words
Assumption if a long word is over-represented,
its subwords also are overrepresented
Solution select a set of over-represented words
and combine them into longer words

34
Word counting. Long words

Consider some k-mers
For each k-mer compute the number of sequences
containing this k-mer
(maybe with some mismatches)
Select the most frequent k-mer

Problem what k-tuples to start with?
1st attempt those actually occurring in the
sample.
But the correct signal (the consensus word) may
not be among them.

2nd attempt those actually occurring in the
sample and some neighborhood.
But
again, the correct signal (the consensus word)
may not be among them
the size of the neighborhood grows exponentially

37
Graph approach

Each k-mer in each sequence corresponds to a
vertex. Two k-mers are linked by an arc, if they
differ in at most h positions (hltltk).
Thus we obtain an n-partite graph (n is the
number of sequences).
A signal corresponds to a clique (a complete
subgraph) or at least a dense subgraph with
vertices in each part.

38
A simple algorithm

Remove vertices that cannot be extended to
complete subgraphs
that is, do not have arcs to all parts of the
graph
Remove pairs that cannot be extended
that is, do not form triangles with the third
vertex in all parts of the graph
Etc.
(will not work as is for dense subgraphs)

39
Optimization. EM algorithms

Generate an initial set of profiles (e.g. seed
with all k-mers)
For each profile
find the best (highest scoring) representative in
each sequence
update the profile
Iterate until convergence

This algorithm converges.
However, it cannot leave the basin of attraction.
Thus, if the initial approximation is bad, it
will converge to nonsense.
Solution stochastic optimization.

41
Simulated annealing

Goal maximize the information content I
I ?j ?b f(b,j)log f(b,j) / p(b)
or any other measure of homogeneity of the sites

Let A be the current signal (set of candidate
sites), and let I(A) be the corresponding
information content.
Let B be a set of sites obtained by randomly
choosing a different site in one sequence, and
let I(B) be its information content.
if I(B) ? I(A), B is accepted
if I(B) lt I(A), B is accepted with probability
P exp (I(B) I(A)) / T
The temperature T decreases exponentially, but
slowly the initial temperature is chosen such
that almost all changes are accepted.

43
Gibbs sampler

Again, A is a signal (set of sites), and I(A) is
its information content.
At each step a new site is selected in one
sequence with probability
P exp (I(Anew)
For each candidate site the total time of
occupation is computed.
(Note that the signal changes all the time)

44
Use of symmetry

DNA-binding factors and their signals
Co-operative homogeneous
Palindromes
Repeats
Co-operative non-homogeneous
Cassetes
Others
RNA signals

45
Recognition PWM/profiles

The simplest technique positional nucleotide
weights are
W(b,j)ln(N(b,j)0.5) 0.25?iln(N(i,j)0.5)
Score of a candidate site b1bk is the sum of the
corresponding positional nucleotide weights
S(b1bk ) ?j1,,kW(bj,j)

46
Distribution of RBS profile scores on sites
(green) and non-sites (red)
47
Pattern recognition

Linear discriminant analysis
Logical rules
Syntactic analysis
Context-sensitive grammars
Perceptron
Neural networks

48
Neural networks architecture

4?k input neurons (sensors), each responsible for
observing a particular nucleotide at particular
position
OR 2?k neurons (one discriminates between purines
and pyrimidines, the other, between AT and GC)
One or more layers of hidden neurons
One output neuron

Each neuron is connected to all neurons of the
next layer
Each connection is ascribed a numerical weight
A neuron
Sums the signals at incoming connections
Compares the total with the threshold (or
transforms it according to a fixed function)
If the threshold is passed, excites the outcoming
connections (resp. sends the modified value)

50
Training

Sites and non-sites from the training sample are
presented one by one.
The output neuron produces the prediction.
The connection weights and thresholds are
modified if the prediction is incorrect.
Networks differ by architecture, particulars of
the signal processing, the training schedule

51
Use of sequence context

Presence of multiple co-operative sites
ArgR (E. coli), purine regulator (Pyrococcus)
XylRCRP CytRCRP (E. coli)
MEFMyoD in muscle-specific promoters (mammals)
Location relative to promoters
repressors vs. activators

52
Benchmarking

Difficult, because
Different algorithms are optimized for different
performance parameters
Incompatible training sets
Difficult to construct a homogeneous and
unambiguous testing set
Unobserved sites
Competition between closely located sites
Activation in specific conditions
non-specific binding (52 out of 54 candidate
HNF-1 binding sites do bind the factor)

53
Promoters of E. coli

PWM at false positive rate 1 per 2000 bp
25 of all promoters,
60 of constitutive (non-activated) promoters
PWM perform as well as neural networks

54
Eukaryotic promoters
55
Ribosome binding sites

Information content of the profile predicts the
average reliability of predictions

56
CRP (E. coli)
57
Comparative approach to the analysis of regulation

Making good predictions
with bad rules

58
Regulation of transcription in prokaryotes

Difficult
Small sample size
Weak signals (or we do not know what features are
relevant, maybe the DNA structure)

59
CRP (E. coli)
60
GenBank entry for the E. coli genome
61

Many genomes are available gt
gt comparative approach
Basic assumption
Regulons (sets of co-regulated genes) are
conserved
well in some cases
in fact, in many cases

62
Corollary The consistency check

True sutes occur upstream of orthologous genes
False sites are scattered at random

63
Orthologs

Orthologous genes
diverged by specitation
retain cellular role
Paralogous genes
diverged by duplication
retain biochemical function only

64
Orthology (definition)

duplication

Genomes are shown as black pipes
1st event duplication
2nd event specitation
Genes of the same color are orthologous
Genes of different color are paralogous

A1 B1
A2 B2
Genome 1 Genome
2 A1 and A2 are orthologs, B1 and B2 are
orthologs, all other pairs are paralogs
65
Search for orthologs (fast and dirty)
66
The basic procedure
Genome 1
Genome 2
Genome N
67
Accounting for the operon structure
68
Checklist

Presence of orthologous transcription factors
Really orthologous (BETs, COGs etc. are not
sufficient)
Conservation of the DNA-binding domain
Conservation of the core pathway

69
Purine regulons of E. coli and H. influenzae
70
Predicted purine transporters
71
Changes in the operon structure more examples

glnK-amtB loci of methanogenic acrhaebacteria

72
Tryptophan operons
73
Heat chock (HrcA) regulons / CIRCE elements
74
Closely related genomes Phylogenetic footprinting

Regulatory sites are more conserved than
non-coding regions in general and are often seen
as conserved islands in alignments of gene
upstream regions.

75
High conservation
76
Low conservation
77
Degeneration of sites
78
Problems and solutions

Unique members of regulons may be lost use of
additional genomes decreases the number of
orphan regulon members.
Closely related factors may have similar sites
careful analysis of function and analysis of
particular sites is usually sufficient to resolve
ambiguities.
Too many genomes and regulons apply preliminary
automated screening.

79
Modification ubiquitous regulators

Present in many genomes
Only core regulon is conserved
Mode of regulation may vary
Signals may be slightly different

80
Arginine repressor ArgR/AhrC
81
ABC transporters (periplasmic components)
82
Modification horizontal transfer

Impossible to resolve the orthology
relationships a homologous regulated gene is
sufficient for corroboration
Often rgulate large loci (several adjacent
operons)
Signals are mainly conserved

83
New signals

Select a group of related genomes
In each genome select metabolically related genes
Add possibly co-transcribed genes
Compare upstream regions for each genome
independently
Construct profiles
Compare constructed profiles if similar, then
relevant

84
The purine regulon of Pyrococcus spp.

Use functional annotation and COGs to select
genes encoding enzymes from purine pathway purA,
purB, purC, purF, purD, purE, purL-I, purL-II,
purT, guaA.
Construct profiles for each genome. The quality
of profiles is weak (lt 1 bit/position).
However, the profiles are almost identical.
There is no significant similarity of upstream
regions (outside sites). Thus the profiles are
probably correct.
Low specificity of profiles, thus gt300 candidate
genes in each genome.
Observation in upstream regions of all genes
from the initial sample the candidate sites occur
twice with 22 bp spacer.
The new rule is absolutely specific only one
additional gene in each genome.

85
2895752_Ef
PyrP_Bc
UraA_Ec
PyrP_Bs
996
UAPA_En
714
UraA_Hi
940
UAPC_En
997
YcdG_Ec
758
YgfO
997
746
965
YicE
YtiP_Bs
PH
981
980
997
YgfU
2239289_Bs
969
PA A
778
2635740_Bs
PF
965
999
2314333_Hp
749
979
2635741_Bs
2689889_Bb
998
YjcD_Hi
994
1000
Y326_Mj
PbuX_Bs
2689890_Bb
YjcD
YgfQ
YieG
YicO
86
Sources