Recognition of regulatory signals - PowerPoint PPT Presentation

About This Presentation
Title:

Recognition of regulatory signals

Description:

Important for practice (in addition to metabolic reconstruction) ... Heat chock (HrcA) regulons / CIRCE elements. Closely related genomes: Phylogenetic footprinting ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 89
Provided by: Gelf6
Category:

less

Transcript and Presenter's Notes

Title: Recognition of regulatory signals


1
Recognition of regulatory signals
  • Mikhail S. Gelfand
  • IntegratedGenomics-Moscow
  • NATO ASI School, October 2001

2
Why?
  • Additional annotation tool (e.g. specificity of
    transporters and enzymes from large families)
  • Important for practice (in addition to metabolic
    reconstruction)
  • Interesting from the evolutionary point of view

3
Overview
  • 0. Biological introduction
  • 1. Algorithms
  • Representation of signals
  • Deriving the signal
  • Site recognition
  • 2. Comparative genomics
  • Phylogenetic footprinting
  • Consistency filtering

4
Some biology
  • Transcription (DNA ? RNA)
  • Splicing (pre-mRNA ? mRNA)
  • Translation (mRNA ? protein)
  • Regulation of transcription in prokaryotes
  • and eukaryotes
  • Initiation of translation

5
Transcription and translation in prokaryotes
6
Initiation of transcription (bacteria)
7
Translation in prokaryotes
8
Translation (details)
9
Splicing (eukaryotes)
10
Regulation of transcriptionin prokaryotes
11
Structure of DNA-binding domain. Example 1
12
Structure of DNA-binding domain. Example 2
13
Protein-DNA interactions
14
Regulation of transcriptionin eukaryotes
15
Representation of signals
  • Consensus
  • Pattern (consensus with degenerate positions)
  • Positional weight matrix (PWM, or profile)
  • Logical rules
  • RNA signals

16
Consensus
  • codB CCCACGAAAACGATTGCTTTTT
  • purE GCCACGCAACCGTTTTCCTTGC
  • pyrD GTTCGGAAAACGTTTGCGTTTT
  • purT CACACGCAAACGTTTTCGTTTA
  • cvpA CCTACGCAAACGTTTTCTTTTT
  • purC GATACGCAAACGTGTGCGTCTG
  • purM GTCTCGCAAACGTTTGCTTTCC
  • purH GTTGCGCAAACGTTTTCGTTAC
  • purL TCTACGCAAACGGTTTCGTCGG
  • consensus ACGCAAACGTTTTCGT

17
Pattern
  • codB CCCACGAAAACGATTGCTTTTT
  • purE GCCACGCAACCGTTTTCCTTGC
  • pyrD GTTCGGAAAACGTTTGCGTTTT
  • purT CACACGCAAACGTTTTCGTTTA
  • cvpA CCTACGCAAACGTTTTCTTTTT
  • purC GATACGCAAACGTGTGCGTCTG
  • purM GTCTCGCAAACGTTTGCTTTCC
  • purH GTTGCGCAAACGTTTTCGTTAC
  • purL TCTACGCAAACGGTTTCGTCGG
  • consensus ACGCAAACGTTTTCGT
  • pattern aCGmAAACGtTTkCkT

18
Frequency matrix
I ?j ?b f(b,j)log f(b,j) / p(b)
Information content
19
Sequence logo
20
Positional weight matrix (PWM)
21
  • Probabilistic motivation log-likelihood (up to a
    linear transformation)
  • More probabilistic motivation z-score (with the
    suitable base of the logarithm)
  • Thermodynamical motivation free energy (assuming
    independence of positions, up to a linear
    transformation)
  • Pseudocounts

22
Logical rules, trees etc.
23
Compilation of samples
  • Initial sample
  • GenBank
  • specialized databases
  • literature (reviews)
  • literature (original papers)
  • Correction of GenBank errors
  • Checking the literature
  • removal of predicted sites
  • Removal of duplicates

24
Re-alignment approaches
  • Initial alignment by a biological landmark
  • start of transcription for promoters
  • start codon for ribosome binding sites
  • exon-intron boundary for splicing sites
  • Deriving the signal within a sliding window
  • Re-alignment
  • etc. etc. until convergence

25
Gene starts of Bacillus subtilis
  • dnaN ACATTATCCGTTAGGAGGATAAAAATG
  • gyrA GTGATACTTCAGGGAGGTTTTTTAATG
  • serS TCAATAAAAAAAGGAGTGTTTCGCATG
  • bofA CAAGCGAAGGAGATGAGAAGATTCATG
  • csfB GCTAACTGTACGGAGGTGGAGAAGATG
  • xpaC ATAGACACAGGAGTCGATTATCTCATG
  • metS ACATTCTGATTAGGAGGTTTCAAGATG
  • gcaD AAAAGGGATATTGGAGGCCAATAAATG
  • spoVC TATGTGACTAAGGGAGGATTCGCCATG
  • ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
  • pabB AAAGAAAATAGAGGAATGATACAAATG
  • rplJ CAAGAATCTACAGGAGGTGTAACCATG
  • tufA AAAGCTCTTAAGGAGGATTTTAGAATG
  • rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
  • rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
  • rplM AGATCATTTAGGAGGGGAAATTCAATG

26
  • dnaN ACATTATCCGTTAGGAGGATAAAAATG
  • gyrA GTGATACTTCAGGGAGGTTTTTTAATG
  • serS TCAATAAAAAAAGGAGTGTTTCGCATG
  • bofA CAAGCGAAGGAGATGAGAAGATTCATG
  • csfB GCTAACTGTACGGAGGTGGAGAAGATG
  • xpaC ATAGACACAGGAGTCGATTATCTCATG
  • metS ACATTCTGATTAGGAGGTTTCAAGATG
  • gcaD AAAAGGGATATTGGAGGCCAATAAATG
  • spoVC TATGTGACTAAGGGAGGATTCGCCATG
  • ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
  • pabB AAAGAAAATAGAGGAATGATACAAATG
  • rplJ CAAGAATCTACAGGAGGTGTAACCATG
  • tufA AAAGCTCTTAAGGAGGATTTTAGAATG
  • rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
  • rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
  • rplM AGATCATTTAGGAGGGGAAATTCAATG
  • cons. aaagtatataagggagggttaataATG
  • num. 001000000000110110000000111
  • 760666658967228106888659666

27
  • dnaN ACATTATCCGTTAGGAGGATAAAAATG
  • gyrA GTGATACTTCAGGGAGGTTTTTTAATG
  • serS TCAATAAAAAAAGGAGTGTTTCGCATG
  • bofA CAAGCGAAGGAGATGAGAAGATTCATG
  • csfB GCTAACTGTACGGAGGTGGAGAAGATG
  • xpaC ATAGACACAGGAGTCGATTATCTCATG
  • metS ACATTCTGATTAGGAGGTTTCAAGATG
  • gcaD AAAAGGGATATTGGAGGCCAATAAATG
  • spoVC TATGTGACTAAGGGAGGATTCGCCATG
  • ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
  • pabB AAAGAAAATAGAGGAATGATACAAATG
  • rplJ CAAGAATCTACAGGAGGTGTAACCATG
  • tufA AAAGCTCTTAAGGAGGATTTTAGAATG
  • rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
  • rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
  • rplM AGATCATTTAGGAGGGGAAATTCAATG
  • cons. tacataaaggaggtttaaaaat
  • num. 0000000111111000000001
  • 5755779156663678679890

28
Positional information content before and after
re-alignment
29
Positional nucleotide frequencies after
re-alignment (aGGAGG pattern)
30
Enhancement of a weak signal
31
Deriving the signal ab initio
  • Discrete (pattern-driven) approaches word
    counting
  • Continuous (profile-driven) approaches
    optimization

32
Word counting. Short words
  • Consider all k-mers
  • For each k-mer compute the number of sequences
    containing this k-mer
  • (maybe with some mismatches)
  • Select the most frequent k-mer

33
  • Problem Complete search is possible only for
    short words
  • Assumption if a long word is over-represented,
    its subwords also are overrepresented
  • Solution select a set of over-represented words
    and combine them into longer words

34
Word counting. Long words
  • Consider some k-mers
  • For each k-mer compute the number of sequences
    containing this k-mer
  • (maybe with some mismatches)
  • Select the most frequent k-mer

35
  • Problem what k-tuples to start with?
  • 1st attempt those actually occurring in the
    sample.
  • But the correct signal (the consensus word) may
    not be among them.

36
  • 2nd attempt those actually occurring in the
    sample and some neighborhood.
  • But
  • again, the correct signal (the consensus word)
    may not be among them
  • the size of the neighborhood grows exponentially

37
Graph approach
  • Each k-mer in each sequence corresponds to a
    vertex. Two k-mers are linked by an arc, if they
    differ in at most h positions (hltltk).
  • Thus we obtain an n-partite graph (n is the
    number of sequences).
  • A signal corresponds to a clique (a complete
    subgraph) or at least a dense subgraph with
    vertices in each part.

38
A simple algorithm
  • Remove vertices that cannot be extended to
    complete subgraphs
  • that is, do not have arcs to all parts of the
    graph
  • Remove pairs that cannot be extended
  • that is, do not form triangles with the third
    vertex in all parts of the graph
  • Etc.
  • (will not work as is for dense subgraphs)

39
Optimization. EM algorithms
  • Generate an initial set of profiles (e.g. seed
    with all k-mers)
  • For each profile
  • find the best (highest scoring) representative in
    each sequence
  • update the profile
  • Iterate until convergence

40
  • This algorithm converges.
  • However, it cannot leave the basin of attraction.
  • Thus, if the initial approximation is bad, it
    will converge to nonsense.
  • Solution stochastic optimization.

41
Simulated annealing
  • Goal maximize the information content I
  • I ?j ?b f(b,j)log f(b,j) / p(b)
  • or any other measure of homogeneity of the sites

42
  • Let A be the current signal (set of candidate
    sites), and let I(A) be the corresponding
    information content.
  • Let B be a set of sites obtained by randomly
    choosing a different site in one sequence, and
    let I(B) be its information content.
  • if I(B) ? I(A), B is accepted
  • if I(B) lt I(A), B is accepted with probability
  • P exp (I(B) I(A)) / T
  • The temperature T decreases exponentially, but
    slowly the initial temperature is chosen such
    that almost all changes are accepted.

43
Gibbs sampler
  • Again, A is a signal (set of sites), and I(A) is
    its information content.
  • At each step a new site is selected in one
    sequence with probability
  • P exp (I(Anew)
  • For each candidate site the total time of
    occupation is computed.
  • (Note that the signal changes all the time)

44
Use of symmetry
  • DNA-binding factors and their signals
  • Co-operative homogeneous
  • Palindromes
  • Repeats
  • Co-operative non-homogeneous
  • Cassetes
  • Others
  • RNA signals

45
Recognition PWM/profiles
  • The simplest technique positional nucleotide
    weights are
  • W(b,j)ln(N(b,j)0.5) 0.25?iln(N(i,j)0.5)
  • Score of a candidate site b1bk is the sum of the
    corresponding positional nucleotide weights
  • S(b1bk ) ?j1,,kW(bj,j)

46
Distribution of RBS profile scores on sites
(green) and non-sites (red)
47
Pattern recognition
  • Linear discriminant analysis
  • Logical rules
  • Syntactic analysis
  • Context-sensitive grammars
  • Perceptron
  • Neural networks

48
Neural networks architecture
  • 4?k input neurons (sensors), each responsible for
    observing a particular nucleotide at particular
    position
  • OR 2?k neurons (one discriminates between purines
    and pyrimidines, the other, between AT and GC)
  • One or more layers of hidden neurons
  • One output neuron

49
  • Each neuron is connected to all neurons of the
    next layer
  • Each connection is ascribed a numerical weight
  • A neuron
  • Sums the signals at incoming connections
  • Compares the total with the threshold (or
    transforms it according to a fixed function)
  • If the threshold is passed, excites the outcoming
    connections (resp. sends the modified value)

50
Training
  • Sites and non-sites from the training sample are
    presented one by one.
  • The output neuron produces the prediction.
  • The connection weights and thresholds are
    modified if the prediction is incorrect.
  • Networks differ by architecture, particulars of
    the signal processing, the training schedule

51
Use of sequence context
  • Presence of multiple co-operative sites
  • ArgR (E. coli), purine regulator (Pyrococcus)
  • XylRCRP CytRCRP (E. coli)
  • MEFMyoD in muscle-specific promoters (mammals)
  • Location relative to promoters
  • repressors vs. activators

52
Benchmarking
  • Difficult, because
  • Different algorithms are optimized for different
    performance parameters
  • Incompatible training sets
  • Difficult to construct a homogeneous and
    unambiguous testing set
  • Unobserved sites
  • Competition between closely located sites
  • Activation in specific conditions
  • non-specific binding (52 out of 54 candidate
    HNF-1 binding sites do bind the factor)

53
Promoters of E. coli
  • PWM at false positive rate 1 per 2000 bp
  • 25 of all promoters,
  • 60 of constitutive (non-activated) promoters
  • PWM perform as well as neural networks

54
Eukaryotic promoters
55
Ribosome binding sites
  • Information content of the profile predicts the
    average reliability of predictions

56
CRP (E. coli)
57
Comparative approach to the analysis of regulation
  • Making good predictions
  • with bad rules

58
Regulation of transcription in prokaryotes
  • Difficult
  • Small sample size
  • Weak signals (or we do not know what features are
    relevant, maybe the DNA structure)

59
CRP (E. coli)
60
GenBank entry for the E. coli genome
61
  • Many genomes are available gt
  • gt comparative approach
  • Basic assumption
  • Regulons (sets of co-regulated genes) are
    conserved
  • well in some cases
  • in fact, in many cases

62
Corollary The consistency check
  • True sutes occur upstream of orthologous genes
  • False sites are scattered at random

63
Orthologs
  • Orthologous genes
  • diverged by specitation
  • retain cellular role
  • Paralogous genes
  • diverged by duplication
  • retain biochemical function only

64
Orthology (definition)
 
 
 

duplication

 
 

  • Genomes are shown as black pipes
  • 1st event duplication
  • 2nd event specitation
  • Genes of the same color are orthologous
  • Genes of different color are paralogous

 

 
 
 
 
 
 
A1 B1
A2 B2
Genome 1 Genome
2 A1 and A2 are orthologs, B1 and B2 are
orthologs, all other pairs are paralogs
65
Search for orthologs (fast and dirty)
66
The basic procedure
Genome 1
Genome 2
Genome N
67
Accounting for the operon structure
68
Checklist
  • Presence of orthologous transcription factors
  • Really orthologous (BETs, COGs etc. are not
    sufficient)
  • Conservation of the DNA-binding domain
  • Conservation of the core pathway

69
Purine regulons of E. coli and H. influenzae
70
Predicted purine transporters
71
Changes in the operon structure more examples
  • glnK-amtB loci of methanogenic acrhaebacteria

72
Tryptophan operons
73
Heat chock (HrcA) regulons / CIRCE elements
74
Closely related genomes Phylogenetic footprinting
  • Regulatory sites are more conserved than
    non-coding regions in general and are often seen
    as conserved islands in alignments of gene
    upstream regions.

75
High conservation
76
Low conservation
77
Degeneration of sites
78
Problems and solutions
  • Unique members of regulons may be lost use of
    additional genomes decreases the number of
    orphan regulon members.
  • Closely related factors may have similar sites
    careful analysis of function and analysis of
    particular sites is usually sufficient to resolve
    ambiguities.
  • Too many genomes and regulons apply preliminary
    automated screening.

79
Modification ubiquitous regulators
  • Present in many genomes
  • Only core regulon is conserved
  • Mode of regulation may vary
  • Signals may be slightly different

80
Arginine repressor ArgR/AhrC
81
ABC transporters (periplasmic components)
82
Modification horizontal transfer
  • Impossible to resolve the orthology
    relationships a homologous regulated gene is
    sufficient for corroboration
  • Often rgulate large loci (several adjacent
    operons)
  • Signals are mainly conserved

83
New signals
  • Select a group of related genomes
  • In each genome select metabolically related genes
  • Add possibly co-transcribed genes
  • Compare upstream regions for each genome
    independently
  • Construct profiles
  • Compare constructed profiles if similar, then
    relevant

84
The purine regulon of Pyrococcus spp.
  • Use functional annotation and COGs to select
    genes encoding enzymes from purine pathway purA,
    purB, purC, purF, purD, purE, purL-I, purL-II,
    purT, guaA.
  • Construct profiles for each genome. The quality
    of profiles is weak (lt 1 bit/position).
  • However, the profiles are almost identical.
  • There is no significant similarity of upstream
    regions (outside sites). Thus the profiles are
    probably correct.
  • Low specificity of profiles, thus gt300 candidate
    genes in each genome.
  • Observation in upstream regions of all genes
    from the initial sample the candidate sites occur
    twice with 22 bp spacer.
  • The new rule is absolutely specific only one
    additional gene in each genome.

85
2895752_Ef
PyrP_Bc
UraA_Ec
PyrP_Bs
996
UAPA_En
714
UraA_Hi
940
UAPC_En
997
YcdG_Ec
758
YgfO
997
746
965
YicE
YtiP_Bs
PH
981
980
997
YgfU
2239289_Bs
969
PA A
778
2635740_Bs
PF
965
999
2314333_Hp
749
979
2635741_Bs
2689889_Bb
998
YjcD_Hi
994
1000
Y326_Mj
PbuX_Bs
2689890_Bb
YjcD
YgfQ
YieG
YicO
86
Sources
  • G. Stormo
  • J. Fickett
  • W. Miller
  • I. Dubchak
  • Yuh et al. (1998)
  • Tronche et al. (1997)
  • textbooks

87
Discussions and collaboration
  • Farid Chetouani (Institute Pasteur)
  • Eugene Koonin (NCBI)
  • Yuri Kozlov (Aginomoto)
  • Leonid Mirny (Harvard - MIT)
  • Alexander Mironov (GosNIIGenetika)
  • Vasily Lybetsky (Inst. Probl. Inform. Trans.)
  • Andrey Osterman (IntegratedGenomics)
  • Danila Perumov (Inst. Nucl. Phys.)
  • Pavel Pevzner (UC San Diego)
  • Michael Roytberg (Inst. Math. Probl. Biol.)

88
Collaborators
  • Andrey A. Mironov
  • A. B. Rakhmaninova
  • Vadim Brodyansky
  • Lyudmila Danilova
  • Anna Gerasimova
  • Alexey Kazakov
  • Ekaterina Kotelnikova
  • Olga Laikova
  • Pavel Novichkov
  • Ekaterina Panina
  • Elya Permina
  • Dmitry Ravcheev
  • Dmitry Rodionov
  • Natalya Sadovskaya
  • Alexey Vitreschak
Write a Comment
User Comments (0)
About PowerShow.com