Title: Comparative genomics to identify DNA binding motifs
1Comparative genomics to identify DNA binding
motifs
- Saurabh Sinha
- Dept. of Computer Science
- University of Illinois, Urbana-Champaign
2Outline
- Binding sites and motifs
- The motif finding problem in one species
- Comparative genomics and alignment
- The motif finding problem with comparative
genomics
3Motif finding in multiple species
- Footprinter the approach without alignments
- PhyloCon The use of alignments
- PhyME PhyloGibbs The use of alignments and an
evolutionary model - MCS Genome-wide motif finding from multiple
species
4Binding sites and motifs
5Binding sites
- A few binding sites of transcription factor
Bicoid in the Drosophila (fruitfly) genome,
collected experimentally
6http//webdisk.berkeley.edu/dap5/data_04/motifs/b
icoid.gif
7http//webdisk.berkeley.edu/dap5/data_04/motifs/b
icoid.gif
8W T or A N A,C,G,T
Consensus String
http//webdisk.berkeley.edu/dap5/data_04/motifs/b
icoid.gif
9Motif
- Common sequence pattern in the binding sites of
a transcription factor - A succinct way of capturing variability among the
binding sites
10Alternative way to represent motif
1 1 9 9 0 0 0 1 A
6 0 0 0 0 9 8 7 C
1 0 0 0 1 0 0 1 G
1 8 0 0 8 0 1 0 T
Position weight matrix (PWM) Or simply, weight
matrix
11Motif representation
- Consensus string
- May allow degenerate symbols in string, e.g., N
A/C/G/T W A/T S C/G R A/G Y T/C
etc. - Tractable search space, enumerative algorithms
- Position weight matrix
- More powerful representation
- Probabilistic treatment, algorithms
- More popular
12The motif finding problem(in one species)
- Suppose a transcription factor (TF) regulates
five different genes - Each of the five genes should have binding sites
for TF in their promoter region
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Binding sites for TF
13The motif finding problem
- Now suppose we are given the promoter regions of
the five genes G1, G2, G5 - Can we find the binding sites of TF, without
knowing about them a priori ? - Binding sites are similar to each other, but not
necessarily identical - This is the motif finding problem
- To find a motif that represents binding sites of
an unknown TF
14Motif finding algorithms
- Version 1 Given promoter regions of co-regulated
genes, find the motif - Existing algorithms
- Gibbs sampling (MCMC) Lawrence et al. 1993
- MEME (Expectation-Maximization) Bailey Elkan
94 - CONSENSUS (Greedy local search, beam search)
Hertz Stormo - Word enumeration methods (with emphasis on
statistical accuracy) - van Helden et al. 1998, Sinha Tompa 2000
- And a hundred others
15Comparative Genomics
16More Data
- Genomes of multiple species available
17Using multiple genomes
- Functional parts of the genome evolve more slowly
than non-functional parts - Identify conserved parts by sequence alignment
algorithms - Look for functional features in conserved regions
this improves the signal
Popular Paradigm in Computational Biology
18Multiple sequence alignment
- Comparative genomics relies upon the ability to
detect similar (evolutionarily related) regions
in different genomes - The problem of multiple species alignment
- A hard computational problem (NP-hard)
- Several fast heuristics exist (Mlagan, TBA)
- Assume this functionality exists
19Motif finding
Back To
20Motif finding from multiple species data
- Version 2 Given promoter regions of same gene
- from multiple species, find the motif
Species 1
Species 2
Gene G
Species 3
Species 4
Species 5
Binding sites for TF
21One approach
- Do multiple sequence alignment of upstream
regions of gene
Species 1
Species 2
Gene G
Species 3
Species 4
Species 5
- Look for recurring motifs in conserved blocks
22Another approach (alignment-free)
- What if binding sites are not entirely within
conserved blocks?
Species 1
Species 2
Gene G
Species 3
Species 4
Species 5
- Look for recurring motifs in entire upstream
regions
23Footprinter (Blanchette et al.)The method
without alignments
24Footprinter
- The input sequences are promoter regions of the
same gene, but from multiple species. - Such sequences are said to be orthologous to
each other.
25Footprinter
Input sequences
Related by an evolutionary tree
Find motif
26A side note Parsimony
- A guiding principle in cross-species comparison
- If the data can be explained in multiple ways,
prefer the one with the fewer number of events
(be parsimonious) - Parsimony score number of evolutionary events
(e.g., substitutions) on the tree - Maximum parsimony principle minimize parsimony
score
27Phylogenetic footprinting formally speaking
- Given
- phylogenetic tree T,
- set of orthologous sequences at leaves of T,
- length k of motif
- threshold d
- Problem
- Find set S of k-mers, one k-mer from each leaf,
such that the parsimony score of S in T is at
most d.
28Small Example
Size of motif sought k 4
29Solution
Parsimony score 1 mutation
30An Exact Algorithm(Blanchettes algorithm)
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
31Recurrence
32Running Time
O(k ? 42k ) time per node
33Footprinter features
- One of the earliest motif-finding algorithms
based on comparative genomics - Simple formulation of motif score, algorithm
efficient in practice - Cannot combine evolutionary conservation
information with overrepresentation information - two motifs, equally conserved, but one occurs in
many co-regulated genes (promoters)
34PhyloCon (Stormo lab)The method with alignments
35The underlying single-species algorithm CONSENSUS
Final goal Find a set of substrings, one in
each input sequence
Set of substrings define a PWM. Goal This PWM
should have high information content.
High information content means that the motif
stands out.
36The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
37The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
The current motif.
38The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
The current motif.
Consider every substring in the next sequence,
try adding it to current motif and scoring
resulting motif
39The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
The current motif.
Pick the best one .
40The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
The current motif.
and repeat
Pick the best one .
41The key Scoring a motif
The current motif.
Scoring a motif
42The key Scoring a motif
1 1 9 9 0 0 0 1 A
6 0 0 0 0 9 8 7 C
1 0 0 0 1 0 0 1 G
1 8 0 0 8 0 1 0 T
The current motif.
Scoring a motif
Build a PWM
Compute information content of PWM For each
column, Compute relative entropy relative to a
background distribution Sum over all columns
Key to align the sites of a motif, and score the
alignment
43Extending CONSENSUS to multiple species
Final goal Find a set of substrings, one in
each input sequence
44Extending CONSENSUS to multiple species
Final goal Find a set of profiles, one in
each set of orthologous input sequences
45Extending CONSENSUS to multiple species
Profiles
46Extending CONSENSUS to multiple species
Profiles
47Extending CONSENSUS to multiple species
48Aligning two profiles
- Compare two profiles column by column
- Each column of a profile is (nA,nC,nG,nT), and
equivalently, (fA,fC,fG,fT) - Probabilistic score to capture if two columns
nbi,fbib and nbj,fbjb are from the same
distribution (and different from background) - ALLR Avg. Log Likelihood Ratio
where pb is background frequency of base b
49One cool feature of ALLR
- Expected value is negative, means very long
profiles will not automatically give large ALLR
scores - Therefore, can automatically detect the right
motif length
50PhyloCon features
- One of the first algorithms to find motifs that
are conserved across species and occur in
multiple co-regulated gene promoters - Does not consider the evolutionary relationships
among species (all species weighted equally)
51PhyME (Sinha et al.) A method with alignments
and an evolutionary model
52- A set of promoter with many matches to unknown
motif W - For some promoters, sequence from other species
also given
53(No Transcript)
54Step 1 Use alignment program (LAGAN) to find
ungapped blocks of conservation
55For each promoter, maximize Pr (promoter
orthologs Model with motif as parameter)
Model Hidden Markov Model
Find motif (parameter) that maximizes the
likelihood.
Well study the model in detail, today evening
56A key component of the likelihood computation Pr
(site s motif W)
Evolutionarily unrelated sites
57A key component of the likelihood computation Pr
(site s motif W)
Given by evolutionary model
58Evolutionary model
- Two species, sites s1 and s2 in a conserved block
- Pr (s1,s2 W)
- Short time limit (t 0)
- a s1 s2
- Pr (s1,s2 W) Pr (a W)
- Long time limit (t ?)
- Pr (s1,s2 W) Pr (s1 W) ? Pr (s2 W)
- Interpolate between these two limits
59Model of binding site evolution
- Evolving binding site must bind the same protein
- Pr (s1,s2 W) ?a Pr(a W) ?i Pr (si a, W, t)
-
- Can be generalized to more than two species
(recursively)
60Training the motif
- Given a motif, we can compute
- Pr (promoter orthologs model with motif)
- But we have to find the motif that maximizes this
probability - Expectation-maximization algorithm
- Local search, not guaranteed to find global
maximum - More on E-M in evening lecture
61PhyloGibbs(Siddharthan et al.)
- Problem formulation very similar to PhyME
(alignments, evolutionary model) - Gibbs sampling approach to find motif
- A special MCMC strategy
- E-M (PhyME) prone to local optima
- Can find multiple motifs simultaneously
62PhyME PhyloGibbs
- Algorithms that consider the phylogenetic tree
relating the species - Another algorithm of same genre MONKEY (Moses et
al. 2004) - Allow binding sites to occur in conserved
(aligned) as well as unconserved regions - Designed to find motifs in sets of co-regulated
genes (and their orthologs) - Not designed to find motifs from whole-genome
analysis
63MCS (Kellis lab.)Genome-wide motif finding from
multiple species
64Algorithm
- Align four mammalian species genomes
- human, mouse, rat, dog
- Focus on all promoter regions and 3 UTRs
- For every possible motif (consensus string model)
- Count the number of occurrences in the human
genome - Count how many of these are completely conserved
in all four species (obvious from alignment) - Evaluate statistical significance
65Statistical Significance
- k of conserved occurrences of motif
- n of occurrences of motif
- p k/n conservation rate of motif
- For 100 random motifs of the same type, compute
average conservation rate p0 - Compute
- n occurrences, p0 rate of being conserved,
significance of k conserved occurrences ? - Exact p-value Binomial(n, p0, k)
- Binomial mean np0, variance np0(1-p0)
66MCS score
- The z score is called the MCS score
- Output all motifs with MCS gt 6
- Post-process this list of motifs, to remove
similar looking motifs (clustering) - A final list of 174 motifs from promoters
- 69 match known motifs
- 105 potential new regulatory motifs
67Conclusion
- Comparative genomics has infused new life into
the motif-finding community - A variety of algorithms geared towards various
assumptions - Footprinter no alignments (2000)
- PhyloCon alignments, but no tree (2003)
- PhyME alignments, tree, and evolutionary model
(2004) - MCS genome-wide motif discovery from very
closely related species (2005)
68QUESTIONS ?
69References single species motif finding
- Timothy L. Bailey and Charles Elkan,
"Unsupervised Learning of Multiple Motifs in
Biopolymers using EM", Machine Learning,
21(1-2)51-80, October, 1995 - Lawrence, C. E., S. F. Altschul, M. S. Boguski,
J. S. Liu, A. F. Neuwald, and J. C. Wootton
(1993, October). Detecting subtle sequence
signals a Gibbs sampling strategy for multiple
alignment. Science 262, 208--214. - Hertz GZ, Hartzell GW 3rd, Stormo
GDIdentification of consensus patterns in
unaligned DNA sequences known to be functionally
related. CABIOS(now Bioinformatics) 1990.
6(2)81-92. - van Helden,J., Andre,B. and Collado-Vides,J.
(1998) Extracting regulatory sites from the
upstream region of yeast genes by computational
analysis of oligonucleotide frequencies. J. Mol.
Biol., 281, 827-842. - Sinha, S. and Tompa, M., A statistical method for
finding transcription factor binding sites, Proc.
Int. Conf. Intell. Syst. Mol. Biol., 8344--354,
2000.
70References multiple species motif finding
- Blanchette, M., Schwikowski, B. and Tompa, M.
(2000). An exact algorithmto identify motifs in
orthologous sequences from multiple
species.??Proceedings of the Eight International
Conference on Intelligent Systems for Molecular
Biology (ISMB 2000), pp. 37-45. - Wang T, Stormo GD. Combining phylogenetic data
with co-regulated genes to identify regulatory
motifs. Bioinformatics. 2003 Dec
1219(18)2369-80. - Sinha S, Blanchette M, Tompa M. PhyME a
probabilistic algorithm for finding motifs in
sets of orthologous sequences.BMC Bioinformatics.
2004 Oct 285170. - Siddharthan R, Siggia ED, van Nimwegen E. a
Gibbs sampling motif finder that incorporates
phylogeny.PLoS Comput Biol. 2005 Dec1(7)e67.
Epub 2005 Dec 9 - Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen
MB. MONKEY identifying conserved
transcription-factor binding sites in multiple
alignments using a binding site-specific
evolutionary model. Genome Biol. 20045(12)R98.
Epub 2004 Nov 30. - Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V,
Lindblad-Toh K, Lander ES, Kellis M. Systematic
discovery of regulatory motifs in human promoters
and 3' UTRs by comparison of several
mammals.Nature. 2005 Mar 17434(7031)338-45.
Epub 2005 Feb 27.