Comparative genomics to identify DNA binding motifs - PowerPoint PPT Presentation

1 / 70

About This Presentation

Title:

Comparative genomics to identify DNA binding motifs

Description:

species1. species2. species3. species4. EVOLUTIONARY TREE. Using multiple genomes ... species1. species2. species3. species4. Promoter 1. Promoter 2. Promoter 3 ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 71

Provided by: saurabh2

Category:

more less

Transcript and Presenter's Notes

Title: Comparative genomics to identify DNA binding motifs

1
Comparative genomics to identify DNA binding
motifs

Saurabh Sinha
Dept. of Computer Science
University of Illinois, Urbana-Champaign

2
Outline

Binding sites and motifs
The motif finding problem in one species
Comparative genomics and alignment
The motif finding problem with comparative
genomics

3
Motif finding in multiple species

Footprinter the approach without alignments
PhyloCon The use of alignments
PhyME PhyloGibbs The use of alignments and an
evolutionary model
MCS Genome-wide motif finding from multiple
species

4
Binding sites and motifs
5
Binding sites

A few binding sites of transcription factor
Bicoid in the Drosophila (fruitfly) genome,
collected experimentally

6
http//webdisk.berkeley.edu/dap5/data_04/motifs/b
icoid.gif
7
http//webdisk.berkeley.edu/dap5/data_04/motifs/b
icoid.gif
8
W T or A N A,C,G,T
Consensus String
http//webdisk.berkeley.edu/dap5/data_04/motifs/b
icoid.gif
9
Motif

Common sequence pattern in the binding sites of
a transcription factor
A succinct way of capturing variability among the
binding sites

10
Alternative way to represent motif
1 1 9 9 0 0 0 1 A
6 0 0 0 0 9 8 7 C
1 0 0 0 1 0 0 1 G
1 8 0 0 8 0 1 0 T
Position weight matrix (PWM) Or simply, weight
matrix
11
Motif representation

Consensus string
May allow degenerate symbols in string, e.g., N
A/C/G/T W A/T S C/G R A/G Y T/C
etc.
Tractable search space, enumerative algorithms
Position weight matrix
More powerful representation
Probabilistic treatment, algorithms
More popular

12
The motif finding problem(in one species)

Suppose a transcription factor (TF) regulates
five different genes
Each of the five genes should have binding sites
for TF in their promoter region

Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Binding sites for TF
13
The motif finding problem

Now suppose we are given the promoter regions of
the five genes G1, G2, G5
Can we find the binding sites of TF, without
knowing about them a priori ?
Binding sites are similar to each other, but not
necessarily identical
This is the motif finding problem
To find a motif that represents binding sites of
an unknown TF

14
Motif finding algorithms

Version 1 Given promoter regions of co-regulated
genes, find the motif
Existing algorithms
Gibbs sampling (MCMC) Lawrence et al. 1993
MEME (Expectation-Maximization) Bailey Elkan
94
CONSENSUS (Greedy local search, beam search)
Hertz Stormo
Word enumeration methods (with emphasis on
statistical accuracy)
van Helden et al. 1998, Sinha Tompa 2000
And a hundred others

15
Comparative Genomics
16
More Data

Genomes of multiple species available

17
Using multiple genomes

Functional parts of the genome evolve more slowly
than non-functional parts
Identify conserved parts by sequence alignment
algorithms
Look for functional features in conserved regions
this improves the signal

Popular Paradigm in Computational Biology
18
Multiple sequence alignment

Comparative genomics relies upon the ability to
detect similar (evolutionarily related) regions
in different genomes
The problem of multiple species alignment
A hard computational problem (NP-hard)
Several fast heuristics exist (Mlagan, TBA)
Assume this functionality exists

19
Motif finding
Back To
20
Motif finding from multiple species data

Version 2 Given promoter regions of same gene
from multiple species, find the motif

Species 1
Species 2
Gene G
Species 3
Species 4
Species 5
Binding sites for TF
21
One approach

Do multiple sequence alignment of upstream
regions of gene

Species 1
Species 2
Gene G
Species 3
Species 4
Species 5

Look for recurring motifs in conserved blocks

22
Another approach (alignment-free)

What if binding sites are not entirely within
conserved blocks?

Species 1
Species 2
Gene G
Species 3
Species 4
Species 5

Look for recurring motifs in entire upstream
regions

23
Footprinter (Blanchette et al.)The method
without alignments
24
Footprinter

The input sequences are promoter regions of the
same gene, but from multiple species.
Such sequences are said to be orthologous to
each other.

25
Footprinter
Input sequences
Related by an evolutionary tree
Find motif
26
A side note Parsimony

A guiding principle in cross-species comparison
If the data can be explained in multiple ways,
prefer the one with the fewer number of events
(be parsimonious)
Parsimony score number of evolutionary events
(e.g., substitutions) on the tree
Maximum parsimony principle minimize parsimony
score

27
Phylogenetic footprinting formally speaking

Given
phylogenetic tree T,
set of orthologous sequences at leaves of T,
length k of motif
threshold d
Problem
Find set S of k-mers, one k-mer from each leaf,
such that the parsimony score of S in T is at
most d.

28
Small Example
Size of motif sought k 4
29
Solution
Parsimony score 1 mutation
30
An Exact Algorithm(Blanchettes algorithm)
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
31
Recurrence
32
Running Time
O(k ? 42k ) time per node
33
Footprinter features

One of the earliest motif-finding algorithms
based on comparative genomics
Simple formulation of motif score, algorithm
efficient in practice
Cannot combine evolutionary conservation
information with overrepresentation information
two motifs, equally conserved, but one occurs in
many co-regulated genes (promoters)

34
PhyloCon (Stormo lab)The method with alignments
35
The underlying single-species algorithm CONSENSUS
Final goal Find a set of substrings, one in
each input sequence
Set of substrings define a PWM. Goal This PWM
should have high information content.
High information content means that the motif
stands out.
36
The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
37
The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
The current motif.
38
The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
The current motif.
Consider every substring in the next sequence,
try adding it to current motif and scoring
resulting motif
39
The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
The current motif.
Pick the best one .
40
The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
The current motif.
and repeat
Pick the best one .
41
The key Scoring a motif
The current motif.
Scoring a motif
42
The key Scoring a motif
1 1 9 9 0 0 0 1 A
6 0 0 0 0 9 8 7 C
1 0 0 0 1 0 0 1 G
1 8 0 0 8 0 1 0 T
The current motif.
Scoring a motif
Build a PWM
Compute information content of PWM For each
column, Compute relative entropy relative to a
background distribution Sum over all columns
Key to align the sites of a motif, and score the
alignment
43
Extending CONSENSUS to multiple species
Final goal Find a set of substrings, one in
each input sequence
44
Extending CONSENSUS to multiple species
Final goal Find a set of profiles, one in
each set of orthologous input sequences
45
Extending CONSENSUS to multiple species
Profiles
46
Extending CONSENSUS to multiple species
Profiles
47
Extending CONSENSUS to multiple species
48
Aligning two profiles

Compare two profiles column by column
Each column of a profile is (nA,nC,nG,nT), and
equivalently, (fA,fC,fG,fT)
Probabilistic score to capture if two columns
nbi,fbib and nbj,fbjb are from the same
distribution (and different from background)
ALLR Avg. Log Likelihood Ratio

where pb is background frequency of base b
49
One cool feature of ALLR

Expected value is negative, means very long
profiles will not automatically give large ALLR
scores
Therefore, can automatically detect the right
motif length

50
PhyloCon features

One of the first algorithms to find motifs that
are conserved across species and occur in
multiple co-regulated gene promoters
Does not consider the evolutionary relationships
among species (all species weighted equally)

51
PhyME (Sinha et al.) A method with alignments
and an evolutionary model
52

Input

A set of promoter with many matches to unknown
motif W
For some promoters, sequence from other species
also given

Output The motif W

53
(No Transcript)
54
Step 1 Use alignment program (LAGAN) to find
ungapped blocks of conservation
55
For each promoter, maximize Pr (promoter
orthologs Model with motif as parameter)
Model Hidden Markov Model
Find motif (parameter) that maximizes the
likelihood.
Well study the model in detail, today evening
56
A key component of the likelihood computation Pr
(site s motif W)
Evolutionarily unrelated sites
57
A key component of the likelihood computation Pr
(site s motif W)
Given by evolutionary model
58
Evolutionary model

Two species, sites s1 and s2 in a conserved block
Pr (s1,s2 W)
Short time limit (t 0)
a s1 s2
Pr (s1,s2 W) Pr (a W)
Long time limit (t ?)
Pr (s1,s2 W) Pr (s1 W) ? Pr (s2 W)
Interpolate between these two limits

59
Model of binding site evolution

Evolving binding site must bind the same protein
Pr (s1,s2 W) ?a Pr(a W) ?i Pr (si a, W, t)
Can be generalized to more than two species
(recursively)

60
Training the motif

Given a motif, we can compute
Pr (promoter orthologs model with motif)
But we have to find the motif that maximizes this
probability
Expectation-maximization algorithm
Local search, not guaranteed to find global
maximum
More on E-M in evening lecture

61
PhyloGibbs(Siddharthan et al.)

Problem formulation very similar to PhyME
(alignments, evolutionary model)
Gibbs sampling approach to find motif
A special MCMC strategy
E-M (PhyME) prone to local optima
Can find multiple motifs simultaneously

62
PhyME PhyloGibbs

Algorithms that consider the phylogenetic tree
relating the species
Another algorithm of same genre MONKEY (Moses et
al. 2004)
Allow binding sites to occur in conserved
(aligned) as well as unconserved regions
Designed to find motifs in sets of co-regulated
genes (and their orthologs)
Not designed to find motifs from whole-genome
analysis

63
MCS (Kellis lab.)Genome-wide motif finding from
multiple species
64
Algorithm

Align four mammalian species genomes
human, mouse, rat, dog
Focus on all promoter regions and 3 UTRs
For every possible motif (consensus string model)
Count the number of occurrences in the human
genome
Count how many of these are completely conserved
in all four species (obvious from alignment)
Evaluate statistical significance

65
Statistical Significance

k of conserved occurrences of motif
n of occurrences of motif
p k/n conservation rate of motif
For 100 random motifs of the same type, compute
average conservation rate p0
Compute
n occurrences, p0 rate of being conserved,
significance of k conserved occurrences ?
Exact p-value Binomial(n, p0, k)
Binomial mean np0, variance np0(1-p0)

66
MCS score

The z score is called the MCS score
Output all motifs with MCS gt 6
Post-process this list of motifs, to remove
similar looking motifs (clustering)
A final list of 174 motifs from promoters
69 match known motifs
105 potential new regulatory motifs

67
Conclusion

Comparative genomics has infused new life into
the motif-finding community
A variety of algorithms geared towards various
assumptions
Footprinter no alignments (2000)
PhyloCon alignments, but no tree (2003)
PhyME alignments, tree, and evolutionary model
(2004)
MCS genome-wide motif discovery from very
closely related species (2005)

68
QUESTIONS ?
69
References single species motif finding

Timothy L. Bailey and Charles Elkan,
"Unsupervised Learning of Multiple Motifs in
Biopolymers using EM", Machine Learning,
21(1-2)51-80, October, 1995
Lawrence, C. E., S. F. Altschul, M. S. Boguski,
J. S. Liu, A. F. Neuwald, and J. C. Wootton
(1993, October). Detecting subtle sequence
signals a Gibbs sampling strategy for multiple
alignment. Science 262, 208--214.
Hertz GZ, Hartzell GW 3rd, Stormo
GDIdentification of consensus patterns in
unaligned DNA sequences known to be functionally
related. CABIOS(now Bioinformatics) 1990.
6(2)81-92.
van Helden,J., Andre,B. and Collado-Vides,J.
(1998) Extracting regulatory sites from the
upstream region of yeast genes by computational
analysis of oligonucleotide frequencies. J. Mol.
Biol., 281, 827-842.
Sinha, S. and Tompa, M., A statistical method for
finding transcription factor binding sites, Proc.
Int. Conf. Intell. Syst. Mol. Biol., 8344--354,
2000.

70
References multiple species motif finding

Blanchette, M., Schwikowski, B. and Tompa, M.
(2000). An exact algorithmto identify motifs in
orthologous sequences from multiple
species.??Proceedings of the Eight International
Conference on Intelligent Systems for Molecular
Biology (ISMB 2000), pp. 37-45.
Wang T, Stormo GD. Combining phylogenetic data
with co-regulated genes to identify regulatory
motifs. Bioinformatics. 2003 Dec
1219(18)2369-80.
Sinha S, Blanchette M, Tompa M. PhyME a
probabilistic algorithm for finding motifs in
sets of orthologous sequences.BMC Bioinformatics.
2004 Oct 285170.
Siddharthan R, Siggia ED, van Nimwegen E. a
Gibbs sampling motif finder that incorporates
phylogeny.PLoS Comput Biol. 2005 Dec1(7)e67.
Epub 2005 Dec 9
Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen
MB. MONKEY identifying conserved
transcription-factor binding sites in multiple
alignments using a binding site-specific
evolutionary model. Genome Biol. 20045(12)R98.
Epub 2004 Nov 30.
Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V,
Lindblad-Toh K, Lander ES, Kellis M. Systematic
discovery of regulatory motifs in human promoters
and 3' UTRs by comparison of several
mammals.Nature. 2005 Mar 17434(7031)338-45.
Epub 2005 Feb 27.