Comparative genomics to identify DNA binding motifs - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

Comparative genomics to identify DNA binding motifs

Description:

species1. species2. species3. species4. EVOLUTIONARY TREE. Using multiple genomes ... species1. species2. species3. species4. Promoter 1. Promoter 2. Promoter 3 ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 71
Provided by: saurabh2
Category:

less

Transcript and Presenter's Notes

Title: Comparative genomics to identify DNA binding motifs


1
Comparative genomics to identify DNA binding
motifs
  • Saurabh Sinha
  • Dept. of Computer Science
  • University of Illinois, Urbana-Champaign

2
Outline
  • Binding sites and motifs
  • The motif finding problem in one species
  • Comparative genomics and alignment
  • The motif finding problem with comparative
    genomics

3
Motif finding in multiple species
  • Footprinter the approach without alignments
  • PhyloCon The use of alignments
  • PhyME PhyloGibbs The use of alignments and an
    evolutionary model
  • MCS Genome-wide motif finding from multiple
    species

4
Binding sites and motifs
5
Binding sites
  • A few binding sites of transcription factor
    Bicoid in the Drosophila (fruitfly) genome,
    collected experimentally

6
http//webdisk.berkeley.edu/dap5/data_04/motifs/b
icoid.gif
7
http//webdisk.berkeley.edu/dap5/data_04/motifs/b
icoid.gif
8
W T or A N A,C,G,T
Consensus String
http//webdisk.berkeley.edu/dap5/data_04/motifs/b
icoid.gif
9
Motif
  • Common sequence pattern in the binding sites of
    a transcription factor
  • A succinct way of capturing variability among the
    binding sites

10
Alternative way to represent motif
1 1 9 9 0 0 0 1 A
6 0 0 0 0 9 8 7 C
1 0 0 0 1 0 0 1 G
1 8 0 0 8 0 1 0 T
Position weight matrix (PWM) Or simply, weight
matrix
11
Motif representation
  • Consensus string
  • May allow degenerate symbols in string, e.g., N
    A/C/G/T W A/T S C/G R A/G Y T/C
    etc.
  • Tractable search space, enumerative algorithms
  • Position weight matrix
  • More powerful representation
  • Probabilistic treatment, algorithms
  • More popular

12
The motif finding problem(in one species)
  • Suppose a transcription factor (TF) regulates
    five different genes
  • Each of the five genes should have binding sites
    for TF in their promoter region

Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Binding sites for TF
13
The motif finding problem
  • Now suppose we are given the promoter regions of
    the five genes G1, G2, G5
  • Can we find the binding sites of TF, without
    knowing about them a priori ?
  • Binding sites are similar to each other, but not
    necessarily identical
  • This is the motif finding problem
  • To find a motif that represents binding sites of
    an unknown TF

14
Motif finding algorithms
  • Version 1 Given promoter regions of co-regulated
    genes, find the motif
  • Existing algorithms
  • Gibbs sampling (MCMC) Lawrence et al. 1993
  • MEME (Expectation-Maximization) Bailey Elkan
    94
  • CONSENSUS (Greedy local search, beam search)
    Hertz Stormo
  • Word enumeration methods (with emphasis on
    statistical accuracy)
  • van Helden et al. 1998, Sinha Tompa 2000
  • And a hundred others

15
Comparative Genomics
16
More Data
  • Genomes of multiple species available

17
Using multiple genomes
  • Functional parts of the genome evolve more slowly
    than non-functional parts
  • Identify conserved parts by sequence alignment
    algorithms
  • Look for functional features in conserved regions
    this improves the signal

Popular Paradigm in Computational Biology
18
Multiple sequence alignment
  • Comparative genomics relies upon the ability to
    detect similar (evolutionarily related) regions
    in different genomes
  • The problem of multiple species alignment
  • A hard computational problem (NP-hard)
  • Several fast heuristics exist (Mlagan, TBA)
  • Assume this functionality exists

19
Motif finding
Back To
20
Motif finding from multiple species data
  • Version 2 Given promoter regions of same gene
  • from multiple species, find the motif

Species 1
Species 2
Gene G
Species 3
Species 4
Species 5
Binding sites for TF
21
One approach
  • Do multiple sequence alignment of upstream
    regions of gene

Species 1
Species 2
Gene G
Species 3
Species 4
Species 5
  • Look for recurring motifs in conserved blocks

22
Another approach (alignment-free)
  • What if binding sites are not entirely within
    conserved blocks?

Species 1
Species 2
Gene G
Species 3
Species 4
Species 5
  • Look for recurring motifs in entire upstream
    regions

23
Footprinter (Blanchette et al.)The method
without alignments
24
Footprinter
  • The input sequences are promoter regions of the
    same gene, but from multiple species.
  • Such sequences are said to be orthologous to
    each other.

25
Footprinter
Input sequences
Related by an evolutionary tree
Find motif
26
A side note Parsimony
  • A guiding principle in cross-species comparison
  • If the data can be explained in multiple ways,
    prefer the one with the fewer number of events
    (be parsimonious)
  • Parsimony score number of evolutionary events
    (e.g., substitutions) on the tree
  • Maximum parsimony principle minimize parsimony
    score

27
Phylogenetic footprinting formally speaking
  • Given
  • phylogenetic tree T,
  • set of orthologous sequences at leaves of T,
  • length k of motif
  • threshold d
  • Problem
  • Find set S of k-mers, one k-mer from each leaf,
    such that the parsimony score of S in T is at
    most d.

28
Small Example
Size of motif sought k 4
29
Solution
Parsimony score 1 mutation
30
An Exact Algorithm(Blanchettes algorithm)
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
31
Recurrence
32
Running Time
O(k ? 42k ) time per node
33
Footprinter features
  • One of the earliest motif-finding algorithms
    based on comparative genomics
  • Simple formulation of motif score, algorithm
    efficient in practice
  • Cannot combine evolutionary conservation
    information with overrepresentation information
  • two motifs, equally conserved, but one occurs in
    many co-regulated genes (promoters)

34
PhyloCon (Stormo lab)The method with alignments
35
The underlying single-species algorithm CONSENSUS
Final goal Find a set of substrings, one in
each input sequence
Set of substrings define a PWM. Goal This PWM
should have high information content.
High information content means that the motif
stands out.
36
The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
37
The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
The current motif.
38
The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
The current motif.
Consider every substring in the next sequence,
try adding it to current motif and scoring
resulting motif
39
The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
The current motif.
Pick the best one .
40
The underlying single-species algorithm CONSENSUS
Start with a substring in one input sequence
Build the set of substrings incrementally, adding
one substring at a time
The current set of substrings.
The current motif.
and repeat
Pick the best one .
41
The key Scoring a motif
The current motif.
Scoring a motif
42
The key Scoring a motif
1 1 9 9 0 0 0 1 A
6 0 0 0 0 9 8 7 C
1 0 0 0 1 0 0 1 G
1 8 0 0 8 0 1 0 T
The current motif.
Scoring a motif
Build a PWM
Compute information content of PWM For each
column, Compute relative entropy relative to a
background distribution Sum over all columns
Key to align the sites of a motif, and score the
alignment
43
Extending CONSENSUS to multiple species
Final goal Find a set of substrings, one in
each input sequence
44
Extending CONSENSUS to multiple species
Final goal Find a set of profiles, one in
each set of orthologous input sequences
45
Extending CONSENSUS to multiple species
Profiles
46
Extending CONSENSUS to multiple species
Profiles
47
Extending CONSENSUS to multiple species
48
Aligning two profiles
  • Compare two profiles column by column
  • Each column of a profile is (nA,nC,nG,nT), and
    equivalently, (fA,fC,fG,fT)
  • Probabilistic score to capture if two columns
    nbi,fbib and nbj,fbjb are from the same
    distribution (and different from background)
  • ALLR Avg. Log Likelihood Ratio

where pb is background frequency of base b
49
One cool feature of ALLR
  • Expected value is negative, means very long
    profiles will not automatically give large ALLR
    scores
  • Therefore, can automatically detect the right
    motif length

50
PhyloCon features
  • One of the first algorithms to find motifs that
    are conserved across species and occur in
    multiple co-regulated gene promoters
  • Does not consider the evolutionary relationships
    among species (all species weighted equally)

51
PhyME (Sinha et al.) A method with alignments
and an evolutionary model
52
  • Input
  • A set of promoter with many matches to unknown
    motif W
  • For some promoters, sequence from other species
    also given
  • Output The motif W

53
(No Transcript)
54
Step 1 Use alignment program (LAGAN) to find
ungapped blocks of conservation
55
For each promoter, maximize Pr (promoter
orthologs Model with motif as parameter)
Model Hidden Markov Model
Find motif (parameter) that maximizes the
likelihood.
Well study the model in detail, today evening
56
A key component of the likelihood computation Pr
(site s motif W)
Evolutionarily unrelated sites
57
A key component of the likelihood computation Pr
(site s motif W)
Given by evolutionary model
58
Evolutionary model
  • Two species, sites s1 and s2 in a conserved block
  • Pr (s1,s2 W)
  • Short time limit (t 0)
  • a s1 s2
  • Pr (s1,s2 W) Pr (a W)
  • Long time limit (t ?)
  • Pr (s1,s2 W) Pr (s1 W) ? Pr (s2 W)
  • Interpolate between these two limits

59
Model of binding site evolution
  • Evolving binding site must bind the same protein
  • Pr (s1,s2 W) ?a Pr(a W) ?i Pr (si a, W, t)
  • Can be generalized to more than two species
    (recursively)

60
Training the motif
  • Given a motif, we can compute
  • Pr (promoter orthologs model with motif)
  • But we have to find the motif that maximizes this
    probability
  • Expectation-maximization algorithm
  • Local search, not guaranteed to find global
    maximum
  • More on E-M in evening lecture

61
PhyloGibbs(Siddharthan et al.)
  • Problem formulation very similar to PhyME
    (alignments, evolutionary model)
  • Gibbs sampling approach to find motif
  • A special MCMC strategy
  • E-M (PhyME) prone to local optima
  • Can find multiple motifs simultaneously

62
PhyME PhyloGibbs
  • Algorithms that consider the phylogenetic tree
    relating the species
  • Another algorithm of same genre MONKEY (Moses et
    al. 2004)
  • Allow binding sites to occur in conserved
    (aligned) as well as unconserved regions
  • Designed to find motifs in sets of co-regulated
    genes (and their orthologs)
  • Not designed to find motifs from whole-genome
    analysis

63
MCS (Kellis lab.)Genome-wide motif finding from
multiple species
64
Algorithm
  • Align four mammalian species genomes
  • human, mouse, rat, dog
  • Focus on all promoter regions and 3 UTRs
  • For every possible motif (consensus string model)
  • Count the number of occurrences in the human
    genome
  • Count how many of these are completely conserved
    in all four species (obvious from alignment)
  • Evaluate statistical significance

65
Statistical Significance
  • k of conserved occurrences of motif
  • n of occurrences of motif
  • p k/n conservation rate of motif
  • For 100 random motifs of the same type, compute
    average conservation rate p0
  • Compute
  • n occurrences, p0 rate of being conserved,
    significance of k conserved occurrences ?
  • Exact p-value Binomial(n, p0, k)
  • Binomial mean np0, variance np0(1-p0)

66
MCS score
  • The z score is called the MCS score
  • Output all motifs with MCS gt 6
  • Post-process this list of motifs, to remove
    similar looking motifs (clustering)
  • A final list of 174 motifs from promoters
  • 69 match known motifs
  • 105 potential new regulatory motifs

67
Conclusion
  • Comparative genomics has infused new life into
    the motif-finding community
  • A variety of algorithms geared towards various
    assumptions
  • Footprinter no alignments (2000)
  • PhyloCon alignments, but no tree (2003)
  • PhyME alignments, tree, and evolutionary model
    (2004)
  • MCS genome-wide motif discovery from very
    closely related species (2005)

68
QUESTIONS ?
69
References single species motif finding
  1. Timothy L. Bailey and Charles Elkan,
    "Unsupervised Learning of Multiple Motifs in
    Biopolymers using EM", Machine Learning,
    21(1-2)51-80, October, 1995
  2. Lawrence, C. E., S. F. Altschul, M. S. Boguski,
    J. S. Liu, A. F. Neuwald, and J. C. Wootton
    (1993, October). Detecting subtle sequence
    signals a Gibbs sampling strategy for multiple
    alignment. Science 262, 208--214.
  3. Hertz GZ, Hartzell GW 3rd, Stormo
    GDIdentification of consensus patterns in
    unaligned DNA sequences known to be functionally
    related. CABIOS(now Bioinformatics) 1990.
    6(2)81-92.
  4. van Helden,J., Andre,B. and Collado-Vides,J.
    (1998) Extracting regulatory sites from the
    upstream region of yeast genes by computational
    analysis of oligonucleotide frequencies. J. Mol.
    Biol., 281, 827-842.
  5. Sinha, S. and Tompa, M., A statistical method for
    finding transcription factor binding sites, Proc.
    Int. Conf. Intell. Syst. Mol. Biol., 8344--354,
    2000.

70
References multiple species motif finding
  1. Blanchette, M., Schwikowski, B. and Tompa, M.
    (2000). An exact algorithmto identify motifs in
    orthologous sequences from multiple
    species.??Proceedings of the Eight International
    Conference on Intelligent Systems for Molecular
    Biology (ISMB 2000), pp. 37-45.
  2. Wang T, Stormo GD. Combining phylogenetic data
    with co-regulated genes to identify regulatory
    motifs. Bioinformatics. 2003 Dec
    1219(18)2369-80.
  3. Sinha S, Blanchette M, Tompa M. PhyME a
    probabilistic algorithm for finding motifs in
    sets of orthologous sequences.BMC Bioinformatics.
    2004 Oct 285170.
  4. Siddharthan R, Siggia ED, van Nimwegen E. a
    Gibbs sampling motif finder that incorporates
    phylogeny.PLoS Comput Biol. 2005 Dec1(7)e67.
    Epub 2005 Dec 9
  5. Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen
    MB. MONKEY identifying conserved
    transcription-factor binding sites in multiple
    alignments using a binding site-specific
    evolutionary model. Genome Biol. 20045(12)R98.
    Epub 2004 Nov 30.
  6. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V,
    Lindblad-Toh K, Lander ES, Kellis M. Systematic
    discovery of regulatory motifs in human promoters
    and 3' UTRs by comparison of several
    mammals.Nature. 2005 Mar 17434(7031)338-45.
    Epub 2005 Feb 27.
Write a Comment
User Comments (0)
About PowerShow.com