Motif Discovery in Heterogeneous Sequence Data - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Motif Discovery in Heterogeneous Sequence Data

Description:

g1.mm. g2.mm. g3.mm. g4.mm. g1.hs. g2.hs. g3.hs. g4.hs. Gelfand et al. 2000 , McGuire et al. 2000 ... g1.mm. g2.mm. g3.mm. g4.mm. g1.hs. g2.hs. g3.hs. g4.hs ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 19
Provided by: amol68
Category:

less

Transcript and Presenter's Notes

Title: Motif Discovery in Heterogeneous Sequence Data


1
Motif Discovery in Heterogeneous Sequence Data
  • Amol Prakash
  • U. Washington
  • Seattle

Mathieu Blanchette McGill U. Montreal
Saurabh Sinha Rockefeller U. New York
Martin Tompa U. Washington Seattle
2
Outline
  • What is a motif?
  • Homogeneous vs. Heterogeneous
  • What makes our approach unique
  • Algorithm description
  • Results
  • Conclusion

3
Predicting Regulatory Elements
CAGTGTTAGTCTCGACGTGAGTGGTATGAACTGGAGTTTTAGTATGATGG
TCGTACAGTGTTTCGACATGGGAAG
  • Functionally important binding site for a
    protein that regulates gene expression
  • Near gene
  • Short Typically 6-20 nucleotides
  • How can you possibly predict them?

4
Homogeneous Sequence Data I
  • Input DNA sequences near co-regulated genes from
    a single organism
  • Tools MEME, Consensus, Gibbs sampler,
    Projection, YMF, and many others.

CAR2 AGTCTCGACGTGAGTTTGCCTTAGGTGGTAGTTTTAAACA
GTCTCGACTAGTCTCGATCGTACAGTGTTTAGTCTTTCGACATG ARG5,
6 TTTTTTCCATTAGGTGGAGTTTTTTAGGTCTCGACAGTCTCGACTC
GTTAGTCTCGAATACAGTTTAGTCTCGAGTTTCGACATG CAR1
TCTCGACAGTTTTCACTTAGCGTTTTATCTCGAGACGTGAGTATGCCATT
AGCTGGACATG
5
Homogeneous Sequence Data II
  • DNA sequences near orthologous genes
  • Tools
  • Multiple alignment (ClustalW, etc.), then find
    highly conserved aligned regions
  • FootPrinter

CCTTGGACCAAGTCCAGCACCCTCGGGGTCGAGGAAAACAGGTAGGGTAT
AAAAAGGGCATGCAAGGACCTGCAGCCAAGCTTGCAGGTAGGGTATAA
AAAGGGCACGCAAGGGACCCCAAAAAAAGAAACTGCTCAGAGTCCTGTGG
ACAGATCACTGCTTGGCAAGAAGTGATAGATGGGGCCAGGGTATAAAA
AGGGCCCAACTCCCCGAACCACTCAGGGTCCTGTGGACAGCTCACCTAGC
TGCAAGAGGGCCCCAAAGCGCTCAGGGTCCTGTGGACAAGGGACCAGG
GTATAAAGAGGGCCCGCACAGCTGGCTCACCCCGGCTGCG
6
Heterogeneous Sequence Data
  • Co-regulated genes from one species, and their
    orthologs from other species.

Rat
Mouse
Human
g1.rn
g1.mm
g1.hs
g2.rn
g2.mm
g2.hs
g3.rn
g3.mm
g3.hs
g4.rn
g4.mm
g4.hs
7
Heterogeneous Data Approach 1
  • Pool everything together
  • Search for statistical overrepresentation

g3.mm
g2.hs
g1.mm
g2.rn
g1.rn
g4.hs
g4.rn
g4.mm
g1.hs
g3.hs
g3.rn
g2.mm
Gelfand et al. 2000 , McGuire et al. 2000
8
Heterogeneous Data Approach 2
  • Filter well conserved orthologous regions
  • Search for overrepresentation in one species

g1.rn
g1.mm
g1.hs
g2.rn
g2.mm
g2.hs
g3.rn
g3.mm
g3.hs
g4.rn
g4.mm
g4.hs
Wasserman et al. 2000 , Kellis et al. 2003,
Cliften et al. 2003, Wang Stormo 2003
9
Heterogeneous Data Approach 3
  • Filter overrepresentation in co-regulated
    regions.
  • Search for well conserved orthologous regions

g1.mm
g1.rn
g1.hs
g2.mm
g2.rn
g2.hs
g3.mm
g3.rn
g3.hs
g4.mm
g4.rn
g4.hs
GuhaThakurta et al. 2002
10
OrthoMEME Our Approach
  • An integrated approach no filtering step
  • Treats orthology and co-regulation differently.
  • Based on Expected-Maximization
  • Does not use global alignment, which can fail on
    diverged sequences.
  • Focus on two-species case

11
OrthoMEME Algorithm
  • Maximization of Expected Likelihood
  • Model
  • As MEME, uses a profile to model the motifs in
    one genome
  • Another phylogenetic profile to model motifs in
    orthologous regions.

12
OrthoMEME Profile
Profile

0.75 0.25 0 0
Rat
Human
A...
g1.rn
g1.hs
C
g2.rn
g2.hs
A
g3.rn
g3.hs
A
g4.rn
g4.hs
13
Phylogenetic Profile


Profile
A C G T
A 0.67 0.33 0 0
C 0 1 0 0
G 0 0 0
0 T 0 0
0 0

0.75 0.25 0 0
Rat
Human
A...
A...
g1.rn
g1.hs
C
C
g2.rn
g2.hs
C
A
g3.rn
g3.hs
A
A
g4.rn
g4.hs
14
Experimental Results
  • Implemented and tested on various pairs of
    species
  • Compared to MEME
  • on single species data
  • same parameters
  • Results from top 3 motifs are reported.

15
Result 1 Mammals
  • SRF motif
  • OrthoMEME missed 2 occurrences
  • MEME found none

16
Result 2 Yeast
  • HAP2HAP3HAP4 motif
  • OrthoMEME missed 2 occurrences
  • MEME missed 4 occurrences

17
Result 3 Worm
  • DAF-19 motif
  • OrthoMEME missed no occurrences
  • MEME missed no occurrences

18
Conclusion
  • First integrated algorithm to handle
    heterogeneous sequence data.
  • Focus on two species case
  • Improve algorithm for multiple species.
  • More experiments will help us improve the
    tool/parameters.
Write a Comment
User Comments (0)
About PowerShow.com