Title: cisGreedy Motif Finder for Cistematic
1cisGreedy Motif Finder for Cistematic
- Sarah Aerni
- Mentors Ali Mortazavi
- Barbara Wold
2cisGreedy
- De novo motif finder which implements a greedy
algorithm similar to Consensus motif finder - Goal To provide an efficient algorithm to be
included in the Cistematic package that performs
similarly to Consensus and meme
3Cistematic
- Integrate visualization, refinement of motifs and
improve performance of multiple motif finders in
a single package
Mortazavi, 2006
4Cistematic
- cisGreedy becomes part of Bottom Tier
- Motif finder would be included in the Cistematic
package (prevents need for complicated
installations)
Image Ali Mortazavi
5What is a Motif?
- cis-Regulatory elements
- Transcription Factor Binding Sites(TFBS)
- Binding by transcription factors may increase or
decrease transcription of genes
6What is a Motif?
- GAL4 in Yeast
- Activator of galactose-induced genes (convert
galactose to glucose) - Protein structure determines motif
- DNA-protein interactions require certain bases at
specified locations - Motif reflects homodimer structure
7What is a Motif?
- cis-Regulatory elements
- Transcription Factor Binding Sites(TFBS)
- Binding by transcription factors may increase or
decrease transcription of genes - Gene Regulation believed to be a major source of
complexity - Plants may have more genes or larger genomes than
humans are they more complex? - Identification of cis-regulatory elements will
help us understand gene regulatory networks
(bigger picture)
8How do we find motifs?
- Hard to identify
- Relatively short sequences (as small as 6 bases)
- Many positions not well conserved
- Factors improving identification
- Usually localized in certain proximity of a gene
(search within 3 kb upstream) - Some positions highly conserved
- Use other data (Microarray?)
9Motif Finders
- Greedy
- Maximizes similarity of motifs from sequences
through a greedy approach - Eliminate background modeling by using Cistematic
package preprocessing steps - Improves speed
- Prevents false negatives
- Implements multiple models (zoops, oops, TCM)
10Consensus Scoring
- Use equation similar to log likelihood called
Information Content
L columns in the matrix A A,C,G,T
frequency of each letter i at each
position j a priori probability of letter
i
Hertz, Gerald Z., and Gary D. Stormo.
"Identifying DNA and protein patterns with
statistically significant alignments of multiple
sequences." Bioinformatics 8 1999 563-577.
11Removing Background
- Goal of a background model differentiate noise
from signal - Issues with background
- What background should be used?
- Whole genome? Conserved regions?
- Selective pressures maintain conserved regions
- Arguably searching in conserved regions
guarantees there is little noise (it has been
maintained) - Solution
- Search in conserved regions
- Use simple repeat masking
- Sequences which reoccur are likely TFBS
12cisGreedy scoring
- Scoring focuses on maximizing number of identical
bases - Percent identity is dependent on number of
deviations from the strict consensus - Background adds complexity that may lead to false
negatives
13cisGreedy
- Input sequences are analyzed
- Randomly select 2 sequences to be compared
14cisGreedy
- The two selected sequences are analyzed
independently of the remaining sequences
15cisGreedy
- The two selected sequences are analyzed
independently of the remaining sequences - Windows of motif size are scanned starting at the
beginning of each sequence
16cisGreedy
- Sequences are scanned in an attempt to locate the
highest scoring alignment - Alignments are ungapped
- Score is established as the number of sequences
containing the most frequently occurring base at
each position
17cisGreedy
- Reverse Complements are analyzed (user specified)
- Once start locations are established with a top
alignment score, these are left unchanged (Greedy)
18cisGreedy
- Select an additional sequence in which to
identify the location of the motif - Windows in the additional sequence are aligned to
previously established windows (Greedy)
19cisGreedy
- Additional sequence scanned as before, reverse
complement (user specified) - Alignment score established as before
20cisGreedy
- Final motif locations are used in order to build
position specific frequency matrices - Reverse complement sequence used in building PSFM
if used
21Testing cis-Greedy
- AIY
- 16bp cis-regulatory motif drives expression
- Experimentally verified
- Gene battery consists of a set of genes bound by
AIY - Orthologous genes contain highly specified
binding sites - Individual binding sites of battery genes within
a single species can vary considerably
(Wenick and Hobert 759)
22Cistematic Results for AIY
regions of conservation
orthologous genes
hen-1
hen-1
23Results for AIY
- AIY Identified AAATTGGCTTCCTCAAA
- cisGreedy TTTGAGGAAGCCAATTT
- (reverse comp) AAATTGGCTTCCTCAAA
- meme AAATTGGCTTCCTCAAA
AIY- Battery Consensus
24Cistematic Results for AIY
hen-1
25Results for AIY
hen-1
26Tompa Bakeoff
- 3 benchmark datasets
- Real
- Markov Chain
- Generic
- 4 organisms
- Human
- Mouse
- Fruitfly
- Yeast
- Each dataset contains 0-1 motifs.
- Each sequence can have 0 or multiple motifs
- Report 0-1 motif per dataset and locations of
motifs - Use statistical tools provided by bakeoff to
analyze runs
27Bakeoff example (hm03)
- Identify most reasonable motif based in each
dataset independently
28Real
Real
Interesting pattern appears between 3 of 10
sequences
29Markov
30Generic
31Bakeoff example (hm03)
- Identify most reasonable motif based in each
dataset independently - Determine which motif appears most reasonable
across 3 benchmarks and map motif in sequences
using Cistematic - Compare results to actual locations (provided in
bakeoff package)
32Solution
33Real
Real
34Solution
35Markov
36Bakeoff results
- Correlation Coefficient
- nCC (nTP nTN - nFN nFP) /
v((nTPnFN)(nTNnFP)(nTPnFP)(nTNnFN)) - Sensitivity (fraction of known sites that are
predicted) - sSn sTP / (sTP sFN)
- Positive Predictive Value (fraction of predicted
sites that are known.) - sPPV sTP / (sTP sFP)
37Bakeoff results
- cisGreedy overall 7th best performer (excluding
those with no data) - Overall top performer in fly
- Worst performer in yeast
- 3rd worst performer in mouse
- 4th best performer in human
38Bakeoff results
Adapted from Tompa, 2005
When running programs in parallel, correlation of
motif finder results to true binding sites
improves
39Future goals
- Complete analysis of results for cisGreedy using
benchmarks established by Tompa paper (Nature
Biotech, 2005) - Document results and algorithm development
- Continue improving cisGreedy
40References
- Bioalgorithms.info
- Jones, Neil C., and Pavel A. Pevzner. An
Introduction to Bioinformatics Algorithms . MIT
Press , 2004. - Hertz, Gerald Z., and Gary D. Stormo.
"Identifying DNA and protein patterns with
statistically significant alignments of multiple
sequences." Bioinformatics 8 1999 563-577. - Tompa, Martin et al. Assessing computational
tools for the discovery of transcription factor
binding sites." Nature Biotechnology January
2005 137-144. - Wenick, Adam S., and Oliver Hobert. "Genomic
cis-Regulatory Architecture and trans-Acting
Regulators of a Single Interneuron-Specific Gene
Battery in C. elegans." Developmental Cell
6(2005) 757-770. - http//cistematic.caltech.edu
41Acknowledgements
- Ali Mortazavi
- Barbara Wold
- Wold Lab funding provided by DOE NASA
- Additional funding by NSF NIH
- SoCalBSI faculty, staff and fellow students