Title: Motif Discovery: Algorithm and Application
1Motif DiscoveryAlgorithm and Application
- Dan Scanfeld
- Hong Xue
- Sumeet Gupta
- Varun Aggarwal
2Objective Motif discovery and use for deriving
biological information
Get bound and unbound sequences by TF nanog in
human ES cells
Find a motif using a motif finding algorithm
Genome wide functional analysis using motif to
find biological pattern
3Why nanog Relevance to ES Cells
- Activate certain genes essential for cell growth
- Repress a key set of genes needed for an embryo
to develop. - This key set of repressed genes activate entire
networks for generating many different
specialized cells and tissues.
1 Genome 1 Cell
gt200 Phenotypes 1013 Cells
4Objective Motif discovery and use for deriving
biological information
Get bound and unbound Sequences by TF nanog in
Human ES cells
Find a motif (nanog) using a motif finding
algorithm
Genome wide Functional Analysis using motif to
find biological signals
5Location Analysis (ChIP-CHIP) in Human ES Cells
(Cell Boyer et al 122 947-956)
Differentially label
Crosslink
Fragment
Enrich for Nanog
44k 10 SetAgilent
6ChIP-CHIP Data Analysis
negative control-subtracted
Perform Median Normalization
Set - normalized
Obtain Intensities using Genepix
Sequences (500 bp) May 2004 Genome Release
IP signal
0
WCE signal
7Objective Motif discovery and use for deriving
biological information
Get bound and unbound Sequences by TF nanog in
Human ES cells
Find a motif (nanog) using a motif finding
algorithm (State-of-the-art)
Genome wide functional analysis using motif to
find biological pattern
8Motif Finding Algorithm(Mac Isaac, et. al., 2006)
Use Structural Prior (Database, MacIssac, et. al.)
Refinement Expectation-Maximization (ZOOPS)
Score of found motifs Classification on unseen
data
Significance testing on score Use of Empirical
p-value
9RefinementExpectation-Maximization
- Differences from EM in Lab 1
- Use of structural prior (beta Strength of
prior) - ZOOPS (Zero or One per sequence) model
- 5th order Markov Model for background trained
over unbound sequences - SVM for hypothesis testing
10ZOOPS Model (Bailey Elkan 1994)
- B Background Model, M Motif Model
- ? Percentage of Bound Sequences (Mixture Model
parameter) - Sequences are drawn from the distribution
- P(S) P(S M) ? P(SB)(1- ?)
- Hidden Variable for EM Zij 1 or 0, position j
in sequence i is bound by the TF (1) or not (0) - E-step
- Prob(Zij) ? P(Si bound at j M)
- -------------------------------
---------- - (1- ?)P(Si B) ? ? j P(Si bound at j
M) - M-step
- (SAME AS BEFORE)
- Updating M (Motif Model) For position p on the
motif model and each base b (A C T or G) - Baseip Base at position p of ith sequence
- PWM(p,b) ? i (? j (prob(Zi(j-p1)) (Baseij
b))) pseudocounts AND NORMALIZE
P(M bound at j Si)
P(Si)
11Hypothesis testing
B
EM
Motif (M)
- Get motifs from EM
- Use 2 sets of bound and unbound seq. ( Train and
test) - Train a linear SVM on train set.
- Find classification error on test set
- Error Misclassifications/Total Samples
- Score 1 error
Input P(SM)/P(SB) Output B OR UB
B
B
UB
UB
Train Set
Test Set Test Classifier
Train Classifier
12Expectation-Maximization
- When to stop? Will it overtrain?
- Rules of thumb (When likelihood increases very
slowly) - Second derivative is negative for given number of
times - Euclidean distance is less than given value
- Over-train to given sequences
- Maximizes likelihood of motif in given sequences.
Disregards their likelihood in unbound sequences - Find test classification error at each EM step
using SVMs.
13Expectation-Maximization
Final Motif
SVM Error
- A different Methodology
- 4 sets of data
- Bound (for EM),
- B U.B. (Train SVM),
- B. U.B. (Test SVM),
- B. U.B. (Validation)
- At each EM iteration, train SVM and find test
Error. - Use two kind of motifs
- Best Test Error motif
- EM last iteration motif
- Choose 10 best hypothesis
- Use larger validation set
Initial Points
Final Motif
SVM Error
SVM Error
SVM Error
SVM Error
Initial Points
14Expectation-Maximization
- Details of RUN
- Transfactor Nanog
- Beta 0 0.2 0.35 0.5 0.6 0.7 1
- (Strength of prior)
- 5 motifs per beta by masking motifs
- Motif Length 8
- 25 bound seqs for EM
- 500 base pairs in each seq.
- 150 total train seq (SVM) Low Noisy
- 150 total test seq (SVM) Low Noisy
- 500 total Validation seq.
- c 1e-3,0.05,100.0 (SVM Budget for
misclassifications) - EM for minimum 60 iterations, Second derivative
is negative for five iterations
15Expectation-Maximization
- Representative Score graphs during EM iterations
X-Axis EM Iteration Y-Axis Score of Motif
Beta 0.0
Beta 0.35
Beta 0.7
Beta 0.6
16Expectation-Maximization
Test and Validate Error of refined Motifs
X-Axis beta Value Y-Axis Score of Motif
Test Classification Score End of iteration EM
result o Best of Iteration
Validate Classification Score End of
iteration EM result o Best of Iteration
17Expectation-Maximization
- When is it the best-of-iteration?
iteration
RUNS
Total iterations Iterations for
Best-Of-Iterations
18Expectation Maximization
- Results
- 6 out of 7 top ranking motifs were
best-of-iteration and 1 was end-of-iteration (6
out of 10 as well) - Best Motif Validate Error over set of 500
- Score 61.2, Error 38.8
- A 0.003392 0.764554 0.995187 0.072268 0.063644
0.459349 0.000033 0.088069 - C 0.268216 0.050266 0.000149 0.000022 0.303880
0.003363 0.472214 0.201074 - G 0.039865 0.000023 0.002015 0.205620 0.105970
0.537248 0.446827 0.228689 - T 0.688527 0.185157 0.002648 0.722090 0.526506
0.000040 0.080927 0.482167 - T A A
T T A or G C or G
T
19Assumptions and Caveats
- Random baseline End-of-run motif in EM
- Low number of sequences for test error
- Bound sets may actually not be bound. Better to
use highly probable sequences as bound. - All runs (inc. beta0) used starting point as the
structural prior.
20Objective Motif discovery and use for deriving
biological information
Get bound and unbound Sequences by TF nanog in
Human ES cells
Find a motif (nanog) using a motif finding
algorithm
Genome wide functional analysis using motif to
find biological pattern
21GSEA (Subramanian et al 2005)
- Gene Set Enrichment Analysis (GSEA) determines
whether an a priori defined set of genes shows
statistically significant differences between two
biological states.
22GSEA Output
- Enrichment Plot
- Gene List
- Gene Set Information
23GSEA Ranked List
- Set of promoter sequences for every human gene.
- 2000 bp upstream and 200 bp downstream of
Transcription initiation site. - Score each promoter for likelihood of the motif.
- Input this ranked list into GSEA.
- Search for gene sets enriched in the ranked list.
24Results
- Human embryonic stem cell genes OCT4, NANOG,
STELLAR, and GDF3 are expressed in both seminoma
and breast carcinoma. ( Ezeh et al 2006 ) - Breast cancer geneset found at p-value 0.008
25Implementation Details
- Young Lab Error model for chIP-chip data Analysis
- Motif finding Algorithm in MATLAB
- Implemented Markov Model
- Implemented ZOOPS Model
- Integrated SVM Toolbox ( by S. R. Gunn.) with
code - Used structural prior from MacIsaac, et.al. 2006
- Used software for GSEA for Functional Analysis.
26Future Directions
- Algorithm
- Better use of classification error.
- Maximize Likelihood in Bound Minimizes
Likelihood in Unbound (Multi-objective
Optimization using GAs) - Biological Information Distance from
transcription site, Conservation - Integrating expression data
- Cross-species Motif search and functional
analysis, maybe using GO Terms - Scoring
- Sequence length
27Acknowledgments
- Fraenkel Lab
- Young Lab
- Kenzie D. MacIsaac
- Dr. David Gifford (CSAIL)
- Dr. Richard Young (WIBR)
- Dr. Tommi Jaakkola (CSAIL)