Motif Discovery: Algorithm and Application - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Motif Discovery: Algorithm and Application

Description:

Objective: Motif discovery and use for deriving biological information ... Genome wide functional analysis using motif to find biological pattern ... – PowerPoint PPT presentation

Number of Views:193

Avg rating:3.0/5.0

Slides: 28

Provided by: sumeet2

Category:

more less

Transcript and Presenter's Notes

Title: Motif Discovery: Algorithm and Application

1
Motif DiscoveryAlgorithm and Application

Dan Scanfeld
Hong Xue
Sumeet Gupta
Varun Aggarwal

2
Objective Motif discovery and use for deriving
biological information
Get bound and unbound sequences by TF nanog in
human ES cells
Find a motif using a motif finding algorithm
Genome wide functional analysis using motif to
find biological pattern
3
Why nanog Relevance to ES Cells

Activate certain genes essential for cell growth
Repress a key set of genes needed for an embryo
to develop.
This key set of repressed genes activate entire
networks for generating many different
specialized cells and tissues.

1 Genome 1 Cell
gt200 Phenotypes 1013 Cells
4
Objective Motif discovery and use for deriving
biological information
Get bound and unbound Sequences by TF nanog in
Human ES cells
Find a motif (nanog) using a motif finding
algorithm
Genome wide Functional Analysis using motif to
find biological signals
5
Location Analysis (ChIP-CHIP) in Human ES Cells
(Cell Boyer et al 122 947-956)
Differentially label
Crosslink
Fragment
Enrich for Nanog
44k 10 SetAgilent
6
ChIP-CHIP Data Analysis
negative control-subtracted
Perform Median Normalization
Set - normalized
Obtain Intensities using Genepix
Sequences (500 bp) May 2004 Genome Release
IP signal
0
WCE signal
7
Objective Motif discovery and use for deriving
biological information
Get bound and unbound Sequences by TF nanog in
Human ES cells
Find a motif (nanog) using a motif finding
algorithm (State-of-the-art)
Genome wide functional analysis using motif to
find biological pattern
8
Motif Finding Algorithm(Mac Isaac, et. al., 2006)
Use Structural Prior (Database, MacIssac, et. al.)
Refinement Expectation-Maximization (ZOOPS)
Score of found motifs Classification on unseen
data
Significance testing on score Use of Empirical
p-value
9
RefinementExpectation-Maximization

Differences from EM in Lab 1
Use of structural prior (beta Strength of
prior)
ZOOPS (Zero or One per sequence) model
5th order Markov Model for background trained
over unbound sequences
SVM for hypothesis testing

10
ZOOPS Model (Bailey Elkan 1994)

B Background Model, M Motif Model
? Percentage of Bound Sequences (Mixture Model
parameter)
Sequences are drawn from the distribution
P(S) P(S M) ? P(SB)(1- ?)
Hidden Variable for EM Zij 1 or 0, position j
in sequence i is bound by the TF (1) or not (0)
E-step
Prob(Zij) ? P(Si bound at j M)
-------------------------------
----------
(1- ?)P(Si B) ? ? j P(Si bound at j
M)
M-step
(SAME AS BEFORE)
Updating M (Motif Model) For position p on the
motif model and each base b (A C T or G)
Baseip Base at position p of ith sequence
PWM(p,b) ? i (? j (prob(Zi(j-p1)) (Baseij
b))) pseudocounts AND NORMALIZE

P(M bound at j Si)
P(Si)
11
Hypothesis testing
B

EM
Motif (M)

Get motifs from EM
Use 2 sets of bound and unbound seq. ( Train and
test)
Train a linear SVM on train set.
Find classification error on test set
Error Misclassifications/Total Samples
Score 1 error

Input P(SM)/P(SB) Output B OR UB
B
B
UB
UB
Train Set
Test Set Test Classifier
Train Classifier
12
Expectation-Maximization

When to stop? Will it overtrain?
Rules of thumb (When likelihood increases very
slowly)
Second derivative is negative for given number of
times
Euclidean distance is less than given value
Over-train to given sequences
Maximizes likelihood of motif in given sequences.
Disregards their likelihood in unbound sequences
Find test classification error at each EM step
using SVMs.

13
Expectation-Maximization
Final Motif
SVM Error

A different Methodology
4 sets of data
Bound (for EM),
B U.B. (Train SVM),
B. U.B. (Test SVM),
B. U.B. (Validation)
At each EM iteration, train SVM and find test
Error.
Use two kind of motifs
Best Test Error motif
EM last iteration motif
Choose 10 best hypothesis
Use larger validation set

Initial Points
Final Motif
SVM Error
SVM Error
SVM Error
SVM Error
Initial Points
14
Expectation-Maximization

Details of RUN
Transfactor Nanog
Beta 0 0.2 0.35 0.5 0.6 0.7 1
(Strength of prior)
5 motifs per beta by masking motifs
Motif Length 8
25 bound seqs for EM
500 base pairs in each seq.
150 total train seq (SVM) Low Noisy
150 total test seq (SVM) Low Noisy
500 total Validation seq.
c 1e-3,0.05,100.0 (SVM Budget for
misclassifications)
EM for minimum 60 iterations, Second derivative
is negative for five iterations

15
Expectation-Maximization

Representative Score graphs during EM iterations

X-Axis EM Iteration Y-Axis Score of Motif
Beta 0.0
Beta 0.35
Beta 0.7
Beta 0.6
16
Expectation-Maximization
Test and Validate Error of refined Motifs
X-Axis beta Value Y-Axis Score of Motif
Test Classification Score End of iteration EM
result o Best of Iteration
Validate Classification Score End of
iteration EM result o Best of Iteration
17
Expectation-Maximization

When is it the best-of-iteration?

iteration
RUNS
Total iterations Iterations for
Best-Of-Iterations
18
Expectation Maximization

Results
6 out of 7 top ranking motifs were
best-of-iteration and 1 was end-of-iteration (6
out of 10 as well)
Best Motif Validate Error over set of 500
Score 61.2, Error 38.8
A 0.003392 0.764554 0.995187 0.072268 0.063644
0.459349 0.000033 0.088069
C 0.268216 0.050266 0.000149 0.000022 0.303880
0.003363 0.472214 0.201074
G 0.039865 0.000023 0.002015 0.205620 0.105970
0.537248 0.446827 0.228689
T 0.688527 0.185157 0.002648 0.722090 0.526506
0.000040 0.080927 0.482167
T A A
T T A or G C or G
T

19
Assumptions and Caveats

Random baseline End-of-run motif in EM
Low number of sequences for test error
Bound sets may actually not be bound. Better to
use highly probable sequences as bound.
All runs (inc. beta0) used starting point as the
structural prior.

20
Objective Motif discovery and use for deriving
biological information
Get bound and unbound Sequences by TF nanog in
Human ES cells
Find a motif (nanog) using a motif finding
algorithm
Genome wide functional analysis using motif to
find biological pattern
21
GSEA (Subramanian et al 2005)

Gene Set Enrichment Analysis (GSEA) determines
whether an a priori defined set of genes shows
statistically significant differences between two
biological states.

22
GSEA Output

Enrichment Plot
Gene List
Gene Set Information

23
GSEA Ranked List

Set of promoter sequences for every human gene.
2000 bp upstream and 200 bp downstream of
Transcription initiation site.
Score each promoter for likelihood of the motif.
Input this ranked list into GSEA.
Search for gene sets enriched in the ranked list.

24
Results

Human embryonic stem cell genes OCT4, NANOG,
STELLAR, and GDF3 are expressed in both seminoma
and breast carcinoma. ( Ezeh et al 2006 )
Breast cancer geneset found at p-value 0.008

25
Implementation Details

Young Lab Error model for chIP-chip data Analysis
Motif finding Algorithm in MATLAB
Implemented Markov Model
Implemented ZOOPS Model
Integrated SVM Toolbox ( by S. R. Gunn.) with
code
Used structural prior from MacIsaac, et.al. 2006
Used software for GSEA for Functional Analysis.

26
Future Directions

Algorithm
Better use of classification error.
Maximize Likelihood in Bound Minimizes
Likelihood in Unbound (Multi-objective
Optimization using GAs)
Biological Information Distance from
transcription site, Conservation
Integrating expression data
Cross-species Motif search and functional
analysis, maybe using GO Terms
Scoring
Sequence length

27
Acknowledgments