Motif Discovery: Algorithm and Application - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Motif Discovery: Algorithm and Application

Description:

Objective: Motif discovery and use for deriving biological information ... Genome wide functional analysis using motif to find biological pattern ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 28
Provided by: sumeet2
Category:

less

Transcript and Presenter's Notes

Title: Motif Discovery: Algorithm and Application


1
Motif DiscoveryAlgorithm and Application
  • Dan Scanfeld
  • Hong Xue
  • Sumeet Gupta
  • Varun Aggarwal

2
Objective Motif discovery and use for deriving
biological information
Get bound and unbound sequences by TF nanog in
human ES cells
Find a motif using a motif finding algorithm
Genome wide functional analysis using motif to
find biological pattern
3
Why nanog Relevance to ES Cells
  • Activate certain genes essential for cell growth
  • Repress a key set of genes needed for an embryo
    to develop.
  • This key set of repressed genes activate entire
    networks for generating many different
    specialized cells and tissues.

1 Genome 1 Cell
gt200 Phenotypes 1013 Cells
4
Objective Motif discovery and use for deriving
biological information
Get bound and unbound Sequences by TF nanog in
Human ES cells
Find a motif (nanog) using a motif finding
algorithm
Genome wide Functional Analysis using motif to
find biological signals
5
Location Analysis (ChIP-CHIP) in Human ES Cells
(Cell Boyer et al 122 947-956)
Differentially label
Crosslink
Fragment
Enrich for Nanog
44k 10 SetAgilent
6
ChIP-CHIP Data Analysis
negative control-subtracted
Perform Median Normalization
Set - normalized
Obtain Intensities using Genepix
Sequences (500 bp) May 2004 Genome Release
IP signal
0
WCE signal
7
Objective Motif discovery and use for deriving
biological information
Get bound and unbound Sequences by TF nanog in
Human ES cells
Find a motif (nanog) using a motif finding
algorithm (State-of-the-art)
Genome wide functional analysis using motif to
find biological pattern
8
Motif Finding Algorithm(Mac Isaac, et. al., 2006)
Use Structural Prior (Database, MacIssac, et. al.)
Refinement Expectation-Maximization (ZOOPS)
Score of found motifs Classification on unseen
data
Significance testing on score Use of Empirical
p-value
9
RefinementExpectation-Maximization
  • Differences from EM in Lab 1
  • Use of structural prior (beta Strength of
    prior)
  • ZOOPS (Zero or One per sequence) model
  • 5th order Markov Model for background trained
    over unbound sequences
  • SVM for hypothesis testing

10
ZOOPS Model (Bailey Elkan 1994)
  • B Background Model, M Motif Model
  • ? Percentage of Bound Sequences (Mixture Model
    parameter)
  • Sequences are drawn from the distribution
  • P(S) P(S M) ? P(SB)(1- ?)
  • Hidden Variable for EM Zij 1 or 0, position j
    in sequence i is bound by the TF (1) or not (0)
  • E-step
  • Prob(Zij) ? P(Si bound at j M)
  • -------------------------------
    ----------
  • (1- ?)P(Si B) ? ? j P(Si bound at j
    M)
  • M-step
  • (SAME AS BEFORE)
  • Updating M (Motif Model) For position p on the
    motif model and each base b (A C T or G)
  • Baseip Base at position p of ith sequence
  • PWM(p,b) ? i (? j (prob(Zi(j-p1)) (Baseij
    b))) pseudocounts AND NORMALIZE

P(M bound at j Si)
P(Si)
11
Hypothesis testing
B

EM
Motif (M)
  • Get motifs from EM
  • Use 2 sets of bound and unbound seq. ( Train and
    test)
  • Train a linear SVM on train set.
  • Find classification error on test set
  • Error Misclassifications/Total Samples
  • Score 1 error

Input P(SM)/P(SB) Output B OR UB
B
B
UB
UB
Train Set
Test Set Test Classifier
Train Classifier
12
Expectation-Maximization
  • When to stop? Will it overtrain?
  • Rules of thumb (When likelihood increases very
    slowly)
  • Second derivative is negative for given number of
    times
  • Euclidean distance is less than given value
  • Over-train to given sequences
  • Maximizes likelihood of motif in given sequences.
    Disregards their likelihood in unbound sequences
  • Find test classification error at each EM step
    using SVMs.

13
Expectation-Maximization
Final Motif
SVM Error
  • A different Methodology
  • 4 sets of data
  • Bound (for EM),
  • B U.B. (Train SVM),
  • B. U.B. (Test SVM),
  • B. U.B. (Validation)
  • At each EM iteration, train SVM and find test
    Error.
  • Use two kind of motifs
  • Best Test Error motif
  • EM last iteration motif
  • Choose 10 best hypothesis
  • Use larger validation set

Initial Points
Final Motif
SVM Error
SVM Error
SVM Error
SVM Error
Initial Points
14
Expectation-Maximization
  • Details of RUN
  • Transfactor Nanog
  • Beta 0 0.2 0.35 0.5 0.6 0.7 1
  • (Strength of prior)
  • 5 motifs per beta by masking motifs
  • Motif Length 8
  • 25 bound seqs for EM
  • 500 base pairs in each seq.
  • 150 total train seq (SVM) Low Noisy
  • 150 total test seq (SVM) Low Noisy
  • 500 total Validation seq.
  • c 1e-3,0.05,100.0 (SVM Budget for
    misclassifications)
  • EM for minimum 60 iterations, Second derivative
    is negative for five iterations

15
Expectation-Maximization
  • Representative Score graphs during EM iterations

X-Axis EM Iteration Y-Axis Score of Motif
Beta 0.0
Beta 0.35
Beta 0.7
Beta 0.6
16
Expectation-Maximization
Test and Validate Error of refined Motifs
X-Axis beta Value Y-Axis Score of Motif
Test Classification Score End of iteration EM
result o Best of Iteration
Validate Classification Score End of
iteration EM result o Best of Iteration
17
Expectation-Maximization
  • When is it the best-of-iteration?

iteration
RUNS
Total iterations Iterations for
Best-Of-Iterations
18
Expectation Maximization
  • Results
  • 6 out of 7 top ranking motifs were
    best-of-iteration and 1 was end-of-iteration (6
    out of 10 as well)
  • Best Motif Validate Error over set of 500
  • Score 61.2, Error 38.8
  • A 0.003392 0.764554 0.995187 0.072268 0.063644
    0.459349 0.000033 0.088069
  • C 0.268216 0.050266 0.000149 0.000022 0.303880
    0.003363 0.472214 0.201074
  • G 0.039865 0.000023 0.002015 0.205620 0.105970
    0.537248 0.446827 0.228689
  • T 0.688527 0.185157 0.002648 0.722090 0.526506
    0.000040 0.080927 0.482167
  • T A A
    T T A or G C or G
    T

19
Assumptions and Caveats
  • Random baseline End-of-run motif in EM
  • Low number of sequences for test error
  • Bound sets may actually not be bound. Better to
    use highly probable sequences as bound.
  • All runs (inc. beta0) used starting point as the
    structural prior.

20
Objective Motif discovery and use for deriving
biological information
Get bound and unbound Sequences by TF nanog in
Human ES cells
Find a motif (nanog) using a motif finding
algorithm
Genome wide functional analysis using motif to
find biological pattern
21
GSEA (Subramanian et al 2005)
  • Gene Set Enrichment Analysis (GSEA) determines
    whether an a priori defined set of genes shows
    statistically significant differences between two
    biological states.

22
GSEA Output
  • Enrichment Plot
  • Gene List
  • Gene Set Information

23
GSEA Ranked List
  • Set of promoter sequences for every human gene.
  • 2000 bp upstream and 200 bp downstream of
    Transcription initiation site.
  • Score each promoter for likelihood of the motif.
  • Input this ranked list into GSEA.
  • Search for gene sets enriched in the ranked list.

24
Results
  • Human embryonic stem cell genes OCT4, NANOG,
    STELLAR, and GDF3 are expressed in both seminoma
    and breast carcinoma. ( Ezeh et al 2006 )
  • Breast cancer geneset found at p-value 0.008

25
Implementation Details
  • Young Lab Error model for chIP-chip data Analysis
  • Motif finding Algorithm in MATLAB
  • Implemented Markov Model
  • Implemented ZOOPS Model
  • Integrated SVM Toolbox ( by S. R. Gunn.) with
    code
  • Used structural prior from MacIsaac, et.al. 2006
  • Used software for GSEA for Functional Analysis.

26
Future Directions
  • Algorithm
  • Better use of classification error.
  • Maximize Likelihood in Bound Minimizes
    Likelihood in Unbound (Multi-objective
    Optimization using GAs)
  • Biological Information Distance from
    transcription site, Conservation
  • Integrating expression data
  • Cross-species Motif search and functional
    analysis, maybe using GO Terms
  • Scoring
  • Sequence length

27
Acknowledgments
  • Fraenkel Lab
  • Young Lab
  • Kenzie D. MacIsaac
  • Dr. David Gifford (CSAIL)
  • Dr. Richard Young (WIBR)
  • Dr. Tommi Jaakkola (CSAIL)
Write a Comment
User Comments (0)
About PowerShow.com