Random Projection Approach to Motif Finding - PowerPoint PPT Presentation

About This Presentation
Title:

Random Projection Approach to Motif Finding

Description:

daf-19 Binding Sites in C. elegans (Peter Swoboda) GTTGTCATGGTGAC. GTTTCCATGGAAAC. GCTACCATGGCAAC ... daf-19. osm-1. osm-6. F02D8.3 -150 -1. Algorithmic ... – PowerPoint PPT presentation

Number of Views:330
Avg rating:3.0/5.0
Slides: 21
Provided by: BenRa2
Category:

less

Transcript and Presenter's Notes

Title: Random Projection Approach to Motif Finding


1
Random Projection Approach to Motif Finding
  • Adapted from http//genome.ucsd.edu/classes/be202/
    ppt/FindingSignals-RandomProjections.ppt

2
daf-19 Binding Sites in C. elegans(Peter Swoboda)
  • GTTGTCATGGTGAC
  • GTTTCCATGGAAAC
  • GCTACCATGGCAAC
  • GTTACCATAGTAAC
  • GTTTCCATGGTAAC
  • che-2
  • daf-19
  • osm-1
  • osm-6
  • F02D8.3

-150
-1
3
Algorithmic Techniques
  • MEME (Expectation Maximization)
  • GibbsDNA (Gibbs Sampling)
  • CONSENUS (greedy multiple alignment)
  • WINNOWER (Clique finding in graphs)
  • SP-STAR (Sum of pairs scoring)
  • MITRA (Mismatch trees to prune exhaustive search
    space)

4
The (l,d) Planted Motif Problem (Sagot 1998,
Pevzner Sze 2000)
  • Generate a random length l consensus sequence C.
  • Generate 20 instances, each differing from C by d
    random mutations.
  • Plant one at a random position in each of N20
    random sequences of length n600.
  • Can you find the planted instances?

5
Planted Motifs
  • AGTTATCGCGGCACAGGCTCCTTCTTTATAGCC
  • ATGATAGCATCAACCTAACCCTAGATATGGGAT
  • TTTTGGGATATATCGCCCCTACACTGGATGACT
  • GGATATACATGAACACGGTGGGAAAACCCTGAC
  • Each instance differs from ACAGGATCA by 2
    mutations
  • Remaining sequence random

6
Random Projection Algorithm
  • Buhler and Tompa (2001)
  • Guiding principle Some instances of a motif
    agree on a subset of positions.
  • Use information from multiple motif instances to
    construct model.

7
k-Projections
  • Choose k positions in string of length l.
  • Concatenate nucleotides at chosen k positions to
    form k-tuple.
  • In l-dimensional Hamming space, projection onto k
    dimensional subspace.

k 7
l 15
P
ATGGCATTCAGATTC
TGCTGAT
P (2, 4, 5, 7, 11, 12, 13)
8
Random Projection Algorithm
  • Choose a projection by selecting k positions
    uniformly at random.
  • For each l-tuple in input sequences, hash into
    bucket based on letters at k selected positions.
  • Recover motif from bucket containing multiple
    l-tuples.

Input sequence x(i) TCAATGCACCTAT...
9
Example
  • l 7 (motif size) , k 4 (projection size)
  • Choose projection (1,2,5,7)

Input Sequence
...TAGACATCCGACTTGCCTTACTAC...
Buckets
GCCTTAC
10
Hashing and Buckets
  • Hash function h(x) obtained from k positions of
    projection.
  • Buckets are labeled by values of h(x).
  • Enriched buckets contain at least s l-tuples,
    for some parameter s.

11
Motif Refinement
  • How do we recover the motif from the sequences in
    the enriched buckets?
  • k nucleotides are known from hash value of
    bucket.
  • Use information in other l-k positions as
    starting point for local refinement scheme, e.g.
    EM or Gibbs sampler

Local refinement algorithm
ATGCGTC
Candidate motif
12
Frequency Matrix Model from Bucket
Frequency matrix W
EM algorithm
Refined matrix W
13
Motif Finding as Global Optimization
  • Scoring function (Hamming distance, likelihood
    ratio, etc.)
  • Many existing algorithms (MEME, GibbsDNA) are
    good local optimization routines.
  • Random projection is a procedure for finding good
    starting points.

14
EM Motif Refinement
  • For each bucket h containing more than s
    sequences, form weight matrix Wh
  • Use EM algorithm with starting point Wh to obtain
    refined weight matrix model Wh
  • For each input sequence x(i), return l tuple y(i)
    which maximizes likelihood ratio
  • Pr(y(i) Wh )/ Pr(y(i) P0).
  • T y(1), y(2), , y(N)
  • C(T ) consensus string

15
Expectation Maximization (EM)
  • S x(1), , x(N) set of input sequences
  • Given
  • W An initial probabilistic motif model
  • P0 background probability distribution.
  • Find value Wmax that maximizes likelihood ratio
  • EM is local optimization scheme. Requires
    starting value W

16
A Single Iteration
  • Choose a random k-projection.
  • Hash each l-mer x in input sequence into bucket
    labelled by h(x).
  • From each bucket B with at least s sequences,
    form weight matrix model, and perform EM/Gibbs
    sampler refinement.
  • Candidate motif is the best one found from
    refinement of all enriched buckets.

17
What is the best motif?
  • Compute score S for each motif
  • Generate W, an initial PSSM from the returned
    l-mers y(1), y(2), , y(N)
  • Return motif with maximal score

18
Parameter Selection
  • Projection size k
  • Choose k small so several motif instances hash to
    same bucket. (k lt l - d)
  • Choose k large to avoid contamination by spurious
    l-mers. E gt (N (n - l 1))/ 4k Bucket
    threshold s (s 3, s 4)

19
How Many Iterations?
  • Planted bucket bucket with hash value h(M),
    where M is motif.
  • Choose m number of iterations, such that
    Pr(planted bucket contains s sequences in at
    least one of m iterations) 0.95.
  • Probability is readily computable since
    iterations form a sequence of independent
    Bernoulli trials.

20
Examples
K set of nt. in motif instances. P set of nt.
in positions predicted by algorithm.
Write a Comment
User Comments (0)
About PowerShow.com