Random Projection Approach to Motif Finding - PowerPoint PPT Presentation

About This Presentation

Title:

Random Projection Approach to Motif Finding

Description:

daf-19 Binding Sites in C. elegans (Peter Swoboda) GTTGTCATGGTGAC. GTTTCCATGGAAAC. GCTACCATGGCAAC ... daf-19. osm-1. osm-6. F02D8.3 -150 -1. Algorithmic ... – PowerPoint PPT presentation

Number of Views:334

Avg rating:3.0/5.0

Slides: 21

Provided by: BenRa2

Category:

more less

Transcript and Presenter's Notes

Title: Random Projection Approach to Motif Finding

1
Random Projection Approach to Motif Finding

Adapted from http//genome.ucsd.edu/classes/be202/
ppt/FindingSignals-RandomProjections.ppt

2
daf-19 Binding Sites in C. elegans(Peter Swoboda)

GTTGTCATGGTGAC
GTTTCCATGGAAAC
GCTACCATGGCAAC
GTTACCATAGTAAC
GTTTCCATGGTAAC
che-2
daf-19
osm-1
osm-6
F02D8.3

-150
-1
3
Algorithmic Techniques

MEME (Expectation Maximization)
GibbsDNA (Gibbs Sampling)
CONSENUS (greedy multiple alignment)
WINNOWER (Clique finding in graphs)
SP-STAR (Sum of pairs scoring)
MITRA (Mismatch trees to prune exhaustive search
space)

4
The (l,d) Planted Motif Problem (Sagot 1998,
Pevzner Sze 2000)

Generate a random length l consensus sequence C.
Generate 20 instances, each differing from C by d
random mutations.
Plant one at a random position in each of N20
random sequences of length n600.
Can you find the planted instances?

5
Planted Motifs

AGTTATCGCGGCACAGGCTCCTTCTTTATAGCC
ATGATAGCATCAACCTAACCCTAGATATGGGAT
TTTTGGGATATATCGCCCCTACACTGGATGACT
GGATATACATGAACACGGTGGGAAAACCCTGAC
Each instance differs from ACAGGATCA by 2
mutations
Remaining sequence random

6
Random Projection Algorithm

Buhler and Tompa (2001)
Guiding principle Some instances of a motif
agree on a subset of positions.
Use information from multiple motif instances to
construct model.

7
k-Projections

Choose k positions in string of length l.
Concatenate nucleotides at chosen k positions to
form k-tuple.
In l-dimensional Hamming space, projection onto k
dimensional subspace.

k 7
l 15
P
ATGGCATTCAGATTC
TGCTGAT
P (2, 4, 5, 7, 11, 12, 13)
8
Random Projection Algorithm

Choose a projection by selecting k positions
uniformly at random.
For each l-tuple in input sequences, hash into
bucket based on letters at k selected positions.
Recover motif from bucket containing multiple
l-tuples.

Input sequence x(i) TCAATGCACCTAT...
9
Example

l 7 (motif size) , k 4 (projection size)
Choose projection (1,2,5,7)

Input Sequence
...TAGACATCCGACTTGCCTTACTAC...
Buckets
GCCTTAC
10
Hashing and Buckets

Hash function h(x) obtained from k positions of
projection.
Buckets are labeled by values of h(x).
Enriched buckets contain at least s l-tuples,
for some parameter s.

11
Motif Refinement

How do we recover the motif from the sequences in
the enriched buckets?
k nucleotides are known from hash value of
bucket.
Use information in other l-k positions as
starting point for local refinement scheme, e.g.
EM or Gibbs sampler

Local refinement algorithm
ATGCGTC
Candidate motif
12
Frequency Matrix Model from Bucket
Frequency matrix W
EM algorithm
Refined matrix W
13
Motif Finding as Global Optimization

Scoring function (Hamming distance, likelihood
ratio, etc.)
Many existing algorithms (MEME, GibbsDNA) are
good local optimization routines.
Random projection is a procedure for finding good
starting points.

14
EM Motif Refinement

For each bucket h containing more than s
sequences, form weight matrix Wh
Use EM algorithm with starting point Wh to obtain
refined weight matrix model Wh
For each input sequence x(i), return l tuple y(i)
which maximizes likelihood ratio
Pr(y(i) Wh )/ Pr(y(i) P0).
T y(1), y(2), , y(N)
C(T ) consensus string

15
Expectation Maximization (EM)

S x(1), , x(N) set of input sequences
Given
W An initial probabilistic motif model
P0 background probability distribution.
Find value Wmax that maximizes likelihood ratio

EM is local optimization scheme. Requires
starting value W

16
A Single Iteration

Choose a random k-projection.
Hash each l-mer x in input sequence into bucket
labelled by h(x).
From each bucket B with at least s sequences,
form weight matrix model, and perform EM/Gibbs
sampler refinement.
Candidate motif is the best one found from
refinement of all enriched buckets.

17
What is the best motif?

Compute score S for each motif
Generate W, an initial PSSM from the returned
l-mers y(1), y(2), , y(N)
Return motif with maximal score

18
Parameter Selection

Projection size k
Choose k small so several motif instances hash to
same bucket. (k lt l - d)
Choose k large to avoid contamination by spurious
l-mers. E gt (N (n - l 1))/ 4k Bucket
threshold s (s 3, s 4)

19
How Many Iterations?

Planted bucket bucket with hash value h(M),
where M is motif.
Choose m number of iterations, such that
Pr(planted bucket contains s sequences in at
least one of m iterations) 0.95.
Probability is readily computable since
iterations form a sequence of independent
Bernoulli trials.

20
Examples
K set of nt. in motif instances. P set of nt.
in positions predicted by algorithm.

Write a Comment

User Comments (0)