Title: Random Projection Approach to Motif Finding
1Random Projection Approach to Motif Finding
- Adapted from http//genome.ucsd.edu/classes/be202/
ppt/FindingSignals-RandomProjections.ppt
2daf-19 Binding Sites in C. elegans(Peter Swoboda)
- GTTGTCATGGTGAC
- GTTTCCATGGAAAC
- GCTACCATGGCAAC
- GTTACCATAGTAAC
- GTTTCCATGGTAAC
- che-2
- daf-19
- osm-1
- osm-6
- F02D8.3
-150
-1
3Algorithmic Techniques
- MEME (Expectation Maximization)
- GibbsDNA (Gibbs Sampling)
- CONSENUS (greedy multiple alignment)
- WINNOWER (Clique finding in graphs)
- SP-STAR (Sum of pairs scoring)
- MITRA (Mismatch trees to prune exhaustive search
space)
4The (l,d) Planted Motif Problem (Sagot 1998,
Pevzner Sze 2000)
- Generate a random length l consensus sequence C.
- Generate 20 instances, each differing from C by d
random mutations. - Plant one at a random position in each of N20
random sequences of length n600. - Can you find the planted instances?
5Planted Motifs
- AGTTATCGCGGCACAGGCTCCTTCTTTATAGCC
- ATGATAGCATCAACCTAACCCTAGATATGGGAT
- TTTTGGGATATATCGCCCCTACACTGGATGACT
- GGATATACATGAACACGGTGGGAAAACCCTGAC
- Each instance differs from ACAGGATCA by 2
mutations - Remaining sequence random
6Random Projection Algorithm
- Buhler and Tompa (2001)
- Guiding principle Some instances of a motif
agree on a subset of positions. - Use information from multiple motif instances to
construct model.
7k-Projections
- Choose k positions in string of length l.
- Concatenate nucleotides at chosen k positions to
form k-tuple. - In l-dimensional Hamming space, projection onto k
dimensional subspace.
k 7
l 15
P
ATGGCATTCAGATTC
TGCTGAT
P (2, 4, 5, 7, 11, 12, 13)
8Random Projection Algorithm
- Choose a projection by selecting k positions
uniformly at random. - For each l-tuple in input sequences, hash into
bucket based on letters at k selected positions. - Recover motif from bucket containing multiple
l-tuples.
Input sequence x(i) TCAATGCACCTAT...
9Example
- l 7 (motif size) , k 4 (projection size)
- Choose projection (1,2,5,7)
Input Sequence
...TAGACATCCGACTTGCCTTACTAC...
Buckets
GCCTTAC
10Hashing and Buckets
- Hash function h(x) obtained from k positions of
projection. - Buckets are labeled by values of h(x).
- Enriched buckets contain at least s l-tuples,
for some parameter s.
11Motif Refinement
- How do we recover the motif from the sequences in
the enriched buckets? - k nucleotides are known from hash value of
bucket. - Use information in other l-k positions as
starting point for local refinement scheme, e.g.
EM or Gibbs sampler
Local refinement algorithm
ATGCGTC
Candidate motif
12Frequency Matrix Model from Bucket
Frequency matrix W
EM algorithm
Refined matrix W
13Motif Finding as Global Optimization
- Scoring function (Hamming distance, likelihood
ratio, etc.) - Many existing algorithms (MEME, GibbsDNA) are
good local optimization routines. - Random projection is a procedure for finding good
starting points.
14EM Motif Refinement
- For each bucket h containing more than s
sequences, form weight matrix Wh - Use EM algorithm with starting point Wh to obtain
refined weight matrix model Wh - For each input sequence x(i), return l tuple y(i)
which maximizes likelihood ratio - Pr(y(i) Wh )/ Pr(y(i) P0).
- T y(1), y(2), , y(N)
- C(T ) consensus string
15Expectation Maximization (EM)
- S x(1), , x(N) set of input sequences
- Given
- W An initial probabilistic motif model
- P0 background probability distribution.
- Find value Wmax that maximizes likelihood ratio
- EM is local optimization scheme. Requires
starting value W
16A Single Iteration
- Choose a random k-projection.
- Hash each l-mer x in input sequence into bucket
labelled by h(x). - From each bucket B with at least s sequences,
form weight matrix model, and perform EM/Gibbs
sampler refinement. - Candidate motif is the best one found from
refinement of all enriched buckets.
17What is the best motif?
- Compute score S for each motif
- Generate W, an initial PSSM from the returned
l-mers y(1), y(2), , y(N) - Return motif with maximal score
18Parameter Selection
- Projection size k
- Choose k small so several motif instances hash to
same bucket. (k lt l - d) - Choose k large to avoid contamination by spurious
l-mers. E gt (N (n - l 1))/ 4k Bucket
threshold s (s 3, s 4)
19How Many Iterations?
- Planted bucket bucket with hash value h(M),
where M is motif. - Choose m number of iterations, such that
Pr(planted bucket contains s sequences in at
least one of m iterations) 0.95. - Probability is readily computable since
iterations form a sequence of independent
Bernoulli trials.
20Examples
K set of nt. in motif instances. P set of nt.
in positions predicted by algorithm.