Title: Motif Finding
1Motif Finding
- Yueyi Irene Liu
- CS374 Lecture
- Oct. 17, 2002
2Outline
- Background biology
- Motif-finding methods
- Word enumeration
- Gibbs sampling
- Random projection
- Phylogenetic footprinting
- Reducer
3(No Transcript)
4Regulation of Gene Expression
- Chromatin structure
- Transcription initiation
- Transcript processing and modification
- RNA transport
- Transcript stability
- Translation initiation
- Post-Translational Modification
- Protein Transport
- Control of Protein Stability
5Typical Structure of an Eukaryotic mRNA Gene
6Control of Transcription Initiation
7Motif
- A conserved pattern that is found in two or more
sequences - Can be found in
- DNA (e.g., transcription factor binding sites)
- Protein
- RNA
8Models for Representing Motifs
- Regular expression
- Consensus
- TGACGCA
- Degenerate
- WGACRCA
- Position Specific Matrix
TGACGCA TGACGCA AGACGCA TGACACA AGACGCA
9Where to look for motifs?
- Gene families a set of genes controlled by a
common transcription factor or common
environmental stimulus - How do you construct gene families?
- Microarray experiments
10Microarrays
10
11Motif-finding Methods
- Goal Look for motifs (5-15bp) in the data set
- Methods
- Word enumeration method
- Gibbs sampling
- Random projection
- Phylogenetic footprinting
- Reducer
12Word Enumeration
- For every word w, calculate
- Expected frequency based on entire upstream
region of the yeast genome - E.g., P(ATTGA) (0.4)4(0.1)1, given P(A) P(T)
0.4, - P(G)P(C) 0.1
- Expected number of occurrences of ATTGA
nP(ATTGA) - Observed frequency in the data set
- Statistical significance of enrichment
- Z (O - E) / sqrtnp ? (1 - p) N(0, 1)
- Disadvantage only consider exact word
- E.g, YCTGCA TCTGCA and CCTGCA
13Gibbs Sampling
- Matrix to capture a motif
- Goal find the best ak to maximize the difference
between motif and background base distribution.
a1
a2
a3
a4
ak
Liu, X
14Gibbs Sampling (Lawrence, et al, 1993)
- Step 1 Pick random start position, compute
current motif matrix - Step 2 Iterative update
- Take one sequence out, update motif matrix
- Calcuate fitness score of each position of out
sequence - Pick start position in out sequence based on
weight Ax - Take out another sequence, , until converge
- Step 3 Reset starting position
Liu, X
15Gibbs Sampling InitializationPick random start
position, compute motif matrix
Liu, X
16Gibbs Sampling Iteration Steps1) Take out one
sequence, calculate the fitness score of every
subsequence relative to the current motif
a1'
?????????????????
a2'
a3'
a4'
ak'
Liu, X
17Fitness Score
Current Motif
- Ax Qx / Px
- Qx probability of generating subsequence x from
current motif - Px probability of generating subsequence x from
background
Background P(A) P(T) 0.4 P(G) P(C) 0.1
X GGA Q? P?
18Gibbs Sampling Iteration Steps2) Pick new start
position sampling from fitness score
a1''
a2'
a3'
a4'
ak'
Liu, X
19Recent Development
- Random Projection
- Phylogenetic Footprinting
- Reducer
20Random Projection (Buhler, 2002)
- (l, d)-motif problem
- M is an (unknown) motif of length l
- Each occurrence of M is corrupted by exactly d
point substitutions in random positions - No known biological motifs are
- of (l, d)-motif
CCcaAG CCcgAG CCgcAG CCtaAG CCtgAG
CtATgG CCctAc tCtTAG CaAcAG CCAgAa
21Random Projection Algorithm
- Guiding principle Some instances of a motif
agree on a subset of positions. - Use information from multiple motif instances to
construct model.
Buhler, J
22k-Projections
- Choose k positions in string of length l.
- Concatenate nucleotides at chosen k positions to
form k-tuple. - In l-dimensional Hamming space, projection onto k
dimensional subspace.
k 7
l 15
P
ATGGCATTCAGATTC
TGCTGAT
Buhler, J
P (2, 4, 5, 7, 11, 12, 13)
23Random Projection Algorithm
- Choose a projection by selecting k positions
uniformly at random. - For each l-tuple in input sequences, hash into
bucket based on letters at k selected positions. - Recover motif from bucket containing multiple
l-tuples.
Input sequence x(i) TCAATGCACCTAT...
Buhler, J
24Example
- l 7 (motif size) , k 4 (projection size)
- Choose projection (1,2,5,7)
Input Sequence
...TAGACATCCGACTTGCCTTACTAC...
Buckets
GCCTTAC
Buhler, J
25Hashing and Buckets
- Hash function h(x) obtained from k positions of
projection. - Buckets are labeled by values of h(x).
- Enriched buckets contain more than s l-tuples,
for some parameter s.
Buhler, J
26Motif Refinement
- How do we recover the motif from the sequences in
the enriched buckets? - k nucleotides are known from hash value of
bucket. - Use information in other l-k positions as
starting point for local refinement scheme, e.g.
EM or Gibbs sampler
Local refinement algorithm
ATGCGTC
Candidate motif
Buhler, J
27Parameter Selection
- Projection size k
- Choose k small so several motif instances hash to
same bucket. (k lt l - d) - Choose k large to avoid contamination by spurious
l-mers. ( 4k gt t (n - l 1) - Bucket threshold s (s 3, s 4)
Buhler, J
28Recent Development
- Random Projection
- Phylogenetic Footprinting
- Reducer
29Conservation of Regulatory Elements in Upstream
of ApoAI Gene
TATA box
TATA box
30(No Transcript)
31Substring Parsimony Problem
- Given
- orthologous upstream sequences S1,Sn
- phylogenetic tree T of the n species
- size k of the motif, threshold d
- Problem
- Find all sets of substrings s1,sn of S1,Sn ,
each of size k, such that the parsimony score of
s1,sn on T is at most d
Blanchette, M
32Parsimony Score
s1
Tree T
s2
s34
s3
s4
s5
s6
Minimum (all possible labelings of internal nodes)
- l(v) label of node v
- d(l1, l2) Hamming distance
Blanchette, M
33String Parsimony Problem
S1 AAAGCATTC S2 TACGCACCC S3 GAAGCAGGG
k 5 d 1
S1
S2
S3
34Algorithm version I
- Root the tree at arbitrary internal node r
- Compute table Wu of size 4k for each node u,
where Wus best parsimony score for subtree
rooted at u when u is labeled with s - Direct implementation of this recursion gives
O(nk(42k l), where l average sequence
length
Blanchette, M
35Algorithm version II
- Define X(u, v)s best parsimony score for
subtree consisting of edge (u,v) and the subtree
rooted at v
u labeled s
w
v
Blanchette, M
36Algorithm version II (continued)
- Update X(u, v) in phases in phase p maintain set
Bp of sequences t, such that X(u, v)t p - Define
- Ra s Wvs a
- N(s) t in ?k d(s, t) 1
- Start in phase m and let Bm Rm
- Update
- Computation of X(u, v) takes O(k4k)
Blanchette, M
37Improvements
- Reduce the size of Bp when sequences contribute
to X(u, v) greater than threshold d - In phase p, only care for sequence X(u, v) s if
- Leads to significant reductions in stages d/2 d
- Reduce the number of substrings inserted in W at
the leaves - For substring s of Si, if its best match against
any Sj, has Hamming distance at least d, s can be
discarded
Blanchette, M
38Results
- Practical limit on k 10
- There appeared to be a threshold d0 with very few
solutions below and many above - Algorithm found 80 known binding sites
- Performed better than ClustalW, MEME, Consensus
Blanchette, M
39Recent Development
- Random Projection
- Phylogenetic Footprinting
- Reducer
40Reducer (Bussemaker, et al 2001)
- Links motif finding to expression level
- Ag C S Fu Nug
- Ag gene expression level (logarithm of
expression ratio) - M number of significant motifs
- Ng number of occurrences of motif u in gene g
- C baseline expression level (same for all genes)
- F increase/decrease of expression level caused
by presence of motif
41Reducer (Contd)
Liu, X
42Reducer (Contd)
- Normalize expression (A) and motif (n) vectors
- Linear regression between A vector and every n
vector to find the best fit n to A - Step-wise regression to combine effects of motifs
- Subtract the effect of one motif
- Find the next best motif
Liu, X
43Acknowlegement
- People from whom I borrowed slides
- Xiaole Liu (Reducer)
- Olga Troyanskaya (Microarray)
- Jeremy Buhler (Random projections)
- Mathieu Blanchette (Phylogenetic footprinting)
- Various web sources
44(No Transcript)
45excitation
scanning
cDNA clones (probes)
laser 1
laser 2
PCR product amplification purification
emission
printing
mRNA target)
overlay images and normalise
0.1nl/spot
Hybridise target to microarray
microarray
analysis
46Information Content of Motifs
- Uncertainty
- Information Hbefore - Hafter
47Improvement on Original Gibbs sampler
- 0 n copies of sites in each sequence
- Iterative masking to find multiple motifs
- Use higher order Markov models to improve motif
specificity
48Clinical Importance of Defects in Regulatory
Elements
Burkitts Lymphoma
49Statistical Methods
- Expectation Maximization (EM)
- MEME
- Gibbs sampling
- BioProspector
- AlignACE
50Motifs are not limited to DNAs
- RNA motifs
- RNA RNA interaction motifs, e.g., intron-exon
splice sites - RNA protein interaction motifs, e.g., binding
of proteins to RNA polyA tail - Protein motifs
- E.g., Helix-turn-helix motif
51Sequence Logo
52Why is this Problem Hard?
- Motif information content low
- Hamming distance between each motif instance high