Motif Finding - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Motif Finding

Description:

Calcuate fitness score of each position of out sequence ... Fitness Score. Ax = Qx / Px. Qx: probability of generating subsequence x from current motif ... – PowerPoint PPT presentation

Number of Views:547
Avg rating:3.0/5.0
Slides: 53
Provided by: iliu
Category:
Tags: finding | motif

less

Transcript and Presenter's Notes

Title: Motif Finding


1
Motif Finding
  • Yueyi Irene Liu
  • CS374 Lecture
  • Oct. 17, 2002

2
Outline
  • Background biology
  • Motif-finding methods
  • Word enumeration
  • Gibbs sampling
  • Random projection
  • Phylogenetic footprinting
  • Reducer

3
(No Transcript)
4
Regulation of Gene Expression
  • Chromatin structure
  • Transcription initiation
  • Transcript processing and modification
  • RNA transport
  • Transcript stability
  • Translation initiation
  • Post-Translational Modification
  • Protein Transport
  • Control of Protein Stability

5
Typical Structure of an Eukaryotic mRNA Gene
6
Control of Transcription Initiation
7
Motif
  • A conserved pattern that is found in two or more
    sequences
  • Can be found in
  • DNA (e.g., transcription factor binding sites)
  • Protein
  • RNA

8
Models for Representing Motifs
  • Regular expression
  • Consensus
  • TGACGCA
  • Degenerate
  • WGACRCA
  • Position Specific Matrix

TGACGCA TGACGCA AGACGCA TGACACA AGACGCA
9
Where to look for motifs?
  • Gene families a set of genes controlled by a
    common transcription factor or common
    environmental stimulus
  • How do you construct gene families?
  • Microarray experiments

10
Microarrays
10
11
Motif-finding Methods
  • Goal Look for motifs (5-15bp) in the data set
  • Methods
  • Word enumeration method
  • Gibbs sampling
  • Random projection
  • Phylogenetic footprinting
  • Reducer

12
Word Enumeration
  • For every word w, calculate
  • Expected frequency based on entire upstream
    region of the yeast genome
  • E.g., P(ATTGA) (0.4)4(0.1)1, given P(A) P(T)
    0.4,
  • P(G)P(C) 0.1
  • Expected number of occurrences of ATTGA
    nP(ATTGA)
  • Observed frequency in the data set
  • Statistical significance of enrichment
  • Z (O - E) / sqrtnp ? (1 - p) N(0, 1)
  • Disadvantage only consider exact word
  • E.g, YCTGCA TCTGCA and CCTGCA

13
Gibbs Sampling
  • Matrix to capture a motif
  • Goal find the best ak to maximize the difference
    between motif and background base distribution.

a1
a2
a3
a4
ak
Liu, X
14
Gibbs Sampling (Lawrence, et al, 1993)
  • Step 1 Pick random start position, compute
    current motif matrix
  • Step 2 Iterative update
  • Take one sequence out, update motif matrix
  • Calcuate fitness score of each position of out
    sequence
  • Pick start position in out sequence based on
    weight Ax
  • Take out another sequence, , until converge
  • Step 3 Reset starting position

Liu, X
15
Gibbs Sampling InitializationPick random start
position, compute motif matrix
Liu, X
16
Gibbs Sampling Iteration Steps1) Take out one
sequence, calculate the fitness score of every
subsequence relative to the current motif
a1'
?????????????????
a2'
a3'
a4'
ak'
Liu, X
17
Fitness Score
Current Motif
  • Ax Qx / Px
  • Qx probability of generating subsequence x from
    current motif
  • Px probability of generating subsequence x from
    background

Background P(A) P(T) 0.4 P(G) P(C) 0.1
X GGA Q? P?
18
Gibbs Sampling Iteration Steps2) Pick new start
position sampling from fitness score
a1''
a2'
a3'
a4'
ak'
Liu, X
19
Recent Development
  • Random Projection
  • Phylogenetic Footprinting
  • Reducer

20
Random Projection (Buhler, 2002)
  • (l, d)-motif problem
  • M is an (unknown) motif of length l
  • Each occurrence of M is corrupted by exactly d
    point substitutions in random positions
  • No known biological motifs are
  • of (l, d)-motif

CCcaAG CCcgAG CCgcAG CCtaAG CCtgAG
CtATgG CCctAc tCtTAG CaAcAG CCAgAa
21
Random Projection Algorithm
  • Guiding principle Some instances of a motif
    agree on a subset of positions.
  • Use information from multiple motif instances to
    construct model.

Buhler, J
22
k-Projections
  • Choose k positions in string of length l.
  • Concatenate nucleotides at chosen k positions to
    form k-tuple.
  • In l-dimensional Hamming space, projection onto k
    dimensional subspace.

k 7
l 15
P
ATGGCATTCAGATTC
TGCTGAT
Buhler, J
P (2, 4, 5, 7, 11, 12, 13)
23
Random Projection Algorithm
  • Choose a projection by selecting k positions
    uniformly at random.
  • For each l-tuple in input sequences, hash into
    bucket based on letters at k selected positions.
  • Recover motif from bucket containing multiple
    l-tuples.

Input sequence x(i) TCAATGCACCTAT...
Buhler, J
24
Example
  • l 7 (motif size) , k 4 (projection size)
  • Choose projection (1,2,5,7)

Input Sequence
...TAGACATCCGACTTGCCTTACTAC...
Buckets
GCCTTAC
Buhler, J
25
Hashing and Buckets
  • Hash function h(x) obtained from k positions of
    projection.
  • Buckets are labeled by values of h(x).
  • Enriched buckets contain more than s l-tuples,
    for some parameter s.

Buhler, J
26
Motif Refinement
  • How do we recover the motif from the sequences in
    the enriched buckets?
  • k nucleotides are known from hash value of
    bucket.
  • Use information in other l-k positions as
    starting point for local refinement scheme, e.g.
    EM or Gibbs sampler

Local refinement algorithm
ATGCGTC
Candidate motif
Buhler, J
27
Parameter Selection
  • Projection size k
  • Choose k small so several motif instances hash to
    same bucket. (k lt l - d)
  • Choose k large to avoid contamination by spurious
    l-mers. ( 4k gt t (n - l 1)
  • Bucket threshold s (s 3, s 4)

Buhler, J
28
Recent Development
  • Random Projection
  • Phylogenetic Footprinting
  • Reducer

29
Conservation of Regulatory Elements in Upstream
of ApoAI Gene
TATA box
TATA box
30
(No Transcript)
31
Substring Parsimony Problem
  • Given
  • orthologous upstream sequences S1,Sn
  • phylogenetic tree T of the n species
  • size k of the motif, threshold d
  • Problem
  • Find all sets of substrings s1,sn of S1,Sn ,
    each of size k, such that the parsimony score of
    s1,sn on T is at most d

Blanchette, M
32
Parsimony Score
s1
Tree T
s2
s34
s3
s4
s5
s6
Minimum (all possible labelings of internal nodes)
  • l(v) label of node v
  • d(l1, l2) Hamming distance

Blanchette, M
33
String Parsimony Problem
S1 AAAGCATTC S2 TACGCACCC S3 GAAGCAGGG
k 5 d 1
S1
S2
S3
34
Algorithm version I
  • Root the tree at arbitrary internal node r
  • Compute table Wu of size 4k for each node u,
    where Wus best parsimony score for subtree
    rooted at u when u is labeled with s
  • Direct implementation of this recursion gives
    O(nk(42k l), where l average sequence
    length

Blanchette, M
35
Algorithm version II
  • Define X(u, v)s best parsimony score for
    subtree consisting of edge (u,v) and the subtree
    rooted at v

u labeled s
w
v
Blanchette, M
36
Algorithm version II (continued)
  • Update X(u, v) in phases in phase p maintain set
    Bp of sequences t, such that X(u, v)t p
  • Define
  • Ra s Wvs a
  • N(s) t in ?k d(s, t) 1
  • Start in phase m and let Bm Rm
  • Update
  • Computation of X(u, v) takes O(k4k)

Blanchette, M
37
Improvements
  • Reduce the size of Bp when sequences contribute
    to X(u, v) greater than threshold d
  • In phase p, only care for sequence X(u, v) s if
  • Leads to significant reductions in stages d/2 d
  • Reduce the number of substrings inserted in W at
    the leaves
  • For substring s of Si, if its best match against
    any Sj, has Hamming distance at least d, s can be
    discarded

Blanchette, M
38
Results
  • Practical limit on k 10
  • There appeared to be a threshold d0 with very few
    solutions below and many above
  • Algorithm found 80 known binding sites
  • Performed better than ClustalW, MEME, Consensus

Blanchette, M
39
Recent Development
  • Random Projection
  • Phylogenetic Footprinting
  • Reducer

40
Reducer (Bussemaker, et al 2001)
  • Links motif finding to expression level
  • Ag C S Fu Nug
  • Ag gene expression level (logarithm of
    expression ratio)
  • M number of significant motifs
  • Ng number of occurrences of motif u in gene g
  • C baseline expression level (same for all genes)
  • F increase/decrease of expression level caused
    by presence of motif

41
Reducer (Contd)
Liu, X
42
Reducer (Contd)
  • Normalize expression (A) and motif (n) vectors
  • Linear regression between A vector and every n
    vector to find the best fit n to A
  • Step-wise regression to combine effects of motifs
  • Subtract the effect of one motif
  • Find the next best motif

Liu, X
43
Acknowlegement
  • People from whom I borrowed slides
  • Xiaole Liu (Reducer)
  • Olga Troyanskaya (Microarray)
  • Jeremy Buhler (Random projections)
  • Mathieu Blanchette (Phylogenetic footprinting)
  • Various web sources

44
(No Transcript)
45
excitation
scanning
cDNA clones (probes)
laser 1
laser 2
PCR product amplification purification
emission
printing
mRNA target)
overlay images and normalise
0.1nl/spot
Hybridise target to microarray
microarray
analysis
46
Information Content of Motifs
  • Uncertainty
  • Information Hbefore - Hafter

47
Improvement on Original Gibbs sampler
  • 0 n copies of sites in each sequence
  • Iterative masking to find multiple motifs
  • Use higher order Markov models to improve motif
    specificity

48
Clinical Importance of Defects in Regulatory
Elements
Burkitts Lymphoma
49
Statistical Methods
  • Expectation Maximization (EM)
  • MEME
  • Gibbs sampling
  • BioProspector
  • AlignACE

50
Motifs are not limited to DNAs
  • RNA motifs
  • RNA RNA interaction motifs, e.g., intron-exon
    splice sites
  • RNA protein interaction motifs, e.g., binding
    of proteins to RNA polyA tail
  • Protein motifs
  • E.g., Helix-turn-helix motif

51
Sequence Logo
52
Why is this Problem Hard?
  • Motif information content low
  • Hamming distance between each motif instance high
Write a Comment
User Comments (0)
About PowerShow.com