Title: Finding Regulatory Binding Motifs in Genomic Sequences
1Finding Regulatory Binding Motifs in Genomic
Sequences
- Jun Liu
- Department of Statistics
- Harvard University
- Email jliu_at_stat.Harvard.edu
- Http//www.fas.Harvard.edu/junliu
2Outline
- Background
- The basic motif model and early attempts
- The EM-method
- A progressive search method
- The Gibbs sampling approach
- Motif sampler and mixture model
- Threshold sampler
- Further modeling efforts
- Applications
3Genome Sequence in Year 2000
DNA (A,T,G,C) ? RNA (A,U,G,C) ? Protein
(A,R,N,D,C,E,Q,G,H,I,L,K, )
- Human genome first draft in year 2000?
- Three billion bases of DNA
- Blueprint for human species
- Already gt 20 smaller genomes complete
- E. Coli, H. influenzae,
- Archaeoglobus fulgidus,
- C. Elegans , S. Cerevisiae
4Other High Throughput Data
- Gene expression arrays
- Single nucleotide point mutations (SNPs)
- In 80,000 human genes
- More coming
- Focus genomic comparisons to reveal patterns --
learn about natures words
5Get the Complete Sequence
By courtesy of Xiaole Liu
- 3-billion letters long
- 140 thousand genes
- Alphabet is A, T, G and C
- No punctuation, no spaces
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
6Each Cell Is Like a Chef
By courtesy of Xiaole Liu
7Information in DNA
By courtesy of Xiaole Liu
Between genes 97
Junk?
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
8Information in DNA
By courtesy of Xiaole Liu
- Between genes 97 Genes 3
- Regulation When, Where,
- Amount, Other Conditions, etc
- ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC
- ATTTACCACATCGCATCACTACGACGGATAGACACGGACG
- GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA
- TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG
- CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT
Milk-gtYogurt
Egg-gtOmelet
Fish-gtSushi
Flour-gtCake
Beef-gtBurger
9Search for Regulatory Binding Sites
- Gene Transcription and Regulation
- Transcription initiated by RNA polymerase binding
at the so-called promoter region (TATA-box or
-10, -35) - Regulated by (regulatory) proteins binding to a
segment of the genome near the promoter
region - These binding sites on DNA are often similar in
composition
RNA polymerase
Enhancers and repressors
Starting codon
3
5
AUG
Promoter region
Translation start
5 UTR
10(No Transcript)
11(No Transcript)
12Decode Gene Regulation
By courtesy of Xiaole Liu
- GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
- CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
- GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
- TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
- CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT
signal to turn on/off switches
Co-regulated genes
The same regulatory protein is often used for
co-expressed genes
Transcription start
13Decode Gene Regulation
By courtesy of Xiaole Liu
- GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
- CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
- GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
- TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
- CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT
Look at genes always expressed together Upstrea
m Regions Co-expressed Genes
Scrambled Egg
Bacon
Cereal
Hash Brown
Orange Juice
14The Particular Dataset
- 18 DNA segments, each of length 105 bps.
- There are at least one CRP binding sites, known
experimentally, in each sequence. - The binding sites are about 16-19 base pairs
long, with considerable variability in their
contents. - Interested in seeing if we can find these sites
computationally.
15The Test Data Set
16Truth?
17Finding Motifs in Multiple Sequences
Motif
a1
a2
width w
ak
length nk
Alignment variable Aa1, a2, , ak
Objective find the best common patterns.
18Statistical Models
- How do we describe patterns?
- frequencies of amino acid types. theoretical
basis? - multinomial distribution --- more generally a
model
TTTGATCGTTTTCACAAAA TTTGCACGGCGTCACACTT
TGTGAGCATGGTCATATTT TGCAAAGGACGTCACATTA
TGTTAAATTGATCACGTTT TTTGAACCAGATCGCATTA TGTAAACG
ATTCCACTAAT CGTGATCAACCCCTCAATT
TGTGAGTTAGCTCACTCAT TGTAACAGAGATCACACAA
A typical aligned motif
19Multinomial Distribution
A total of k sequences
p1
p6
p2
Model Mi for i-th column
(ki,1, ki,2, , ki,4) Multinom (k, pi )
where pi(pi,1 ,, pi,4)
20 How to estimate?
- The maximum likelihood
- Bayesian estimate
- Prior pi Dirichlet (ai,1, ..., ai,4),
pseudo-counts - Posterior pi obs Dirichlet (ai,1ki,1,,
ai,4ki,20) - Posterior Mean
21Motif Alignment Model
Motif
a1
a2
width w
ak
length nk
Alignment variable Aa1, a2, , ak
- Every non-site positions follows a common
multinomial - with p0(p0,1 ,, p0,20)
- Every position i in the motif element
follows probability - distribution pi(pi,1 ,, pi,20)
22Handling missing data --- the EM method
- This is a standard missing data problem
- Given A, Q is easy given Q, A is easy
- The EM approach
- E-step let be the predictive prob for jth
position in seq k is a site. - M-step multinomial with fractional counts.
Lawrence and Reilly (1990)
23A sequential updating method
(Stormo and Hartzel 1989).
- Start with the first 2 sequences find the top N
(50, say) best patterns based on pair-wise
comparison - Consider a new sequence at each step
- There are (Lk-W1) possible locations for the
pattern - Each possible location is compared with the N
(50) current patterns and result in N?(Lk-W1)
new patterns - Choose the best N among these new patterns
- Continue until the last sequence
24Connection with sequential imputation
- Missing data Z(z1,,zK) obs Y(y1,,yK)
parameter. - We can impute multiple of zj by sampling from
- The incremental weight is computed as
z1
z2
z3
zj ?
25Gibbs sampling approach
- Let Q(?0 , ?1 , , ?w ), parameter, Aa1,
a2, , aK - Iterative sampling P(Q A, Data) P(A Q,
Data) - Draw from Q A, Data, then draw from A Q,
Data - Predictive Updating pretend that K-1 sequences
have been aligned. We stochastically predict for
the K-th sequence!!
26The Algorithm
- Initialized by choosing random starting positions
- Iterate the following steps many times
- Randomly or systematically choose a sequence,
say, sequence k, to exclude - Carry out the predictive-updating step to update
ak - Stop when no changes observed, or some criterion
met
27The PU-step
1. Compute predictive frequencies of each
position i in motif cij count of amino
acid type j at position i. c0j count
of amino acid type j in all non-site positions.
qij (cijbj)/(K-1B), Bb1 bK
pseudo-counts 2. Sample from the predictive
distriubtion of ak .
28Why Does It work? ---An Analogy
Crystalize
29How to determine the set of co-regulated genes?
(how to derive the dataset)
By courtesy of Xiaole Liu
- Assemble the dataset based on scientific
knowledge - Observe gene expression profiles at different
cell states
(Approach taken by Church et al.)
GeneChip detects expression of every gene at a
certain cell state (tastes the food and tells the
kind and amount of dishes cooked at situations)
30Or Cross-species Comparison
- Find similar genes (homologs)
- They may correspond to similar control mechanisms
- Analogy English breakfast, French breakfast,
Italian breakfast, Chinese breakfast - We compared 8 bacterial genomes and found gt2000
binding motifs. Estimate 80 accuracy
31Repeated Motifs Gibbs Motif Sampler
- Some sequences have multiple repeats of several
motifs - Some sequences do not have any motif site at all
- Sequences are put together mostly based on
certain derived and imprecise information
32Idea 1 Mixture modeling--- Motif sampler
- View the dataset as a long sequence with m motif
types - Idea partition the input sequence into segments
that correspond to different (unknown) motif
models. - It is a mixture model (unsupervised learning).
- Implement a predictive updating scheme.
33Special Case Bernoulli Sampler
- Sequence data R r1 r2 r3 rN
- Indicator variable D d1d2d3 .. .. .. dN
- Likelihood p(R, D Q, e ), e is the prior
prob for di1 - Predictive Update
if it is the start of an element
if not.
parameter for the motif model
34Idea 2 Threshold Sampler
- Use the usual iterative updating
- Give two cutoff values c1ltc2 add all positions
whose predictive prob gtc2 sample those between
c1and c2 - Advantage less susceptive to sequence
correlations and low complexity features.
35Further Modeling Efforts
By courtesy of Xiaole Liu
- DNA can be read
- in two directions
- GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
- CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC
- TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG
36Further Modeling Efforts
By courtesy of Xiaole Liu
- Some sequence patterns have multiple parts
- GACACATTTACCTATGC TGGCCCTACGACCTCTCGC
- CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC
- GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA
- TCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG
- CGCTAGCCATTTACCGATCTTTGGCGTTCTCGAGAATTGCCTAT
Sample jointly
37Some are palindromic
TGTGAtcaaccTCACA
Spacing information
D
Translation start
38Examples
- Bacterial regulation
- Several proteobacteria genomes (9)
- Find orthologs of each E Coli ORF (2113) in other
genomes - Examine 5 upstream UTR for these orthologs to
find motifs - We predicted 2097 sites some new findings
confirmed - Human-mouse comparison
- Yeast study
- S. cerevisiae telomere-binding protein Rap1
- 60 noncoding sequences interacted with Rap1
- lengths range 1631339
- Some experimentally determined sites with known
pattern
39(No Transcript)
40The Hidden Markov Model
- For given zs, ys f(ys zs, ?), and the zs
follow a Markov process with transition ps(zs
zs-1, ?).
The State Space Model
41What Are Hidden in Sequence Alignment?
- HMM Architecture transition diagram for the
underlying Markov chain.
42Future Work
- Modeling regulatory modules and developing a
sampling strategy for its discovery. - Combining gene expression data with sequence data
for studying gene regulation. - Hierarchical protein classification models.
- Motif classification.
43Young, yet Promising
44References (Self-serving)
- Liu, X., Brutlag, D. and Liu, J.S. (2000).
Bioprospector a new motif finding algorithm. - Liu, J.S., Neuwald, A.F., and Lawrence, C.E.
(1999) . Markovian structures in biological
sequence alignments. J. Amer. Statist. Assoc.,
1-15. - Neuwald, A.F., Liu, J.S., Lipman, D.J., and
Lawrence, C.E. (1997) . Extracting protein
alignment models from the sequence database.
Nucleic Acids. Res. 25, 1665-1677. - Liu, J.S., Neuwald, A.F., and Lawrence, C.E.
(1995) . Bayesian models for multiple local
sequence alignment and Gibbs sampling strategies.
J. Amer. Statist. Assoc. 90, 1156-1170. - Neuwald, A.F., Liu, J.S., and Lawrence, C.E.
(1995) . Gibbs motif sampling detection of ..
Protein Science 4, 1618-1632. - Lawrence, et al. (1993). Science 262, 208-214.