Finding Regulatory Binding Motifs in Genomic Sequences - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Finding Regulatory Binding Motifs in Genomic Sequences

Description:

Scrambled Egg. Bacon. Cereal. Hash Brown. Orange Juice. By ... Scrambled Egg. Bacon. Cereal. Hash Brown. Orange Juice. Look at genes always expressed together: ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 43

Provided by: jun52

Category:

more less

Transcript and Presenter's Notes

Title: Finding Regulatory Binding Motifs in Genomic Sequences

1
Finding Regulatory Binding Motifs in Genomic
Sequences

Jun Liu
Department of Statistics
Harvard University
Email jliu_at_stat.Harvard.edu
Http//www.fas.Harvard.edu/junliu

2
Outline

Background
The basic motif model and early attempts
The EM-method
A progressive search method
The Gibbs sampling approach
Motif sampler and mixture model
Threshold sampler
Further modeling efforts
Applications

3
Genome Sequence in Year 2000
DNA (A,T,G,C) ? RNA (A,U,G,C) ? Protein
(A,R,N,D,C,E,Q,G,H,I,L,K, )

Human genome first draft in year 2000?
Three billion bases of DNA
Blueprint for human species
Already gt 20 smaller genomes complete
E. Coli, H. influenzae,
Archaeoglobus fulgidus,
C. Elegans , S. Cerevisiae

4
Other High Throughput Data

Gene expression arrays
Single nucleotide point mutations (SNPs)
In 80,000 human genes
More coming
Focus genomic comparisons to reveal patterns --
learn about natures words

5
Get the Complete Sequence
By courtesy of Xiaole Liu

3-billion letters long
140 thousand genes
Alphabet is A, T, G and C
No punctuation, no spaces

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
6
Each Cell Is Like a Chef
By courtesy of Xiaole Liu
7
Information in DNA
By courtesy of Xiaole Liu
Between genes 97
Junk?
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
8
Information in DNA
By courtesy of Xiaole Liu

Between genes 97 Genes 3
Regulation When, Where,
Amount, Other Conditions, etc
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC
ATTTACCACATCGCATCACTACGACGGATAGACACGGACG
GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA
TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG
CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT

Milk-gtYogurt
Egg-gtOmelet
Fish-gtSushi
Flour-gtCake
Beef-gtBurger
9
Search for Regulatory Binding Sites

Gene Transcription and Regulation
Transcription initiated by RNA polymerase binding
at the so-called promoter region (TATA-box or
-10, -35)
Regulated by (regulatory) proteins binding to a
segment of the genome near the promoter
region
These binding sites on DNA are often similar in
composition

RNA polymerase
Enhancers and repressors
Starting codon
3
5
AUG
Promoter region
Translation start
5 UTR
10
(No Transcript)
11
(No Transcript)
12
Decode Gene Regulation
By courtesy of Xiaole Liu

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

signal to turn on/off switches
Co-regulated genes
The same regulatory protein is often used for
co-expressed genes
Transcription start
13
Decode Gene Regulation
By courtesy of Xiaole Liu

GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Look at genes always expressed together Upstrea
m Regions Co-expressed Genes
Scrambled Egg
Bacon
Cereal
Hash Brown
Orange Juice
14
The Particular Dataset

18 DNA segments, each of length 105 bps.
There are at least one CRP binding sites, known
experimentally, in each sequence.
The binding sites are about 16-19 base pairs
long, with considerable variability in their
contents.
Interested in seeing if we can find these sites
computationally.

15
The Test Data Set
16
Truth?
17
Finding Motifs in Multiple Sequences
Motif
a1
a2
width w
ak
length nk
Alignment variable Aa1, a2, , ak
Objective find the best common patterns.
18
Statistical Models

How do we describe patterns?
frequencies of amino acid types. theoretical
basis?
multinomial distribution --- more generally a
model

TTTGATCGTTTTCACAAAA TTTGCACGGCGTCACACTT
TGTGAGCATGGTCATATTT TGCAAAGGACGTCACATTA
TGTTAAATTGATCACGTTT TTTGAACCAGATCGCATTA TGTAAACG
ATTCCACTAAT CGTGATCAACCCCTCAATT
TGTGAGTTAGCTCACTCAT TGTAACAGAGATCACACAA
A typical aligned motif
19
Multinomial Distribution
A total of k sequences
p1
p6
p2
Model Mi for i-th column
(ki,1, ki,2, , ki,4) Multinom (k, pi )
where pi(pi,1 ,, pi,4)
20
How to estimate?

The maximum likelihood
Bayesian estimate
Prior pi Dirichlet (ai,1, ..., ai,4),
pseudo-counts
Posterior pi obs Dirichlet (ai,1ki,1,,
ai,4ki,20)
Posterior Mean

21
Motif Alignment Model
Motif
a1
a2
width w
ak
length nk
Alignment variable Aa1, a2, , ak

Every non-site positions follows a common
multinomial
with p0(p0,1 ,, p0,20)
Every position i in the motif element
follows probability
distribution pi(pi,1 ,, pi,20)

22
Handling missing data --- the EM method

This is a standard missing data problem
Given A, Q is easy given Q, A is easy
The EM approach
E-step let be the predictive prob for jth
position in seq k is a site.
M-step multinomial with fractional counts.

Lawrence and Reilly (1990)
23
A sequential updating method
(Stormo and Hartzel 1989).

Start with the first 2 sequences find the top N
(50, say) best patterns based on pair-wise
comparison
Consider a new sequence at each step
There are (Lk-W1) possible locations for the
pattern
Each possible location is compared with the N
(50) current patterns and result in N?(Lk-W1)
new patterns
Choose the best N among these new patterns
Continue until the last sequence

24
Connection with sequential imputation

Missing data Z(z1,,zK) obs Y(y1,,yK)
parameter.
We can impute multiple of zj by sampling from
The incremental weight is computed as

z1
z2
z3
zj ?
25
Gibbs sampling approach

Let Q(?0 , ?1 , , ?w ), parameter, Aa1,
a2, , aK
Iterative sampling P(Q A, Data) P(A Q,
Data)
Draw from Q A, Data, then draw from A Q,
Data
Predictive Updating pretend that K-1 sequences
have been aligned. We stochastically predict for
the K-th sequence!!

26
The Algorithm

Initialized by choosing random starting positions
Iterate the following steps many times
Randomly or systematically choose a sequence,
say, sequence k, to exclude
Carry out the predictive-updating step to update
ak
Stop when no changes observed, or some criterion
met

27
The PU-step
1. Compute predictive frequencies of each
position i in motif cij count of amino
acid type j at position i. c0j count
of amino acid type j in all non-site positions.
qij (cijbj)/(K-1B), Bb1 bK
pseudo-counts 2. Sample from the predictive
distriubtion of ak .
28
Why Does It work? ---An Analogy
Crystalize
29
How to determine the set of co-regulated genes?
(how to derive the dataset)
By courtesy of Xiaole Liu

Assemble the dataset based on scientific
knowledge
Observe gene expression profiles at different
cell states

(Approach taken by Church et al.)
GeneChip detects expression of every gene at a
certain cell state (tastes the food and tells the
kind and amount of dishes cooked at situations)
30
Or Cross-species Comparison

Find similar genes (homologs)
They may correspond to similar control mechanisms
Analogy English breakfast, French breakfast,
Italian breakfast, Chinese breakfast
We compared 8 bacterial genomes and found gt2000
binding motifs. Estimate 80 accuracy

31
Repeated Motifs Gibbs Motif Sampler

Some sequences have multiple repeats of several
motifs
Some sequences do not have any motif site at all
Sequences are put together mostly based on
certain derived and imprecise information

32
Idea 1 Mixture modeling--- Motif sampler

View the dataset as a long sequence with m motif
types
Idea partition the input sequence into segments
that correspond to different (unknown) motif
models.
It is a mixture model (unsupervised learning).
Implement a predictive updating scheme.

33
Special Case Bernoulli Sampler

Sequence data R r1 r2 r3 rN
Indicator variable D d1d2d3 .. .. .. dN
Likelihood p(R, D Q, e ), e is the prior
prob for di1
Predictive Update

if it is the start of an element
if not.
parameter for the motif model
34
Idea 2 Threshold Sampler

Use the usual iterative updating
Give two cutoff values c1ltc2 add all positions
whose predictive prob gtc2 sample those between
c1and c2
Advantage less susceptive to sequence
correlations and low complexity features.

35
Further Modeling Efforts
By courtesy of Xiaole Liu

DNA can be read
in two directions
GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC
TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG

36
Further Modeling Efforts
By courtesy of Xiaole Liu

Some sequence patterns have multiple parts
GACACATTTACCTATGC TGGCCCTACGACCTCTCGC
CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC
GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA
TCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG
CGCTAGCCATTTACCGATCTTTGGCGTTCTCGAGAATTGCCTAT

Sample jointly
37
Some are palindromic
TGTGAtcaaccTCACA
Spacing information
D
Translation start
38
Examples

Bacterial regulation
Several proteobacteria genomes (9)
Find orthologs of each E Coli ORF (2113) in other
genomes
Examine 5 upstream UTR for these orthologs to
find motifs
We predicted 2097 sites some new findings
confirmed
Human-mouse comparison
Yeast study
S. cerevisiae telomere-binding protein Rap1
60 noncoding sequences interacted with Rap1
lengths range 1631339
Some experimentally determined sites with known
pattern

39
(No Transcript)
40
The Hidden Markov Model

For given zs, ys f(ys zs, ?), and the zs
follow a Markov process with transition ps(zs
zs-1, ?).

The State Space Model
41
What Are Hidden in Sequence Alignment?

HMM Architecture transition diagram for the
underlying Markov chain.

42
Future Work

Modeling regulatory modules and developing a
sampling strategy for its discovery.
Combining gene expression data with sequence data
for studying gene regulation.
Hierarchical protein classification models.
Motif classification.

43
Young, yet Promising
44
References (Self-serving)

Liu, X., Brutlag, D. and Liu, J.S. (2000).
Bioprospector a new motif finding algorithm.
Liu, J.S., Neuwald, A.F., and Lawrence, C.E.
(1999) . Markovian structures in biological
sequence alignments. J. Amer. Statist. Assoc.,
1-15.
Neuwald, A.F., Liu, J.S., Lipman, D.J., and
Lawrence, C.E. (1997) . Extracting protein
alignment models from the sequence database.
Nucleic Acids. Res. 25, 1665-1677.
Liu, J.S., Neuwald, A.F., and Lawrence, C.E.
(1995) . Bayesian models for multiple local
sequence alignment and Gibbs sampling strategies.
J. Amer. Statist. Assoc. 90, 1156-1170.
Neuwald, A.F., Liu, J.S., and Lawrence, C.E.
(1995) . Gibbs motif sampling detection of ..
Protein Science 4, 1618-1632.
Lawrence, et al. (1993). Science 262, 208-214.