Finding Regulatory Binding Motifs in Genomic Sequences - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Finding Regulatory Binding Motifs in Genomic Sequences

Description:

Scrambled Egg. Bacon. Cereal. Hash Brown. Orange Juice. By ... Scrambled Egg. Bacon. Cereal. Hash Brown. Orange Juice. Look at genes always expressed together: ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 43
Provided by: jun52
Category:

less

Transcript and Presenter's Notes

Title: Finding Regulatory Binding Motifs in Genomic Sequences


1
Finding Regulatory Binding Motifs in Genomic
Sequences
  • Jun Liu
  • Department of Statistics
  • Harvard University
  • Email jliu_at_stat.Harvard.edu
  • Http//www.fas.Harvard.edu/junliu

2
Outline
  • Background
  • The basic motif model and early attempts
  • The EM-method
  • A progressive search method
  • The Gibbs sampling approach
  • Motif sampler and mixture model
  • Threshold sampler
  • Further modeling efforts
  • Applications

3
Genome Sequence in Year 2000
DNA (A,T,G,C) ? RNA (A,U,G,C) ? Protein
(A,R,N,D,C,E,Q,G,H,I,L,K, )
  • Human genome first draft in year 2000?
  • Three billion bases of DNA
  • Blueprint for human species
  • Already gt 20 smaller genomes complete
  • E. Coli, H. influenzae,
  • Archaeoglobus fulgidus,
  • C. Elegans , S. Cerevisiae

4
Other High Throughput Data
  • Gene expression arrays
  • Single nucleotide point mutations (SNPs)
  • In 80,000 human genes
  • More coming
  • Focus genomic comparisons to reveal patterns --
    learn about natures words

5
Get the Complete Sequence
By courtesy of Xiaole Liu
  • 3-billion letters long
  • 140 thousand genes
  • Alphabet is A, T, G and C
  • No punctuation, no spaces

ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
6
Each Cell Is Like a Chef
By courtesy of Xiaole Liu
7
Information in DNA
By courtesy of Xiaole Liu
Between genes 97
Junk?
ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC ATTTACCA
CATCGCATCACCAGTTCAGGATAGACACGGACG GCCTCGATTGACGGTG
GTACAGTTCAATGACAACCTGACTA TCTCGTTAGGACCCATGCGTACGA
CCCGTTTAAATCGAGAG CGCTAGCCCGTCATCGATCTTGTTCGAATCGC
GAATTGCCT
8
Information in DNA
By courtesy of Xiaole Liu
  • Between genes 97 Genes 3
  • Regulation When, Where,
  • Amount, Other Conditions, etc
  • ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTCTC
  • ATTTACCACATCGCATCACTACGACGGATAGACACGGACG
  • GCCTCGATTGACGGTGGTACAGTTCAATGACAACCTGACTA
  • TCTCGTTAGGACCCATGCGTACGACCCGTTTAAATCGAGAG
  • CGCTAGGTCATCCCAGATCTTGTTCGAATCGCGAATTGCCT

Milk-gtYogurt
Egg-gtOmelet
Fish-gtSushi
Flour-gtCake
Beef-gtBurger
9
Search for Regulatory Binding Sites
  • Gene Transcription and Regulation
  • Transcription initiated by RNA polymerase binding
    at the so-called promoter region (TATA-box or
    -10, -35)
  • Regulated by (regulatory) proteins binding to a
    segment of the genome near the promoter
    region
  • These binding sites on DNA are often similar in
    composition

RNA polymerase
Enhancers and repressors
Starting codon
3
5
AUG
Promoter region
Translation start
5 UTR
10
(No Transcript)
11
(No Transcript)
12
Decode Gene Regulation
By courtesy of Xiaole Liu
  • GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
  • CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
  • GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
  • TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
  • CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

signal to turn on/off switches
Co-regulated genes
The same regulatory protein is often used for
co-expressed genes
Transcription start
13
Decode Gene Regulation
By courtesy of Xiaole Liu
  • GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
  • CACATCGCATATTTACCACCAGTTCAGACACGGACGGC
  • GCCTCGATTTACCGTGGTACAGTTCAAACCTGACTAAA
  • TCTCGTTAGGACCATATTTACCACCCACATCGAGAGCG
  • CGCTAGCCATTTACCGATCTTGTTCGAGAATTGCCTAT

Look at genes always expressed together Upstrea
m Regions Co-expressed Genes
Scrambled Egg
Bacon
Cereal
Hash Brown
Orange Juice
14
The Particular Dataset
  • 18 DNA segments, each of length 105 bps.
  • There are at least one CRP binding sites, known
    experimentally, in each sequence.
  • The binding sites are about 16-19 base pairs
    long, with considerable variability in their
    contents.
  • Interested in seeing if we can find these sites
    computationally.

15
The Test Data Set
16
Truth?
17
Finding Motifs in Multiple Sequences
Motif
a1
a2
width w
ak
length nk
Alignment variable Aa1, a2, , ak
Objective find the best common patterns.
18
Statistical Models
  • How do we describe patterns?
  • frequencies of amino acid types. theoretical
    basis?
  • multinomial distribution --- more generally a
    model

TTTGATCGTTTTCACAAAA TTTGCACGGCGTCACACTT
TGTGAGCATGGTCATATTT TGCAAAGGACGTCACATTA
TGTTAAATTGATCACGTTT TTTGAACCAGATCGCATTA TGTAAACG
ATTCCACTAAT CGTGATCAACCCCTCAATT
TGTGAGTTAGCTCACTCAT TGTAACAGAGATCACACAA
A typical aligned motif
19
Multinomial Distribution
A total of k sequences
p1
p6
p2
Model Mi for i-th column
(ki,1, ki,2, , ki,4) Multinom (k, pi )
where pi(pi,1 ,, pi,4)
20
How to estimate?
  • The maximum likelihood
  • Bayesian estimate
  • Prior pi Dirichlet (ai,1, ..., ai,4),
    pseudo-counts
  • Posterior pi obs Dirichlet (ai,1ki,1,,
    ai,4ki,20)
  • Posterior Mean

21
Motif Alignment Model
Motif
a1
a2
width w
ak
length nk
Alignment variable Aa1, a2, , ak
  • Every non-site positions follows a common
    multinomial
  • with p0(p0,1 ,, p0,20)
  • Every position i in the motif element
    follows probability
  • distribution pi(pi,1 ,, pi,20)

22
Handling missing data --- the EM method
  • This is a standard missing data problem
  • Given A, Q is easy given Q, A is easy
  • The EM approach
  • E-step let be the predictive prob for jth
    position in seq k is a site.
  • M-step multinomial with fractional counts.

Lawrence and Reilly (1990)
23
A sequential updating method
(Stormo and Hartzel 1989).
  • Start with the first 2 sequences find the top N
    (50, say) best patterns based on pair-wise
    comparison
  • Consider a new sequence at each step
  • There are (Lk-W1) possible locations for the
    pattern
  • Each possible location is compared with the N
    (50) current patterns and result in N?(Lk-W1)
    new patterns
  • Choose the best N among these new patterns
  • Continue until the last sequence

24
Connection with sequential imputation
  • Missing data Z(z1,,zK) obs Y(y1,,yK)
    parameter.
  • We can impute multiple of zj by sampling from
  • The incremental weight is computed as

z1
z2
z3
zj ?
25
Gibbs sampling approach
  • Let Q(?0 , ?1 , , ?w ), parameter, Aa1,
    a2, , aK
  • Iterative sampling P(Q A, Data) P(A Q,
    Data)
  • Draw from Q A, Data, then draw from A Q,
    Data
  • Predictive Updating pretend that K-1 sequences
    have been aligned. We stochastically predict for
    the K-th sequence!!

26
The Algorithm
  • Initialized by choosing random starting positions
  • Iterate the following steps many times
  • Randomly or systematically choose a sequence,
    say, sequence k, to exclude
  • Carry out the predictive-updating step to update
    ak
  • Stop when no changes observed, or some criterion
    met

27
The PU-step
1. Compute predictive frequencies of each
position i in motif cij count of amino
acid type j at position i. c0j count
of amino acid type j in all non-site positions.
qij (cijbj)/(K-1B), Bb1 bK
pseudo-counts 2. Sample from the predictive
distriubtion of ak .
28
Why Does It work? ---An Analogy
Crystalize
29
How to determine the set of co-regulated genes?
(how to derive the dataset)
By courtesy of Xiaole Liu
  • Assemble the dataset based on scientific
    knowledge
  • Observe gene expression profiles at different
    cell states

(Approach taken by Church et al.)
GeneChip detects expression of every gene at a
certain cell state (tastes the food and tells the
kind and amount of dishes cooked at situations)
30
Or Cross-species Comparison
  • Find similar genes (homologs)
  • They may correspond to similar control mechanisms
  • Analogy English breakfast, French breakfast,
    Italian breakfast, Chinese breakfast
  • We compared 8 bacterial genomes and found gt2000
    binding motifs. Estimate 80 accuracy

31
Repeated Motifs Gibbs Motif Sampler
  • Some sequences have multiple repeats of several
    motifs
  • Some sequences do not have any motif site at all
  • Sequences are put together mostly based on
    certain derived and imprecise information

32
Idea 1 Mixture modeling--- Motif sampler
  • View the dataset as a long sequence with m motif
    types
  • Idea partition the input sequence into segments
    that correspond to different (unknown) motif
    models.
  • It is a mixture model (unsupervised learning).
  • Implement a predictive updating scheme.

33
Special Case Bernoulli Sampler
  • Sequence data R r1 r2 r3 rN
  • Indicator variable D d1d2d3 .. .. .. dN
  • Likelihood p(R, D Q, e ), e is the prior
    prob for di1
  • Predictive Update

if it is the start of an element
if not.
parameter for the motif model
34
Idea 2 Threshold Sampler
  • Use the usual iterative updating
  • Give two cutoff values c1ltc2 add all positions
    whose predictive prob gtc2 sample those between
    c1and c2
  • Advantage less susceptive to sequence
    correlations and low complexity features.

35
Further Modeling Efforts
By courtesy of Xiaole Liu
  • DNA can be read
  • in two directions
  • GATGGCTGCACATTTACCTATGCCCTACGACCTCTCGC
  • CACATCGCATGGTAAATACCAGTTCAGACACGGACGGC
  • TCTCAGGTAAATCAGTCATACTACCCACATCGAGAGCG

36
Further Modeling Efforts
By courtesy of Xiaole Liu
  • Some sequence patterns have multiple parts
  • GACACATTTACCTATGC TGGCCCTACGACCTCTCGC
  • CACAATTTACCACCA TGGCGTGATCTCAGACACGGACGGC
  • GCCTCGATTTACCGTGGTATGGCTAGTTCTCAAACCTGACTAAA
  • TCTCGTTAGATTTACCACCCA TGGCCGTATCGAGAGCG
  • CGCTAGCCATTTACCGATCTTTGGCGTTCTCGAGAATTGCCTAT

Sample jointly
37
Some are palindromic
TGTGAtcaaccTCACA
Spacing information
D
Translation start
38
Examples
  • Bacterial regulation
  • Several proteobacteria genomes (9)
  • Find orthologs of each E Coli ORF (2113) in other
    genomes
  • Examine 5 upstream UTR for these orthologs to
    find motifs
  • We predicted 2097 sites some new findings
    confirmed
  • Human-mouse comparison
  • Yeast study
  • S. cerevisiae telomere-binding protein Rap1
  • 60 noncoding sequences interacted with Rap1
  • lengths range 1631339
  • Some experimentally determined sites with known
    pattern

39
(No Transcript)
40
The Hidden Markov Model
  • For given zs, ys f(ys zs, ?), and the zs
    follow a Markov process with transition ps(zs
    zs-1, ?).

The State Space Model
41
What Are Hidden in Sequence Alignment?
  • HMM Architecture transition diagram for the
    underlying Markov chain.

42
Future Work
  • Modeling regulatory modules and developing a
    sampling strategy for its discovery.
  • Combining gene expression data with sequence data
    for studying gene regulation.
  • Hierarchical protein classification models.
  • Motif classification.

43
Young, yet Promising
44
References (Self-serving)
  • Liu, X., Brutlag, D. and Liu, J.S. (2000).
    Bioprospector a new motif finding algorithm.
  • Liu, J.S., Neuwald, A.F., and Lawrence, C.E.
    (1999) . Markovian structures in biological
    sequence alignments. J. Amer. Statist. Assoc.,
    1-15.
  • Neuwald, A.F., Liu, J.S., Lipman, D.J., and
    Lawrence, C.E. (1997) . Extracting protein
    alignment models from the sequence database.
    Nucleic Acids. Res. 25, 1665-1677.
  • Liu, J.S., Neuwald, A.F., and Lawrence, C.E.
    (1995) . Bayesian models for multiple local
    sequence alignment and Gibbs sampling strategies.
    J. Amer. Statist. Assoc. 90, 1156-1170.
  • Neuwald, A.F., Liu, J.S., and Lawrence, C.E.
    (1995) . Gibbs motif sampling detection of ..
    Protein Science 4, 1618-1632.
  • Lawrence, et al. (1993). Science 262, 208-214.
Write a Comment
User Comments (0)
About PowerShow.com