Title: Biological Motif Discovery
1Biological Motif Discovery
- Concepts
- Motif Modeling and Motif Information
- EM and Gibbs Sampling
- Comparative Motif Prediction
-
- Applications
- Transcription Factor Binding Site Prediction
- Epitope Prediction
- Lab Practical
- DNA Motif Discovery with MEME and AlignAce
- Co-regulated genes from TB Boshoff data set
2Regulatory Motifs
Find promoter motifs associated with co-regulated
or functionally related genes
3Transcription Factor Binding Sites
- Very Small
- Highly Variable
- Constant Size
- Often repeated
- Low-complexity-ish
Slide Credit S. Batzoglou
4Other Motifs
- Splicing Signals
- Splice junctions
- Exonic Splicing Enhancers (ESE)
- Exonic Splicing Surpressors (ESS)
- Protein Domains
- Glycosylation sites
- Kinase targets
- Targetting signals
- Protein Epitopes
- MHC binding specificities
5Essential Tasks
- Modeling Motifs
- How to computationally represent motifs
- Visualizing Motifs
- Motif Information
- Predicting Motif Instances
- Using the model to classify new sequences
- Learning Motif Structure
- Finding new motifs, assessing their quality
6Modeling Motifs
7Consensus Sequences
- Useful for publication
- IUPAC symbols for degenerate sites
- Not very amenable to computation
Nature Biotechnology 24, 423 - 425 (2006)
8Probabilistic Model
M1
M1
MK
A C G T
.1
.1
.4
.1
.2
.1
.5
.1
.2
.2
.2
.2
.2
.1
.2
.4
.5
.4
.2
.7
.2
.2
.1
.3
Pk(SM)
Position Frequency Matrix (PFM)
9Scoring A Sequence
To score a sequence, we compare to a null model
Log likelihood ratio
Position Weight Matrix (PWM)
Background DNA (B)
PFM
10Scoring a Sequence
Common threshold 60 of maximum score
MacIsaac Fraenkel (2006) PLoS Comp Bio
11Visualizing Motifs Motif Logos
Represent both base frequency and conservation at
each position
Height of letter proportional to frequency of
base at that position
Height of stack proportional to conservation at
that position
12Motif Information
- The height of a stack is often called the motif
information at that position measured in bits
Information
Why is this a measure of information?
13Uncertainty and probability
Uncertainty is related to our surprise at an event
The sun will rise tomorrow
Not surprising (p1)
The sun will not rise tomorrow
Very surprising (pltlt1)
Uncertainty is inversely related to probability
of event
14Average Uncertainty
Two possible outcomes for sun rising
A The sun will rise tomorrow
P(A)p1
B The sun will not rise tomorrow
P(B)p2
What is our average uncertainty about the sun
rising
Entropy
15Entropy
- Entropy measures average uncertainty
- Entropy measures randomness
If log is base 2, then the units are called bits
16Entropy versus randomness
Entropy is maximum at maximum randomness
Example Coin Toss
P(heads)0.1 Not very random H(X)0.47 bits
Entropy
P(heads)0.5 Completely random H(X)1 bits
P(heads)
17Entropy Examples
18Information Content
- Information is a decrease in uncertainty
Once I tell you the sun will rise, your
uncertainty about the event decreases
Hbefore(X)
Hafter(X)
-
Information
Information is difference in entropy after
receiving information
19Motif Information
2
-
Motif Position Information
Hbackground(X)
Hmotif_i(X)
Uncertainty after learning it is position i in a
motif
Prior uncertainty about nucleotide
P(x)
P(x)
H(X)2 bits
H(X)0.63 bits
Uncertainty at this position has been reduced by
0.37 bits
20Motif Logo
Conserved Residue Reduction of uncertainty of 2
bits
Little Conservation Minimal reduction of
uncertainty
21Background DNA Frequency
The definition of information assumes a uniform
background DNA nucleotide frequency What if the
background frequency is not uniform?
2
-
Motif Position Information
-0.2 bits
Some motifs could have negative information!
22A Different Measure
- Relative entropy or Kullback-Leibler (KL)
divergence - Divergence between a true distribution and
another
True Distribution
Other Distribution
Properties
DKL is larger the more different Pmotif is from
Pbackground Same as Information if Pbackground
is uniform
23Comparing Both Methods
Information assuming uniform background DNA
KL Distance assuming 20 GC content (e.g.
Plasmodium)
24Online Logo Generation
http//weblogo.berkeley.edu/
http//biodev.hgen.pitt.edu/cgi-bin/enologos/enolo
gos.cgi
25Finding New Motifs
26A Promoter Model
Length K
Motif
Pk(SM)
The same motif model in all promoters
27Probability of a Sequence
Given a sequence(s), motif model and motif
location
1
60
65
100
A T A T G C
28Parameterizing the Motif Model
Given multiple sequences and motif locations but
no motif model
AATGCG ATATGG ATATCG GATGCA
Count Frequencies Add pseudocounts
29Finding Known Motifs
Given multiple sequences and motif model but no
motif locations
P(SeqwindowMotif)
window
Calculate P(SeqwindowMotif) for every starting
location
Choose best starting location in each sequence
30Discovering Motifs
Given a set of co-regulated genes, we need to
discover with only sequences
We have neither a motif model nor motif
locations Need to discover both
How can we approach this problem? (Hint start
with a random motif model)
31Expectation Maximization (EM)
- Remember the basic idea!
- Use model to estimate (distribution of) missing
data - Use estimate to update model
- Repeat until convergence
Model is the motif model Missing data are the
motif locations
32EM for Motif Discovery
- Start with random motif model
- E Step estimate probability of motif positions
for each sequence - M Step use estimate to update motif model
- Iterate (to convergence)
At each iteration, P(SequencesModel) guaranteed
to increase
ETC
33MEME
- MEME - implements EM for motif discovery in DNA
and proteins - MAST search sequences for motifs given a model
http//meme.sdsc.edu/meme/
34P(SeqModel) Landscape
EM searches for parameters to increase
P(seqsparameters)
Useful to think of P(seqsparameters) as a
function of parameters
P(Sequencesparams1,params2)
Parameter1
Parameter2
Where EM starts can make a big difference
35Search from Many Different Starts
To minimize the effects of local maxima, you
should search multiple times from different
starting points
MEME uses this idea Start at many points Run
for one iteration Choose starting point that
got the highest and continue
P(Sequencesparams1,params2)
Parameter1
Parameter2
36Gibbs Sampling
A stochastic version of EM that differs from
deterministic EM in two key ways
- At each iteration, we only update the motif
position - of a single sequence
- 2. We may update a motif position to a
suboptimal new position
37Gibbs Sampling
Best Location
New Location
- Start with random motif locations and calculate
a motif model - Randomly select a sequence, remove its motif and
recalculate tempory model - With temporary model, calculate probability of
motif at each position on sequence - Select new position based on this distribution
- Update model and Iterate
ETC
38Gibbs Sampling and Climbing
Because gibbs sampling does always choose the
best new location it can move to another place
not directly uphill
P(Sequencesparams1,params2)
Parameter1
Parameter2
In theory, Gibbs Sampling less likely to get
stuck a local maxima
39AlignACE
- Implements Gibbs sampling for motif discovery
- Several enhancements
- ScanAce look for motifs in a sequence given a
model - CompareAce calculate similarity between two
motifs (i.e. for clustering motifs)
http//atlas.med.harvard.edu/cgi-bin/alignace.pl
40Assessing Motif Quality
Scan the genome for all instances and associate
with nearest genes
- Category Enrichment look for association
between motif and sets of genes. Score using
Hypergeometric distribution - Functional Category
- Gene Families
- Protein Complexes
- Group Specificity how restricted are motif
instances to the promoter sequences used to find
the motif? - Positional Bias do motif instances cluster at a
certain distance from ATG? - Orientation Bias do motifs appear
preferentially upstream or downstream of genes?
41Comparative Motif Prediction
42Kellis et al. (2003) Nature
43Conservation of Motifs
GAL10
Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTT
TGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCA
TATACA Spar CTATGTTGATCTTTTCAGAATTTTT-C
ACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTT
TCCTATCATACACA Smik GTATATTGAATTTTTCAGT
TTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTA
CATTTCGTTACTATCATACACA Sbay
TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGA
AAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA
Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-G
TAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTT
TCAGTAATACG Spar TATCCATATCTAGTCTTACTTATATGTTGT-
GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTA
TGAAACTTGAACTG-TACG Smik TACCGATGTCTAGTCTTACTTAT
ATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT-
-CTTTCTATGGAGCTTTG-CTA-TATG Sbay
TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGT
GCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG
Scer CTTAACTGCTCATTGC-----TATATTGAAGT
ACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTC
CTCCGTGCGTCCTCGTCT Spar CTAAACTGCTCATTGC-----AAT
ATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGA
ATATTCCCCTCCGTGCGTCGCCGTCT Smik
TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCG
AACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCT
Sbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGA
AGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGA
AGTCGTCT
Scer TCACCGG-TCGCGTTCCTGA
AACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACA
A-----TACTAGCTTTT--ATGGTTATGAA Spar
TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTC
CGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGAC
Smik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCA
CCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACG
GTGATGCC Sbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTC
TCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTA
GTG--GCAGTTATGGT
Scer
GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTA
ACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T
Spar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTT
TCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG-----
-TTAG--G Smik CAACGCAAAATAAACAGTCC----CCCGGCCCCA
CATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTA
GCAA-AATATTAG--G Sbay GAACGTGAAATGACAATTCCTTGCCC
CT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGA
TGGGGTTGCGGTCAAGCCTACTCG
Scer
TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT
-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT
Spar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAA
TGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCA
C-----TT Smik TTCTCA--CCTTTCTCTGTGATAATTCATCACCG
AAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCG
CAGAGATCA-----AT Sbay TTTTCCGTTTTACTTCTGTAGTGGCT
CAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATAT
GAAAGTAAGATCGCCTCAATTGTA
Scer
TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAA
T----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT
Spar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TT
TGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACA
TCTATACT Smik TCATTCC-ATTCGAACCTTTGAGACTAATTATAT
TTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTC
AGTATCTATACATACA Sbay TAGTTTTTCTTTATTCCGTTTGTACT
TCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACAT
CAATAACAAGTATTCAACATTTGT
Scer
TTAA-CGTCAAGGA---GAAAAAACTATA Spar
TTAT-CGTCAAGGAAA-GAACAAACTATA Smik
TCGTTCATCAAGAA----AAAAAACTA.. Sbay
TTATCCCAAAAAAACAACAACAACATATA
GAL1
slide credits M. Kellis
44Genome-wide conservation
Evaluate conservation within
(1) All intergenic regions
A signature for regulatory motifs
45Finding Motifs in Yeast GenomesM. Kellis PhD
Thesis
- Enumerate all mini-motifs
- Apply three tests
- Look for motifs conserved in intergenic regions
- Look for motifs more conserved intergenically
than in genes - Look for motifs preferentially conserved upstream
or downstream of genes
N
C
T
A
C
G
A
slide credits M. Kellis
46Test 1 Intergenic conservation
Conserved count
Total count
slide credits M. Kellis
47Test 2 Intergenic vs. Coding
Intergenic Conservation
Coding Conservation
slide credits M. Kellis
48Test 3 Upstream vs. Downstream
Upstream Conservation
Downstream Conservation
slide credits M. Kellis
49Constructing full motifs
Test 1
Test 2
Test 3
2,000 Mini-motifs
C
T
A
C
G
A
R
R
slide credits M. Kellis
50Results
Rank Discovered Motif Known TF motif Tissue Enrichment Distance bias
1 RCGCAnGCGY NRF-1 Yes Yes
2 CACGTG MYC Yes Yes
3 SCGGAAGY ELK-1 Yes Yes
4 ACTAYRnnnCCCR Yes Yes
5 GATTGGY NF-Y Yes Yes
6 GGGCGGR SP1 Yes Yes
7 TGAnTCA AP-1 Yes
8 TMTCGCGAnR Yes Yes
9 TGAYRTCA ATF3 Yes Yes
10 GCCATnTTG YY1 Yes
11 MGGAAGTG GABP Yes Yes
12 CAGGTG E12 Yes
13 CTTTGT LEF1 Yes
14 TGACGTCA ATF3 Yes Yes
15 CAGCTG AP-4 Yes
16 RYTTCCTG C-ETS-2 Yes Yes
17 AACTTT IRF1() Yes
18 TCAnnTGAY SREBP-1 Yes Yes
19 GKCGCn(7)TGAYG Yes Yes
- 174 promoter motifs
- 70 match known TF motifs
- 60 show positional bias
- ? 75 have evidence
- Control sequences
- lt 2 match known TF motifs
- lt 3 show positional bias
- ? lt 7 false positives
slide credits M. Kellis
51Antigen Epitope Prediction
52Genome to Immunome
Pathogen genome sequences provide define all
proteins that could illicit an immune response
- Looking for a needle
- Only a small number of epitopes are typically
antigenic - in a very big haystack
- Vaccinia virus (258 ORFs) 175,716 potential
epitopes (8-, 9-, and 10-mers) - M. tuberculosis (4K genes) 433,206 potential
epitopes - A. nidulans (9K genes) 1,579,000 potential
epitopes
Can computational approaches predict all
antigenic epitopes from a genome?
53Antigen Processing Presentation
54Modeling MHC Epitopes
- Have a set of peptides that have been associate
with a particular MHC allele - Want to discover motif within the peptide bound
by MHC allele - Use motif to predict other potential epitopes
55Motifs Bound by MHCs
- MHC 1
- Closed ends of grove
- Peptides 8-10 AAs in length
- Motif is the peptide
- MHC 2
- Grove has open ends
- Peptides have broad length distribution 10-30
AAs - Need to find binding motif within peptides
56MHC 2 Motif Discovery
Use Gibbs Sampling!
462 peptides known to bind to MHC
II HLA-DR4(B10401) 9-30 residues in
length Goal identify a common length 9 binding
motif
Nielsen et al (2004) Bioinf
57Vaccinia Epitope Prediction
Mutaftsi et al (2006) Nat. Biotech.
- Predict MHC1 binding peptides
- Using 4 matrices for H-2 Kb and Db
- Top 1 predictions experimentally validated
49 validated epitopes accounting for 95 of
immune response
58(No Transcript)