Title: Ab initio motif finding
1(No Transcript)
2Ab initio motif finding
3Agenda
- Background / motivation
- Paper 1
- Paper 2
- Conclusion
4Central Dogma
Transcription
Translation
mRNA A,C,G,U
Amino Acid
DNA A,C,G,T
Protein
Folding
Image credits DNA, mRNA, Protein, Amino acid
5Impacts of gene regulation
- Functioning of an organism
- Development of an organism
- Evolution of organisms
6Transcription
- Process in which mRNA is made using DNA as a
template - Only genes are transcribed
- Regulated by transcription factors
7Transcription movie
8Binding Site
- Region on a protein, DNA, or RNA to which ligands
attach
9Motif
- Common sequence pattern in the binding sites of
a transcription factor - A succinct way of capturing variability among the
binding sites
credit
10Motif representation
XTCATCAX
- Position Specific Scoring Matrix
PSSM graph
11Ab initio Motif finding
- Say a transcription factor (TF) controls five
different genes - Each of the five genes will have binding sites
for the TF in their promoter region
12Ab initio Motif finding
- GIVEN promoter regions of the five genes G1,,G5
- FIND binding sites of TF, without prior
knowledge
13Agenda
- Background / motivation
- Paper 1
- Paper 2
- Conclusion
14Paper 1
- Ab Initio Prediction of Transcription Factor
Targets Using Structural Knowledge - Tommy Kaplan1,2, Nir Friedman1, Hanah Margalit2
15Overview
Known binding sites of others in the same protein
family
Identify binding site of new proteins (of that
family)
Same family !? Same binding specifity of residues
Prediction
16Application Cys2His2 Zinc Finger
Largest known DNA-binding family in multicellular
organism
Extensively studied
Has a stringent binding model
17Curation of Zinc Finger sequences and their
binding sites
31 Experimentally determined canonical domains
Classified as Canonical / Non-canonical
output
train
Profile HMM
curate
Cys2His2 in TRANSFAC database
input
61 Canonical Fingers 455 protein-binding site
pairs
18Identification of DNA-binding residues
canonical binding model of solved proteinDNA
complex of Egr-1
PROSITE motif pattern CX(24)CX(1113)HX(35)H
19Estimating DNA Recognition Preferences
- INPUT set of pairs of transcription factors and
their target DNA sequences.
TF
Target DNA sequence
20Probabilistic model of binding preferences
Set of interacting residues in the 4 positions, p
of the k fingers
E.g. A1,2 ? Set of interacting residue for finger
1 at position 2
C.Prob of interaction with DNA subseq, starting
from jth pos in DNA
Pp(NA) ? c. prob. of nucleotide N given amino
acid A at position p. N1,NL ? target DNA sequence
21Where did the P2 term go?
22Estimating DNA Recognition Preferences
- Apply Expectation Maximization
Identify binding locations
Optimize recognition preferences
23Expectation Maximization algorithm
E
M
Initial guess of DNA recog. Pref.
Compute expected posterior probability of binding
locations for all proteinDNA pair.
Maximize likelihood of the current binding
positions for all proteinDNA pairs based on the
distribution of possible binding locations
24Probabilistic model output
Tot. height info content Rel. height
probability C intensity confidence
(Phenylalanine, Cytocine) Prevalent in position 2
(Lysine,Guanine) ? irrespective of position
25Ab initio genome wide prediction
16,201 putative gene products
29 canonical fingers
DNA recognition preference
Binding site model
Scan the promoter regions
image
26ResultsD.Melanogaster
Transcription Factors
Match prior biological knowledge!!
GO Terms
Blue cellssignificant enrichment of GO terms
27ResultsD.Melanogaster
Transcription Factors
Embryogenesis phase
At least one significant embryogenesis experiment
28Agenda
- Background / motivation
- Paper 1
- Paper 2
- Conclusion
29Paper 2
- MotifCut regulatory motif finding with maximum
density subgraphs - Eugene Fratkin, Brian Naughton, Douglass Brutlag,
Serafim Batzoglou
30Drawbacks of existing methods
- As an optimization problem
- Intractable
- Relies on EM or local heuristic search
31Drawbacks of existing methods
WRONG!
Indeependence assumption biologically unrealistic
32Overview
- Nodes k-mers of input sequence
- Edges pairwise k-mer similarity
- Motif search ? maximum density subgraph
33MotifCut Algorithm
- Convert sequence into a collection of k-mers
- Each overlap/duplicate considered distinct
34MotifCut Algorithm
- For every pair of vertices (vi, vj) create an
edge with weight wij - wij f( mismatches bet. k-mers in vi, vj)
M ? k-mers of binding site B ? background k-mers
35Resulting graph
Note should be maximally connected!
36MotifCut Algorithm
- Find the maximum density subgraph
- Parametric flow algorithm (Gallo et al, 1989)
- A type of fractional programming
- Iteratively apply push/relable to find max-flow
and min-cut - O(VElog(V2E)) ? too slow!
37MDS optimization
Pick a center of neighborhood
Discard edges with weight lt w
Re-introduce all edges in neighborhood
Run MDS in neighborhood
Repeat for every vertex
38Results
- Synthetic Data
- vs MEME(Bailey et al, 1995)
- vs AlignAce (Hughes et al, 2000)
- vs BioProspector (Liu et al, 2001)
- Yeast Data
39Synthetic benchmark results
40Results Running time and yeast data
41Agenda
- Background / motivation
- Paper 1
- Paper 2
- Conclusion
42Conclusion
- Ab initio motif finding
- Use of structural knowledge
- Graph representation of motifs