Ab initio motif finding - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Ab initio motif finding

Description:

Ab initio motif finding. Ryo Shimizu. Agenda. Background ... curate. Identification of DNA-binding residues. PROSITE motif pattern: CX(2 4)CX(11 13)HX(3 5)H ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 43

Provided by: aiSta

Category:

more less

Transcript and Presenter's Notes

Title: Ab initio motif finding

1
(No Transcript)
2
Ab initio motif finding

Ryo Shimizu

3
Agenda

Background / motivation
Paper 1
Paper 2
Conclusion

4
Central Dogma
Transcription
Translation
mRNA A,C,G,U
Amino Acid
DNA A,C,G,T
Protein
Folding
Image credits DNA, mRNA, Protein, Amino acid
5
Impacts of gene regulation

Functioning of an organism
Development of an organism
Evolution of organisms

6
Transcription

Process in which mRNA is made using DNA as a
template
Only genes are transcribed
Regulated by transcription factors

7
Transcription movie
8
Binding Site

Region on a protein, DNA, or RNA to which ligands
attach

9
Motif

Common sequence pattern in the binding sites of
a transcription factor
A succinct way of capturing variability among the
binding sites

credit
10
Motif representation

Consensus Sequence

XTCATCAX

Position Specific Scoring Matrix

A graph

PSSM graph
11
Ab initio Motif finding

Say a transcription factor (TF) controls five
different genes
Each of the five genes will have binding sites
for the TF in their promoter region

12
Ab initio Motif finding

GIVEN promoter regions of the five genes G1,,G5
FIND binding sites of TF, without prior
knowledge

13
Agenda

Background / motivation
Paper 1
Paper 2
Conclusion

14
Paper 1

Ab Initio Prediction of Transcription Factor
Targets Using Structural Knowledge
Tommy Kaplan1,2, Nir Friedman1, Hanah Margalit2

15
Overview
Known binding sites of others in the same protein
family
Identify binding site of new proteins (of that
family)
Same family !? Same binding specifity of residues
Prediction
16
Application Cys2His2 Zinc Finger
Largest known DNA-binding family in multicellular
organism
Extensively studied
Has a stringent binding model
17
Curation of Zinc Finger sequences and their
binding sites
31 Experimentally determined canonical domains
Classified as Canonical / Non-canonical
output
train
Profile HMM
curate
Cys2His2 in TRANSFAC database
input
61 Canonical Fingers 455 protein-binding site
pairs
18
Identification of DNA-binding residues
canonical binding model of solved proteinDNA
complex of Egr-1
PROSITE motif pattern CX(24)CX(1113)HX(35)H
19
Estimating DNA Recognition Preferences

INPUT set of pairs of transcription factors and
their target DNA sequences.

TF
Target DNA sequence
20
Probabilistic model of binding preferences
Set of interacting residues in the 4 positions, p
of the k fingers
E.g. A1,2 ? Set of interacting residue for finger
1 at position 2
C.Prob of interaction with DNA subseq, starting
from jth pos in DNA
Pp(NA) ? c. prob. of nucleotide N given amino
acid A at position p. N1,NL ? target DNA sequence
21
Where did the P2 term go?
22
Estimating DNA Recognition Preferences

Apply Expectation Maximization

Identify binding locations
Optimize recognition preferences
23
Expectation Maximization algorithm
E
M
Initial guess of DNA recog. Pref.
Compute expected posterior probability of binding
locations for all proteinDNA pair.
Maximize likelihood of the current binding
positions for all proteinDNA pairs based on the
distribution of possible binding locations
24
Probabilistic model output
Tot. height info content Rel. height
probability C intensity confidence
(Phenylalanine, Cytocine) Prevalent in position 2
(Lysine,Guanine) ? irrespective of position
25
Ab initio genome wide prediction
16,201 putative gene products
29 canonical fingers
DNA recognition preference
Binding site model
Scan the promoter regions
image
26
ResultsD.Melanogaster
Transcription Factors
Match prior biological knowledge!!
GO Terms
Blue cellssignificant enrichment of GO terms
27
ResultsD.Melanogaster
Transcription Factors
Embryogenesis phase
At least one significant embryogenesis experiment
28
Agenda

Background / motivation
Paper 1
Paper 2
Conclusion

29
Paper 2

MotifCut regulatory motif finding with maximum
density subgraphs
Eugene Fratkin, Brian Naughton, Douglass Brutlag,
Serafim Batzoglou

30
Drawbacks of existing methods

As an optimization problem
Intractable
Relies on EM or local heuristic search

31
Drawbacks of existing methods
WRONG!
Indeependence assumption biologically unrealistic
32
Overview

Nodes k-mers of input sequence
Edges pairwise k-mer similarity
Motif search ? maximum density subgraph

33
MotifCut Algorithm

Convert sequence into a collection of k-mers
Each overlap/duplicate considered distinct

34
MotifCut Algorithm

For every pair of vertices (vi, vj) create an
edge with weight wij
wij f( mismatches bet. k-mers in vi, vj)

M ? k-mers of binding site B ? background k-mers
35
Resulting graph
Note should be maximally connected!
36
MotifCut Algorithm

Find the maximum density subgraph
Parametric flow algorithm (Gallo et al, 1989)
A type of fractional programming
Iteratively apply push/relable to find max-flow
and min-cut
O(VElog(V2E)) ? too slow!

37
MDS optimization
Pick a center of neighborhood
Discard edges with weight lt w
Re-introduce all edges in neighborhood
Run MDS in neighborhood
Repeat for every vertex
38
Results