Modeling Dependencies in Protein-DNA Binding Sites - PowerPoint PPT Presentation

About This Presentation
Title:

Modeling Dependencies in Protein-DNA Binding Sites

Description:

Modeling Dependencies in Protein-DNA Binding Sites. 1 School of ... Example: ROC curve of HSF1. Mixture of Trees. Tree ~60 FP. Mixture of Profiles -20 -10 ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 25
Provided by: tommyk1
Category:

less

Transcript and Presenter's Notes

Title: Modeling Dependencies in Protein-DNA Binding Sites


1
Modeling Dependencies in Protein-DNA Binding Sites
  • Yoseph Barash 1
  • Gal Elidan 1
  • Nir Friedman 1
  • Tommy Kaplan 1,2

1 School of Computer Science Engineering2
Hadassah Medical SchoolThe Hebrew University,
Jerusalem, Israel
2
Dependent positions in binding sites
?C
?T
gene
A
promoter
binding site
Most approaches assume position independence
To model or not to model dependencies ? Man
Stormo 2001, Bulyk et al, 2002, Benos et al,
2002
  • Pros Biology suggests dependencies
  • Single amino-acid interacts with two nucleotides
  • Change in conformation of protein or DNA
  • Cons Modeling dependencies is harder
  • Additional parameters
  • Requires more data, not as robust

3
Data driven approach
  • Can we learn dependencies from available genomic
    data ?
  • Do dependency models perform better ?
  • Outline
  • Flexible models of dependencies
  • Learning from (un)aligned sequences
  • Systematic evaluation
  • ? Biological insights
  • Yes
  • Yes

4
How to model binding sites ?
represent a distribution of binding sites
Profile Independency model Tree Direct
dependencies Mixture of Profiles Global
dependencies Mixture of Trees Both types of
dependencies
5
Learning models Aligned binding sites
Learning Machineryselect maximum likelihood
model
  • Learning based on methods for probabilistic
    graphical models (Bayesian networks)

6
Evaluation using aligned data
95 TFs with 20 binding sites from TRANSFAC
database Wingender et al, 2001
  • Estimate generalization of each model
  • Test how probable is the site given the model?

?Cross-validation
Data set
Test Log-Likelihood
Training set
-20.34 -23.03 -21.31
-19.10 -18.42 -19.70
-22.39 -23.54 -22.39 -23.54 -18.07 -19.18 -1
8.31 -21.43
ATGGGGCGGGGC GTGGGGCGGGGC ATGGGGCGGGGC GTGGG
GCGGGGCGCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGG
GGCGGGGC
GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG
GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG
TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC
TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC
Testavg. LL -20.77
7
Arabidopsis ABA binding factor 1
Profile
Test LL per instance -19.93
Test LL per instance -18.70 (1.23)(improvement
in likelihood gt 2-fold)
8
Likelihood improvement over profiles
TRANSFAC 95 aligned data sets
128
Significant(paired t-test)
64
Not significant
32
16
Fold-change in likelihood
8
4
Significant improvement in generalization ?
Data often exhibits dependencies
2
1
0.5
10
20
30
40
50
60
70
80
90
9
Evaluation for unaligned data
  • Motif finding problem
  • Input A set of potentially co-regulated genes
  • Output A common motif in their promoters
  • Sources of data
  • Gene annotation (e.g. Hughes et al, 2000)
  • Gene expression (e.g. Spellman et al, 1998
    Tavazoie et al, 2000)
  • ChIP (e.g. Simon et al, 2001 Lee et al, 2002)

10
Learning models unaligned data
  • Use EM algorithm to simultaneously
  • Identify binding site positions
  • Learn a dependency model

EM algorithm
Unaligned Data
Learna model
Identify binding sites
11
ChIP location analysisLee et al, 2002
  • Yeast genome-wide location experiments
  • Target genes for 106 TFs in 146 experiments

....
Gene
ABF1 Targets
ZAP1 Targets
YAL005C...YAL010CYAL012CYAL013WYPR201W
YAL001CYAL002WYAL003W
YAL001CYAL002WYAL003W
...
. ..
genes 6000
12
Example Models learned for ABF1 (YPD)
Autonomously replicating sequence-binding factor
1
Known profile(from TRANSFAC)
13
Evaluating Performance
  • Detect target genes on a genomic scale

ACGTAT..AGGGATGC
GAGC
-1000
0
-473
14
Evaluating Performance
Detect target genes on a genomic scale
Biologicallyverified site
Gal4 regulates Gal80
15
Evaluation using ChIP location dataLee et al,
2002
  • Evaluate using a 5-fold cross-validation test

Prediction
Data set
YAL005CYAL007CYAL008WYAL009WYAL010CYAL012C
YAL013WYPR201W

YAL001CYAL002WYAL003W

16
Evaluation using ChIP location dataLee et al,
2002
  • Evaluate using a 5-fold cross-validation test

Prediction
Data set

YAL001CYAL002WYAL003W

YAL005CYAL007CYAL008WYAL009WYAL010CYAL012C
YAL013WYPR201W
vvvvFNvvvFPvv
17
Example ROC curve of HSF1
60 FP
18
Improvement in sensitivity specificity
105 unaligned data sets from Lee et al.
True
TP
Predicted
SensitivityTP / True SpecificityTP / Predicted
19
Improvement in sensitivity specificity
105 unaligned data sets from Lee et al.
Mixture of Profiles vs. Profile
True
TP
Predicted
? specificity
SensitivityTP / True SpecificityTP / Predicted
? sensitivity
20
Improvement in sensitivity specificity
105 unaligned data sets from Lee et al.
Mixture of Trees vs. Profile
True
TP
Predicted
? specificity
SensitivityTP / True SpecificityTP / Predicted
? sensitivity
21
Is it worthwhile to model dependencies?Evaluati
on clearly supports this What about the
underlying biology ?(with Prof. Hanah Margalit,
Hadassah Medical School)
22
Distance between dependent positions
Tree models learned from the aligned data sets
lt 1/3 of the dependencies
23
Structural families
Dependency models vs. Profile on aligned data sets
24
Conclusions
  • Flexible framework for learning dependencies
  • Dependencies are found in many cases
  • It is worthwhile to model them -
  • Better learning and binding site prediction

Future work
  • Link to the underlying structural biology
  • Incorporate as part of other regulatory mechanism
    models

http//compbio.cs.huji.ac.il/TFBN
Write a Comment
User Comments (0)
About PowerShow.com