Title: Modeling Dependencies in Protein-DNA Binding Sites
1Modeling Dependencies in Protein-DNA Binding Sites
- Yoseph Barash 1
- Gal Elidan 1
- Nir Friedman 1
- Tommy Kaplan 1,2
1 School of Computer Science Engineering2
Hadassah Medical SchoolThe Hebrew University,
Jerusalem, Israel
2Dependent positions in binding sites
?C
?T
gene
A
promoter
binding site
Most approaches assume position independence
To model or not to model dependencies ? Man
Stormo 2001, Bulyk et al, 2002, Benos et al,
2002
- Pros Biology suggests dependencies
- Single amino-acid interacts with two nucleotides
- Change in conformation of protein or DNA
- Cons Modeling dependencies is harder
- Additional parameters
- Requires more data, not as robust
3Data driven approach
- Can we learn dependencies from available genomic
data ? - Do dependency models perform better ?
- Outline
- Flexible models of dependencies
- Learning from (un)aligned sequences
- Systematic evaluation
- ? Biological insights
4How to model binding sites ?
represent a distribution of binding sites
Profile Independency model Tree Direct
dependencies Mixture of Profiles Global
dependencies Mixture of Trees Both types of
dependencies
5Learning models Aligned binding sites
Learning Machineryselect maximum likelihood
model
- Learning based on methods for probabilistic
graphical models (Bayesian networks)
6Evaluation using aligned data
95 TFs with 20 binding sites from TRANSFAC
database Wingender et al, 2001
- Estimate generalization of each model
- Test how probable is the site given the model?
?Cross-validation
Data set
Test Log-Likelihood
Training set
-20.34 -23.03 -21.31
-19.10 -18.42 -19.70
-22.39 -23.54 -22.39 -23.54 -18.07 -19.18 -1
8.31 -21.43
ATGGGGCGGGGC GTGGGGCGGGGC ATGGGGCGGGGC GTGGG
GCGGGGCGCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGG
GGCGGGGC
GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG
GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG
TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC
TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC
Testavg. LL -20.77
7Arabidopsis ABA binding factor 1
Profile
Test LL per instance -19.93
Test LL per instance -18.70 (1.23)(improvement
in likelihood gt 2-fold)
8Likelihood improvement over profiles
TRANSFAC 95 aligned data sets
128
Significant(paired t-test)
64
Not significant
32
16
Fold-change in likelihood
8
4
Significant improvement in generalization ?
Data often exhibits dependencies
2
1
0.5
10
20
30
40
50
60
70
80
90
9Evaluation for unaligned data
- Motif finding problem
- Input A set of potentially co-regulated genes
- Output A common motif in their promoters
- Sources of data
- Gene annotation (e.g. Hughes et al, 2000)
- Gene expression (e.g. Spellman et al, 1998
Tavazoie et al, 2000) - ChIP (e.g. Simon et al, 2001 Lee et al, 2002)
10Learning models unaligned data
- Use EM algorithm to simultaneously
- Identify binding site positions
- Learn a dependency model
EM algorithm
Unaligned Data
Learna model
Identify binding sites
11ChIP location analysisLee et al, 2002
- Yeast genome-wide location experiments
- Target genes for 106 TFs in 146 experiments
....
Gene
ABF1 Targets
ZAP1 Targets
YAL005C...YAL010CYAL012CYAL013WYPR201W
YAL001CYAL002WYAL003W
YAL001CYAL002WYAL003W
...
. ..
genes 6000
12Example Models learned for ABF1 (YPD)
Autonomously replicating sequence-binding factor
1
Known profile(from TRANSFAC)
13Evaluating Performance
- Detect target genes on a genomic scale
ACGTAT..AGGGATGC
GAGC
-1000
0
-473
14Evaluating Performance
Detect target genes on a genomic scale
Biologicallyverified site
Gal4 regulates Gal80
15Evaluation using ChIP location dataLee et al,
2002
- Evaluate using a 5-fold cross-validation test
Prediction
Data set
YAL005CYAL007CYAL008WYAL009WYAL010CYAL012C
YAL013WYPR201W
YAL001CYAL002WYAL003W
16Evaluation using ChIP location dataLee et al,
2002
- Evaluate using a 5-fold cross-validation test
Prediction
Data set
YAL001CYAL002WYAL003W
YAL005CYAL007CYAL008WYAL009WYAL010CYAL012C
YAL013WYPR201W
vvvvFNvvvFPvv
17Example ROC curve of HSF1
60 FP
18Improvement in sensitivity specificity
105 unaligned data sets from Lee et al.
True
TP
Predicted
SensitivityTP / True SpecificityTP / Predicted
19Improvement in sensitivity specificity
105 unaligned data sets from Lee et al.
Mixture of Profiles vs. Profile
True
TP
Predicted
? specificity
SensitivityTP / True SpecificityTP / Predicted
? sensitivity
20Improvement in sensitivity specificity
105 unaligned data sets from Lee et al.
Mixture of Trees vs. Profile
True
TP
Predicted
? specificity
SensitivityTP / True SpecificityTP / Predicted
? sensitivity
21Is it worthwhile to model dependencies?Evaluati
on clearly supports this What about the
underlying biology ?(with Prof. Hanah Margalit,
Hadassah Medical School)
22Distance between dependent positions
Tree models learned from the aligned data sets
lt 1/3 of the dependencies
23Structural families
Dependency models vs. Profile on aligned data sets
24Conclusions
- Flexible framework for learning dependencies
- Dependencies are found in many cases
- It is worthwhile to model them -
- Better learning and binding site prediction
Future work
- Link to the underlying structural biology
- Incorporate as part of other regulatory mechanism
models
http//compbio.cs.huji.ac.il/TFBN