Modeling Dependencies in Protein-DNA Binding Sites - PowerPoint PPT Presentation

About This Presentation

Title:

Modeling Dependencies in Protein-DNA Binding Sites

Description:

Modeling Dependencies in Protein-DNA Binding Sites. 1 School of ... Example: ROC curve of HSF1. Mixture of Trees. Tree ~60 FP. Mixture of Profiles -20 -10 ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 25

Provided by: tommyk1

Category:

more less

Transcript and Presenter's Notes

Title: Modeling Dependencies in Protein-DNA Binding Sites

1
Modeling Dependencies in Protein-DNA Binding Sites

Yoseph Barash 1
Gal Elidan 1
Nir Friedman 1
Tommy Kaplan 1,2

1 School of Computer Science Engineering2
Hadassah Medical SchoolThe Hebrew University,
Jerusalem, Israel
2
Dependent positions in binding sites
?C
?T
gene
A
promoter
binding site
Most approaches assume position independence
To model or not to model dependencies ? Man
Stormo 2001, Bulyk et al, 2002, Benos et al,
2002

Pros Biology suggests dependencies
Single amino-acid interacts with two nucleotides
Change in conformation of protein or DNA
Cons Modeling dependencies is harder
Additional parameters
Requires more data, not as robust

3
Data driven approach

Can we learn dependencies from available genomic
data ?
Do dependency models perform better ?
Outline
Flexible models of dependencies
Learning from (un)aligned sequences
Systematic evaluation
? Biological insights

4
How to model binding sites ?
represent a distribution of binding sites
Profile Independency model Tree Direct
dependencies Mixture of Profiles Global
dependencies Mixture of Trees Both types of
dependencies
5
Learning models Aligned binding sites
Learning Machineryselect maximum likelihood
model

Learning based on methods for probabilistic
graphical models (Bayesian networks)

6
Evaluation using aligned data
95 TFs with 20 binding sites from TRANSFAC
database Wingender et al, 2001

Estimate generalization of each model
Test how probable is the site given the model?

?Cross-validation
Data set
Test Log-Likelihood
Training set
-20.34 -23.03 -21.31
-19.10 -18.42 -19.70
-22.39 -23.54 -22.39 -23.54 -18.07 -19.18 -1
8.31 -21.43
ATGGGGCGGGGC GTGGGGCGGGGC ATGGGGCGGGGC GTGGG
GCGGGGCGCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGG
GGCGGGGC
GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG
GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG
TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC
TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC
Testavg. LL -20.77
7
Arabidopsis ABA binding factor 1
Profile
Test LL per instance -19.93
Test LL per instance -18.70 (1.23)(improvement
in likelihood gt 2-fold)
8
Likelihood improvement over profiles
TRANSFAC 95 aligned data sets
128
Significant(paired t-test)
64
Not significant
32
16
Fold-change in likelihood
8
4
Significant improvement in generalization ?
Data often exhibits dependencies
2
1
0.5
10
20
30
40
50
60
70
80
90
9
Evaluation for unaligned data

Motif finding problem
Input A set of potentially co-regulated genes
Output A common motif in their promoters

Sources of data
Gene annotation (e.g. Hughes et al, 2000)
Gene expression (e.g. Spellman et al, 1998
Tavazoie et al, 2000)
ChIP (e.g. Simon et al, 2001 Lee et al, 2002)

10
Learning models unaligned data

Use EM algorithm to simultaneously
Identify binding site positions
Learn a dependency model

EM algorithm
Unaligned Data
Learna model
Identify binding sites
11
ChIP location analysisLee et al, 2002

Yeast genome-wide location experiments
Target genes for 106 TFs in 146 experiments

....
Gene
ABF1 Targets
ZAP1 Targets
YAL005C...YAL010CYAL012CYAL013WYPR201W
YAL001CYAL002WYAL003W
YAL001CYAL002WYAL003W
...
. ..
genes 6000
12
Example Models learned for ABF1 (YPD)
Autonomously replicating sequence-binding factor
1
Known profile(from TRANSFAC)
13
Evaluating Performance

Detect target genes on a genomic scale

ACGTAT..AGGGATGC
GAGC
-1000
0
-473
14
Evaluating Performance
Detect target genes on a genomic scale
Biologicallyverified site
Gal4 regulates Gal80
15
Evaluation using ChIP location dataLee et al,
2002

Evaluate using a 5-fold cross-validation test

Prediction
Data set
YAL005CYAL007CYAL008WYAL009WYAL010CYAL012C
YAL013WYPR201W

YAL001CYAL002WYAL003W

16
Evaluation using ChIP location dataLee et al,
2002

Evaluate using a 5-fold cross-validation test

Prediction
Data set

YAL001CYAL002WYAL003W

YAL005CYAL007CYAL008WYAL009WYAL010CYAL012C
YAL013WYPR201W
vvvvFNvvvFPvv
17
Example ROC curve of HSF1
60 FP
18
Improvement in sensitivity specificity
105 unaligned data sets from Lee et al.
True
TP
Predicted
SensitivityTP / True SpecificityTP / Predicted
19
Improvement in sensitivity specificity
105 unaligned data sets from Lee et al.
Mixture of Profiles vs. Profile
True
TP
Predicted
? specificity
SensitivityTP / True SpecificityTP / Predicted
? sensitivity
20
Improvement in sensitivity specificity
105 unaligned data sets from Lee et al.
Mixture of Trees vs. Profile
True
TP
Predicted
? specificity
SensitivityTP / True SpecificityTP / Predicted
? sensitivity
21
Is it worthwhile to model dependencies?Evaluati
on clearly supports this What about the
underlying biology ?(with Prof. Hanah Margalit,
Hadassah Medical School)
22
Distance between dependent positions
Tree models learned from the aligned data sets
lt 1/3 of the dependencies
23
Structural families
Dependency models vs. Profile on aligned data sets
24
Conclusions