Title: Computational Analysis of Protein-DNA Interactions
1Computational Analysis of Protein-DNA Interactions
- Changhui (Charles) Yan
- Department of Computer Science
- Utah State University
2Problem I
- Identifying amino acid residues involved in
protein-DNA interactions from sequence
3Materials And Methods
- 56 double-stranded DNA binding proteins
previously used in the study of Jones et al.
(2003) - Encoding
4Materials And Methods
5Naïve Bayes Classifier
- Leave-one-out cross-validation
Naïve Bayes
6Naïve Bayes Classifier
- Leave-one-out cross-validation
Naïve Bayes
7Leave-One-Out Cross-Validations
Sequence-based Sequence-based Sequence/structure-based Sequence/structure-based
Identities (ID) ID entropy ID rASA ID rASA entropy
Correlation coefficient 0.25 0.29 0.28 0.30
Accuracy() 77 75 76 77
Specificity() 37 37 36 39
Sensitivity() 43 53 51 52
8Predictions in The Context of 3-D Structures
Actual
Predicted
- Pit-1, PDB 1au7
- TP30
- FP 16
- TN 86
- FN14
- CC 0.51 (2nd)
- Accuracy 79
9Predictions in The Context of 3-D Structures
Predicted
Actual
- ?-Cro, PDB 6cro
- TP10
- FP 5
- TN 34
- FN10
- CC 0.37 (19th)
- Accuracy 73
10Predictions Compared With PROSITE Motifs
- Predicted binding sites substantially overlap
with 34 of the 37 DNA-binding PROSITE motifs - In 52 of the 56 proteins, the predictor
identifies at least 20 of the DNA-binding
residues - 28 of the 56 proteins contain no PROSITE motifs
that are annotated as DNA-binding
11Comparison With Previous Study
Method Naïve Bayes classifier Ahmad and Sarai method
Correlation Coefficient 0.26 0.23
Accuracy () 80 66
Specificity() 29 21
Sensitivity() 48 68
Ahmad, S. and Sarai, A. (2005) PSSM-based
prediction of DNA binding sites in proteins. BMC
Bioinformatics, 6, 33.
12Summary
- A simple sequence-based Naive Bayes classifier
predicts interface residues in DNA-binding
proteins with 75 accuracy, 37 specificity, 53
sensitivity and correlation coefficient of 0.29 - Predicted binding sites
- correctly indicate the locations of actual
binding sites - substantially overlap with known PROSITE motifs
13Problem II
- Identification of Helix-Turn-Helix (HTH)
DNA-binding motifs
14HTH Motifs
- Sequences sharing low similarities can fold into
a similar HTH structure - Identifying HTH motifs from sequence is extremely
challenging
15Trick 1
- Including more information
- Amino acid sequence
- Secondary structure
16Hidden Markov Model (HMM)
LQQITHIANQL-GLE----KDVVRVWF
17Hidden Markov Model (HMM_AA_SS)
LQQITHIANQL-GLE----KDVVRVWF
HHHEEHEEEHMHE----HHEEMMEH
18Trick 2
- There are similarities among the 20 naturally
occurred amino acids - Reduced alphabets
19Reduced Alphabets
- Schemes for reducing amino acid alphabet based on
the BLOSUM50 matrix by Henikoff and Henikoff
(1992) derived by grouping and averaging the
similarity matrix elements as described in the
text. (Murphy et al. 2000)
20Cross-Families Evaluations
True Positive 1 False Positive 2
HMM_AA 3 0
HMM_AA_SS (20 letters) 3 227 0
HMM_AA_SS (Murphy_15) 3 474 0
HMM_AA_SS (Murphy_10) 3 470 3
HMM_AA_SS (Murphy_8) 3 431 5
- True positive HTH motifs that are correctly
identified as such. - False positive Non-HTH motifs that are
identified as HTH motifs. - The alphabet used to encode amino acid sequences.
21Questions
22Within-family Three-Fold Cross-Validations
Family (number of HTH motifs in the family) HMM_AA HMM_AA_SS (Murphy_15)
PF00126 (1635) 1594 1622
PF00165 (90) 63 80
PF00196 (30) 26 30
PF04545 (164) 137 164
PF01022 (42) 39 39
PF00046 (189) 176 188
PF03965 (48) 48 48
.
23Comparisons of HMM_AA_SS with FFAS03 in
Cross-Family Evaluations
Total HTH motifs Recognized by both FFAS03 and HMM_AA_SS Recognized by FFAS03 only Recognized by HMM_AA_SS only
563 135 24 71
24Putative HTH motifs in Ureaplasma parvum
Protein Location Annotation from Uniprot
spQ9PQE5SCPB_UREPA 176-214 Participates to chromosomal partition during cell division
spQ9PQV6RPOB_UREPA 540-587 DNA-directed RNA polymerase
spQ9PR27SYY_UREPA 340-380 Tyrosyl-tRNA synthetase
spQ9PQC2SYA_UREPA 217-265 Alanyl-tRNA synthetase
spQ9PQ74DPO3A_UREPA 365-400 DNA polymerase III subunit alpha
spQ9PQX7Y166_UREPA 507-553 Hypothetical protein