Machine Learning Algorithms for Protein Structure Prediction - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Machine Learning Algorithms for Protein Structure Prediction

Description:

Machine Learning Algorithms for Protein Structure Prediction. Jianlin Cheng ... Pseudo-energy (Zhu and Braun, 1999) X. 40% Statistical Potential (Hubbard, 1994) 84 ... – PowerPoint PPT presentation

Number of Views:261
Avg rating:3.0/5.0
Slides: 66
Provided by: contact6
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning Algorithms for Protein Structure Prediction


1
Machine Learning Algorithms for Protein Structure
Prediction
  • Jianlin Cheng
  • Institute for Genomics and Bioinformatics
  • School of Information and Computer Sciences
  • University of California Irvine
  • 2006

2
Outline
  1. Introduction
  2. 1D Prediction
  3. 2D Prediction (Beta-Sheet Topology)
  4. 3D Prediction (Fold Recognition)
  5. Publications and Bioinformatics Tools

3
Importance of Protein Structure Prediction
AGCWY
Cell
Sequence Structure
Function
4
Four Levels of Protein Structure
Primary Structure (a directional sequence of
amino acids/residues)
N
C

Residue1
Residue2
Peptide bond
Secondary Structure (helix, strand, coil)
Alpha Helix
Beta Strand / Sheet
Coil
5
Four Levels of Protein Structure
Quaternary Structure (complex)
Tertiary Structure
G Protein Complex
6
1D Secondary Structure Prediction
MWLKKFGINLLIGQSV
Helix
Neural Networks Alignments
Coil
CCCCHHHHHCCCSSSSS Accuracy 78
Strand
Cheng, Randall, Sweredoski, Baldi. Nucleic Acid
Research, 2005
7
1D Solvent Accessibility Prediction
Exposed
MWLKKFGINLLIGQSV
Neural Networks Alignments
eeeeeeebbbbbbbbeeeebbb Accuracy 79
Buried
Cheng, Randall, Sweredoski, Baldi. Nucleic Acid
Research, 2005
8
1D Disordered Region Prediction Using Neural
Networks
MWLKKFGINLLIGQSV
Disordered Region
1D-RNN
OOOOODDDDOOOOO 93 TP at 5 FP
Cheng, Sweredoski, Baldi. Data Mining and
Knowledge Discovery, 2005
9
1D Protein Domain Prediction Using Neural
Networks
MWLKKFGINLLIGQSV
Boundary
SS and SA
1D-RNN
NNNNNNNBBBBBNNNN
Inference/Cut
HIV capsid protein
Domain 1
Domain 2
Domains
Top ab-initio domain predictor in CAFASP4
Cheng, Sweredoski, Baldi. Data Mining and
Knowledge Discovery, 2006.
10
1D Predict Single-Site Mutation From Sequence
Using Support Vector Machine
Correlation 0.76
Support Vector Machine
MWLAVFILINLK
  • First method to predict energy changes from
    sequence accurately
  • Useful for protein engineering, protein design,
    and mutagenesis analysis

Cheng, Randall, and Baldi. Proteins, 2006
11
2D Contact Map Prediction
2D Contact Map
3D Structure
1 2 ....j...
..n
1 2 3 . . . . i . . . . . . . n
Distance Threshold 8Ao
Cheng, Randall, Sweredoski, Baldi. Nucleic Acid
Research, 2005
12
2D Disulfide Bond Prediction
Cysteine i
Support Vector Machine
yes
2D-RNN
Disulfide Bond
Graph Matching
Cysteine j
1 Baldi, Cheng, Vullo. NIPS, 2004. 2 Cheng,
Saigo, Baldi. Proteins, 2005
13
2D Prediction of Beta-Sheet Topology
N terminus
  • Ab-Initio Structure Prediction
  • Fold Recognition
  • Protein Design
  • Protein Folding

Beta Sheet
Beta Strand
Cheng and Baldi, Bioinformatics, 2005
C terminus
Beta Residue Pair
14
An Example of Beta-Sheet Topology
Level 1
4 5
2 1 3 6 7
Structure of Protein 1VJG
Beta Sheets
15
An Example of Beta-Sheet Topology
Level 1
Level 2
4 5
Antiparallel
2 1 3 6 7
Parallel
Strand Strand Pair Strand Alignment Pairing
Direction
Structure of Protein 1VJG
Beta Sheets
16
An Example of Beta-Sheet Topology
Level 1
Level 2
Level 3
4 5
Antiparallel
H-bond
2 1 3 6 7
Parallel
Strand Strand Pair Strand Alignment Pairing
Direction
Structure of Protein 1VJG
Beta Sheets
Beta Residue Residue Pair
17
Three-Stage Prediction of Beta-Sheets
  • Stage 1
  • Predict beta-residue pairing probabilities
    using 2D-Recursive Neural Networks (2D-RNN, Baldi
    and Pollastri, 2003)
  • Stage 2
  • Use beta-residue pairing probabilities to
    align beta-strands
  • Stage 3
  • Predict beta-strand pairs and beta-sheet
    topology using graph algorithms

18
Stage 1 Prediction of Beta-Residue Pairings
Using 2D-Recusive Neural Networks
Input Matrix I (mm)
Output / Target Matrix (mm)
Iij
2D-RNN O f(I)
(i,j)
i
j
Oij Pairing Prob. Tij 0/1
AHYHCKRWQNEDGHTPRKDECLIELMQDAQRMRK.
20 for Residues
3 SS
2 SA
19
An Example (Target)
1
2
3
4
5
6
7
Protein 1VJG
Beta-Residue Pairing Map (Target Matrix)
20
An Example (Target)
1
2
3
4
5
6
7
Antiparallel
Parallel
Protein 1VJG
Beta-Residue Pairing Map (Target Matrix)
21
An Example (Prediction)
22
Stage 2 Beta-Strand Alignment
Antiparallel
  • Use output probability matrix as scoring matrix
  • Dynamic programming
  • Disallow gaps and use the simplified search
    algorithm

1 m
n 1
Parallel
1 m
1 n
Total number of alignments 2(mn-1)
23
Strand Alignment and Pairing Matrix
  • The alignment score is the sum of the pairing
    probabilities of the aligned residues
  • The best alignment is the alignment with the
    maximum score
  • Strand Pairing Matrix

Strand Pairing Matrix of 1VJG
24
Stage 3 Prediction of Beta-Strand Pairings and
Beta-Sheet Topology
(a) Seven strands of protein 1VJG in sequence
order
(b) Beta-sheet topology of protein 1VJG
25
Minimum Spanning Tree Like Algorithm
Strand Pairing Graph (SPG)
(a) Complete SPG
Strand Pairing Matrix
26
Minimum Spanning Tree Like Algorithm
Strand Pairing Graph (SPG)
(b) True Weighted SPG
(a) Complete SPG
Strand Pairing Matrix
Goal Find a set of connected subgraphs that
maximize the sum of the alignment scores
and satisfy the constraints Algorithm Minimum
Spanning Tree Like Algorithm
27
An Example of MST Like Algorithm
1
2
3
4
5
6
7
Step 1 Pair strand 4 and 5
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
Strand Pairing Matrix of 1VJG
28
An Example of MST Like Algorithm
1
2
3
4
5
6
7
Step 2 Pair strand 1 and 2
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
2
1
Strand Pairing Matrix of 1VJG
N
29
An Example of MST Like Algorithm
1
2
3
4
5
6
7
Step 3 Pair strand 1 and 3
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
2
1
3
Strand Pairing Matrix of 1VJG
N
30
An Example of MST Like Algorithm
1
2
3
4
5
6
7
Step 4 Pair strand 3 and 6
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
7
2
1
3
6
Strand Pairing Matrix of 1VJG
N
31
An Example of MST Like Algorithm
1
2
3
4
5
6
7
Step 5 Pair strand 6 and 7
0
1.3 0
.94 .37 0
.02 .02 .04 0
.02 .02 .03 1.9 0
.10 .05 .74 .04 .04 0
.02 .02 .03 .02 .02 .20 0
1
2
3
4
5
4
5
6
C
7
2
1
3
6
7
Strand Pairing Matrix of 1VJG
N
32
1.Beta Residue Pairing
Method Specificity/ Sensitivity Ratio of Improvement
BetaPairing 41 17.8
CMAPpro (Pollastri and Baldi, 2002) 27 11.7
2. Beta Strand Alignment
Method Alignment Accuracy Pairing Direction
BetaPairing 66 84
Statistical Potential (Hubbard, 1994) 40 X
Pseudo-energy (Zhu and Braun, 1999) 35 X
Information Theory (Steward and Thornton, 2002) 37 X
3. Beta Strand Pairing
Method Specificity Sensitivity of non-local pairs
MST Like 53 59 20
33
3D Structure Prediction
MWLKKFGINLLIGQSV
  • Ab-Initio Structure Prediction

Simulation

Physical force field protein folding Contact
map - reconstruction
Select structure with minimum free energy
  • Template-Based Structure Prediction

Query protein
Fold
MWLKKFGINKH
Recognition
Alignment
Template
Protein Data Bank
34
A Machine Learning Information Retrieval
Framework for Fold Recognition
Fold Recognition
Cheng and Baldi, Bioinformatics, 2006
Query Protein
Alignment
MWLKKFGIN
Template
Protein Data Bank
Machine Learning Ranking
35
Classic Fold Recognition Approaches
Sequence - Sequence Alignment (Needleman and
Wunsch, 1970. Smith and Waterman, 1981)
Query
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL
Template
ITAKPQWLKTSE------------SVTFLSFLLPQTQGLYHL
Alignment (similarity) score
Works for gt40 sequence identity (Close homologs
in protein family)
36
Classic Fold Recognition Approaches
Profile - Sequence Alignment (Altschul et al.,
1997)
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSP
REQAIGLSVTFLEFLLPAGWVLYHL ITAKPAKTPTSPKEEAIGLSVTFL
SFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYH
L
Query Family
Average Score
Template
ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN
More sensitive for distant homologs in
superfamily. (gt 25 identity)
37
Classic Fold Recognition Approaches
Profile - Sequence Alignment (Altschul et al.,
1997)
12.n
1 2 n
A  0.4      
C  0.1      
       
W  0.5      
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSP
REQAIGLSVTFLEFLLPAGWVLYHL ITAKPAKTPTSPKEEAIGLSVTFL
SFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYH
L
Query Family
Position Specific Scoring Matrix Or Hidden Markov
Model
Template
ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN
More sensitive for distant homologs in
superfamily. (gt 25 identity)
38
Classic Fold Recognition Approaches
Profile - Profile Alignment (Rychlewski et al.,
2000)
1 2 n
A  0.1      
C  0.4      
       
W  0.5      
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL ITAKPEKTPTSP
REQAIGLSVTFLEFLLPAGWVLYHL ILAKPAKTPTSPKEEAIGLSVTFL
SFLLPAGWVLYHL ITAKPQKTPTSLKEQAIGLSVTFLSFLLPAGWALYH
L
Query Family
1 2 m
A  0.3      
C  0.5      
       
W  0.2      
Template Family
ITAKPQWLKTSERSTEWQSVTFLSFLLPQTQGLYHN IPARPQWLKTSKR
STEWQSVTFLSFLLPYTQGLYHN IGAKPQWLWTSERSTEWHSVTFLSFL
LPQTQGLYHM
More sensitive for very distant homologs. (gt 15
identity)
39
Classic Fold Recognition Approaches
Sequence - Structure Alignment (Threading) (Bowie
et al., 1991. Jones et al., 1992. Godzik,
Skolnick, 1992. Lathrop, 1994)
Fit
Query
Fitness Score
MWLKKFGINLLIGQS.
Template Structure
Useful for recognizing similar folds without
sequence similarity. (no evolutionary
relationship)
40
Integration of Complementary Approaches
FR Server1
Query
Meta Server
FR server2
Consensus
(Lundstrom et al.,2001. Fischer, 2003)
FR server3
Internet
  1. Reliability depends on availability of external
    servers
  2. Make decisions on a handful candidates

41
Machine Learning Classification Approach
Class 1
Support Vector Machine (SVM)
Class 2
Proteins
Class m
Classify individual proteins to several or dozens
of structure classes (Jaakkola et al., 2000.
Leslie et al., 2002. Saigo et al., 2004)
Problem 1 cant scale up to thousands of protein
classes
Problem 2 doesnt provide templates for
structure modeling
42
Machine Learning Information Retrieval Framework
Query-Template Pair
Score 1
Relevance Function (e.g., SVM)

Score 2
Rank
. . .
-
Score n
  • Extract pairwise features
  • Comparison of two pairs (four proteins)
  • Relevant or not (one score) vs. many classes
  • Ranking of templates (retrieval)

43
Pairwise Feature Extraction
  • Sequence / Family Information Features
  • Cosine, correlation, and Gaussian kernel
  • Sequence Sequence Alignment Features
  • Palign, ClustalW
  • Sequence Profile Alignment Features
  • PSI-BLAST, IMPALA, HMMer, RPS-BLAST
  • Profile Profile Alignment Features
  • ClustalW, HHSearch, Lobster, Compass, PRC-HMM
  • Structural Features
  • Secondary structure, solvent accessibility,
    contact map, beta-sheet topology

44
Pairwise Feature Extraction
45
Relevance Function Support Vector Machine
Learning
Feature Space
Positive Pairs (Same Folds)
Support Vector Machine
Negative Pairs (Different Folds)
Training/Learning
Hyperplane
Training Data Set
46
Relevance Function Support Vector Machine
Learning
(2)
(1)
Margin
Margin
f(x)
K is Gaussian Kernel
47
Training and Cross-Validation
  • Standard benchmark (Lindahls dataset, 976
    proteins)
  • 976 x 975 query-template pairs (about 7,468
    positives)

Query
Query 1s pairs
975 pairs
1 2 3 . . . . . 976
Query 2s pairs
Train / Learn
975 pairs
. . .
(90 1- 878)
Rank 975 templates for each query
Test
(10 879 976)
975 pairs
48
Results for Top Five Ranked Templates
Method Family Superfamily Fold
PSI-BLAST 72.3 27.9 4.7
HMMER 73.5 31.3 14.6
SAM-T98 75.4 38.9 18.7
BLASTLINK 78.9 4.06 16.5
SSEARCH 75.5 32.5 15.6
SSHMM 71.7 31.6 24
THREADER 58.9 24.7 37.7
FUGUE 85.8 53.2 26.8
RAPTOR 77.8 50 45.1
SPARKS3 86.8 67.7 47.4
FOLDpro 89.9 70.0 48.3
  • Family close homologs, more identity
  • Superfamily distant homologs, less identity
  • Fold no evolutionary relation, no identity

49
Specificity-Sensitivity Plot (Family)
50
Specificity-Sensitivity Plot (Superfamily)
51
Specificity-Sensitivity Plot (Fold)
52
Advantages of MLIR Framework
  • Integration
  • Accuracy
  • Extensibility
  • Simplicity
  • Reliability
  • Completeness
  • Potentials

Disadvantages
Slower than some alignment methods
53
A CASP7 Example T0290
Query sequence (173 residues) RPRCFFDIAINNQPAGRVV
FELFSDVCPKTCENFRCLCTGEKGTGKSTQKPLHYKSCLFHRVVKDFMVQ
GGDFSEGNGRGGESIYGGFFEDESFAVKHNAAFLLSMANRGKDTNGSQFF
ITKPTPHLDGHHVVFGQVISGQEVVREIENQKTDAASKPFAEVRILSCGE
LIP
FOLDpro
Compare with the experimental structure RMSD
1Ao
Predicted Structure
54
Publications and Bioinformatics Tools
1. P. Baldi, J. Cheng, and A. Vullo. Large-Scale
Prediction of Disulphide Bond Connectivity.
NIPS 2004. DIpro 1.0 2. J. Cheng, H.
Saigo, and P. Baldi. Large-Scale Prediction of
Disulphide Bridges Using Kernel Methods,
Two-Dimensional Recursive Neural Networks, and
Weighted Graph Matching. Proteins, 2006.
DIpro 2.0
3. J. Cheng and
P. Baldi. Three-Stage Prediction of Protein
Beta-Sheets by Neural Networks, Alignments, and
Graph Algorithms. Bioinformatics, 2005.
BETApro 4. J. Cheng, A. Randall, M.
Sweredoski, and P. Baldi. SCRATCH a Protein
Structure and Structural Feature Prediction
Server. Nucleic Acids Research, 2005. SSpro
4/ACCpro 4/CMAPpro 2 5. J. Cheng, M. Sweredoski,
and P. Baldi. Accurate Prediction of Protein
Disordered Regions by Mining Protein Structure
Data. Data Mining and Knowledge Discovery,
2005. DISpro
55
Publications and Bioinformatics Tools
6. J. Cheng, L. Scharenbroich, P. Baldi, and E.
Mjolsness. Sigmoid Towards a Generative,
Scalable, Software Infrastructure for Pathway
Bioinformatics and Systems Biology. IEEE
Intelligent Systems, 2005. Sigmoid 7. J.
Cheng, A. Randall, and P. Baldi. Prediction of
Protein Stability Changes for Single Site
Mutations Using Support Vector Machines.
Proteins, 2006. MUpro 8. S. A. Danziger, S. J.
Swamidass, J. Zeng, L. R. Dearth, Q. Lu, J. H.
Chen, J. Cheng, V. P. Hoang, H. Saigo, R. Luo,
P. Baldi, R. K. Brachmann, and R. H. Lathrop.
Functional Census of Mutation Sequence Spaces
The Example of p53 Cancer Rescue Mutants. IEEE
Transactions on Computational Biology and
Bioinformatics, 2006. 9. J. Cheng, M.
Sweredoski, and P. Baldi. DOMpro Protein Domain
Prediction Using Profiles, Secondary Structure,
Relative Solvent Accessibility, and Recursive
Neural Networks. Data Mining and Knowledge
Discovery, 2006. DOMpro 10. J. Cheng and P.
Baldi. A Machine Learning Information Retrieval
Approach to Protein Fold Recognition.
Bioinformatics, 2006. FOLDpro
56
Acknowledgements
  • Pierre Baldi
  • G. Wesley Hatfield, Eric Mjolsness, Hal Stern,
    Dennis Decoste, Suzanne Sandmeyer, Richard
    Lathrop, Gianluca Pollastri, Chin-Rang Yang
  • Mike Sweredoski, Arlo Randall, Liza Larsen, Sam
    Danziger, Trent Su, Hiroto Saigo, Alessandro
    Vullo, Lucas Scharenbroich

57
(No Transcript)
58
Markov Models
59
(No Transcript)
60
(No Transcript)
61
1D-Recursive Neural Network
62
2D-Recursive Neural Network
63
(No Transcript)
64
2D-RNNs
65
2D RNNs
Write a Comment
User Comments (0)
About PowerShow.com