Title: Protein Tertiary and Quaternary Fold Recognition: A ML Approach
1Protein Tertiary and Quaternary Fold Recognition
A ML Approach
- Jaime Carbonell
- Joint work with
- Yan Liu(IBM), Vanathi Gopalakrishnan (U Pitt),
Peter Weigele (MIT) - Language Technologies Institute
- Carnegie Mellon University
- Machine Learning Lunch 11-April-2007
2Snapshot of Cell Biology
3(Borrowed from Judith Klein-Seetharaman)
PROTEINS Sequence ? Structure ? Function
Primary Sequence
MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML
AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA
VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG
GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT
WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN
ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES
ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG
SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT
LCCGKNPLGD DEASTTVSKT ETSQVAPA
Folding
3D Structure
Complex function within network of proteins
Normal
4PROTEINS Sequence ? Structure ? Function
Primary Sequence
MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML
AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA
VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG
GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT
WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN
ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES
ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG
SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT
LCCGKNPLGD DEASTTVSKT ETSQVAPA
Folding
3D Structure
Complex function within network of proteins
5Example Protein Structures
Triple beta-spiral fold in Adenovirus Fiber Shaft
Adenovirus Fibre Shaft
Virus Capsid
6Predicting Protein Structures
- Protein Structure is a key determinant of protein
function - Crystalography to resolve protein structures
experimentally in-vitro is very expensive, NMR
can only resolve very-small proteins - The gap between the known protein sequences and
structures - 3,023,461 sequences v.s. 36,247 resolved
structures (1.2) - Therefore we need to predict structures in-silico
7Quaternary Folds and Alignments
- Protein fold
- Identifiable regular arrangement of secondary
structural elements - Thus far, a limited number of protein folds have
been discovered (1000) - Very few research work on quaternary folds
- Complex structures and few labeled data
- Quaternary fold recognition
Biology task Protein fold Membership and non-membership proteins Will the protein take the fold?
AI task Pattern to be induced Training data (seq-struc pairs physics) Does the pattern appear in the testing sequence?
8Previous Work
- Sequence similarity perspective
- Sequence similarity searches, e.g. PSI-BLAST
Altschul et al, 1997 - Profile HMM, .e.g. HMMER Durbin et al, 1998 and
SAM Karplus et al, 1998 - Window-based methods, e.g. PSI_pred Jones, 2001
- Physical forces perspective
- Homology modeling or threading, e.g. Threader
Jones, 1998 - Structural biology perspective
- Painstakingly hand-engineered methods for
specific structures, e.g. aa- and ßß- hairpins,
ß-turn and ß-helix Efimov, 1991 Wilmot and
Thornton, 1990 Bradley at al, 2001
Fail to capture the structure properties and
long-range dependencies
Generative models based on rough approximation of
free-energy, perform very poorly on complex
structures
Very Hard to generalize due to built-in
constants, fixed features
9Conditional Random Fields
- Hidden Markov model (HMM) Rabiner, 1989
- Conditional random fields (CRFs) Lafferty et al,
2001 - Model conditional probability directly
(discriminative models, directly optimizable) - Allow arbitrary dependencies in observation
- Adaptive to different loss functions and
regularizers - Promising results in multiple applications
- But, need to scale up (computationally) and
extend to long-distance dependencies
10Our Solution Conditional Graphical Models
Long-range dependency
Local dependency
- Outputs Y M, Wi , where Wi pi, qi, si
- Feature definition
- Node feature
- Local interaction feature
- Long-range interaction feature
11Linked Segmentation CRF
- Node secondary structure elements and/or simple
fold - Edges Local interactions and long-range
inter-chain and intra-chain interactions - L-SCRF conditional probability of y given x is
defined as
12Linked Segmentation CRF (II)
- Classification
- Training learn the model parameters ?
- Minimizing regularized negative log loss
- Iterative search algorithms by seeking the
direction whose empirical values agree with the
expectation - Complex graphs results in huge computational
complexity
13Approximate Inference of L-SCRF
- Most approximation algorithms cannot handle
variable number of nodes in the graph, but we
need variable graph topologies, so - Reversible jump MCMC sampling Greens, 1995,
Schmidler et al, 2001 with Four types of
Metropolis operators - State switching
- Position switching
- Segment split
- Segment merge
- Simulated annealing reversible jump MCMC Andireu
et al, 2000 - Replace the sample with RJ MCMC
- Theoretically converge on the global optimum
14Features for Protein Fold Recognition
15Tertiary Fold Recognition ß-Helix fold
- Histogram and ranks for known ß-helices against
PDB-minus dataset
5
Chain graph model reduces the real running time
of SCRFs model by around 50 times
16Fold Alignment Prediction ß-Helix
- Predicted alignment for known ß -helices on
cross-family validation
17Discovery of New Potential ß-helices
- Run structural predictor seeking potential
ß-helices from Uniprot (structurally unresolved)
databases - Full list (98 new predictions) can be accessed at
www.cs.cmu.edu/yanliu/SCRF.html - Verification on 3 proteins with later
experimentally resolved structures from different
organisms - 1YP2 Potato Tuber ADP-Glucose Pyrophosphorylase
- 1PXZ The Major Allergen From Cedar Pollen
- GP14 of Shigella bacteriophage as a ß-helix
protein - No single false positive!
18Experiments Target Quaternary Fold
- Triple beta-spirals van Raaij et al. Nature
1999 - Virus fibers in adenovirus, reovirus and PRD1
- Double barrel trimer Benson et al, 2004
- Coat protein of adenovirus, PRD1, STIV, PBCV
19Experiment Results Fold Recognition
Triple beta-spirals
20Experiment Results Alignment Prediction
21Experiment ResultsDiscovery of New Membership
Proteins
- Predicted membership proteins of triple
beta-spirals can be accessed at - http//www.cs.cmu.edu/yanliu/swissprot_list.xls
- Membership proteins of double barrel-trimer
suggested by biologists Benson, 2005 compared
with L-SCRF predictions
22Concluding Remarks
- Conditional graphical models for protein
structure prediction - Effective representation for protein structural
properties - Feasibility to incorporate different kinds of
informative features - Efficient inference algorithms for large-scale
applications - A major extension compared with previous work
- Knowledge representation through graphical models
- Ability to handle long-range interactions within
one chain and between chains - Future work
- Automatic learning of graph topology
- Active learning including minority-class
discovery