Protein Tertiary and Quaternary Fold Recognition: A ML Approach - PowerPoint PPT Presentation

About This Presentation
Title:

Protein Tertiary and Quaternary Fold Recognition: A ML Approach

Description:

Triple beta-spiral fold in Adenovirus Fiber Shaft. Carnegie ... Node: secondary structure elements and/or simple fold ... Fold Alignment Prediction: -Helix ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 23
Provided by: erichny
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Protein Tertiary and Quaternary Fold Recognition: A ML Approach


1
Protein Tertiary and Quaternary Fold Recognition
A ML Approach
  • Jaime Carbonell
  • Joint work with
  • Yan Liu(IBM), Vanathi Gopalakrishnan (U Pitt),
    Peter Weigele (MIT)
  • Language Technologies Institute
  • Carnegie Mellon University
  • Machine Learning Lunch 11-April-2007

2
Snapshot of Cell Biology
3
(Borrowed from Judith Klein-Seetharaman)
PROTEINS Sequence ? Structure ? Function
Primary Sequence
MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML
AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA
VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG
GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT
WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN
ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES
ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG
SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT
LCCGKNPLGD DEASTTVSKT ETSQVAPA
Folding
3D Structure
Complex function within network of proteins
Normal
4
PROTEINS Sequence ? Structure ? Function
Primary Sequence
MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML
AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA
VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG
GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT
WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN
ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES
ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG
SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT
LCCGKNPLGD DEASTTVSKT ETSQVAPA
Folding
3D Structure
Complex function within network of proteins
5
Example Protein Structures
Triple beta-spiral fold in Adenovirus Fiber Shaft
Adenovirus Fibre Shaft
Virus Capsid
6
Predicting Protein Structures
  • Protein Structure is a key determinant of protein
    function
  • Crystalography to resolve protein structures
    experimentally in-vitro is very expensive, NMR
    can only resolve very-small proteins
  • The gap between the known protein sequences and
    structures
  • 3,023,461 sequences v.s. 36,247 resolved
    structures (1.2)
  • Therefore we need to predict structures in-silico

7
Quaternary Folds and Alignments
  • Protein fold
  • Identifiable regular arrangement of secondary
    structural elements
  • Thus far, a limited number of protein folds have
    been discovered (1000)
  • Very few research work on quaternary folds
  • Complex structures and few labeled data
  • Quaternary fold recognition

Biology task Protein fold Membership and non-membership proteins Will the protein take the fold?
AI task Pattern to be induced Training data (seq-struc pairs physics) Does the pattern appear in the testing sequence?
8
Previous Work
  • Sequence similarity perspective
  • Sequence similarity searches, e.g. PSI-BLAST
    Altschul et al, 1997
  • Profile HMM, .e.g. HMMER Durbin et al, 1998 and
    SAM Karplus et al, 1998
  • Window-based methods, e.g. PSI_pred Jones, 2001
  • Physical forces perspective
  • Homology modeling or threading, e.g. Threader
    Jones, 1998
  • Structural biology perspective
  • Painstakingly hand-engineered methods for
    specific structures, e.g. aa- and ßß- hairpins,
    ß-turn and ß-helix Efimov, 1991 Wilmot and
    Thornton, 1990 Bradley at al, 2001

Fail to capture the structure properties and
long-range dependencies
Generative models based on rough approximation of
free-energy, perform very poorly on complex
structures
Very Hard to generalize due to built-in
constants, fixed features
9
Conditional Random Fields
  • Hidden Markov model (HMM) Rabiner, 1989
  • Conditional random fields (CRFs) Lafferty et al,
    2001
  • Model conditional probability directly
    (discriminative models, directly optimizable)
  • Allow arbitrary dependencies in observation
  • Adaptive to different loss functions and
    regularizers
  • Promising results in multiple applications
  • But, need to scale up (computationally) and
    extend to long-distance dependencies

10
Our Solution Conditional Graphical Models
Long-range dependency
Local dependency
  • Outputs Y M, Wi , where Wi pi, qi, si
  • Feature definition
  • Node feature
  • Local interaction feature
  • Long-range interaction feature

11
Linked Segmentation CRF
  • Node secondary structure elements and/or simple
    fold
  • Edges Local interactions and long-range
    inter-chain and intra-chain interactions
  • L-SCRF conditional probability of y given x is
    defined as

12
Linked Segmentation CRF (II)
  • Classification
  • Training learn the model parameters ?
  • Minimizing regularized negative log loss
  • Iterative search algorithms by seeking the
    direction whose empirical values agree with the
    expectation
  • Complex graphs results in huge computational
    complexity

13
Approximate Inference of L-SCRF
  • Most approximation algorithms cannot handle
    variable number of nodes in the graph, but we
    need variable graph topologies, so
  • Reversible jump MCMC sampling Greens, 1995,
    Schmidler et al, 2001 with Four types of
    Metropolis operators
  • State switching
  • Position switching
  • Segment split
  • Segment merge
  • Simulated annealing reversible jump MCMC Andireu
    et al, 2000
  • Replace the sample with RJ MCMC
  • Theoretically converge on the global optimum

14
Features for Protein Fold Recognition
15
Tertiary Fold Recognition ß-Helix fold
  • Histogram and ranks for known ß-helices against
    PDB-minus dataset

5
Chain graph model reduces the real running time
of SCRFs model by around 50 times
16
Fold Alignment Prediction ß-Helix
  • Predicted alignment for known ß -helices on
    cross-family validation

17
Discovery of New Potential ß-helices
  • Run structural predictor seeking potential
    ß-helices from Uniprot (structurally unresolved)
    databases
  • Full list (98 new predictions) can be accessed at
    www.cs.cmu.edu/yanliu/SCRF.html
  • Verification on 3 proteins with later
    experimentally resolved structures from different
    organisms
  • 1YP2 Potato Tuber ADP-Glucose Pyrophosphorylase
  • 1PXZ The Major Allergen From Cedar Pollen
  • GP14 of Shigella bacteriophage as a ß-helix
    protein
  • No single false positive!

18
Experiments Target Quaternary Fold
  • Triple beta-spirals van Raaij et al. Nature
    1999
  • Virus fibers in adenovirus, reovirus and PRD1
  • Double barrel trimer Benson et al, 2004
  • Coat protein of adenovirus, PRD1, STIV, PBCV

19
Experiment Results Fold Recognition
  • Double barrel-trimer

Triple beta-spirals
20
Experiment Results Alignment Prediction
21
Experiment ResultsDiscovery of New Membership
Proteins
  • Predicted membership proteins of triple
    beta-spirals can be accessed at
  • http//www.cs.cmu.edu/yanliu/swissprot_list.xls
  • Membership proteins of double barrel-trimer
    suggested by biologists Benson, 2005 compared
    with L-SCRF predictions

22
Concluding Remarks
  • Conditional graphical models for protein
    structure prediction
  • Effective representation for protein structural
    properties
  • Feasibility to incorporate different kinds of
    informative features
  • Efficient inference algorithms for large-scale
    applications
  • A major extension compared with previous work
  • Knowledge representation through graphical models
  • Ability to handle long-range interactions within
    one chain and between chains
  • Future work
  • Automatic learning of graph topology
  • Active learning including minority-class
    discovery
Write a Comment
User Comments (0)
About PowerShow.com