Predicting Protein Function Using Machine-Learned Hierarchical Classifiers - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Predicting Protein Function Using Machine-Learned Hierarchical Classifiers

Description:

e.g. Catalysis of reactions, Structural and mechanical roles, ... Directed Acyclic Graph (DAG) Always changing. Describes 3 aspects of protein annotations: ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 61
Provided by: csUal
Category:

less

Transcript and Presenter's Notes

Title: Predicting Protein Function Using Machine-Learned Hierarchical Classifiers


1
Predicting Protein Function Using Machine-Learned
Hierarchical Classifiers
  • Roman Eisner
  • Supervisors Duane Szafron and Paul Lu

2
Outline
  • Introduction
  • Predictors
  • Evaluation in a Hierarchy
  • Local Predictor Design
  • Experimental Results
  • Conclusion

3
(No Transcript)
4
Proteins
  • Functional Units in the cell
  • Perform a Variety of Functions
  • e.g. Catalysis of reactions, Structural and
    mechanical roles, transport of other molecules
  • Can take years to study a single protein
  • Any good leads would be helpful!

5
Protein Function Prediction and Protein Function
Determination
  • Prediction
  • An estimate of what function a protein performs
  • Determination
  • Work in a laboratory to observe and discover what
    function a protein performs
  • Prediction complements determination

6
Proteins
  • Chain of amino acids
  • 20 Amino Acids
  • FastA Format

gtP18077 R35A_HUMAN MSGRLWSKAIFAGYKRGLRNQREHTALLK
IEGVYARDETEFYLGKR CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAH
GNSGMVRAKFRSNL PAKAIGHRIRVMLYPSRI
7
Ontologies
  • Standardized Vocabularies (Common Language)
  • In biological literature, different terms can be
    used to describe the same function
  • e.g. peroxiredoxin activity and
  • thioredoxin peroxidase activity
  • Can be structured in a hierarchy to show
    relationships

8
Gene Ontology
  • Directed Acyclic Graph (DAG)
  • Always changing
  • Describes 3 aspects of protein annotations
  • Molecular Function
  • Biological Process
  • Cellular Component

9
Gene Ontology
  • Directed Acyclic Graph (DAG)
  • Always changing
  • Describes 3 aspects of protein annotations
  • Molecular Function
  • Biological Process
  • Cellular Component

10
Hierarchical Ontologies
  • Can help to represent a large number of classes
  • Represent General and Specific data
  • Some data is incomplete could become more
    specific in the future

11
Incomplete Annotations
12
Goal
  • To predict the function of proteins given their
    sequence

13
Data Set
  • Protein Sequences
  • UniProt database
  • Ontology
  • Gene Ontology Molecular Function aspect
  • Experimental Annotations
  • Gene Ontology Annotation project _at_ EBI
  • Pruned Ontology 406 nodes (out of 7,399) with
    20 proteins
  • Final Data Set 14,362 proteins

14
Outline
  • Introduction
  • Predictors
  • Evaluation in a Hierarchy
  • Local Predictor Design
  • Experimental Results
  • Conclusion

15
Predictors
  • Global
  • BLAST NN
  • Local
  • PA-SVM
  • PFAM-SVM
  • Probabilistic Suffix Trees

16
Predictors
  • Global
  • BLAST NN
  • Local
  • PA-SVM
  • PFAM-SVM
  • Probabilistic Suffix Trees

  • Linear

17
Why Linear SVMs?
  • Accurate
  • Explainability
  • Each term in the dot product in meaningful

18
PA-SVM
  • Proteome Analyst

19
PFAM-SVM
  • Hidden Markov Models

20
PST
  • Probabilistic Suffix Trees
  • Efficient Markov chains
  • Model the protein sequences directly
  • Prediction

21
BLAST
  • Protein Sequence Alignment for a query protein
    against any set of protein sequences

22
BLAST
23
Outline
  • Introduction
  • Predictors
  • Evaluation in a Hierarchy
  • Local Predictor Design
  • Experimental Results
  • Conclusion

24
Evaluating Predictions in a Hierarchy
  • Not all errors are equivalent
  • Error to sibling different than error to
    unrelated part of hierarchy
  • Proteins can perform more than one function
  • Need to combine predictions of multiple functions
    into a single measure

25
Evaluating Predictions in a Hierarchy
  • Semantics of the hierarchy True Path Rule
  • Protein labeled with
  • T -gt T, A1, A2
  • Predicted functions
  • S -gt S, A1, A2
  • Precision 2/3 67
  • Recall 2/3 67

26
Evaluating Predictions in a Hierarchy
  • Protein labelled with
  • T -gt T, A1, A2
  • Predicted
  • C1 -gt C1, T, A1, A2
  • Precision 3/4 75
  • Recall 3/3 100

27
Supervised Learning
28
Cross-Validation
  • Used to estimate performance of classification
    system on future data
  • 5 Fold Cross-Validation

29
Outline
  • Introduction
  • Predictors
  • Evaluation in a Hierarchy
  • Local Predictor Design
  • Experimental Results
  • Conclusion

30
Inclusive vs Exclusive Local Predictors
  • In a system of local predictors, how should each
    local predictor behave?
  • Two extremes
  • A local predictor predicts positive only for
    those proteins that belong exactly at that node
  • A local predictor predicts positive for those
    proteins that belong at or below them in the
    hierarchy
  • No a priori reason to choose either

31
Exclusive Local Predictors
32
Inclusive Local Predictors
33
Training Set Design
  • Proteins in the current folds training set can
    be used in any way
  • Need to select for each local predictor
  • Positive training examples
  • Negative training examples

34
Training Set Design
35
Training Set Design
36
Training Set Design
37
Training Set Design
38
Training Set Design
39
Comparing Training Set Design Schemes
  • Using PA-SVM

40
Exclusive have more exceptions
41
Lowering the Cost of Local Predictors
  • Top-Down
  • Compute local predictors top to bottom until a
    negative prediction is reached

42
Lowering the Cost of Local Predictors
  • Top-Down
  • Compute local predictors top to bottom until a
    negative prediction is reached

43
Lowering the Cost of Local Predictors
  • Top-Down
  • Compute local predictors top to bottom until a
    negative prediction is reached

44
Top-Down Search
45
Outline
  • Introduction
  • Predictors
  • Evaluation in a Hierarchy
  • Local Predictor Design
  • Experimental Results
  • Conclusion

46
Predictor Results
47
Similar and Dissimilar Proteins
  • 89 of proteins at least one good BLAST hit
  • Proteins which are similar (often homologous) to
    the set of well studied proteins
  • 11 of proteins no good BLAST hit
  • Proteins which are not similar to the set of well
    studied proteins

48
Coverage
  • Coverage Percentage of proteins for which a
    prediction is made

49
Similar Proteins Exploiting BLAST
  • BLAST is fast and accurate when a good hit is
    found
  • Can exploit this to lower the cost of local
    predictors
  • Generate candidate nodes
  • Only compute local predictors for candidate nodes
  • Candidate node set should have
  • High Recall
  • Minimal Size

50
Similar Proteins Exploiting BLAST
  • candidate nodes generating methods
  • Searching outward from BLAST hit
  • Performing the union of more than one BLAST hits
    annotations

51
Similar Proteins Exploiting BLAST
52
Dissimilar Proteins
  • The more interesting case

53
Comparison to Protfun
  • On a pruned ontology (9 Gene Ontology classes)
  • On 1,637 no good BLAST hit proteins

54
Future Work
  • Try other two ontologies biological process and
    cellular component
  • Use other local predictors
  • More parameter tuning
  • Predictor cost

55
Conclusion
  • Protein Function Prediction provides good leads
    for Protein Function Determination
  • Hierarchical ontologies can represent incomplete
    data allowing the prediction of more functions
  • Considering the hierarchy
  • More accurate Less Computationally Intensive
  • Methods presented have a higher coverage than
    BLAST alone
  • Results accepted to IEEE CIBCB 2005

56
Thanks to
  • Duane Szafron and Paul Lu
  • Brett Poulin and Russ Greiner
  • Everyone in the Proteome Analyst research group

57
Incomplete Data Prediction
  • Inclusive avoids using ambiguous (incomplete)
    training data
  • Does this help?
  • To test
  • Train on more Incomplete Data
  • Choose X of proteins, and move one annotation up
  • Evaluation Predictions on Complete data

58
Robustness to Incomplete Data
59
Local vs Global Cross-Validation
  • Some node predictors have as little as 20
    positive examples
  • How to do cross-validation to make sure each
    predictor has enough positive training examples?

60
Local vs Global Cross-Validation
  • Local cross-validation is invalid
  • Predictions must be consistent
  • Need fold isolation
  • A single global split
  • global cross-validation
Write a Comment
User Comments (0)
About PowerShow.com