Predicting Protein Function Using Machine-Learned Hierarchical Classifiers - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

Predicting Protein Function Using Machine-Learned Hierarchical Classifiers

Description:

e.g. Catalysis of reactions, Structural and mechanical roles, ... Directed Acyclic Graph (DAG) Always changing. Describes 3 aspects of protein annotations: ... – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 61

Provided by: csUal

Category:

more less

Transcript and Presenter's Notes

Title: Predicting Protein Function Using Machine-Learned Hierarchical Classifiers

1
Predicting Protein Function Using Machine-Learned
Hierarchical Classifiers

Roman Eisner
Supervisors Duane Szafron and Paul Lu

2
Outline

Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion

3
(No Transcript)
4
Proteins

Functional Units in the cell
Perform a Variety of Functions
e.g. Catalysis of reactions, Structural and
mechanical roles, transport of other molecules
Can take years to study a single protein
Any good leads would be helpful!

5
Protein Function Prediction and Protein Function
Determination

Prediction
An estimate of what function a protein performs
Determination
Work in a laboratory to observe and discover what
function a protein performs
Prediction complements determination

6
Proteins

Chain of amino acids
20 Amino Acids
FastA Format

gtP18077 R35A_HUMAN MSGRLWSKAIFAGYKRGLRNQREHTALLK
IEGVYARDETEFYLGKR CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAH
GNSGMVRAKFRSNL PAKAIGHRIRVMLYPSRI
7
Ontologies

Standardized Vocabularies (Common Language)
In biological literature, different terms can be
used to describe the same function
e.g. peroxiredoxin activity and
thioredoxin peroxidase activity
Can be structured in a hierarchy to show
relationships

8
Gene Ontology

Directed Acyclic Graph (DAG)
Always changing
Describes 3 aspects of protein annotations
Molecular Function
Biological Process
Cellular Component

9
Gene Ontology

Directed Acyclic Graph (DAG)
Always changing
Describes 3 aspects of protein annotations
Molecular Function
Biological Process
Cellular Component

10
Hierarchical Ontologies

Can help to represent a large number of classes
Represent General and Specific data
Some data is incomplete could become more
specific in the future

11
Incomplete Annotations
12
Goal

To predict the function of proteins given their
sequence

13
Data Set

Protein Sequences
UniProt database
Ontology
Gene Ontology Molecular Function aspect
Experimental Annotations
Gene Ontology Annotation project _at_ EBI
Pruned Ontology 406 nodes (out of 7,399) with
20 proteins
Final Data Set 14,362 proteins

14
Outline

Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion

15
Predictors

Global
BLAST NN
Local
PA-SVM
PFAM-SVM
Probabilistic Suffix Trees

16
Predictors

Global
BLAST NN
Local
PA-SVM
PFAM-SVM
Probabilistic Suffix Trees

Linear

17
Why Linear SVMs?

Accurate
Explainability
Each term in the dot product in meaningful

18
PA-SVM

Proteome Analyst

19
PFAM-SVM

Hidden Markov Models

20
PST

Probabilistic Suffix Trees
Efficient Markov chains
Model the protein sequences directly
Prediction

21
BLAST

Protein Sequence Alignment for a query protein
against any set of protein sequences

22
BLAST
23
Outline

Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion

24
Evaluating Predictions in a Hierarchy

Not all errors are equivalent
Error to sibling different than error to
unrelated part of hierarchy
Proteins can perform more than one function
Need to combine predictions of multiple functions
into a single measure

25
Evaluating Predictions in a Hierarchy

Semantics of the hierarchy True Path Rule
Protein labeled with
T -gt T, A1, A2
Predicted functions
S -gt S, A1, A2
Precision 2/3 67
Recall 2/3 67

26
Evaluating Predictions in a Hierarchy

Protein labelled with
T -gt T, A1, A2
Predicted
C1 -gt C1, T, A1, A2
Precision 3/4 75
Recall 3/3 100

27
Supervised Learning
28
Cross-Validation

Used to estimate performance of classification
system on future data
5 Fold Cross-Validation

29
Outline

Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion

30
Inclusive vs Exclusive Local Predictors

In a system of local predictors, how should each
local predictor behave?
Two extremes
A local predictor predicts positive only for
those proteins that belong exactly at that node
A local predictor predicts positive for those
proteins that belong at or below them in the
hierarchy
No a priori reason to choose either

31
Exclusive Local Predictors
32
Inclusive Local Predictors
33
Training Set Design

Proteins in the current folds training set can
be used in any way
Need to select for each local predictor
Positive training examples
Negative training examples

34
Training Set Design
35
Training Set Design
36
Training Set Design
37
Training Set Design
38
Training Set Design
39
Comparing Training Set Design Schemes

Using PA-SVM

40
Exclusive have more exceptions
41
Lowering the Cost of Local Predictors

Top-Down
Compute local predictors top to bottom until a
negative prediction is reached

42
Lowering the Cost of Local Predictors

Top-Down
Compute local predictors top to bottom until a
negative prediction is reached

43
Lowering the Cost of Local Predictors

Top-Down
Compute local predictors top to bottom until a
negative prediction is reached

44
Top-Down Search
45
Outline

Introduction
Predictors
Evaluation in a Hierarchy
Local Predictor Design
Experimental Results
Conclusion

46
Predictor Results
47
Similar and Dissimilar Proteins

89 of proteins at least one good BLAST hit
Proteins which are similar (often homologous) to
the set of well studied proteins
11 of proteins no good BLAST hit
Proteins which are not similar to the set of well
studied proteins

48
Coverage

Coverage Percentage of proteins for which a
prediction is made

49
Similar Proteins Exploiting BLAST

BLAST is fast and accurate when a good hit is
found
Can exploit this to lower the cost of local
predictors
Generate candidate nodes
Only compute local predictors for candidate nodes
Candidate node set should have
High Recall
Minimal Size

50
Similar Proteins Exploiting BLAST

candidate nodes generating methods
Searching outward from BLAST hit
Performing the union of more than one BLAST hits
annotations

51
Similar Proteins Exploiting BLAST
52
Dissimilar Proteins

The more interesting case

53
Comparison to Protfun

On a pruned ontology (9 Gene Ontology classes)
On 1,637 no good BLAST hit proteins

54
Future Work

Try other two ontologies biological process and
cellular component
Use other local predictors
More parameter tuning
Predictor cost

55
Conclusion

Protein Function Prediction provides good leads
for Protein Function Determination
Hierarchical ontologies can represent incomplete
data allowing the prediction of more functions
Considering the hierarchy
More accurate Less Computationally Intensive
Methods presented have a higher coverage than
BLAST alone
Results accepted to IEEE CIBCB 2005

56
Thanks to

Duane Szafron and Paul Lu
Brett Poulin and Russ Greiner
Everyone in the Proteome Analyst research group

57
Incomplete Data Prediction

Inclusive avoids using ambiguous (incomplete)
training data
Does this help?
To test
Train on more Incomplete Data
Choose X of proteins, and move one annotation up
Evaluation Predictions on Complete data

58
Robustness to Incomplete Data
59
Local vs Global Cross-Validation

Some node predictors have as little as 20
positive examples
How to do cross-validation to make sure each
predictor has enough positive training examples?

60
Local vs Global Cross-Validation