Title: Predicting Protein Function Using Machine-Learned Hierarchical Classifiers
1Predicting Protein Function Using Machine-Learned
Hierarchical Classifiers
- Roman Eisner
- Supervisors Duane Szafron and Paul Lu
2Outline
- Introduction
- Predictors
- Evaluation in a Hierarchy
- Local Predictor Design
- Experimental Results
- Conclusion
3(No Transcript)
4Proteins
- Functional Units in the cell
- Perform a Variety of Functions
- e.g. Catalysis of reactions, Structural and
mechanical roles, transport of other molecules - Can take years to study a single protein
- Any good leads would be helpful!
5Protein Function Prediction and Protein Function
Determination
- Prediction
- An estimate of what function a protein performs
- Determination
- Work in a laboratory to observe and discover what
function a protein performs - Prediction complements determination
6Proteins
- Chain of amino acids
- 20 Amino Acids
- FastA Format
gtP18077 R35A_HUMAN MSGRLWSKAIFAGYKRGLRNQREHTALLK
IEGVYARDETEFYLGKR CAYVYKAKNNTVTPGGKPNKTRVIWGKVTRAH
GNSGMVRAKFRSNL PAKAIGHRIRVMLYPSRI
7Ontologies
- Standardized Vocabularies (Common Language)
- In biological literature, different terms can be
used to describe the same function - e.g. peroxiredoxin activity and
- thioredoxin peroxidase activity
- Can be structured in a hierarchy to show
relationships
8Gene Ontology
- Directed Acyclic Graph (DAG)
- Always changing
- Describes 3 aspects of protein annotations
- Molecular Function
- Biological Process
- Cellular Component
9Gene Ontology
- Directed Acyclic Graph (DAG)
- Always changing
- Describes 3 aspects of protein annotations
- Molecular Function
- Biological Process
- Cellular Component
10Hierarchical Ontologies
- Can help to represent a large number of classes
- Represent General and Specific data
- Some data is incomplete could become more
specific in the future
11Incomplete Annotations
12Goal
- To predict the function of proteins given their
sequence
13Data Set
- Protein Sequences
- UniProt database
- Ontology
- Gene Ontology Molecular Function aspect
- Experimental Annotations
- Gene Ontology Annotation project _at_ EBI
- Pruned Ontology 406 nodes (out of 7,399) with
20 proteins - Final Data Set 14,362 proteins
14Outline
- Introduction
- Predictors
- Evaluation in a Hierarchy
- Local Predictor Design
- Experimental Results
- Conclusion
15Predictors
- Global
- BLAST NN
- Local
- PA-SVM
- PFAM-SVM
- Probabilistic Suffix Trees
16Predictors
- Global
- BLAST NN
- Local
- PA-SVM
- PFAM-SVM
- Probabilistic Suffix Trees
17Why Linear SVMs?
- Accurate
- Explainability
- Each term in the dot product in meaningful
18PA-SVM
19PFAM-SVM
20PST
- Probabilistic Suffix Trees
- Efficient Markov chains
- Model the protein sequences directly
- Prediction
21BLAST
- Protein Sequence Alignment for a query protein
against any set of protein sequences
22BLAST
23Outline
- Introduction
- Predictors
- Evaluation in a Hierarchy
- Local Predictor Design
- Experimental Results
- Conclusion
24Evaluating Predictions in a Hierarchy
- Not all errors are equivalent
- Error to sibling different than error to
unrelated part of hierarchy - Proteins can perform more than one function
- Need to combine predictions of multiple functions
into a single measure
25Evaluating Predictions in a Hierarchy
- Semantics of the hierarchy True Path Rule
- Protein labeled with
- T -gt T, A1, A2
- Predicted functions
- S -gt S, A1, A2
- Precision 2/3 67
- Recall 2/3 67
26Evaluating Predictions in a Hierarchy
- Protein labelled with
- T -gt T, A1, A2
- Predicted
- C1 -gt C1, T, A1, A2
- Precision 3/4 75
- Recall 3/3 100
27Supervised Learning
28Cross-Validation
- Used to estimate performance of classification
system on future data - 5 Fold Cross-Validation
29Outline
- Introduction
- Predictors
- Evaluation in a Hierarchy
- Local Predictor Design
- Experimental Results
- Conclusion
30Inclusive vs Exclusive Local Predictors
- In a system of local predictors, how should each
local predictor behave? - Two extremes
- A local predictor predicts positive only for
those proteins that belong exactly at that node - A local predictor predicts positive for those
proteins that belong at or below them in the
hierarchy - No a priori reason to choose either
31Exclusive Local Predictors
32Inclusive Local Predictors
33Training Set Design
- Proteins in the current folds training set can
be used in any way - Need to select for each local predictor
- Positive training examples
- Negative training examples
34Training Set Design
35Training Set Design
36Training Set Design
37Training Set Design
38Training Set Design
39Comparing Training Set Design Schemes
40Exclusive have more exceptions
41Lowering the Cost of Local Predictors
- Top-Down
- Compute local predictors top to bottom until a
negative prediction is reached
42Lowering the Cost of Local Predictors
- Top-Down
- Compute local predictors top to bottom until a
negative prediction is reached
43Lowering the Cost of Local Predictors
- Top-Down
- Compute local predictors top to bottom until a
negative prediction is reached
44Top-Down Search
45Outline
- Introduction
- Predictors
- Evaluation in a Hierarchy
- Local Predictor Design
- Experimental Results
- Conclusion
46Predictor Results
47Similar and Dissimilar Proteins
- 89 of proteins at least one good BLAST hit
- Proteins which are similar (often homologous) to
the set of well studied proteins - 11 of proteins no good BLAST hit
- Proteins which are not similar to the set of well
studied proteins
48Coverage
- Coverage Percentage of proteins for which a
prediction is made
49Similar Proteins Exploiting BLAST
- BLAST is fast and accurate when a good hit is
found - Can exploit this to lower the cost of local
predictors - Generate candidate nodes
- Only compute local predictors for candidate nodes
- Candidate node set should have
- High Recall
- Minimal Size
50Similar Proteins Exploiting BLAST
- candidate nodes generating methods
- Searching outward from BLAST hit
- Performing the union of more than one BLAST hits
annotations
51Similar Proteins Exploiting BLAST
52Dissimilar Proteins
- The more interesting case
53Comparison to Protfun
- On a pruned ontology (9 Gene Ontology classes)
- On 1,637 no good BLAST hit proteins
54Future Work
- Try other two ontologies biological process and
cellular component - Use other local predictors
- More parameter tuning
- Predictor cost
55Conclusion
- Protein Function Prediction provides good leads
for Protein Function Determination - Hierarchical ontologies can represent incomplete
data allowing the prediction of more functions - Considering the hierarchy
- More accurate Less Computationally Intensive
- Methods presented have a higher coverage than
BLAST alone - Results accepted to IEEE CIBCB 2005
56Thanks to
- Duane Szafron and Paul Lu
- Brett Poulin and Russ Greiner
- Everyone in the Proteome Analyst research group
57Incomplete Data Prediction
- Inclusive avoids using ambiguous (incomplete)
training data - Does this help?
- To test
- Train on more Incomplete Data
- Choose X of proteins, and move one annotation up
- Evaluation Predictions on Complete data
58Robustness to Incomplete Data
59Local vs Global Cross-Validation
- Some node predictors have as little as 20
positive examples - How to do cross-validation to make sure each
predictor has enough positive training examples?
60Local vs Global Cross-Validation
- Local cross-validation is invalid
- Predictions must be consistent
- Need fold isolation
- A single global split
- global cross-validation