Title: T. M. Murali
1The State of Gene Function Prediction in
Arabidopsis thaliana
- T. M. Murali
- Department of Computer Science
- Virginia Tech
- Slides prepared by Arjun Krishnan
- Introduction to Computational Biology and
Bioinformatics (CS 3824 - October 11, 13, 2011
2How a cell is wired
Small molecules
DNA
mRNA
Protein
The dynamics of such interactions emerge as
cellular processes and functions
3Molecular interaction networks
How do the genes and their products interact to
collectively perform a function?
4Molecular interaction networks
- A network containing genes connected to each
other whenever they physically or functionally
interact - Proteins that interact/co-complex (ribosomal,
polymerase, etc.) - Transcription factors and their target
- Enzymes catalyzing different steps in the same
metabolic pathway - Genes with correlation in expression
- Genes with similar phylogenetic profiles
5Arabidopsis is the primary model organism for
plants
- Complex organization from molecular to whole
organism level. - A key challenge
- Understanding the cellular machinery that
sustains this complexity. - In the current post-genomic times, a main aspect
of this challenge is gene function prediction - Identification of functions of all the (30, 000)
genes in the genome.
6Extent of gene annotations in Arabidopsis
Total of 30,000 genes in the genome
Ashburner et al, (2000) Nat. Gen. Swarbreck et al
(2008) Nuc. Acids. Res.
7Exploit high-throughput data
- Integrating functional genomic data could lead to
- Network models of gene interactions that resemble
the underlying cellular map. - Typically these networks contain gene functional
interactions - Connecting pairs of genes that participate in the
same biological processes. - In such a network, the very place of a gene
establishes the functional context that gene. - Guilt-by-association genes of unknown
functions can also be imputed with the function
of their annotated neighbors.
8Functional interaction networks
- Functional interaction network models have been
developed for Arabidopsis. - Lee et al. (2010) Rational association of genes
with traits using a genome-scale gene network for
Arabidopsis thaliana. - Very comprehensive in terms of using and
integrating datasets in other organisms for
application in plants. - Integrated 24 datasets 5 datasets from
Arabidopsis and the rest from other models. - AraNet 19,647 genes, 1,062,222 interactions.
9Goal of this study
- We examine the state of network-based gene
function prediction in Arabidopsis. - Evaluate the performance of multiple prediction
algorithms on AraNet. - Assesses the influence of the number of genes
annotated to a function and the source of
annotation evidence. - Compute the correlation of prediction performance
with network properties. - Evaluate prediction performance for
plant-specific functions.
10Network-based gene function prediction algorithms
11Network-based gene function prediction
12Network-based gene function prediction
13In this study
14Performance of different algorithms
- Computational gene function prediction precedes
and guides experimental validation - What we get is a ranked list of novel predictions
- An experimenter would choose a manageable number
of top-scoring predictions to pursue - Precision at the top of the prediction list
- We choose precision at 20 recall (P20R) as the
measure of performance
15Performance of different algorithms
SS seems to be better than the other algorithms
What about the influence of the number of genes
in a function?
16Performance of different algorithms
Each group containing 125 functions
Number of functions
Number of genes annotated with a function
17Performance of different algorithms
For small functions, the algorithm does not
matter!
And, using just experimental annotations is
better when you know little about a function.
- For large functions
- SS is clearly the best
- - Using all annotation is better
For medium functions, SS is a little better and
use of electronic evidences is mixed.
18Performance of different algorithms
Wilcoxon test SS vs. other algorithms
All ECs
Sans IEA/ISS
Overall, SinkSource appears to be best algorithm.
19Correlation of performance with network
properties
- Performance on a particular function might depend
on how its genes are organized / connected among
themselves in the network. - Number of nodes
- Number of components
- Fraction of nodes in the largest connected
component - Total edge weight
- Weighted density
- Average weighted degree
- Average segregation
20Correlation of performance with network
properties
21Correlation of performance with network
properties
22Correlation of performance with network
properties
- Number of nodes 9
- Number of components 3
- Fraction of nodes in the largest connected
component 4/9 - Total edge weight 8
- Weighted density 8/36
- Average weighted degree 16/9
23Correlation of performance with network
properties
Functional modularity Average Segregation
24Correlation of performance with network
properties
Functional modularity Average Segregation
25Correlation of performance with network
properties
- We have
- Vector of SS P20R values for each function
- Vector of values of a particular topological
property for each function - Spearman rank correlation
26Correlation of performance with network
properties
27Performance on plant-specific functions
- The underlying network is built based on data
from multiple non-plant species
- For plant-specific functions
- Performance is much worse compared to conserved
functions - Using only experimental annotations is better
- For conserved functions
- Performance is better than that for all functions
- Using all annotations is better
28Most predictable conserved functions
- protein folding
- nucleotide transport
- innate immunity
- cytoskeleton organization, and
- cell cycle
29Least predictable conserved functions
Specialized functions
30Most predictable plant-specific functions
Contribution from Arabidopsis datasets
- cell wall modification
- auxin/cytokinin signaling, and
- photosynthesis
31Least predictable plant-specific functions
- development, morphogenesis
- pattern formation
- phase transitions of various tissues, organs /
growth stages
32Conclusions
- Evaluated the performance of various prediction
algorithms on AraNet. - SinkSource is the overall best prediction
algorithm. - Measured the influence of the number of genes
annotated to a function and the source of
annotation evidence. - All algorithms perform poorly when only a small
number of genes are known or when annotating
very specific functions. - When only a small number of genes are known,
use only experimentally verified annotations to
make new predictions. - When a considerable number of genes are known,
use all annotations to make new predictions.
33Conclusions
- Measured the correlation of performance with
network properties - Several topological properties correlate well
with performance. - Average segregation has the strongest
correlation.
34Conclusions
- Assessed performance on conserved/plant-specific
functions - Performance on basic conserved functions is
better than that for all the functions. - Specialized conserved functions are hard to
predict. - Performance on plant-specific functions is very
poor. - Also a consequence of the fact that
plant-specific functions generally have small
number of annotations.
35Conclusions
- Avenues for improvement in functional interaction
networks - Build functional interaction networks that are
based on a larger collection of plant datasets. - If possible, rely as little as possible on data
from other species. - Avenues for future experimental work
- Plant-specific functions and
- Specialized conserved functions.
36Acknowledgements
- Arjun Krishnan
- Brett Tyler
- Andy Pereira