Title: ComPath Comparative Metabolic Pathway Analysis Tool
1ComPath Comparative Metabolic Pathway Analysis
Tool
- Kwangmin Choi and Sun Kim
- School of Informatics
- Indiana University
2Contents
- Introduction
- System Components
- Current Implementation
- Experiment Result
- Future Plan
3INTRODUCTIONSYSTEM COMPONENTS
4Introduction
- ComPath is a web-based sequence analysis system
built upon - KEGG (Kyoto Encyclopedia of Genes and Genomes)
- PLATCOM (A Platform for Computational Comparative
Genomics)
5KEGGKyoto Encyclopedia of Genes and Genomes
- Four Databases
- PATHWAY 32,657 pathways generated from 262
reference pathways - GENES 1,213,035 genes in 32 eukaryotes 260
bacteria 24 archaea - LIGAND 13,387 compounds, 2,543 drugs, 11,161
glycans, 6,446 reactions - BRITE 7,817 KO (KEGG Orthology) groups
- KEGG adopts EC enzyme classification system
6EC system 0/2
- An Old, but still universally accepted system by
biochemists - EC system was developed long before protein
sequence or structure information were available,
so the system focuses on reaction, not sequence
homology and structure - Many biochemists and structural biologists try to
harmonize newly available chemical, sequential,
and structural data with traditional
understanding of enzyme function.
7Problems in EC system 1/2
- Inconsistency in the EC hierarchy
- For each of the six top-level EC classes,
subclasses and sub-subclasses may have different
meanings. - e.g. EC1. are divided by substrate type, but
EC5. by general isomerase type -
- Problem with Multi-functional enzymes and
multiple subunits involved in a function - EC presumes only a 111 relationship between
gene, protein, and reaction. - Different sequence/structure, but similar EC
- Two enzymes with lower sequence identities
sometimes belong to the same or very similar EC. - e.g. o-succinylbenzoate synthase across several
bacteria have below the 40 sequence identity
8Problems in EC system 2/2
- Similar sequence/structure, but different EC
- Even variation in the fourth digit of the EC
number is rare above a sequence identity
threshold of 40. - However, exceptions to this rule are prevalent.
- e.g. melamine deaminase and atrazine
chlorohydrolase have 98 identical, but belong to
different EC. - No information on sequence/structure-mechanism
relationship - EC system considers only overall transformation
- Similarity among sequences is strongly correlated
with similarities in the level of a common
(structural domain-related) partial reaction,
rather than overall transformation - How to combine enzyme structure data with partial
reaction data? - Research Goal
- We provide a computational environment for enzyme
analysis via genome comparison - And it will be built on PLATCOM system
9Our Research Goal
- We provide a computational environment for enzyme
analysis via multiple genome comparison - And it will be built on PLATCOM system
10PLATCOMA Platform for Comparative Genomics
- Providing a platform for comparative genomics ON
THE WEB - Comparative analysis system for users to freely
select any sets of genomes - Scalable system interactively combining
high-performance sequence analysis tools
11CURRENT IMPLEMENTATION
12ComPath
- ComPath KEGG PLATCOM
- Not just for retrieving information from
Database, - but focuses on analyzing enzymes using the
enzyme-genome table - Easy to use
- Optional Upload a user sequence and/or a saved
enzyme-genome table data - Select a metabolic pathway
- Select any combination of genomes in KEGG
- Create an enzyme-genome table
- Then use the table for various enzyme sequence
analysis tasks
13Screenshot Pathway Selection
- 11 categories
- 123 pathways
- Users can upload the previous Enzyme-Genome table
datatype to continue analysis
14Screenshot Genome Selection
- 250 genomes from KEGG database
- Users can select genomes by taxonomical and
alphabetical order
15Enzyme-Genome Table
- An enzyme-genome table allows for tests on
whether a certain enzyme in a given pathway is
present or missing using sequence analysis
techniques. - Information in this table can be easily saved,
uploaded, transferred. - Users also can upload their sequence set, e.g.,
an entire set of predicted proteins in a newly
sequenced genome, and perform annotation of the
sequences in terms of KEGG pathways.
16Screenshot KEGGs Ortholog Table STATIC!
17Screenshot ComPath Enzyme-Genome Table
INTERACTIVE!
18Screenshot Upload Query Genome and Table
Editing Functions
19Sequence Analyses
- Missing enzyme search
- Pairwise (FASTA) and multiple sequence alignment
(CLUSTALW), - Domain search using SCOPEC/SUPERFAMILY and PDB
domains - Domain-based analysis using hidden markov models
(HMM), - Contextual sequence analysis (currently not
available) - Sequence analysis for further investigation
- Phylogenetic analysis of enzymes in selected
genomes, - Gibbs motif sampler.
- BAG clustering
- Contextual sequence analysis (currently not
available)
20Screenshot Sequence Analysis Functions
21TEST
22ExperimentsGenomes, Queries, Pathways
- Selected Genomes
- B.subtilus, B.Halodurans, E.coli
- H.Influenza, H.pylori, M.genitalium, Y.pestis KIM
- Query genomes
- M.tuberculosis
- A.aeolicus
- B.anthracis
- Metabolic Pathways
- 00010 (glycolysisglycogenesis),
- 00020 (TCA cycle)
23ExperimentsComparison of Sequence Analysis
Methods
- Four methods (abbr.)
- HMMer
- HMM search using the whole sequence
- CSR
- HMM search using common shared regions generated
by BAG program - SCOPEC
- Domain search using SCOP/SUPERFMAILY and PDB
database - FASTA
- Simple FASTA search
-
- Cutoff
- 1e-10, 1e-20, 1e-30
24ExperimentsOverall Design
25Screenshot ComPath Enzyme-Genome Table
INTERACTIVE!
26Experiment Results (e.g.)
Query Genome Pathway Mehthod Sensitivity Specificity E-value
M. tuberculosis Path 00010 HMMer 0.596491228 0.454545455 1.00E-30
CSR 0.666666667 0.454545455 1.00E-30
SCOPEC 0.614035088 0.348484848 1.00E-30
FASTA 0.649122807 0.378787879 1.00E-30
HMMer 0.623188406 0.524590164 1.00E-10
CSR 0.739130435 0.360655738 1.00E-10
SCOPEC 0.652173913 0.418032787 1.00E-10
FASTA 0.811594203 0.204918033 1.00E-10
Query Genome Pathway Method Sensitivity Specificity E-value
M. tuberculosis Path 00020 HMMer 0.535714286 0.769230769 1.00E-30
CSR 0.642857143 0.846153846 1.00E-30
SCOPEC 0.535714286 0.769230769 1.00E-30
FASTA 0.678571429 0.615384615 1.00E-30
HMMer 0.516129032 0.777777778 1.00E-10
CSR 0.709677419 0.666666667 1.00E-10
SCOPEC 0.548387097 0.777777778 1.00E-10
FASTA 0.741935484 0.333333333 1.00E-10
27A. aeolicus
28B. anthracis
29M. tuberculosis
30FUTURE PLAN
31Future PlanMore Resources
- ComPath is being extended to incorporate more
resources, including - KEGG LIGAND A composite database consisting of
compound, glycan, reaction etc. - ProRule A new database containing functional
and structural information on PROSITE profiles - SFLD Structure-Function Linkage Database
-
- Also we are developing databases and algorithms
for enzyme analysis, e.g. Classifiers using a
database of enzyme-specific HMMs. - ComPath is in an early stage of system
development and we solicit feedback and
suggestions from biology and bioinformatics
communities.
32Future PlanMore Algorithms and Tools
- More integrative understanding on biochemical
network evolution - Algorithms to handle isozyme problem
- Algorithms to computationally reconstruct
alternative pathways - Algorithms to combine sequence, structure,
chemical reaction, and contextual information for
better enzyme annotation - Etc.
- ComPath is in an early stage of system
development and we solicit feedback and
suggestions from biology and bioinformatics
communities.