Title: PathoLogic Pathway Predictor
1PathoLogic Pathway Predictor
2Inference of Metabolic Pathways
Annotated Genomic Sequence
Pathway/Genome Database
Pathways
Reactions
PathoLogic Software Integrates genome and pathway
data to identify putative metabolic networks
Compounds
Multi-organism Pathway Database (MetaCyc)
Gene Products
Genes
Genomic Map
3PathoLogic Functionality
- Initialize schema for new PGDB
- Transform existing genome to PGDB form
- Infer metabolic pathways and store in PGDB
- Infer operons and store in PGDB
- Assemble Overview diagram
- Assist user with manual tasks
- Assign enzymes to reactions they catalyze
- Identify false-positive pathway predictions
- Build protein complexes from monomers
- Infer transport reactions
4PathoLogic Input/Output
- Inputs
- File listing genetic elements
- http//bioinformatics.ai.sri.com/ptools/genetic-el
ements.dat - Files containing DNA sequence for each genetic
element - Files containing annotation for each genetic
element - MetaCyc database
- Output
- Pathway/genome database for the subject organism
- Reports that summarize
- Evidence contained in the input genome for the
presence of reference pathways - Reactions missing from inferred pathways
5PathoLogic Analysis Phases
- Trial parsing of input data files few days
- Initialize schema of new PGDB 3 min
- Create DB objects for replicons, genes, proteins
5 min - Assign enzymes to reactions they catalyze
- ferrochelatase
10 min / 1 week - glutamate 1-semialdehyde 2,1-aminomutase
- porphobilinogen deaminase
E1
E2
B
D
E
F
6PathoLogic Analysis Phases
- From assigned reactions, infer what pathways are
present
5 min / few days - Define metabolic overview diagram 30
min - Define protein complexes
few days
7genetic-elements.dat
- ID TEST-CHROM-1
- NAME Chromosome 1
- TYPE CHRSM
- CIRCULAR? N
- ANNOT-FILE chrom1.pf
- SEQ-FILE chrom1.fsa
- //
- ID TEST-CHROM-2
- NAME Chromosome 2
- CIRCULAR? N
- ANNOT-FILE /mydata/chrom2.gbk
- SEQ-FILE /mydata/chrom2.fna
- //
8File Naming Conventions
- One pair of sequence and annotation files for
each genetic element - Sequence files FASTA format
- suffix fsa or fna
- Annotation file
- Genbank format suffix .gbk
- PathoLogic format suffix .pf
9Typical Problems Using Genbank Files With
PathoLogic
- Wrong qualifier names used read PathoLogic
documentation! - Extraneous information in a given qualifier
- Check results of trial parse carefully
10GenBank File Format
- Accepted feature types
- CDS, tRNA, rRNA, misc_RNA
- Accepted qualifiers
- /locus_tag Unique ID recm
- /gene Gene name req
- /product req
- /EC_number recm
- /product_comment opt
- /gene_comment opt
- /alt_name Synonyms opt
- /pseudo Gene is a pseudogene opt
- For multifunctional proteins, put each function
in a separate /product line
11PathoLogic File Format
- Each record starts with line containing an ID
attribute - Tab delimited
- Each record ends with a line containing //
- One attribute-value pair is allowed per line
- Use multiple FUNCTION lines for multifunctional
proteins - Lines starting with are comment lines
- Valid attributes are
- ID, NAME, SYNONYM
- STARTBASE, ENDBASE, GENE-COMMENT
- FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT
- DBLINK
- INTRON
12PathoLogic File Format
- ID TP0734
- NAME deoD
- STARTBASE 799084
- ENDBASE 799785
- FUNCTION purine nucleoside phosphorylase
- DBLINK PIDg3323039
- PRODUCT-TYPE P
- GENE-COMMENT similar to GP1638807 percent
identity 57.51 identified by sequence
similarity putative - //
- ID TP0735
- NAME gltA
- STARTBASE 799867
- ENDBASE 801423
- FUNCTION glutamate synthase
- DBLINK PIDg3323040
- PRODUCT-TYPE P
13Before you start What to do when an error occurs
- Most Navigator errors are automatically trapped
debugging information is saved to error.tmp file. - All other errors (including most PathoLogic
errors) will cause software to drop into the Lisp
debugger - Unix error message will show up in the original
terminal window from which you started Pathway
Tools. - Windows Error message will show up in the Lisp
console. The Lisp console usually starts out
iconified its icon is a blue bust of Franz
Liszt - 2 goals when an error occurs
- Try to continue working
- Obtain enough information for a bug report to
send to pathway-tools support team.
14The Lisp Debugger
- Sample error (details and number of restart
actions differ for each case) - Error Received signal number 2 (Keyboard
interrupt) - Restart actions (select using continue)
- 0 continue computation
- 1 Return to command level
- 2 Pathway Tools version 10.0 top level
- 3 Exit Pathway Tools version 10.0
- 1c EC(2)
- To generate debugging information (stack
backtrace) - zoom count all
- To continue from error, find a restart that takes
you to the top level in this case, number 2 - cont 2
- To exit Pathway Tools
- exit
15How to report an error
- Determine if problem is reproducible, and how to
reproduce it (make sure you have all the latest
patches installed) - Send email to ptools-support_at_ai.sri.com
containing - Pathway Tools version number and platform
- Description of exactly what you were doing (which
command you invoked, what you typed, etc.) or
instructions for how to reproduce the problem - error.tmp file, if one was generated
- If software breaks into the lisp debugger, the
complete error message and stack backtrace
(obtained using the command zoom count all, as
described on previous slide)
16Using the PPP GUI to Create a Pathway/Genome
Database
- Input Project Information
- Organism -gt Create New
17Input Project Information
18PathoLogic Command Menus
- Organism
- Select
- Create New
- Save KB
- Revert KB
- Reinitialize KB
- Specify Reference PGDB(s)
- Exit
- Build
- Trial Parse
- Automated Build
- Refine
- Assign Probable Enzymes
- Assign Modified Proteins
- Create Protein Complexes
- Re-run Name Matcher
- Rescore Pathways
- Predict transcription units
- Transport Identification Parser
- Update Overview
- Pathway Hole Filler
19Next Steps
- Trial Parse
- Build -gt Trial Parse
- Fix any errors in input files
- Build pathway/genome database
- Build -gt Automated Build
20PathoLogic Parser Output
21Assign Enzymes to Reactions
5.1.3.2
Gene product
MetaCyc
UDP-glucose-4-epimerase
Match
yes
no
Assign
Probable enzyme -ase
UDP-D-glucose ? UDP-galactose
no
yes
Manually search
Not a metabolic enzyme
yes
no
Assign
Cant Assign
22Enzyme Name Matcher
- Matches on full enzyme name
- Match is case-insensitive and removes the
punctuation characters -_()', - Also matches after removal of prefixes and
suffixes such as - Putative, Hypothetical, etc
- alphabetacatalyticinducible
chainsubunitcomponent - Parenthetical gene name
23Enzyme Name Matcher
- For names that do not match, software identifies
probable metabolic enzymes as those - Containing ase
- Not containing keywords such as
- sensor kinase
- topoisomerase
- protein kinase
- peptidase
- Etc
- Research unknown enzymes
- MetaCyc, Swiss-Prot, PubMed
24Enzyme Name to Reaction Mapping
See also file PTools Tutorial/PathoLogic
Reports/name-matching-report.txt
25Manual Polishing
- Refine -gt Assign Probable Enzymes ? Do this
first - Refine -gt Rescore Pathways ?
Redo after assigning enzymes - Refine -gt Create Protein Complexes ? Can be
done at any time - Refine -gt Assign Modified Proteins ? Can
be done at any time - Refine -gt Transport Identification Parser ? Can
be done at any time - Refine -gt Pathway Hole Filler
- Refine -gt Predict Transcription Units
- Refine -gt Update Overview ? Do this last, and
repeat after any material changes to PGDB
26Assign Probable Enzymes
27How to find reactions for probable enzymes
- First, verify that enzyme name describes a
specific, metabolic function - Search for fragment of name in MetaCyc you may
be able to find a match that PathoLogic missed - Look up protein in SwissProt or other DBs
- Search for gene name in PGDB for related organism
(bear in mind that gene names are not reliable
indicators of function, so check carefully) - Search for function name in PubMed
- Other
28Manual Polishing
- Refine -gt Assign Probable Enzymes
- Refine -gt Rescore Pathways
- Refine -gt Create Protein Complexes
- Refine -gt Assign Modified Proteins
- Refine -gt Transport Identification Parser
- Refine -gt Pathway Hole Filler
- Refine -gt Predict Transcription Units
- Refine -gt Run Consistency Checker
- Refine -gt Update Overview
29Automated Pathway Inference
- All pathways in MetaCyc for which there is at
least one enzyme identified in the target
organism are considered for possible inclusion. - Algorithm errs on side of inclusivity easier to
manually delete a pathway from an organism than
to find a pathway that should have been predicted
but wasnt.
30Considerations taken into account when deciding
whether or not a pathway should be inferred
- Is there a unique enzyme an enzyme not involved
in any other pathway? - Does the organism fall in the expected taxonomic
domain of the pathway? - Is this pathway part of a variant set, and, if
so, is there more evidence for some other
variant? - If there is no unique enzyme
- Is there evidence for more than one enzyme?
- If a biosynthetic pathway, is there evidence for
final reaction(s)? - If a degradation pathway, is there evidence for
initial reaction(s)? - If an energy metabolism pathway, is there
evidence for more than half the reactions?
31Assigning Evidence Scores to Predicted Pathways
- XYZ denotes score for P in O
- where
- X total number of reactions in P
- Y enzymes catalyzing number of reactions for
which there is evidence in O - Z number of Y reactions that are used in other
pathways in O
32Manual Pruning of Pathways
- Use pathway evidence report
- Coloring scheme aids in assessing pathway
evidence - Phase I Prune extra variant pathways
- Rescore pathways, re-generate pathway evidence
report - Phase II Prune pathways unlikely to be present
- No/few unique enzymes
- Most pathway steps present because they are used
in another pathway - Pathway very unlikely to be present in this
organism - Nonspecific enzyme name assigned to a pathway
step
33Caveats
- Cannot predict pathways not present in MetaCyc
- Evidence for short pathways is hard to interpret
- Since many reactions occur in multiple pathways,
some false positives
34Output from PPP
- Pathway/genome database
- Summary pages
- Pathway evidence page
- Click Summary of Organisms, then click organism
name, then click Pathway Evidence, then click
Save Pathway Report - Missing enzymes report
- Directory tree containing sequence files,
reports, etc.
35Resulting Directory Structure
- ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/
- input
- organism.dat
- organism-init.dat
- genetic-elements.dat
- annotation files
- sequence files
- reports
- name-matching-report.txt
- trial-parse-report.txt
- kb
- ORGIDbase.ocelot
- data
- overview.graph
- released -gt VERSION
36Manual Polishing
- Refine -gt Assign Probable Enzymes
- Refine -gt Rescore Pathways
- Refine -gt Create Protein Complexes
- Refine -gt Assign Modified Proteins
- Refine -gt Transport Identification Parser
- Refine -gt Pathway Hole Filler
- Refine -gt Predict Transcription Units
- Refine -gt Run Consistency Checker
- Refine -gt Update Overview
37Creating Protein Complexes
38Complex Subunits Stoichiometries
39Manual Polishing
- Refine -gt Assign Probable Enzymes
- Refine -gt Re-run Name Matcher
- Refine -gt Create Protein Complexes
- Refine -gt Assign Modified Proteins
- Refine -gt Transport Identification Parser
- Refine -gt Pathway Hole Filler
- Refine -gt Predict Transcription Units
- Refine -gt Run Consistency Checker
- Refine -gt Update Overview
40Proteins as Reaction Substrates
41Manual polishing
- Refine -gt Assign Probable Enzymes
- Refine -gt Re-run Name Matcher
- Refine -gt Create Protein Complexes
- Refine -gt Assign Modified Proteins
- Refine -gt Transport Identification Parser
- Refine -gt Pathway Hole Filler
- Refine -gt Predict Transcription Units
- Refine -gt Run Consistency Checker
- Refine -gt Update Overview
42What are pathway holes?
At least one reaction in the pathway has an
enzyme assigned. The reactions in the pathway
without enzymes assigned are holes.
1.4.3.-
iminoaspartate
No EC
L-aspartate
quinolinate
holes
n.n. pyrophosphorylase nadC, RV1596
6.3.1.5
deamido-NAD
deamido-NAD
nicotinate nucleotide
2.7.7.18
6.3.5.1
NAD
43Algorithm for identifying candidates and
consolidating data
Step III IV Consolidate hits and evaluate
evidence using a Bayes classifier
Step II BLAST against target genome
Step I collect query isozymes of function A
3 queries have low-scoring hits to sequence
X Resulting P(has-function) is low
gene X
organism 1 enzyme A
organism 2 enzyme A
organism 3 enzyme A
8 queries have high-scoring hits to sequence
Y Resulting P(has-function) is high
organism 4 enzyme A
organism 5 enzyme A
gene Y
organism 6 enzyme A
organism 7 enzyme A
organism 8 enzyme A
5 queries have low-scoring hits to sequence
Z Resulting P(has-function) is low
gene Z
target genome
44Reference for the Pathway Hole Filler
- Green, ML and Karp, PD. A Bayesian method for
identifying missing enzymes in predicted
metabolic pathway databases. BMC Bioinformatics
2004, 576.
45Features used to calculate the probability that a
protein has the desired function
Candidate is in a contiguous set of genes
transcribed in one direction with another gene in
the pathway
- Best E-value
- Avg. rank
- Avg aligned
- Number of query sequences aligned
- Potential operon?
- Adjacent reactions?
Candidate is adjacent to the gene assigned to an
adjacent reaction in the pathway
46Navigating to the Pathway Hole Filler
47Steps that must be completed before running the
Pathway Hole Filler
- Install BLAST executable (should already be
installed on training room machines) - Prepare BLAST protein db
- Need FASTA format genome nucleotide sequence (see
instructor if you have something different, like
ESTs, or have no sequence data file) - In general, the more pathways in your PGDB, the
more the pathway hole filler will have to search
for
48- Steps for operating the pathway hole filler
- Prepare training data for Bayes classifier
- Collect feature data for known rxns in PGDB
- Calculate probability distributions for classifier
- Identify and evaluate candidates
- Collect feature data for each candidate
- Use classifier to determine P(has-function)
- Choose holes to fill in KB
- Either select all above a cutoff or manually
review candidates
49Step 1 Prepare Training Data
- Calculate training data from your organism or use
existing training data
- Once Step 1 has been completed, the training data
are saved and can be reused (even in another
Pathway Tools session). - If using existing data from E. coli the training
data are based on data from the literature.
50Step 2 Identify Evaluate Candidates
51Step 2 Identify Evaluate Candidates
A list of all pathway holes in the PGDB
A list of all pathways in the PGDB with holes
52Modes of operation
- Fully automatic
- No interaction required from user
- All default values used
- Prepare training data all known rxns in KB
- Identify and evaluate candidates all pathways
with pathway holes - Choose holes to fill in KB all holes with Pgt0.9
filled
53Modes of operation
- Wizard
- Wizard prompts user for training data source and
for which holes to make predictions. Wizard runs
Steps 1 2, then prompts user to complete Step 3.
Power-user mode User must proceed through each
step in order. Program still prompts user for
required parameters, but each step must be
completed before advancing to next step.
54Step 3 Choose Holes to Fill in KB
55Step 3 Choose Holes to Fill in KB
56(No Transcript)
57(No Transcript)
58(No Transcript)
59Output from Pathway Hole Filler- from Prepare
Training Data step
- ROOT/aic-export/ecocyc/ORGIDcyc/VERSION/data/
- (e.g., ROOT/aic-export/ecocyc/caulocyc/1.0/data/)
- rxn-list data retrieved from ORGID for
calculating training data - priors/ directory containing training data that
is loaded when using existing data from ORGID - These files contain the training data computed in
Step 1. If either file is available, the user may
use existing training data in Step 1. - Each file is overwritten each time you run this
step.
60Output from Pathway Hole Filler- from Identify
and Evaluate Candidates step
- ROOT/aic-export/ecocyc/ORGIDcyc/VERSION/reports/
- (e.g., ROOT/aic-export/ecocyc/caulocyc/1.0/reports
/) - ORGIDholesX-Y.html (e.g., CAULOholes0-10.html)
- ORGID_filled-holes.html the list of holes that
user selected to fill in the KB in Step 3. - blasterrors.log log of each rxn describing
whether or not any candidates were found - hole-data file containing data (in a Lisp
structure) found for each rxn, used to generate
list in Choose holes to fill in KB dialogue. If
this file is available, step 3 can be initiated
without repeating Step 2. - Each file is overwritten each time you run this
step.
61Manual polishing
- Refine -gt Assign Probable Enzymes
- Refine -gt Rescore Pathways
- Refine -gt Create Protein Complexes
- Refine -gt Assign Modified Proteins
- Refine -gt Transport Identification Parser
- Refine -gt Pathway Hole Filler
- Refine -gt Predict Transcription Units
- Refine -gt Run Consistency Checker
- Refine -gt Update Overview
62Nomenclature
- WO pair pair of genes within an operon
- TUB pair pair of genes at a transcription unit
boundary (delineate operons)
63Operation of the operon predictor
- For each contiguous gene pair, predict whether
gene pairs are within the same operon or at a
transcription unit boundary - Use pairwise predictions to identify potential
operons - AB TUB pair
- BC WO pair operon BCD
- CD WO pair
- DE TUB pair
A
B
C
D
E
64Operon predictor
- Predicts operon gene pairs based on
- intergenic distance between genes
- genes in the same functional class
- Typically used for operon prediction
- We use method from Salgado et al, PNAS (2000) as
a starting point. - Uses E. coli experimentally verified data as a
training set. - Compute log likelihood of two genes being WO or
TUB pair based on intergenic distance.
65Operon predictor
- Additional features easily computed from a PGDB
- both genes products enzymes in the same metabolic
pathway - both gene products monomers in the same protein
complex - one gene product transports a substrate for a
metabolic pathway in which the other gene product
is involved as an enzyme - a gene upstream or downstream from the gene pair
(and within the same directon) is related to
either one of the genes in the pair as per
features 1, 2 and 3 above.