The EcoCyc and MetaCyc Pathway/Genome Databases - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

The EcoCyc and MetaCyc Pathway/Genome Databases

Description:

Ocelot Storage System Architecture. Persistent storage via disk files, Oracle DBMS ... DBMS is submerged within Ocelot, invisible to users ... – PowerPoint PPT presentation

Number of Views:662
Avg rating:3.0/5.0
Slides: 36
Provided by: Pan47
Category:

less

Transcript and Presenter's Notes

Title: The EcoCyc and MetaCyc Pathway/Genome Databases


1
The EcoCyc and MetaCyc Pathway/Genome Databases
  • Peter D. Karp, Ph.D.
  • Bioinformatics Research Group
  • SRI International
  • pkarp_at_ai.sri.com
  • http//www.ai.sri.com/pkarp/
  • http//EcoCyc.org/

2
Overview
  • Motivations and terminology
  • Pathway/genome databases
  • BioCyc collection
  • EcoCyc, MetaCyc
  • Pathway Tools software
  • Bioinformatics Database Warehouse project

3
A
E
4
(No Transcript)
5
What to do When Theories BecomeLarger than Minds
can Grasp?
  • Example E. coli metabolic network
  • 160 pathways involving 744 reactions and 791
    substrates
  • Example E. coli genetic network
  • Control by 97 transcription factors of 1174 genes
    in 630 transcription units
  • Past solutions
  • Partition theories across multiple minds
  • Encode theories in natural-language text
  • We cannot compute with theories in those forms
  • Evaluate theories for consistency with new data
    microarrays
  • Refine theories with respect to new data
  • Compare theories describing different organisms

6
Solution Biological Knowledge Bases
  • Store biological knowledge and theories in
    computers in a declarative form
  • Amenable to computational analysis and generative
    user interfaces
  • Establish ongoing efforts to curate (maintain,
    refine, embellish) these knowledge bases
  • Accepted to store data in computers, but not
    knowledge
  • Such knowledge bases are an integral part of the
    scientific enterprise

7
Pathway Definition
  • Chemical reactions interconvert chemical
    compounds
  • An enzyme is a protein that accelerates chemical
    reactions
  • A pathway is a linked set of reactions
  • Often regulated as a unit
  • A conceptual unit of cells biochemical machine

A B C D
A C E
8
Terminology
  • Model Organism Database (MOD) DB describing
    genome and other information about an organism
  • Pathway/Genome Database (PGDB) MOD that
    combines information about
  • Pathways, reactions, substrates
  • Enzymes, transporters
  • Genes, replicons
  • Transcription factors, promoters, operons, DNA
    binding sites
  • BioCyc Collection of 15 PGDBs at BioCyc.org
  • EcoCyc, AgroCyc, YeastCyc

9
BioCyc Collection ofPathway/Genome DBs
  • Computationally Derived Datasets
  • Agrobacterium tumefaciens
  • Caulobacter crescentus
  • Chlamydia trachomatis
  • Bacillus subtilis
  • Helicobacter pylori
  • Haemophilus influenzae
  • Mycobacterium tuberculosis RvH37
  • Mycobacterium tuberculosis CDC1551
  • Mycoplasma pneumonia
  • Pseudomonas aeruginosa
  • Saccharomyces cerevisiae
  • Treponema pallidum
  • Vibrio cholerae
  • Yellow Open Database
  • Literature-based Datasets
  • MetaCyc
  • Escherichia coli (EcoCyc)

http//BioCyc.org/
10
Terminology Pathway Tools Software
  • PathoLogic
  • Prediction of metabolic network from genome
  • Computational creation of new Pathway/Genome
    Databases
  • Pathway/Genome Editors
  • Distributed curation of PGDBs
  • Distributed object database system, interactive
    editing tools
  • Pathway/Genome Navigator
  • WWW publishing of PGDBs
  • Querying, visualization of pathways, chromosomes,
    operons
  • Analysis operations
  • Pathway visualization of gene-expression data
  • Global comparisons of metabolic networks
  • Bioinformatics 18S225 2002

11
Pathway Tools Algorithms
  • Query, visualization and editing tools for these
    datatypes
  • Full Metabolic Map
  • Paint gene expression data on metabolic network
    compare metabolic networks
  • Pathways
  • Pathway prediction
  • Reactions
  • Balance checker
  • Compounds
  • Chemical substructure comparison
  • Enzymes, Transporters, Transcription Factors
  • Genes Blast search
  • Chromosomes
  • Operons
  • Operon prediction

12
Model Organism Databases
  • DBs that describe the genome and other
    information about an organism
  • Every sequenced organism with an active
    experimental community requires a MOD
  • Integrate genome data with information about the
    biochemical and genetic network of the organism
  • MODs are platforms for global analyses of an
    organism
  • Interpret gene expression data in a pathway
    context
  • Characterize systems properties of metabolic and
    genetic networks
  • Determine consistency of metabolic and transport
    networks
  • In silico prediction of essential genes

13
EcoCyc Project EcoCyc.org
  • E. coli Encyclopedia
  • Model-Organism Database for E. coli
  • Computational symbolic theory of E. coli
  • Electronic review article for E. coli over 3500
    literature citations
  • Tracks the evolving annotation of the E. coli
    genome
  • Collaborative development via Internet
  • Karp (SRI) -- Bioinformatics architect
  • John Ingraham -- Advisor
  • (SRI) Metabolic pathways
  • Saier (UCSD) and Paulsen (TIGR)-- Transport
  • Collado (UNAM)-- Regulation of gene expression
  • Database content 18,000 objects

14
EcoCyc E.coli Dataset
Pathway/Genome Navigator
Pathways 165
Reactions 2,760
Compounds 774
Enzymes 914 Transporters 162 Promoters
812 TransFac Sites 956 Citations 3,508
Proteins 4,273
Transcription Units 724 Factors 110
Genes 4,393
http//EcoCyc.org/
15
EcoCyc Procedures
  • All DB updates by 5 staff curators
  • Information gathered from biomedical literature
  • Corrections solicited from E. coli researchers
  • Review-level database
  • Four releases per year
  • Available through WWW site, as data files, as
    downloadable application
  • Quality assurance of data and software
  • Evaluate database consistency constraints
  • Perform element balancing of reactions
  • Run other checking programs
  • Display every DB object

16
MetaCyc Metabolic Encyclopedia
  • Nonredundant metabolic pathway database
  • Describe a representative sample of every
    experimentally determined metabolic pathway
  • Literature-based DB with extensive references and
    commentary
  • Pathways, reactions, enzymes, substrates
  • 460 pathways, 1267 enzymes, 4294 reactions
  • 172 E. coli pathways, 2735 citations
  • Nucleic Acids Research 3059-61 2002.
  • Jointly developed by SRI and Carnegie Institution
  • New focus on plant pathways

17
Family of Pathway/GenomeDatabases
18
Pathway Tools Implementation Details
  • Allegro Common Lisp
  • Sun and PC platforms
  • Ocelot object database
  • 250,000 lines of code
  • Lisp-based WWW server at BioCyc.org
  • Manages 15 PGDBs

19
Pathway Tools Architecture
Pathway Genome Navigator
Object DBMS
20
Ocelot Knowledge Server Architecture
  • Frame data model
  • Classes, instances, inheritance
  • Frames have slots that define their properties,
    attributes, relationships
  • A slot has one or more values
  • Each value can be any Lisp datatype
  • Slotunits define metadata about slots
  • Domain, range, inverse
  • Collection type, number of values, value
    constraints
  • Transaction logging facility
  • Schema evolution

21
Ocelot Storage System Architecture
  • Persistent storage via disk files, Oracle DBMS
  • Concurrent development Oracle
  • Single-user development disk files
  • Read-only delivery bundle data into binary
    program
  • Oracle storage
  • DBMS is submerged within Ocelot, invisible to
    users
  • Relational schema is domain independent, supports
    multiple KBs simultaneously
  • Frames transferred from DBMS to Ocelot
  • On demand
  • By background prefetcher
  • Memory cache
  • Persistent disk cache to speed performance via
    Internet

22
The Common Lisp ProgrammingEnvironment
  • Gatt studied Lisp and Java implementation of 16
    programs by 14 programmers (Intelligence 1121
    2000)

23
EcoCyc WWW Server
24
Pathway/Genome DBs Created byExternal Users
  • Plasmodium falciparum, Stanford University
  • plasmocyc.stanford.edu
  • Mycobacterium tuberculosis, Stanford University
  • BioCyc.org
  • Arabidopsis thaliana and Synechosistis, Carnegie
    Institution of Washington
  • Arabidopsis.org1555
  • Methanococcus janaschii, EBI
  • Maine.ebi.ac.uk1555
  • Other PGDBs in progress by 24 other users
  • Software freely available
  • Each PGDB owned by its creator

25
Global Consistency Checking of Biochemical
Network
  • Given
  • A PGDB for an organism
  • A set of initial metabolites
  • Infer
  • What set of products can be synthesized by the
    small-molecule metabolism of the organism
  • Can known growth medium yield known essential
    compounds?
  • Pacific Symposium on Biocomputing p471 2001

26
AlgorithmForward Propagation
Nutrient set
Products
PGDB reaction pool
Transport
Fire reactions
Metabolite set
Reactants
27
Results
  • Phase I Forward propagation
  • 21 initial compounds yielded only half of 38
    essential compounds for E. coli
  • Phase II Manually identify
  • Bugs in EcoCyc (e.g., two objects for tryptophan)
  • Missing initial protein substrates (e.g., ACP)
  • Missing pathways in EcoCyc
  • Phase III Forward propagation with 11 more
    initial metabolites
  • Yielded all 38 essential compounds

28
Nutrient-Related AnalysisValidation of the
EcoCyc Database
  • Results on EcoCyc
  • Phase I
  • Essential compounds
  • produced 19
  • not produced 19
  • Total compounds
  • produced (28)
  • Reactions
  • Fired (31)

29
Missing Essential Compounds Due To
  • Bugs in EcoCyc
  • Narrow conceptualization of the problem
  • Protein substrates
  • Incomplete biochemical knowledge

30
Nutrient-Related AnalysisValidation of the
EcoCyc Database
  • Results on EcoCyc
  • Phase II (After adding 11 extra metabolites)
  • Essential compounds
  • produced 38
  • not produced 0
  • Total compounds
  • produced (49)
  • not produced (51)
  • Reactions
  • Fired (58)
  • Not fired (42)

31
Pathway Tools Misconceptions
  • PathoLogic
  • Does not re-annotate genomes
  • Pathway Tools does not handle quantitative
    information
  • Pathway/Genome Editors do not work through the
    web

32
HumanCyc Human Metabolic PathwayDatabase
Consortium
  • Construct DB of human metabolic pathways using
    PathoLogic
  • Link to human genome web sites
  • Hire one curator to refine and curate with
    respect to literature over a 2 year period
  • Remove false-positive predictions
  • Insert known pathways missed by PathoLogic
  • Add comments and citations from pathways and
    enzymes to the literature
  • Add enzyme activators, inhibitors, cofactors,
    tissue information
  • Available as flatfiles and with Pathway/Genome
    Navigator
  • New versions to be released every 6 months

33
Summary
  • Pathway/Genome Databases
  • MetaCyc non-redundant DB of literature-derived
    pathways
  • 14 organism-specific PGDBs available through SRI
    at BioCyc.org
  • Computational theories of biochemical machinery
  • Pathway Tools software
  • Extract pathways from genomes
  • Morph annotated genome into structured ontology
  • Distributed curation tools for MODs
  • Query, visualization, WWW publishing

34
BioCyc and Pathway Tools Availability
  • WWW BioCyc freely available to all
  • BioCyc.org
  • Six BioCyc DBs openly available to all
  • BioCyc DBs freely available to non-profits
  • Flatfiles downloadable from BioCyc.org
  • Binary executable
  • Sun UltraSparc-170 w/ 64MB memory
  • PC, 400MHz CPU, 64MB memory, Windows-98 or newer
  • PerlCyc API
  • Pathway Tools freely available to non-profits

35
Acknowledgements
  • SRI
  • Suzanne Paley, Pedro Romero, John Pick, Cindy
    Krieger, Martha Arnaud
  • EcoCyc Project
  • Julio Collado-Vides, Ian Paulsen, Monica Riley,
    Milton Saier
  • MetaCyc Project
  • Sue Rhee, Lukas Mueller, Peifen Zhang, Chris
    Somerville
  • Stanford
  • Gary Schoolnik, Harley McAdams, Lucy Shapiro,
    Russ Altman, Iwei Yeh
  • Funding sources
  • NIH National Center for Research Resources
  • NIH National Institute of General Medical
    Sciences
  • NIH National Human Genome Research Institute
  • Department of Energy Microbial Cell Project
  • DARPA BioSpice, UPC

BioCyc.org
Write a Comment
User Comments (0)
About PowerShow.com