BioWarehouse: A Bioinformatics Database Warehouse - PowerPoint PPT Presentation

About This Presentation
Title:

BioWarehouse: A Bioinformatics Database Warehouse

Description:

A Bioinformatics Database Warehouse Peter D. Karp, Thomas J. Lee, Valerie Wagner BioCyc BioPAX BioCyc ENZYME CMR Genbank BioWarehouse Eco2DBase Oracle (10g) or – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 32
Provided by: sric7
Category:

less

Transcript and Presenter's Notes

Title: BioWarehouse: A Bioinformatics Database Warehouse


1
BioWarehouse A Bioinformatics Database Warehouse
  • Peter D. Karp, Thomas J. Lee, Valerie Wagner

2
Overview
  • Motivations for BioWarehouse
  • Facile programmatic access to individual DBs
  • Capture locally produced data
  • Database integration
  • BioWarehouse technical approach
  • Loaders
  • Schema overview
  • Applications of BioWarehouse
  • Join the BioWarehouse project

3
Motivations Computing with Individual Databases
  • Most bioinformatics DBs are not queryable via a
    database management system
  • Via Internet or locally installable
  • Having relational database versions of individual
    bioinformatics DBs facilitates complex queries
    against individual DBs
  • What is the alternative? Perl scripts? Awkward
    to program, slow to execute

4
MotivationsManage/Integrate Locally Produced
Data
  • Need schema to capture locally produced data
  • Integrate locally produced data with public
    databases

5
Why is the Multidatabase Approach Alone Not
Sufficient?
  • Multidatabase query approaches assume databases
    are in a queryable DBMS
  • Most sites that do operate DBMSs do not allow
    remote query access because of security and
    loading concerns
  • Users want to control data stability
  • Users want to control speed of their queries
  • Multidatabase query systems limited by Internet
    bandwidth and by the speed of the slowest data
    source that they query
  • Users need to capture, integrate and publish
    locally produced data of different types
  • Multidatabase and Warehouse approaches
    complementary

6
Key Challenges / Results for BioWarehouse
  • Design schema that accurately captures the
    contents of source DBs
  • Design schema that is understandable and scalable
  • Address poorly-specified syntax semantics of
    source DBs
  • Balancing the preservation of source data with
    mapping into common semantics
  • Clearly document data mappings performed by
    loaders

7
Technical Approach
  • Multi-platform support Oracle (10G) and MySQL
  • Schema support for multitude of bioinformatics
    datatypes
  • Create loaders for public bioinformatics DBs
  • Parse file format of the source DB
  • Some loaders parse interchange formats (BioPAX)
  • Semantic transformations
  • Insert DB contents into warehouse tables

BMC Bioinformatics 7170 2006 http//bioinformatic
s.ai.sri.com/biowarehouse/
8
Technical Approach
  • Provide Warehouse query access mechanisms
  • SQL queries via ODBC, JDBC, OAA
  • High quality documentation for schema and loader
    transformations
  • No graphical query interface yet

9
How to Use BioWarehouse?
  • Create your own local instance of BioWarehouse
  • Query an existing BioWarehouse instance, such as
    publichouse

10
PublicHouse Server
  • Publicly queryable BioWarehouse server operated
    by SRI
  • Manages a set of biological DBs constructed using
    BioWarehouse
  • CMR
  • BioCyc Pathway/Genome DBs
  • ENZYME
  • NCBI Taxonomy
  • Will be transitioning publichouse to contain
  • BioCyc
  • E. coli gene expression, proteomics, and
    ChIP-chip datasets
  • See http//bioinformatics.ai.sri.com/biowarehous
    e/publichouse.html
  • Note publichouse will become a BioCyc/EcoliHub
    BioWarehouse server

Host publichouse.sri.com Port 3306 Database
biospice
11
BioWarehouse Schema
  • Manages many bioinformatics datatypes
    simultaneously
  • Pathways, Reactions, Chemicals
  • Proteins, Genes, Replicons
  • Sequences, Sequence Features
  • Gene expression data
  • Protein expression data
  • Flow cytometry data
  • Organisms, Taxonomic relationships
  • Computations (sequence matches)
  • Citations, Controlled vocabularies
  • Links to external databases
  • Each type of warehouse object implemented through
    one or more relational tables

12
BioWarehouse Schema
  • Manages multiple datasets simultaneously
  • Dataset Single version of a database
  • Version comparison
  • Multiple software tools or experiments that
    require access to different versions
  • Each dataset is a warehouse entity
  • Every warehouse object is registered in a dataset

13
BioWarehouse Schema
  • Different databases storing the same biological
    datatypes are coerced into same warehouse tables
  • Design of most datatypes inspired by multiple
    databases
  • Representational tricks to decrease schema bloat
  • Single space of primary keys
  • Single set of satellite tables such as for
    synonyms, citations, comments, etc.
  • Schema size
  • Core schema 70 tables
  • Gene expression schema 109 tables

14
BioWarehouse Loaders
Database Loader Language Input Format Comments
BioCyc C BioCyc attribute-value Pathway/Genome Databases
BioPAX Java BioPAX format Protein interactions data
CMR C CMR column-delimited Comprehensive Microbial Resource 350 microbial genomes
Eco2Dbase Java Relational table dumps E. coli 2-D gel data
ENZYME Java ENZYME attribute-value Enzyme Commission set of reactions
Genbank Java XML derived from ASN.1 Bacterial subset of Genbank
Gene Ontology Java OBO XML Hierarchical controlled vocabulary
KEGG C KEGG format Metabolic pathway data
MAGE-ML Java MAGE-ML format Microarray gene expression data
NCBI Taxonomy C Taxonomy format Organism taxonomy
UniProt Java UniProt XML SWISS-PROT and TrEMBL
15
BioWarehouse Schema Overview
  • Schema manages many bioinformatics datatypes
  • including links to external databases
  • Main biological objects
  • Each type of warehouse object implemented through
    one or more relational tables (70)

16
Pathway Data
  • BioCyc
  • KEGG
  • BioPAX format
  • Physical interaction data only
  • ENZYME
  • Populates these tables Reaction, Protein,
    Chemical

17
Pathway Schema Neighborhood
Pathway
Product
PathwayReaction
Chemical
Reaction
Substrate
EnzymaticReaction
Protein
18
Pathway Data BioCyc Loader
  • Each BioCyc DB can be loaded into separate
    BioWarehouse dataset, or one common dataset
  • Loads data from 13 BioCyc source files
  • pubs.dat not present for all BioCyc PDDBs
  • compounds.dat
  • proteins.dat
  • protseq.fasta not present for all BioCyc PGDBs
  • transunits.dat not present for all BioCyc PGDBs
  • genes.dat
  • promoters.dat not present for the MetaCyc PGDB
  • terminators.dat not present for the MetaCyc
    PGDB
  • dnabindsites.dat not present for the MetaCyc
    PGDB
  • reactions.dat
  • enzrxns.dat
  • regulation.dat
  • pathways.dat
  • http//biowarehouse.ai.sri.com/repos/biocyc-loader
    /flatfile/doc/index.html

19
BioCyc Loader Chemical Compounds
20
BioCyc Loader Reactions
21
BioCyc Loader Products
22
(No Transcript)
23
Comparative Analysis with BioWarehouseCompare
MetaCyc to KEGG
  • KEGG pathways are larger than MetaCyc pathways
  • MetaCyc has a larger number of pathways
  • Which database has a larger collection of pathway
    data?
  • Prior result KEGG pathways are on average 4.2
    times larger than MetaCyc pathways

The outcomes of pathway database computations
depend on pathway ontology Green and Karp,
Nucleic Acids Research 200634 3687-97
24
MetaCyc contains 5.1 times as many pathways as
does KEGG
25
MetaCyc contains 1.4 times as many reactions
within its pathways as does KEGG
26
Gene Expression Data inBioWarehouse
  • Goals
  • Experimentalist loads locally produced data into
    BioWarehouse
  • Computational biologist loads remotely downloaded
    data into BioWarehouse
  • For processing and/or integration with other data
  • Source data format MAGE-ML
  • http//mged.org/
  • BioWarehouse and ArrayExpress are only MAGE-ML
    compliant data models we could find

27
Our Approach
  • Translate MAGE-OM into a relational database
    schema
  • One class ? One table gives too large a schema
    (ArrayExpress)
  • Instead, one table per inheritance hierarchy
    reduces table count by half
  • Use MAGE SDK tool for XML-gtObject use Castor for
    Object-gtRelational mapping
  • Merge the resulting schema into BioWarehouse
    schema to eliminate redundancy
  • Result 109 tables

28
ChIP-Chip Data
  • Current project to extend MAGE-ML loader and
    BioWarehouse to accommodate ChIP-chip data
  • Meta data, gene expression data, transcription
    factor(s), antibody(s)

29
Protein Interactions Data
  • Schema support
  • Load via BioPAX

30
Contribute to BioWarehouse
  • Open Source project
  • Ways to contribute
  • Maintain/update an existing loader
  • Implement a new loader
  • Port to new compiler or platform or DBMS
  • support_at_biowarehouse.org

31
Acknowledgments
  • Funded by
  • NIH/NIGMS EcoliHub project
  • NIH/NIGMS BioCyc project
  • DARPA Bio-SPICE program
  • SRI Colleagues
  • Valerie Wagner, Tom Lee, Tomer Altman
  • Learn more
  • http//bioinformatics.ai.sri.com/biowarehouse/
  • BMC Bioinformatics 7170 2006
Write a Comment
User Comments (0)
About PowerShow.com