Title: Overview of Genome Databases
1Overview of Genome Databases
- Peter D. Karp, Ph.D.
- SRI International
- pkarp_at_ai.sri.com
- www-db.stanford.edu/dbseminar/seminar.html
2Talk Overview
- Definition of bioinformatics
- Motivations for genome databases
- Issues in building genome databases
3Definition of Bioinformatics
- Computational techniques for management and
analysis of biological data and knowledge - Methods for disseminating, archiving,
interpreting, and mining scientific information - Computational theories of biology
- Genome Databases is a subfield of bioinformatics
4Motivations for Bioinformatics
- Growth in molecular-biology knowledge
(literature) - Genomics
- Study of genomes through DNA sequencing
- Industrial Biology
5Example Genomics Datatypes
- Genome sequences
- DOE Joint Genome Institute
- 511M bases in Dec 2001
- 11.97G bases since Mar 1999
- Gene and protein expression data
- Protein-protein interaction data
- Protein 3-D structures
6Genome Databases
- Experimental data
- Archive experimental datasets
- Retrieving past experimental results should be
faster than repeating the experiment - Capture alternative analyses
- Lots of data, simpler semantics
- Computational symbolic theories
- Complex theories become too large to be grasped
by a single mind - The database is the theory
- Biology is very much concerned with qualitative
relationships - Less data, more complex semantics
7Bioinformatics
- Distinct intellectual field at the intersection
of CS and molecular biology - Distinct field because researchers in the field
must know CS, biology, and bioinformatics - Spectrum from CS research to biology service
- Rich source of challenging CS problems
- Large, noisy, complex data-sets and
knowledge-sets - Biologists and funding agencies demand working
solutions
8Bioinformatics Research
- algorithms data structures programs
- algorithms databases discoveries
- Combine sophisticated algorithms with the right
content - Properly structured
- Carefully curated
- Relevant data fields
- Proper amount of data
9Reference on Major Genome Databases
- Nucleic Acids Research Database Issue
- http//nar.oupjournals.org/content/vol30/issue1/
- 112 databases
10Questions to Ask of a New Genome Database
11What are Database Goals andRequirements?
- What problems will database be used to solve?
- Who are the users and what is their expertise?
12What is its Organizing Principle?
- Different DBs partition the space of genome
information in different dimensions - Experimental methods (Genbank, PDB)
- Organism (EcoCyc, Flybase)
13What is its Level of Interpretation?
- Laboratory data
- Primary literature (Genbank)
- Review (SwissProt, MetaCyc)
- Does DB model disagreement?
14What are its Semantics and Content?
- What entities and relationships does it model?
- How does its content overlap with similar DBs?
- How many entities of each type are present?
- Sparseness of attributes and statistics on
attribute values
15What are Sources of its Data?
- Potential information sources
- Laboratory instruments
- Scientific literature
- Manual entry
- Natural-language text mining
- Direct submission from the scientific community
- Genbank
- Modification policy
- DB staff only
- Submission of new entries by scientific community
- Update access by scientific community
16What DBMS is Employed?
- None
- Relational
- Object oriented
- Frame knowledge representation system
17Distribution / User Access
- Multiple distribution forms enhance access
- Browsing access with visualization tools
- API
- Portability
18What Validation Approaches areEmployed?
- None
- Declarative consistency constraints
- Programmatic consistency checking
- Internal vs external consistency checking
- What types of systematic errors might DB contain?
19Database Documentation
- Schema and its semantics
- Format
- API
- Data acquisition techniques
- Validation techniques
- Size of different classes
- Coverage of subject matter
- Sparseness of attributes
- Error rates
- Update frequency
20Relationship of Database Field toBioinformatics
- Scientists generally unaware of basic DB
principles - Complex queries vs click-at-a-time access
- Data model
- Defined semantics for DB fields
- Controlled vocabularies
- Regular syntax for flatfiles
- Automated consistency checking
- Most biologists take one programming class
- Evolution of typical genome database
- Finer points of DB research off their radar
screen - Handfull of DB researchers work in bioinformatics
21Database Field
- For many years, the majority of bioinformatics
DBs did not employ a DBMS - Flatfiles were the rule
- Scientists want to see the data directly
- Commercial DBMSs too expensive, too complex
- DBAs too expensive
- Most scientists do not understand
- Differences between BA, MS, PhD in CS
- CS research vs applications
- Implications for project planning, funding,
bioinformatics research
22Recommendation
- Teaching scientists programming is not enough
- Teaching scientists how to build a DBMS is
irrelevant - Teach scientists basic aspects of databases and
symbolic computing - Database requirements analysis
- Data models, schema design
- Knowledge representation, ontologies
- Formal grammars
- Complex queries
- Database interoperability
23BioSPICE BioinformaticsDatabase Warehouse
- Peter Karp, Dave Stringer-Calvert, Tom Lee, Kemal
Sonmez - SRI International
- http//www.BioSPICE.org/
24Project Goal
- Create a toolkit for constructing bioinformatics
database warehouses that collect together a set
of bioinformatics databases into one physical DBMS
25Motivations
- Important bioinformatics problems require access
to multiple bioinformatics databases - Hundreds of bioinformatics databases exist
- Nucleic Acids Research 30(1) 2002 DB issue
- Nucleic Acids Research DB list 350 DBs at
http//www3.oup.co.uk/nar/database/a/ - Different problems require different sets of
databases
26Motivations
- Combining multiple databases allows for data
verification and complementation - Simulation problems require access to data on
pathways, enzymes, reactions, genetic regulation
27Why is the Multidatabase Approach Not Sufficient?
- Multidatabase query approaches assume databases
are in a DBMS - Internet bandwidth limits query throughput
- Most sites that do operate DBMSs do not allow
remote SQL access because of security and loading
concerns - Control data stability
- Need to capture, integrate and publish locally
produced data of different types - Multidatabase and Warehouse approaches
complementary
28Scenario 1
- BioSPICE scientist wants to model multiple
metabolic pathways in a given organism - Enumerate pathways and reactions
- What enzymes catalyze each reaction?
- What genes code for each enzyme?
- What control regions regulate each gene?
29Approach
- Oracle and MySQL implementations
- Warehouse schema defines many bioinformatics
datatypes - Create loaders for public bioinformatics DBs
- Parse file format for the DB
- Semantic transformations
- Insert database into warehouse tables
- Warehouse query access mechanisms
- SQL queries via Perl, ODBC, OAA
30Example Swiss-Prot DB
- Version 40.0 describes 101K proteins in a 320MB
file - Each protein described as one block of records
(an entry) in a large text file - Loader tool parses file one entry at a time
- Creates new entries in a set of warehouse tables
31Warehouse Schema
- Manages many bioinformatics datatypes
simultaneously - Pathways, Reactions, Chemicals
- Proteins, Genes, Replicons
- Citations, Organisms
- Links to external databases
- Each type of warehouse object implemented through
one or more relational tables (currently 43)
32Warehouse Schema
- Databases on our wish list
- Genbank (nucleotide sequences)
- Protein expression database
- Protein-protein interactions database
- Gene expression database
- NCBI Taxonomy database
- Gene Ontology
- CMR
33Warehouse Schema
- Manages multiple datasets simultaneously
- Dataset Single version of a database
- Support alternative measurements and viewpoints
- Version comparison
- Multiple software tools or experiments that
require access to different versions - Each dataset is a warehouse entity
- Every warehouse object is registered in a dataset
34Warehouse Schema
- Different databases storing the same biological
types are coerced into same warehouse tables - Design of most datatypes inspired by multiple
databases - Representational tricks to decrease schema bloat
- Single space of primary keys
- Single set of satellite tables such as for
synonyms, citations, comments, etc.
35Warehouse Schema
- Examples
- Protein data from Swiss-Prot, TrEMBL, KEGG, and
EcoCyc all loaded into same relational tables - Pathway data from MetaCyc and KEGG are loaded
into the same relational tables
36Example Swiss-Prot DB
ID 1A11_CUCMA STANDARD PRT 493
AA. AC P23599 DT 01-NOV-1991 (Rel. 20,
Created) DT 01-NOV-1991 (Rel. 20, Last sequence
update) DT 15-DEC-1998 (Rel. 37, Last
annotation update) DE 1-AMINOCYCLOPROPANE-1-CARB
OXYLATE SYNTHASE CMW33 (EC 4.4.1.14) (ACC DE
SYNTHASE) (S-ADENOSYL-L-METHIONINE
METHYLTHIOADENOSINE-LYASE). GN ACS1 OR ACCW.
37How Swiss-Prot is Loaded intoThe Warehouse
- Register Swiss-Prot in Datasets table
- Create entry in Entry and Protein tables for each
Swiss-Prot protein - Satellite tables store
- Protein synonyms, citations, comments, accession
numbers, organism, sequence features,
subunits/complexes, DB links
38Protein Table
CREATE TABLE Protein ( WID
NUMBER --The warehouse ID of this protein
Name VARCHAR2(500) --Common
name of the protein AASequence
VARCHAR2(4000),--Amino-acid sequence for this
protein Charge NUMBER,
--Charge of the chemical Fragment
CHAR(1), --Is this protein a fragment or
not, T or F MolecularWeightCalc NUMBER,
--Molecular weight calculated from sequence.
Units Daltons. MolecularWeightExp NUMBER,
--Molecular Weight determined through
experimentation. Units Daltons. PICalc
VARCHAR2(50), --pI calculated from its
sqeuence. PIExp VARCHAR2(50),
--pI value determined through experimentation.
DataSetWID NUMBER --Reference
to the data set from which the entity came from )
39Database Loaders
- Loader tool defined for each DB to be loaded into
Warehouse - Example loaders available in several languages
- Loaders
- KEGG (C)
- BioCyc collection of 15 pathway DBs (C)
- Swiss-Prot (Java)
- ENZYME (Java)
40Terminology
- Model Organism Database (MOD) DB describing
genome and other information about an organism - Pathway/Genome Database (PGDB) MOD that
combines information about - Pathways, reactions, substrates
- Enzymes, transporters
- Genes, replicons
- Transcription factors, promoters, operons, DNA
binding sites - BioCyc Collection of 15 PGDBs at BioCyc.org
- EcoCyc, AgroCyc, YeastCyc
41Loader Architecture
Swiss-Prot Datafile
ANTLR Parser Generator
Parser for SwissProt
Grammar for Swiss-Prot
Oracle Loadable File
SQL Insert Commands
42Current Warehouse Contents
KEGG ENZYME SwissProt BsubCyc Warehouse Total
Chemicals 7,284 2,952 0 576 10,812
Genes 5,714 0 88,605 4,221 98,540
Organisms 60 0 103,807 1 103,868
Proteins 3,829 3,870 101,602 4,150 113,451
Enzymatic Reactions 3,509 0 0 717 4,226
Pathways 4,517 0 0 138 4,655
Pathway Reactions 36,271 0 0 530 36,801
43Example Warehouse Uses
- Check completeness of data sources
Count reactions in ENZYME database with (and
without) associated protein sequences in
SWISS-PROT database 3870 reactions in
ENZYME 1662 reactions (43) with a sequence in
SWISS-PROT 2208 reactions (57) without a
sequence in SWISS-PROT Count of distinct
non-partial EC numbers in SWISS-PROT 1554
distinct EC numbers in SWISS-PROT (non-partial)