Overview of Genome Databases - PowerPoint PPT Presentation

About This Presentation

Title:

Overview of Genome Databases

Description:

Definition of Bioinformatics ... For many years, the majority of bioinformatics DBs did not employ a DBMS ... Warehouse schema defines many bioinformatics datatypes ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 44

Provided by: aiS9

Category:

more less

Transcript and Presenter's Notes

Title: Overview of Genome Databases

1
Overview of Genome Databases

Peter D. Karp, Ph.D.
SRI International
pkarp_at_ai.sri.com
www-db.stanford.edu/dbseminar/seminar.html

2
Talk Overview

Definition of bioinformatics
Motivations for genome databases
Issues in building genome databases

3
Definition of Bioinformatics

Computational techniques for management and
analysis of biological data and knowledge
Methods for disseminating, archiving,
interpreting, and mining scientific information
Computational theories of biology
Genome Databases is a subfield of bioinformatics

4
Motivations for Bioinformatics

Growth in molecular-biology knowledge
(literature)
Genomics
Study of genomes through DNA sequencing
Industrial Biology

5
Example Genomics Datatypes

Genome sequences
DOE Joint Genome Institute
511M bases in Dec 2001
11.97G bases since Mar 1999
Gene and protein expression data
Protein-protein interaction data
Protein 3-D structures

6
Genome Databases

Experimental data
Archive experimental datasets
Retrieving past experimental results should be
faster than repeating the experiment
Capture alternative analyses
Lots of data, simpler semantics
Computational symbolic theories
Complex theories become too large to be grasped
by a single mind
The database is the theory
Biology is very much concerned with qualitative
relationships
Less data, more complex semantics

7
Bioinformatics

Distinct intellectual field at the intersection
of CS and molecular biology
Distinct field because researchers in the field
must know CS, biology, and bioinformatics
Spectrum from CS research to biology service
Rich source of challenging CS problems
Large, noisy, complex data-sets and
knowledge-sets
Biologists and funding agencies demand working
solutions

8
Bioinformatics Research

algorithms data structures programs
algorithms databases discoveries
Combine sophisticated algorithms with the right
content
Properly structured
Carefully curated
Relevant data fields
Proper amount of data

9
Reference on Major Genome Databases

Nucleic Acids Research Database Issue
http//nar.oupjournals.org/content/vol30/issue1/
112 databases

10
Questions to Ask of a New Genome Database
11
What are Database Goals andRequirements?

What problems will database be used to solve?
Who are the users and what is their expertise?

12
What is its Organizing Principle?

Different DBs partition the space of genome
information in different dimensions
Experimental methods (Genbank, PDB)
Organism (EcoCyc, Flybase)

13
What is its Level of Interpretation?

Laboratory data
Primary literature (Genbank)
Review (SwissProt, MetaCyc)
Does DB model disagreement?

14
What are its Semantics and Content?

What entities and relationships does it model?
How does its content overlap with similar DBs?
How many entities of each type are present?
Sparseness of attributes and statistics on
attribute values

15
What are Sources of its Data?

Potential information sources
Laboratory instruments
Scientific literature
Manual entry
Natural-language text mining
Direct submission from the scientific community
Genbank
Modification policy
DB staff only
Submission of new entries by scientific community
Update access by scientific community

16
What DBMS is Employed?

None
Relational
Object oriented
Frame knowledge representation system

17
Distribution / User Access

Multiple distribution forms enhance access
Browsing access with visualization tools
API
Portability

18
What Validation Approaches areEmployed?

None
Declarative consistency constraints
Programmatic consistency checking
Internal vs external consistency checking
What types of systematic errors might DB contain?

19
Database Documentation

Schema and its semantics
Format
API
Data acquisition techniques
Validation techniques
Size of different classes
Coverage of subject matter
Sparseness of attributes
Error rates
Update frequency

20
Relationship of Database Field toBioinformatics

Scientists generally unaware of basic DB
principles
Complex queries vs click-at-a-time access
Data model
Defined semantics for DB fields
Controlled vocabularies
Regular syntax for flatfiles
Automated consistency checking
Most biologists take one programming class
Evolution of typical genome database
Finer points of DB research off their radar
screen
Handfull of DB researchers work in bioinformatics

21
Database Field

For many years, the majority of bioinformatics
DBs did not employ a DBMS
Flatfiles were the rule
Scientists want to see the data directly
Commercial DBMSs too expensive, too complex
DBAs too expensive
Most scientists do not understand
Differences between BA, MS, PhD in CS
CS research vs applications
Implications for project planning, funding,
bioinformatics research

22
Recommendation

Teaching scientists programming is not enough
Teaching scientists how to build a DBMS is
irrelevant
Teach scientists basic aspects of databases and
symbolic computing
Database requirements analysis
Data models, schema design
Knowledge representation, ontologies
Formal grammars
Complex queries
Database interoperability

23
BioSPICE BioinformaticsDatabase Warehouse

Peter Karp, Dave Stringer-Calvert, Tom Lee, Kemal
Sonmez
SRI International
http//www.BioSPICE.org/

24
Project Goal

Create a toolkit for constructing bioinformatics
database warehouses that collect together a set
of bioinformatics databases into one physical DBMS

25
Motivations

Important bioinformatics problems require access
to multiple bioinformatics databases
Hundreds of bioinformatics databases exist
Nucleic Acids Research 30(1) 2002 DB issue
Nucleic Acids Research DB list 350 DBs at
http//www3.oup.co.uk/nar/database/a/
Different problems require different sets of
databases

26
Motivations

Combining multiple databases allows for data
verification and complementation
Simulation problems require access to data on
pathways, enzymes, reactions, genetic regulation

27
Why is the Multidatabase Approach Not Sufficient?

Multidatabase query approaches assume databases
are in a DBMS
Internet bandwidth limits query throughput
Most sites that do operate DBMSs do not allow
remote SQL access because of security and loading
concerns
Control data stability
Need to capture, integrate and publish locally
produced data of different types
Multidatabase and Warehouse approaches
complementary

28
Scenario 1

BioSPICE scientist wants to model multiple
metabolic pathways in a given organism
Enumerate pathways and reactions
What enzymes catalyze each reaction?
What genes code for each enzyme?
What control regions regulate each gene?

29
Approach

Oracle and MySQL implementations
Warehouse schema defines many bioinformatics
datatypes
Create loaders for public bioinformatics DBs
Parse file format for the DB
Semantic transformations
Insert database into warehouse tables
Warehouse query access mechanisms
SQL queries via Perl, ODBC, OAA

30
Example Swiss-Prot DB

Version 40.0 describes 101K proteins in a 320MB
file
Each protein described as one block of records
(an entry) in a large text file
Loader tool parses file one entry at a time
Creates new entries in a set of warehouse tables

31
Warehouse Schema

Manages many bioinformatics datatypes
simultaneously
Pathways, Reactions, Chemicals
Proteins, Genes, Replicons
Citations, Organisms
Links to external databases
Each type of warehouse object implemented through
one or more relational tables (currently 43)

32
Warehouse Schema

Databases on our wish list
Genbank (nucleotide sequences)
Protein expression database
Protein-protein interactions database
Gene expression database
NCBI Taxonomy database
Gene Ontology
CMR

33
Warehouse Schema

Manages multiple datasets simultaneously
Dataset Single version of a database
Support alternative measurements and viewpoints
Version comparison
Multiple software tools or experiments that
require access to different versions
Each dataset is a warehouse entity
Every warehouse object is registered in a dataset

34
Warehouse Schema

Different databases storing the same biological
types are coerced into same warehouse tables
Design of most datatypes inspired by multiple
databases
Representational tricks to decrease schema bloat
Single space of primary keys
Single set of satellite tables such as for
synonyms, citations, comments, etc.

35
Warehouse Schema

Examples
Protein data from Swiss-Prot, TrEMBL, KEGG, and
EcoCyc all loaded into same relational tables
Pathway data from MetaCyc and KEGG are loaded
into the same relational tables

36
Example Swiss-Prot DB
ID 1A11_CUCMA STANDARD PRT 493
AA. AC P23599 DT 01-NOV-1991 (Rel. 20,
Created) DT 01-NOV-1991 (Rel. 20, Last sequence
update) DT 15-DEC-1998 (Rel. 37, Last
annotation update) DE 1-AMINOCYCLOPROPANE-1-CARB
OXYLATE SYNTHASE CMW33 (EC 4.4.1.14) (ACC DE
SYNTHASE) (S-ADENOSYL-L-METHIONINE
METHYLTHIOADENOSINE-LYASE). GN ACS1 OR ACCW.
37
How Swiss-Prot is Loaded intoThe Warehouse

Register Swiss-Prot in Datasets table
Create entry in Entry and Protein tables for each
Swiss-Prot protein
Satellite tables store
Protein synonyms, citations, comments, accession
numbers, organism, sequence features,
subunits/complexes, DB links

38
Protein Table
CREATE TABLE Protein ( WID
NUMBER --The warehouse ID of this protein
Name VARCHAR2(500) --Common
name of the protein AASequence
VARCHAR2(4000),--Amino-acid sequence for this
protein Charge NUMBER,
--Charge of the chemical Fragment
CHAR(1), --Is this protein a fragment or
not, T or F MolecularWeightCalc NUMBER,
--Molecular weight calculated from sequence.
Units Daltons. MolecularWeightExp NUMBER,
--Molecular Weight determined through
experimentation. Units Daltons. PICalc
VARCHAR2(50), --pI calculated from its
sqeuence. PIExp VARCHAR2(50),
--pI value determined through experimentation.
DataSetWID NUMBER --Reference
to the data set from which the entity came from )
39
Database Loaders

Loader tool defined for each DB to be loaded into
Warehouse
Example loaders available in several languages
Loaders
KEGG (C)
BioCyc collection of 15 pathway DBs (C)
Swiss-Prot (Java)
ENZYME (Java)

40
Terminology

Model Organism Database (MOD) DB describing
genome and other information about an organism
Pathway/Genome Database (PGDB) MOD that
combines information about
Pathways, reactions, substrates
Enzymes, transporters
Genes, replicons
Transcription factors, promoters, operons, DNA
binding sites
BioCyc Collection of 15 PGDBs at BioCyc.org
EcoCyc, AgroCyc, YeastCyc

41
Loader Architecture
Swiss-Prot Datafile
ANTLR Parser Generator
Parser for SwissProt
Grammar for Swiss-Prot
Oracle Loadable File
SQL Insert Commands
42
Current Warehouse Contents
KEGG ENZYME SwissProt BsubCyc Warehouse Total
Chemicals 7,284 2,952 0 576 10,812
Genes 5,714 0 88,605 4,221 98,540
Organisms 60 0 103,807 1 103,868
Proteins 3,829 3,870 101,602 4,150 113,451
Enzymatic Reactions 3,509 0 0 717 4,226
Pathways 4,517 0 0 138 4,655
Pathway Reactions 36,271 0 0 530 36,801
43
Example Warehouse Uses

Check completeness of data sources

Count reactions in ENZYME database with (and
without) associated protein sequences in
SWISS-PROT database 3870 reactions in
ENZYME 1662 reactions (43) with a sequence in
SWISS-PROT 2208 reactions (57) without a
sequence in SWISS-PROT Count of distinct
non-partial EC numbers in SWISS-PROT 1554
distinct EC numbers in SWISS-PROT (non-partial)

Write a Comment

User Comments (0)