Title: Biological Database Systems
1Biological Database Systems
- Denis Shestakov,
- University of Turku/Tampere
2Course Information
- Contact info
- Email
- Office B6019, ICT
- Course URL http//users.utu.fi/denshe/biodb/index
.html - Course Blog (will be updated occasionally)
http//biodb.wordpress.com/
3Course Information
- Course structure
- Lectures(topics) approx. 12 (plus todays intro
and review lecture in the end of the course) - Project work details will be given next time
- Exam easy to pass if all assignments are
completed
4Course Information
- Project work
- Build database that combines data from several
sources - Please send to me your suggestions (if you have
any) for project, otherwise project will be
suggested by me - Design part
- E-R schema, relational schema, XML schema
- Deploying part
- Data converters
- Building relational db ( XML-based db) based on
corresponding schemas - Performance comparison relational database vs.
XML-based db vs. file storage system - Invoking web services
5Course Information Literature
- Slides
- References in the end of slides
- Books
- Database Systems Concepts, 5th edition by
Silbershatz, Korth Sudarshan, McGraw-Hill, 2005
ISBN-10 0072958863 - Bioinformatics Managing Scientific Data by
Lacroix Critchlow, Morgan Kaufmann, 2003
ISBN-10 155860829X - Articles
- Biological database design and implementation by
Birney Clamp (the Ensembl project), Briefings
in Bioinformatics, 5(1)31-38, 2004
6Biological Database Systems
- 1.1. Course Content
- 1.2. Course Objectives
- 1.3. Database and DBMS
- 1.4. Biological Databases
7Course content main topics
- Database concepts, overview of database design
process - Entity-relationship (ER) data model
- Relational data model
- Introduction to SQL
- XML and XML Schema
- Design of biological database systems
8Course content main topics
- Entity-attribute-value (EAV) modeling
- Model organism databases
- Web services
- Integration of biological data
- Analysis workflows
9Course focus
- Database issues
- Biology-specific
- Representation of biological data
- Design of biological databases
- NOT about
- Usage of existing databases
- Accessing/retrieving data from bio-databases
10Course goal
- Give basic knowledge of biological database
design
- for molecular biology
11Do you need to know this?
- Work in wet laboratory
- One bioinformatician and many biologists
- Likely to be IT guru for others
- Expect to answer IT-related questions
(database-related too) - Work in bioinformatics lab
- Many bioinformaticians
- Group may maintain several dbs
- Basics are helpful
- Interested in creating/maintaining biological
databases - Start learning!
- Ask for more information
12Database?
From Merriam-Webster dictionary (http//www.merri
am-webster.com/dictionary/database)
13Database?
- A collection of data
- structured
- searchable (i.e., indexable)
- updated
- cross-referenced
- Objective
- Transform meaningless raw data into useful
information which can be accessed and analyzed in
the best way - Database Management System (DBMS)
- software designed for the purpose of managing
databases (access, insert, delete, update, etc.)
14DBMS database management system
- A set of tools that
- Store
- Extract
- Modify
15Biological Databases?
- Explosive growth in biological data
- E.g., tremendous increase in nucleotide sequences
(first increase in data due to the polymerase
chain reaction (PCR) technique development in
1983) - 1980 80 genes fully sequenced
16Biological Databases?
Total nucleotides (Nov 07 188,490,792,445)
Number of entries(Nov 07 106,144,026)
17Biological Databases?
- Database systems are crucial for managing large
and very large collections of data - Data (genomic sequences, 3D structures, 2D gel
analysis, microarrays.) directly submitted to
databases - Essential tools for biological research, like
reading relevant literature
18Biological Databases History
- 1965
- Margaret Dayhoff et al. publish Atlas of Protein
Sequences and Structures - 1982
- EMBL initiates DNA sequence databases, followed
within a year by GenBank and in 1984 by the DNA
Database of Japan - 1988
- EMBL/GenBank/DDBJ agree on common format for data
elements
19Biological Databases some statistics
- More than 1000 different databases
- 968 databases reported in The Molecular Biology
Database Collection 2007 update by Galperin,
Nucleic Acids Research, 2007, Vol. 35, Database
issue D3-D4 - Metabase database of biological databases,
http//biodatabase.org/index.php/Main_Page - Database sizes lt100kB to gt100GB (EMBL gt500GB)
- DNA gt100GB
- Protein 1GB
- 3D structure 5GB
- Update (adding new data) frequency daily to
annually - Freely accessible (as a rule)
20Some databases in the field of molecular biology
- AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
- ARR, AsDb, BBDB, BCGD, Beanref,
Biolmage, - BioMagResBank, BIOMDB, BLOCKS,
BovGBASE, - BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
- CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
- ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
- CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP,
DictyDb, - Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract,
ECDC, - ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
- ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
- GCRDB, GDB, GENATLAS, Genbank, GeneCards,
- Genline, GenLink, GENOTK, GenProtEC,
GIFTS, - GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
- HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
- HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
- HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
- KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,
- Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5
- Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,
Find more at http//biodatabase.org
21Categories of Biological Databases
- Nucleotide sequences
- Genomics (information on gene chromosomal
location and nomenclature, provide links to
sequence databases) - Mutation/polymorphism (sequence variations linked
or not to genetic diseases) - Protein sequences
- Protein domain/family
- Proteomics (2D gel, MS)
22Categories of Biological Databases
- Microarray (high-dimensional data profiles of
thousands of genes depending on
hundreds/thousands of various conditions) - Organism-specific
- 3D structure
- Metabolism (e.g., metabolic pathways graph
data) - Bibliography
- Others
23Biological Databases specific features
- Sub-class of scientific databases
- Autonomous many independent maintainers
- Heterogeneous data formats e.g., various data
formats for the same data entities various types
of biological data genomic, microarray,
proteomic, ... - Dynamic frequent and continuous changes in data
content (and, more importantly, in data schema) - Broad domain knowledge
- Workflow-oriented databases rich set of
analysis tools - Information integration is essential data
aggregation from several databases
24Biological Databases integration
Figure is taken from Bioinformatics Managing
Scientific Data by Lacroix Critchlow, p.20