Title: Biological Databases
1Biological Databases
- By Lim Yun PingE mail yunping_at_chitre.net
- National University of Singapore
2Overview
- Introduction
- What is a database
- What type of databases can we access
- What roles do they play
- What type of information can we get from them
- How do we access these information
3What is a database ?
- Convenient method of vast amount of information
- Allows for proper storing, searching retrieving
of data. - Before analyzing them we need to assemble them
into central, shareable resources
4Why databases ?
- Means to handle and share large volumes of
biological data - Support large-scale analysis efforts
- Make data access easy and updated
- Link knowledge obtained from various fields of
biology and medicine
5Different Database Types
- depends on the nature of information stored
- (sequences, 2D gel or 3D structure images)
- manner of storage (flat files, tables in a
relational database, etc) - In this course we are concerned more about the
different types of databases rather than the
particular storage
6Features
- Most of the databases have a web-interface to
search for data - Common mode to search is by Keywords
- User can choose to view the data or save to your
computer - Cross-references help to navigate from one
database to another easily
7Biological Databases
8Types Of Biological Databases Accessible
- There are many different types of database but
for routine sequence analysis, the following are
initially the most important - Primary databases
- Secondary databases
- Composite databases
9Primary databases
- Contain sequence data such as nucleic acid or
protein - Example of primary databases include
- Protein Databases
- SWISS-PROT
- TREMBL
- PIR
- Nucleic Acid Databases
- EMBL
- Genbank
- DDBJ
10Secondary databases
- Or sometimes known as pattern databases
- Contain results from the analysis of the
sequences in the primary databases - Example of secondary databases include
- PROSITE
- Pfam
- BLOCKS
- PRINTS
11Composite databases
- Combine different sources of primary databases.
- Make querying and searching efficient and without
the need to go to each of the primary databases. - Example of composite databases include
- NRDB Non-Redundant DataBase
- OWL
12NCBI http//www.ncbi.nlm.nih.gov/ NCBI, at the
NIH campus, USA
EMBL http//www.embl-heidelberg.de/ European
Molecular Biology Laboratory, UK
DDBJ
DDBJ http//www.ddbj.nig.ac.jpDNA Databank of
Japan
Nucleic acid Databases
13The International Sequence Database Collaboration
GenBank
EMBL
DDBJ
14The International Sequence Database Collaboration
- These three databases have collaborated since
1982. Each database collects and processes new
sequence data and relevant biological information
from scientists in their region e.g. EMBL
collects from Europe, GenBank from the USA. - These databases automatically update each other
with the new sequences collected from each
region, every 24 hours. The result is that they
contain exactly the same information, except for
any sequences that have been added in the last 24
hours. - This is an important consideration in your choice
of database. If you need accurate and up to date
information, you must search an up to date
database.
15Amount Of Data Grows Rapidly
- As of June 2003, there were 32528249295 bases
- in 25592865 sequence
16How to access them
- Main Sites
- NCBI http//www.ncbi.nlm.nih.gov/
- EMBL http//www.embl-heidelberg.de/
- DDBJ http//www.ddbj.nig.ac.jp
-
- full release every two months
- incremental and cumulative updates daily
- available only through internet
ftp//ftp.ncbi.nih.gov/genbank/ - 66.3 Gigabytes of data
17The Internet and WWW
18NCBI http//www.ncbi.nlm.nih.gov/ NCBI, a
division of NLM at the NIH campus, USA
EXPASY http//www.expasy.org Swiss Institute
of Bioinformatics
Kyoto Encyclopedia of Genes and
Genomes http//www.genome.ad.jp/kegg/kegg2.html
19- National Centre for Biotechnology Information
- Established in 1988 as a national resource for
molecular biology information, NCBI creates
public databases, conducts research in
computational biology, develops software tools
for analyzing genome data, and disseminates
biomedical information all for the better
understanding of molecular processes affecting
human health and disease.http//www.ncbi.nlm.nih
.gov/
20(No Transcript)
21Entrez
- Entrez is a search and retrieval system that
integrates information from databases at NCBI.
22(No Transcript)
23BNIP
24(No Transcript)
25Brief description of the sequence.
Accession Number Unique identifier
Source Organisms common name
Formal scientific name
Contains information on the publications such as
the authors, and topic titles of the journals
that discuss the data reported in the record.
Contains the contact information of the submitter
- Contains the information about the genes, gene
products and regions of biological significance
reported in the sequence - length of sequence
- scientific name of the source organism
- Taxon ID number, Map location
26Coding sequence (region of the nucleotides that
correspond to the sequence of amino acid). This
is also the location that contains the start and
stop codon.
Region of biological interest
The amino acid translation corresponding to the
nucleotide coding sequence
27How to understand the output
Unique Identifiers Each entry in a database
must have a unique identifier EMBL Identifier
(ID) GENBANK Accession Number (AC) Other
information is stored along with the
sequence. Each piece of information is written on
it's own line, with a code defining the line. For
example, DE, description OS, organism species
AC, accession number. Relevant biological
information is usually described in the feature
table (FT).
28Genbank Flat File Format
- Refer to Summary Description of the Genbank
Flat File Format - Or
- http//www.ncbi.nlm.nih.gov/Sitemap/samplerecord.h
tml
29ExPASy
- Expert Protein Analysis System proteomics server
of the Swiss Institute of Bioinformatics (SIB) - dedicated to the analysis of protein sequences
and structures http//www.expasy.org/
30Databases on the Expasy server
- SWISS-PROT and TrEMBL - Protein knowledgebase
- PROSITE - Protein families and domains
- SWISS-2DPAGE - Two-dimensional polyacrylamide gel
electrophoresis - ENZYME - Enzyme nomenclature
- SWISS-3DIMAGE - 3D images of proteins and other
biological macromolecules - SWISS-MODEL Repository - Automatically generated
protein models
31SWISS-PROT
- A curated protein sequence database which
strives to provide a high level of annotations
(such as the description of the function of a
protein, its domains structure,
post-translational modifications, variants,
etc.), a minimal level of redundancy and high
level of integration with other
databaseshttp//tw.expasy.org/sprot/
32TrEMBL
- Computer-annotated supplement to SWISS-PROT
33ENZYME
- Enzyme nomenclature database
- http//tw.expasy.org/enzyme/
34ENZYME Database
- A repository of information relative to the
nomenclature of enzymes - Describes each type of characterized enzyme for
which an EC (Enzyme Commission) number has been
provided
35Access to ENZYME
- by EC number
- by enzyme class
- by description (official name) or alternative
name(s) - by chemical compound
- by cofactor
36(No Transcript)
37(No Transcript)
38K E G G
- Kyoto Encyclopedia of Genes and Genomes
- http//www.genome.ad.jp/kegg/kegg2.html
39A structured database containing information
about metabolic pathways in many organisms.
40KEGG
- Part of the GenomeNet database system
- Linked to all accessible databases by search
engines LIGAND BRITE
41(No Transcript)
42(No Transcript)
43Link to other pathways
Enzyme
Compound
44(No Transcript)
45Summary
- Biological databases represent an invaluable
resource in support of biological research. - We can learn much about a particular molecule by
searching databases and using available analysis
tools. - A large number of databases are available for
that task. Some databases are very general while
some are very specialised. For best results we
often need to access multiple databases.
46- Common database search methods include keyword
matching, sequence similarity, motif searching,
and class searching - The problems with using biological databases
include incomplete information, data spread over
multiple databases, redundant information,
various errors, sometimes incorrect links, and
constant change.
47- Database standards, nomenclature, and naming
conventions are not clearly defined for many
aspects of biological information. This makes
information extraction more difficult - Retrieval systems help extract rich information
from multiple databases. Examples include Entrez
and SRS. - Formulating queries is a serious issue in
biological databases. Often the quality of
results depends on the quality of the queries. - Access to biological databases is so important
that today virtually every molecular biological
project starts and ends with querying biological
databases.
48The End