Biological Databases - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Biological Databases

Description:

SWISS-3DIMAGE - 3D images of proteins and other biological macromolecules ... Formulating queries is a serious issue in biological databases. ... – PowerPoint PPT presentation

Number of Views:183

Avg rating:3.0/5.0

Slides: 49

Provided by: YP11

Category:

more less

Transcript and Presenter's Notes

Title: Biological Databases

1
Biological Databases

By Lim Yun PingE mail yunping_at_chitre.net
National University of Singapore

2
Overview

Introduction
What is a database
What type of databases can we access
What roles do they play
What type of information can we get from them
How do we access these information

3
What is a database ?

Convenient method of vast amount of information
Allows for proper storing, searching retrieving
of data.
Before analyzing them we need to assemble them
into central, shareable resources

4
Why databases ?

Means to handle and share large volumes of
biological data
Support large-scale analysis efforts
Make data access easy and updated
Link knowledge obtained from various fields of
biology and medicine

5
Different Database Types

depends on the nature of information stored
(sequences, 2D gel or 3D structure images)
manner of storage (flat files, tables in a
relational database, etc)
In this course we are concerned more about the
different types of databases rather than the
particular storage

6
Features

Most of the databases have a web-interface to
search for data
Common mode to search is by Keywords
User can choose to view the data or save to your
computer
Cross-references help to navigate from one
database to another easily

7
Biological Databases
8
Types Of Biological Databases Accessible

There are many different types of database but
for routine sequence analysis, the following are
initially the most important
Primary databases
Secondary databases
Composite databases

9
Primary databases

Contain sequence data such as nucleic acid or
protein
Example of primary databases include

Protein Databases
SWISS-PROT
TREMBL
PIR

Nucleic Acid Databases
EMBL
Genbank
DDBJ

10
Secondary databases

Or sometimes known as pattern databases
Contain results from the analysis of the
sequences in the primary databases
Example of secondary databases include

PROSITE
Pfam
BLOCKS
PRINTS

11
Composite databases

Combine different sources of primary databases.
Make querying and searching efficient and without
the need to go to each of the primary databases.
Example of composite databases include

NRDB Non-Redundant DataBase
OWL

12
NCBI http//www.ncbi.nlm.nih.gov/ NCBI, at the
NIH campus, USA
EMBL http//www.embl-heidelberg.de/ European
Molecular Biology Laboratory, UK
DDBJ
DDBJ http//www.ddbj.nig.ac.jpDNA Databank of
Japan
Nucleic acid Databases
13
The International Sequence Database Collaboration
GenBank
EMBL
DDBJ
14
The International Sequence Database Collaboration

These three databases have collaborated since
1982. Each database collects and processes new
sequence data and relevant biological information
from scientists in their region e.g. EMBL
collects from Europe, GenBank from the USA.
These databases automatically update each other
with the new sequences collected from each
region, every 24 hours. The result is that they
contain exactly the same information, except for
any sequences that have been added in the last 24
hours.
This is an important consideration in your choice
of database. If you need accurate and up to date
information, you must search an up to date
database.

15
Amount Of Data Grows Rapidly

As of June 2003, there were 32528249295 bases
in 25592865 sequence

16
How to access them

Main Sites
NCBI http//www.ncbi.nlm.nih.gov/
EMBL http//www.embl-heidelberg.de/
DDBJ http//www.ddbj.nig.ac.jp

full release every two months
incremental and cumulative updates daily
available only through internet
ftp//ftp.ncbi.nih.gov/genbank/
66.3 Gigabytes of data

17
The Internet and WWW
18
NCBI http//www.ncbi.nlm.nih.gov/ NCBI, a
division of NLM at the NIH campus, USA
EXPASY http//www.expasy.org Swiss Institute
of Bioinformatics
Kyoto Encyclopedia of Genes and
Genomes http//www.genome.ad.jp/kegg/kegg2.html
19

National Centre for Biotechnology Information
Established in 1988 as a national resource for
molecular biology information, NCBI creates
public databases, conducts research in
computational biology, develops software tools
for analyzing genome data, and disseminates
biomedical information all for the better
understanding of molecular processes affecting
human health and disease.http//www.ncbi.nlm.nih
.gov/

20
(No Transcript)
21
Entrez

Entrez is a search and retrieval system that
integrates information from databases at NCBI.

22
(No Transcript)
23
BNIP
24
(No Transcript)
25
Brief description of the sequence.
Accession Number Unique identifier
Source Organisms common name
Formal scientific name
Contains information on the publications such as
the authors, and topic titles of the journals
that discuss the data reported in the record.
Contains the contact information of the submitter

Contains the information about the genes, gene
products and regions of biological significance
reported in the sequence
length of sequence
scientific name of the source organism
Taxon ID number, Map location

26
Coding sequence (region of the nucleotides that
correspond to the sequence of amino acid). This
is also the location that contains the start and
stop codon.
Region of biological interest
The amino acid translation corresponding to the
nucleotide coding sequence
27
How to understand the output
Unique Identifiers Each entry in a database
must have a unique identifier EMBL Identifier
(ID) GENBANK Accession Number (AC) Other
information is stored along with the
sequence. Each piece of information is written on
it's own line, with a code defining the line. For
example, DE, description OS, organism species
AC, accession number. Relevant biological
information is usually described in the feature
table (FT).
28
Genbank Flat File Format

Refer to Summary Description of the Genbank
Flat File Format
Or
http//www.ncbi.nlm.nih.gov/Sitemap/samplerecord.h
tml

29
ExPASy

Expert Protein Analysis System proteomics server
of the Swiss Institute of Bioinformatics (SIB)
dedicated to the analysis of protein sequences
and structures http//www.expasy.org/

30
Databases on the Expasy server

SWISS-PROT and TrEMBL - Protein knowledgebase
PROSITE - Protein families and domains
SWISS-2DPAGE - Two-dimensional polyacrylamide gel
electrophoresis
ENZYME - Enzyme nomenclature
SWISS-3DIMAGE - 3D images of proteins and other
biological macromolecules
SWISS-MODEL Repository - Automatically generated
protein models

31
SWISS-PROT

A curated protein sequence database which
strives to provide a high level of annotations
(such as the description of the function of a
protein, its domains structure,
post-translational modifications, variants,
etc.), a minimal level of redundancy and high
level of integration with other
databaseshttp//tw.expasy.org/sprot/

32
TrEMBL

Computer-annotated supplement to SWISS-PROT

33
ENZYME

Enzyme nomenclature database
http//tw.expasy.org/enzyme/

34
ENZYME Database

A repository of information relative to the
nomenclature of enzymes
Describes each type of characterized enzyme for
which an EC (Enzyme Commission) number has been
provided

35
Access to ENZYME

by EC number
by enzyme class
by description (official name) or alternative
name(s)
by chemical compound
by cofactor

36
(No Transcript)
37
(No Transcript)
38
K E G G

Kyoto Encyclopedia of Genes and Genomes
http//www.genome.ad.jp/kegg/kegg2.html

39
A structured database containing information
about metabolic pathways in many organisms.
40
KEGG

Part of the GenomeNet database system
Linked to all accessible databases by search
engines LIGAND BRITE

41
(No Transcript)
42
(No Transcript)
43
Link to other pathways
Enzyme
Compound
44
(No Transcript)
45
Summary

Biological databases represent an invaluable
resource in support of biological research.
We can learn much about a particular molecule by
searching databases and using available analysis
tools.
A large number of databases are available for
that task. Some databases are very general while
some are very specialised. For best results we
often need to access multiple databases.

Common database search methods include keyword
matching, sequence similarity, motif searching,
and class searching
The problems with using biological databases
include incomplete information, data spread over
multiple databases, redundant information,
various errors, sometimes incorrect links, and
constant change.

Database standards, nomenclature, and naming
conventions are not clearly defined for many
aspects of biological information. This makes
information extraction more difficult
Retrieval systems help extract rich information
from multiple databases. Examples include Entrez
and SRS.
Formulating queries is a serious issue in
biological databases. Often the quality of
results depends on the quality of the queries.
Access to biological databases is so important
that today virtually every molecular biological
project starts and ends with querying biological
databases.