Databases in Biology - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Databases in Biology

Description:

Databases in Biology – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 31

Provided by: bioinform2

Category:

more less

Transcript and Presenter's Notes

Title: Databases in Biology

1
Databases in Biology

Support and Service, Bioinformatics Lab., CRS4
november 2006

sergio contrino
2
outline

- Why biological databases?
- Why RDBMS?
- examples

3
Why biological databases?
www.downtowngrassvalley.com/
4
Biological databases
As of 2006, there are over 1,000 public and
commercial biological databases. These biological
databases usually contain genomics and proteomics
data, but databases are also used in taxonomy.
The data are nucleotide sequences of genes or
amino acid sequences of proteins. Furthermore
information about function, structure,
localisation on chromosome, clinical effects of
mutations as well as similarities of biological
sequences can be found.
http//en.wikipedia.org/wiki/Biological_database
5
Biological databases
http//nar.oxfordjournals.org/
6
NAR 2006
http//nar.oxfordjournals.org/ 13th edition 856
databases (139 than in 2005)
7
NAR 2006
Database usefulness 5 most cited (Science
Citation Index) Pfam (375 in lt 2
years) GO Uniprot SMART KEGG
8
Biological databases because ..
a way of organising vast and complex
data producing scientific results
9
Why (R)DBMS?
www.net.dcs.hull.ac.uk/ aboutus/findingus.htm
10
(R)DBMS

Relational
Data
Base
Management
System

11
The alternative
ID U87107 standard DNA SYN 8840
BP. XX AC U87107 XX SV U87107.1 XX DT
15-OCT-1997 (Rel. 52, Created) DT 15-OCT-1997
(Rel. 52, Last updated, Version 4) XX DE
Cloning vector pAL-F insertion sequence IS1
galactokinase (galK), DE aminoglycoside
3'-phosphotransferase (kn), beta-galactosidase
(lacZ), small DE ribosomal protein and
beta-lactamase (Ap) genes, complete cds. XX KW
. XX OS Cloning vector pAL-F OC artificial
sequence vectors. XX RN 1 RP 1-8840 RA
Ahmed A., Podemski L. RT "Use of ordered
deletions in genome sequencing" RL Gene
197367-373(1997). XX RN 2 RP 1-8840 RA
Ahmed A. RT RL Submitted (27-JAN-1997) to
the EMBL/GenBank/DDBJ databases. RL Biological
Sciences, University of Alberta, Edmonton,
Alberta, T6G 2E9, RL Canada XX DR REMTREMBL
AAC53713 AAC53713. .. XX FH Key
Location/Qualifiers FH FT source
1..8840 FT /db_xref"taxon56954
" FT /organism"Cloning vector
pAL-F" FT /insertion_seq"IS1" F
T /specific_host"Escherichia
coli" FT CDS complement(933..2081) F
T /codon_start1 FT
/db_xref"REMTREMBLAAC53713" FT
/transl_table11 FT
/gene"galK" FT
/product"galactokinase" FT
/protein_id"AAC53713.1gt" FT
/translation"MSLKEKTQSLFANAFGYPATHTIQAPGRVNLIGEHT
DYNDGFVLP FT CAIDYQTVISCAPRDDRKV
RVMAADYENQLDEFSLDAPIVAHENYQWANYVRGVVKHLQ FT
LRNNSFGGVDMVDHGNVPQGAGLSSSASLEVAVGTVLQ
QLYHLPLDGAQIALNGQEAEN FT
QFVGCNCGIMDQLISALGKKDHALLIDCRSLGTKAVSMPKGVAVVIINSN
FKRTLVGSE FT YNTRREQCETGARFFQQPA
LRDVTIEEFNAVAHELDPIVAKRVRHILTENARTVEAASA FT
LEQGDLKRMGELMAESHASMRDDFEITVPQIDTLVEIV
KAVIGDKGGVRMTGGGFGGCI FT
VALIPEELVPAVQQRVAEQYEAKTGIKETFYVCKPSQGAGQC" .. XX
SQ Sequence 8840 BP 2068 A 2288 C 2319 G
2165 T 0 other caattactgc aatgccctcg
taattaagtg aatttacaat atcgtcctgt tcggagggaa
60 gaacgcggga tgttcattct tcatcacttt
taattgatgt atatgctctc ttttctgacg 120 ..
agtaagttgg cagcatcacc
8840 //
EMBL flat file - structured information -
parsable
12
The alternative
ID U87107 standard DNA SYN 8840
BP. XX AC U87107 XX SV U87107.1 XX DT
15-OCT-1997 (Rel. 52, Created) DT 15-OCT-1997
(Rel. 52, Last updated, Version 4) XX DE
Cloning vector pAL-F insertion sequence IS1
galactokinase (galK), DE aminoglycoside
3'-phosphotransferase (kn), beta-galactosidase
(lacZ), small DE ribosomal protein and
beta-lactamase (Ap) genes, complete cds. XX KW
. XX OS Cloning vector pAL-F OC artificial
sequence vectors. XX RN 1 RP 1-8840 RA
Ahmed A., Podemski L. RT "Use of ordered
deletions in genome sequencing" RL Gene
197367-373(1997). XX RN 2 RP 1-8840 RA
Ahmed A. RT RL Submitted (27-JAN-1997) to
the EMBL/GenBank/DDBJ databases. RL Biological
Sciences, University of Alberta, Edmonton,
Alberta, T6G 2E9, RL Canada XX DR REMTREMBL
AAC53713 AAC53713. ..
EMBL flat file (upper)
13
The alternative
.. XX FH Key Location/Qualifiers FH
FT source 1..8840 FT
/db_xref"taxon56954" FT
/organism"Cloning vector pAL-F" FT
/insertion_seq"IS1" FT
/specific_host"Escherichia coli" FT CDS
complement(933..2081) FT
/codon_start1 FT
/db_xref"REMTREMBLAAC53713" FT
/transl_table11 FT
/gene"galK" FT
/product"galactokinase" FT
/protein_id"AAC53713.1gt" FT
/translation"MSLKEKTQSLFANAFGYPATHTIQAPGRVNLIGEHT
DYNDGFVLP FT CAIDYQTVISCAPRDDRKV
RVMAADYENQLDEFSLDAPIVAHENYQWANYVRGVVKHLQ FT
LRNNSFGGVDMVDHGNVPQGAGLSSSASLEVAVGTVLQ
QLYHLPLDGAQIALNGQEAEN FT
QFVGCNCGIMDQLISALGKKDHALLIDCRSLGTKAVSMPKGVAVVIINSN
FKRTLVGSE FT YNTRREQCETGARFFQQPA
LRDVTIEEFNAVAHELDPIVAKRVRHILTENARTVEAASA FT
LEQGDLKRMGELMAESHASMRDDFEITVPQIDTLVEIV
KAVIGDKGGVRMTGGGFGGCI FT
VALIPEELVPAVQQRVAEQYEAKTGIKETFYVCKPSQGAGQC" .. XX
SQ Sequence 8840 BP 2068 A 2288 C 2319 G
2165 T 0 other caattactgc aatgccctcg
taattaagtg aatttacaat atcgtcctgt tcggagggaa
60 .. agtaagttgg cagcatcacc

8840 //
EMBL flat file (lower)
14
EMBL flat file legenda (upper)
The ID (IDentification line) line is
always the first line of an entry. The general
form of the ID line is Term ID entryname
dataclass molecule division sequencelength
(BasePairs) e.g. ID U87107 standard DNA SYN
8840 BP The XX line ltXXgt contains no
data or comments. It is used instead of blank
lines to avoid confusion with sequence data
lines. The AC (Accession Number) line
lists the accession numbers associated with this
entry. The SV (Sequence Version) line
contains the new format of the nucleotide
sequence identifier. The DT (DaTe)
line shows when an entry first appeared in the
the database and when it was last updated.
The DE (DEscription) lines contain general
descriptive information about the sequence
stored. The KW (KeyWord) lines provide
information which can be used to generate
cross-reference indexes of the sequence entries
based on functional, structural, or other
categories deemed important. The keywords chosen
for each entry serve as a subject reference for
the sequence, and will be expanded as work with
the database continues. Often several KW lines
are necessary for a single entry. The OS
(Organism Species) line specifies the
preferred scientific name of the organism which
was the source of the stored sequence. The
OC (Organism Classification) lines contain
the taxonomic classification of the source
organism. The RN (Reference Number)
line gives a unique number to each reference
citation within an entry. The RC
(Reference Comment) line type is an optional
line type which appears if the reference has a
comment.
15
The RP (Reference Position) line type
is an optional line type which appears if one or
more contiguous base spans of the presented
sequence can be attributed to the reference in
question. The RX (Reference
Cross-reference) line type is an optional line
type which contains a cross-reference to an
external citation or abstract database.
The RA (Reference Author) lines list the
authors of the paper (or other work) cited.
The RT (Reference Title) lines give the
title of the paper (or other work). The RL
(Reference Location) line contains the
conventional citation information for the
reference. The DR (Database
Cross-Reference) line cross-references other
databases which contain information related to
the entry in which the DR line appears.
The CC lines are free text comments about the
entry, and may be used to convey any sort of
information thought to be useful. The FH
(Feature Header) lines are present only to
improve readability of an entry when it is
printed or displayed on a terminal screen. The
lines contain no data and may be ignored by
computer programs. The FT (Feature Table)
lines provide a mechanism for the annotation of
the sequence data. Regions or sites in the
sequence which are of interest are listed in the
table. The SQ (SeQuence header) line
marks the beginning of the sequence data and
gives a summary of its content. The
sequence data lines has lines of code starting
with two blanks. The sequence is written 60 bases
per line, in groups of 10 bases separated by a
blank character, beginning in position 6 of the
line. The direction listed is always 5' to 3'
The // (terminator) line also contains no data
or comments. It designates the end of an entry.
EMBL flat file legenda (lower)
16
SWISS-PROT flat file
17
Why (R)DBMS?

- data independence
- improved consistency
- data integrity
- data security
- uniform data administration
- concurrent access
- redundancy control

18
Why (R)DBMS?

AAA / CIA / ACID
Authentication, Authorization, Accountability
Confidentiality, Integrity, Availability
Atomicity, Consistency, Isolation, Durability

19
Why (R)DBMS?

information
- data management

20
Biological databases
http//mia.sdsc.edu/mia/html/bioDBsmap.html
21

In Oracle
30 tables
241365 3368791 3.6 Mentries (Rel. 9.1,
14/11/06)
10G AA

22
In Oracle
23
Each field is independent
24
Ensembl

comprehensive and integrated source of
annotation of large genome sequences
In MySQL
65 tables (Rel. 41)
19 genomes

ArrayExpress is a public repository for
microarray data, which is aimed at storing well
annotated data in accordance with MGED
recommendations.

In Oracle 9.2
Tables 219 Columns 670
db size 1337133 MB (1.3 TB)
Nr of db rows 1010989277 (1G)
October 2006

EXPERIMENTS 1739
ARRAYS 1202
PROTOCOLS 8251
HYBRIDIZATIONS 52610
October 2006

The PDB is the single worldwide repository for
the processing and distribution of 3-D structure
data of large molecules of proteins and nucleic
acids.

In MySQL (5.0)
Tables 461 Columns 5294
Nr of db rows 819872661 (800M)
PDB2004

30
References
http//nar.oxfordjournals.org/ http//www.expasy.
uniprot.org/ http//www.ensembl.org/ http//www.ge
nome.org/ http//www.ebi.ac.uk/arrayexpress-old/ h
ttp//www.pdb.org/pdb

Write a Comment

User Comments (0)