Title: Databases in Biology
1Databases in Biology
- Support and Service, Bioinformatics Lab., CRS4
- november 2006
sergio contrino
2outline
- - Why biological databases?
- - Why RDBMS?
- - examples
3Why biological databases?
www.downtowngrassvalley.com/
4Biological databases
As of 2006, there are over 1,000 public and
commercial biological databases. These biological
databases usually contain genomics and proteomics
data, but databases are also used in taxonomy.
The data are nucleotide sequences of genes or
amino acid sequences of proteins. Furthermore
information about function, structure,
localisation on chromosome, clinical effects of
mutations as well as similarities of biological
sequences can be found.
http//en.wikipedia.org/wiki/Biological_database
5Biological databases
http//nar.oxfordjournals.org/
6NAR 2006
http//nar.oxfordjournals.org/ 13th edition 856
databases (139 than in 2005)
7NAR 2006
Database usefulness 5 most cited (Science
Citation Index) Pfam (375 in lt 2
years) GO Uniprot SMART KEGG
8Biological databases because ..
a way of organising vast and complex
data producing scientific results
9Why (R)DBMS?
www.net.dcs.hull.ac.uk/ aboutus/findingus.htm
10(R)DBMS
- Relational
- Data
- Base
- Management
- System
11The alternative
ID U87107 standard DNA SYN 8840
BP. XX AC U87107 XX SV U87107.1 XX DT
15-OCT-1997 (Rel. 52, Created) DT 15-OCT-1997
(Rel. 52, Last updated, Version 4) XX DE
Cloning vector pAL-F insertion sequence IS1
galactokinase (galK), DE aminoglycoside
3'-phosphotransferase (kn), beta-galactosidase
(lacZ), small DE ribosomal protein and
beta-lactamase (Ap) genes, complete cds. XX KW
. XX OS Cloning vector pAL-F OC artificial
sequence vectors. XX RN 1 RP 1-8840 RA
Ahmed A., Podemski L. RT "Use of ordered
deletions in genome sequencing" RL Gene
197367-373(1997). XX RN 2 RP 1-8840 RA
Ahmed A. RT RL Submitted (27-JAN-1997) to
the EMBL/GenBank/DDBJ databases. RL Biological
Sciences, University of Alberta, Edmonton,
Alberta, T6G 2E9, RL Canada XX DR REMTREMBL
AAC53713 AAC53713. .. XX FH Key
Location/Qualifiers FH FT source
1..8840 FT /db_xref"taxon56954
" FT /organism"Cloning vector
pAL-F" FT /insertion_seq"IS1" F
T /specific_host"Escherichia
coli" FT CDS complement(933..2081) F
T /codon_start1 FT
/db_xref"REMTREMBLAAC53713" FT
/transl_table11 FT
/gene"galK" FT
/product"galactokinase" FT
/protein_id"AAC53713.1gt" FT
/translation"MSLKEKTQSLFANAFGYPATHTIQAPGRVNLIGEHT
DYNDGFVLP FT CAIDYQTVISCAPRDDRKV
RVMAADYENQLDEFSLDAPIVAHENYQWANYVRGVVKHLQ FT
LRNNSFGGVDMVDHGNVPQGAGLSSSASLEVAVGTVLQ
QLYHLPLDGAQIALNGQEAEN FT
QFVGCNCGIMDQLISALGKKDHALLIDCRSLGTKAVSMPKGVAVVIINSN
FKRTLVGSE FT YNTRREQCETGARFFQQPA
LRDVTIEEFNAVAHELDPIVAKRVRHILTENARTVEAASA FT
LEQGDLKRMGELMAESHASMRDDFEITVPQIDTLVEIV
KAVIGDKGGVRMTGGGFGGCI FT
VALIPEELVPAVQQRVAEQYEAKTGIKETFYVCKPSQGAGQC" .. XX
SQ Sequence 8840 BP 2068 A 2288 C 2319 G
2165 T 0 other caattactgc aatgccctcg
taattaagtg aatttacaat atcgtcctgt tcggagggaa
60 gaacgcggga tgttcattct tcatcacttt
taattgatgt atatgctctc ttttctgacg 120 ..
agtaagttgg cagcatcacc
8840 //
EMBL flat file - structured information -
parsable
12The alternative
ID U87107 standard DNA SYN 8840
BP. XX AC U87107 XX SV U87107.1 XX DT
15-OCT-1997 (Rel. 52, Created) DT 15-OCT-1997
(Rel. 52, Last updated, Version 4) XX DE
Cloning vector pAL-F insertion sequence IS1
galactokinase (galK), DE aminoglycoside
3'-phosphotransferase (kn), beta-galactosidase
(lacZ), small DE ribosomal protein and
beta-lactamase (Ap) genes, complete cds. XX KW
. XX OS Cloning vector pAL-F OC artificial
sequence vectors. XX RN 1 RP 1-8840 RA
Ahmed A., Podemski L. RT "Use of ordered
deletions in genome sequencing" RL Gene
197367-373(1997). XX RN 2 RP 1-8840 RA
Ahmed A. RT RL Submitted (27-JAN-1997) to
the EMBL/GenBank/DDBJ databases. RL Biological
Sciences, University of Alberta, Edmonton,
Alberta, T6G 2E9, RL Canada XX DR REMTREMBL
AAC53713 AAC53713. ..
EMBL flat file (upper)
13The alternative
.. XX FH Key Location/Qualifiers FH
FT source 1..8840 FT
/db_xref"taxon56954" FT
/organism"Cloning vector pAL-F" FT
/insertion_seq"IS1" FT
/specific_host"Escherichia coli" FT CDS
complement(933..2081) FT
/codon_start1 FT
/db_xref"REMTREMBLAAC53713" FT
/transl_table11 FT
/gene"galK" FT
/product"galactokinase" FT
/protein_id"AAC53713.1gt" FT
/translation"MSLKEKTQSLFANAFGYPATHTIQAPGRVNLIGEHT
DYNDGFVLP FT CAIDYQTVISCAPRDDRKV
RVMAADYENQLDEFSLDAPIVAHENYQWANYVRGVVKHLQ FT
LRNNSFGGVDMVDHGNVPQGAGLSSSASLEVAVGTVLQ
QLYHLPLDGAQIALNGQEAEN FT
QFVGCNCGIMDQLISALGKKDHALLIDCRSLGTKAVSMPKGVAVVIINSN
FKRTLVGSE FT YNTRREQCETGARFFQQPA
LRDVTIEEFNAVAHELDPIVAKRVRHILTENARTVEAASA FT
LEQGDLKRMGELMAESHASMRDDFEITVPQIDTLVEIV
KAVIGDKGGVRMTGGGFGGCI FT
VALIPEELVPAVQQRVAEQYEAKTGIKETFYVCKPSQGAGQC" .. XX
SQ Sequence 8840 BP 2068 A 2288 C 2319 G
2165 T 0 other caattactgc aatgccctcg
taattaagtg aatttacaat atcgtcctgt tcggagggaa
60 .. agtaagttgg cagcatcacc
8840 //
EMBL flat file (lower)
14EMBL flat file legenda (upper)
The ID (IDentification line) line is
always the first line of an entry. The general
form of the ID line is Term ID entryname
dataclass molecule division sequencelength
(BasePairs) e.g. ID U87107 standard DNA SYN
8840 BP The XX line ltXXgt contains no
data or comments. It is used instead of blank
lines to avoid confusion with sequence data
lines. The AC (Accession Number) line
lists the accession numbers associated with this
entry. The SV (Sequence Version) line
contains the new format of the nucleotide
sequence identifier. The DT (DaTe)
line shows when an entry first appeared in the
the database and when it was last updated.
The DE (DEscription) lines contain general
descriptive information about the sequence
stored. The KW (KeyWord) lines provide
information which can be used to generate
cross-reference indexes of the sequence entries
based on functional, structural, or other
categories deemed important. The keywords chosen
for each entry serve as a subject reference for
the sequence, and will be expanded as work with
the database continues. Often several KW lines
are necessary for a single entry. The OS
(Organism Species) line specifies the
preferred scientific name of the organism which
was the source of the stored sequence. The
OC (Organism Classification) lines contain
the taxonomic classification of the source
organism. The RN (Reference Number)
line gives a unique number to each reference
citation within an entry. The RC
(Reference Comment) line type is an optional
line type which appears if the reference has a
comment.
15 The RP (Reference Position) line type
is an optional line type which appears if one or
more contiguous base spans of the presented
sequence can be attributed to the reference in
question. The RX (Reference
Cross-reference) line type is an optional line
type which contains a cross-reference to an
external citation or abstract database.
The RA (Reference Author) lines list the
authors of the paper (or other work) cited.
The RT (Reference Title) lines give the
title of the paper (or other work). The RL
(Reference Location) line contains the
conventional citation information for the
reference. The DR (Database
Cross-Reference) line cross-references other
databases which contain information related to
the entry in which the DR line appears.
The CC lines are free text comments about the
entry, and may be used to convey any sort of
information thought to be useful. The FH
(Feature Header) lines are present only to
improve readability of an entry when it is
printed or displayed on a terminal screen. The
lines contain no data and may be ignored by
computer programs. The FT (Feature Table)
lines provide a mechanism for the annotation of
the sequence data. Regions or sites in the
sequence which are of interest are listed in the
table. The SQ (SeQuence header) line
marks the beginning of the sequence data and
gives a summary of its content. The
sequence data lines has lines of code starting
with two blanks. The sequence is written 60 bases
per line, in groups of 10 bases separated by a
blank character, beginning in position 6 of the
line. The direction listed is always 5' to 3'
The // (terminator) line also contains no data
or comments. It designates the end of an entry.
EMBL flat file legenda (lower)
16SWISS-PROT flat file
17Why (R)DBMS?
- - data independence
- - improved consistency
- - data integrity
- - data security
- - uniform data administration
- - concurrent access
- - redundancy control
18Why (R)DBMS?
- AAA / CIA / ACID
-
- Authentication, Authorization, Accountability
- Confidentiality, Integrity, Availability
- Atomicity, Consistency, Isolation, Durability
19Why (R)DBMS?
- information
- - data management
20Biological databases
http//mia.sdsc.edu/mia/html/bioDBsmap.html
21- In Oracle
- 30 tables
- 241365 3368791 3.6 Mentries (Rel. 9.1,
14/11/06) - 10G AA
22In Oracle
23Each field is independent
24Ensembl
- comprehensive and integrated source of
annotation of large genome sequences - In MySQL
- 65 tables (Rel. 41)
- 19 genomes
25- ArrayExpress is a public repository for
microarray data, which is aimed at storing well
annotated data in accordance with MGED
recommendations.
26- In Oracle 9.2
- Tables 219 Columns 670
- db size 1337133 MB (1.3 TB)
- Nr of db rows 1010989277 (1G)
- October 2006
27- EXPERIMENTS 1739
- ARRAYS 1202
- PROTOCOLS 8251
- HYBRIDIZATIONS 52610
- October 2006
28- The PDB is the single worldwide repository for
the processing and distribution of 3-D structure
data of large molecules of proteins and nucleic
acids.
29- In MySQL (5.0)
- Tables 461 Columns 5294
- Nr of db rows 819872661 (800M)
- PDB2004
30References
http//nar.oxfordjournals.org/ http//www.expasy.
uniprot.org/ http//www.ensembl.org/ http//www.ge
nome.org/ http//www.ebi.ac.uk/arrayexpress-old/ h
ttp//www.pdb.org/pdb