Title: Biological Databases
1Biological Databases
- Bioinformatics Databases for the Molecular
Biologist 8.9.2003 - Lorenza Bordoli
2Overview
- Introduction to Biological Databases
- Sequences DNA and Protein
- Species Specific (Genomics)
- Protein families and domains
- Mutation and polymorphisms
- Proteomics
- 3D structures
- Conclusions
3Introduction
4What is a database ?
- A collection of
- structured
- searchable (index) -gt table of contents
- updated periodically (release) -gt new edition
- cross-referenced (hyperlinks) -gt links with
other db - data
- Includes also associated tools (software)
necessary for db access/query, db updating, db
information insertion, db information deletion. - Data storage/ressource management
- flat files, relational databases, objet
oriented, )
5Why biological databases ?
- Exponential growth in biological data.
- Data (genomic sequences, 3D structures, 2D gel
analysis, MS analysis, Microarrays.) are no
longer published in a conventional manner, but
directly submitted to databases. - Essential tools for biological research.
6Some statistics
- More than 1000 different biological databases
- Variable size lt100Kb to gt10Gb
- DNA gt 10 Gb
- Protein 1 Gb
- 3D structure 5 Gb
- Other smaller
- Update frequency daily to annually
- Usually accessible through the web (free !?)
- Amos links www.expasy.org/alinks.html
- Biohunt http//www.expasy.org/BioHunt/
- Google http//www.google.com/
7ExPASy Server
ExPASy Web Server ExPASy Expert Protein
Analysis System
http//www.expasy.org/
8- Some databases in the field of molecular
biology - AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
- ARR, AsDb, BBDB, BCGD, Beanref,
Biolmage, - BioMagResBank, BIOMDB, BLOCKS,
BovGBASE, - BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
- CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
- ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
- CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP,
DictyDb, - Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract,
ECDC, - ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
- ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
- GCRDB, GDB, GENATLAS, Genbank, GeneCards,
- Genline, GenLink, GENOTK, GenProtEC,
GIFTS, - GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
- HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
- HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
- HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
- KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,
9Categories of databases for Life Sciences
- Sequences (DNA, protein) (primary db)
- Genomics (Species Specific)
- Mutation/polymorphism
- Protein domain/family (----gt tools)
- Proteomics (2D gel, Mass Spectrometry)
- 3D structure
- Metabolism
- Bibliography
- Others (Microarrays, Protein protein
interaction)
10Sequence Databases
11Ideal minimal content of a sequence database entry
- Sequences !!
- Accession number (AC) (unique identifier)
- Taxonomic data
- References
- ANNOTATION/CURATION (gt not always the case !)
- Keywords
- Cross-references
- Documentation
12Sequence database example
SWISS-PROT (protein db) (flat file)
Accession number
Taxonomy
Reference
Annotations (comments)
Cross-references
Keywords
13Sequence database example (cont.)
Annotations (features)
Sequence
14Sequence Databases
15Sequence Database 1. nucleotide sequences
- The 3 main nucleic acid sequence databases (DNA)
are - EMBL (Europe)/GenBank (USA) /DDBJ (Japan)
- EMBL since 1982
- Specialized databases for the different types of
RNAs (i.e tRNA, rRNA, tmRNA, URNA,) - 3D structure (DNA and RNA)-gtPDB
- Others Aberrant splicing db, Eukaryotic promoter
db (EPD) RNA editing sites, Multimedia Telomere
Resource,
16Nucleotids and associated topics databases
(AMOSlinks) EMBL - EMBL Nucleotide
sequence db (EBI) Genbank - GenBank
Nucleotide Sequence db (NCBI) DDBJ - DNA
Data Bank of Japan dbEST - dbEST
(Expressed Sequence Tags) db (NCBI) dbSTS
- dbSTS (Sequence Tagged Sites) db (NCBI)
NDB - Nucleic Acid Databank (3D structures)
BNASDB - Nucleic acid structure db from
University of Pune AsDb - Aberrant
Splicing db ACUTS - Ancient conserved
untranslated DNA sequences db Codon Usage
Db EPD - Eukaryotic Promoter db
HOVERGEN - Homologous Vertebrate Genes db
IMGT - ImMunoGeneTics db Mirror at EBI
ISIS - Intron Sequence and Information System
RDP - Ribosomal db Project gRNAs db -
Guide RNA db PLACE - Plant cis-acting
regulatory DNA elements db PlantCARE -
Plant cis-acting regulatory DNA elements db
sRNA db - Small RNA db ssu rRNA - Small
ribosomal subunit db lsu rRNA - Large
ribosomal subunit db 5S rRNA - 5S
ribosomal RNA db tmRNA Website
tmRDB - tmRNA dB tRNA - tRNA compilation
from the University of Bayreuth uRNADB -
uRNA db RNA editing - RNA editing site
RNAmod db - RNA modification db
SOS-DGBD - Db of Drosophila DNA sequences
annotated with regulatory binding sites
TelDB - Multimedia Telomere Resource
TRADAT - TRAnscription Databases and Analysis
Tools Subviral RNA db - Small circular
RNAs db (viroid and viroid-like) MPDB -
Molecular probe db OPD - Oligonucleotide
probe db VectorDB - Vector sequence db
(seems dead!)
17Sequence Database 1. DNA EMBL/GenBank/DDBJ
- These 3 db contain mainly the same informations
within 2-3 days (few differences in the format
and syntax) - Contribution EMBL 10 GenBank 73 DDBJ 17
- Serve as archives containing all sequences
(single genes, ESTs, complete genomes, etc.)
derived from - Genome projects (gt 80 of entries)
- Sequencing centers
- Individual scientists ( 15 of entries)
- Patent offices (i.e. European Patent Office, EPO)
- Currently 18 x106 sequences, 30 x109 bp
- Sequences from gt 50000 different species
18The tremendous increase in nucleotide sequences
EMBL datafirst increase in data due to the PCR
development
human
High throughput genomes (HTG)
mouse
rat
human
1980 80 genes fully sequenced !
19EMBL/GenBank/DDBJ
- Heterogeneous sequence qualities and length
ESTs, genomes, variants, fragments - Sequence sizes
- max 350000 bp /entry (! genomic sequences,
overlapping) - min 10 bp /entry
- Archive nothing goes out -gt highly redundant !
- full of errors in sequences, in annotations, in
CDS attribution. - no consistency of annotations most annotations
are done by the submitters heterogeneity of the
quality and the completion and updating of the
informations - entries contain only the assembly data
20(No Transcript)
21EMBL/GenBank/DDBJ
- Unexpected information you can find in these db
- FT source 1..124
- FT /db_xref"taxon4097"
- FT /organelle"plastidchloropla
st" - FT /organism"Nicotiana
tabacum" - FT /isolate"Cuban cahibo
cigar, gift from President Fidel - FT Castro"
- Or
- FT source 1..17084
- FT /chromosome"complete
mitochondrial genome" - FT /db_xref"taxon9267"
- FT /organelle"mitochondrion"
- FT /organism"Didelphis
virginiana" - FT /dev_stage"adult"
- FT /isolate"fresh road killed
individual" - FT /tissue_type"liver"
22EMBL entry example
- ID HSERPG standard DNA HUM 3398 BP.
- XX
- AC X02158
- XX
- SV X02158.1
- XX
- DT 13-JUN-1985 (Rel. 06, Created)
- DT 22-JUN-1993 (Rel. 36, Last updated, Version
2) - XX
- DE Human gene for erythropoietin
- XX
- KW erythropoietin glycoprotein hormone
hormone signal peptide. - XX
- OS Homo sapiens (human)
- OC Eukaryota Metazoa Chordata Craniata
Vertebrata Euteleostomi Mammalia - OC Eutheria Primates Catarrhini Hominidae
Homo. - XX
- RN 1
- RP 1-3398
keyword
taxonomy
references
Cross-references
23EMBL entry (cont.)
- CC Data kindly reviewed (24-FEB-1986) by K.
Jacobs - FH Key Location/Qualifiers
- FH
- FT source 1..3398
- FT /db_xreftaxon9606
- FT /organismHomo sapiens
- FT mRNA join(397..627,1194..1339,1596
..1682,2294..2473,2608..3327) - FT CDS join(615..627,1194..1339,1596
..1682,2294..2473,2608..2763) - FT /db_xrefSWISS-PROTP01588
- FT /producterythropoietin
- FT /protein_idCAA26095.1
- FT /translationMGVHECPAWLWLLLSL
LSLPLGLPVLGAPPRLICDSRVLQRYLLE - FT AKEAENITTGCAEHCSLNENITVPDTKVN
FYAWKRMEVGQQAVEVWQGLALLSEAVLRG - FT QALLVNSSQPWEPLQLHVDKAVSGLRSLT
TLLRALGAQKEAISPPDAASAAPLRTITAD - FT TFRKLFRVYSNFLRGKLKLYTGEACRTGD
R - FT mat_peptide join(1262..1339,1596..1682,22
94..2473,2608..2763) - FT /producterythropoietin
- FT sig_peptide join(615..627,1194..1261)
- FT exon 397..627
CDS Coding sequence
annotation
sequence
24EMBL The Genome divisionshttp//www.ebi.ac.uk/ge
nomes/
25Nucleotide databases and associated genomic
projects/databases
- Problem
- Redundancy makes Blasts searches of the
complete - databases useless for detecting anything behond
the closest homologs. - Solutions
- assemblies of genomic sequence data (contigs)
and corresponding RNA and - protein sequences -gt dataset of genomic contigs,
RNAs and proteins - annotation of genes, RNAs, proteins, variation
(SNPs), STS markers, - gene prediction, nomenclature and chromosomal
location. - compute connection to other resources
(cross-references) - Examples RefSeq/Locus link (drosophila, human,
mouse, rat and zebrafish), - TIGR (bacteria and plants),
Ensembl (Eukaryota)
26Nucleotide databases and associated genomic
projects/databases
- LocusLink (http//www.ncbi.nlm.nih.gov/LocusLink/
) - Focal point for genes and associated
information (C. elegans, cow, fruit fly, human,
human HIV type 1, mouse, rat and zebrafish) - RefSeq
- Reference mRNAs and proteins for
human, mouse, rat - UniGene
- UniGene clusters, expression data
- Ensembl
- Provides a bioinformatics framework to organise
biology around the sequences of large genomes.
Available now are human, mouse, rat,fugu,
zebrafish, mosquito, Drosophila, C. elegans, and
C. briggsae, -
27Nucleotide databases and associated genomic
projects/databases
- LocusLink
- From gene loci to curated sequences and
descriptive informations.
28 LocusLink
29 LocusLink
30Nucleotide databases and associated genomic
projects/databases
- RefSeq Reference sequences of genomic contigs,
mRNAs, and proteins.
31Nucleotide databases and associated genomic
projects/databases
- UniGene UniGene is an experimental system for
automatically partitioning GenBank sequences into
a non-redundant set of gene-oriented clusters.
Each UniGene cluster contains sequences that
represent a unique gene, as well as related
information such as the tissue types in which the
gene has been expressed and map location.
32Nucleotide databases and associated genomic
projects/databases
- Ensembl Provides a bioinformatics framework to
organise biology around the sequences of large
genomes. Available now are human, mouse,
rat,fugu, zebrafish, mosquito, Drosophila, C.
elegans, and C. briggsae, -
33Sequence Databases
34Sequence Database 2. Proteins
- SWISS-PROT created in 1986 (A.Bairoch)
http//www.expasy.org/sprot/ - TrEMBL created in 1996 complement to
SWISS-PROT derived from EMBL CDS translations
( proteomic version of EMBL) - PIR-PSD Protein Information Resources
http//pir.georgetown.edu/ - Genpept proteomic version of GenBank
- Many specialized protein databases for specific
families or groups of proteins (M. Primig) - Examples AMSDb (antibacterial peptides), GPCRDB
(7 TM receptors), IMGT (immune system) YPD
(Yeast),
35The first protein db
36Swiss-Prot
- Collaboration between the SIB (Geneva) and
EMBL/EBI (UK) - Fully manually annotated, non-redundant,
cross-referenced, documented protein sequence
database. - 113 000 sequences from more than 6800
different species 70 000 references
(publications) 550 000 cross-references
(databases) 200 Mb of annotations - Weekly releases available from about 50 servers
across the world, the main source being ExPASy
37TrEMBL (Translation of EMBL)
- It is impossible to cope with the quantity of
newly generated data AND to maintain the high
quality of SWISS-PROT -gt TrEMBL, created in 1996. - TrEMBL is automatically generated (from annotated
EMBL coding sequences (CDS)) and annotated using
software tools. - Contains all what is not in SWISS-PROT.
- SWISS-PROT TrEMBL all known protein
sequences. - Well-structured SWISS-PROT-like resource.
38The simplified story of a Swiss-Prot entry
Some data are not submitted to the public
databases !! (delayed or cancelled)
cDNAs, genomes,
- Automated
- Redundancy check (merge)
- Family attribution (InterPro)
- Annotation (computer)
EMBLnew EMBL
CDS
TrEMBLnew TrEMBL
- Manual
- Redundancy (merge, conflicts)
- Annotation (manual)
- Swiss-Prot tools (macros)
- Swiss-Prot documentation
- Medline
- Databases (MIM, MGD.)
- Brain storming
Swiss-Prot
Once in Swiss-Prot, the entry is no more in
TrEMBL, but still in EMBL (archive)
39Some nomenclature Example SRS6 at the Sanger
Center
http//www.sanger.ac.uk/srs6bin/cgi-bin/wgetz?-pag
etop
40SWISS-PROT (SP)TrEMBL TrEMBL new (SWALL,
SPTR) (Standard)
(Preliminary)
- TrEMBL SPTrEMBL REMTrEMBL
- SPTrEMBL contains TrEMBL entries which will be
integrated into SWISS-PROT. - REMTrEMBL contains TrEMBL entries which will
never be integrated into SWISS-PROT
(Immunoglobulins and T-cell receptors, Synthetic
sequences, Patent application sequences - Small fragments, CDS not coding for real
proteins) - TrEMBLnew contains entries which have not yet
been integrated into TrEMBL (weekly update to
TrEMBL) - SPTR (SWall) SWISS-PROT (SP)TrEMBL
TrEMBLnew - ! Usually what we call TrEMBL is (SP)TrEMBL and
does not include REMTrEMBL !
41 a Swiss-Prot entry overview
42Protein name Gene name
43(No Transcript)
44(No Transcript)
45Cross-references
46Keywords
47(No Transcript)
48(No Transcript)
49TrEMBL example
Original TrEMBL entry which has been integrated
into the SWISS-PROT EPO_HUMAN entry and thus
which is not found in TrEMBL anymore.
50(No Transcript)
51Swiss-Prot / TrEMBL a minimal of redundancy
- Swiss-Prot and TrEMBL introduces some degree of
- redundancy
- Only 100 identical sequences are automatically
merged - between SWISS-PROT and TrEMBL
- Complete sequences or fragments with 1-3
conflicts will be - automatically merged soon (genome projects check
for chromosomal location and gene names)
52Swiss-Prot / TrEMBL a minimal of redundancy
Human EPO Blastp results
53Swiss-Prot and the cross-references (X-ref)
- SWISS-PROT was the 1st database with X-ref.
- Explicitly X-referenced to 36 databases
- X-ref to DNA (EMBL/GenBank/DDBJ), 3D-structure
(PDB), - literature (Medline), genomic (MIM, MGD,
FlyBase, SGD, SubtiList, - etc.), 2D-gel (SWISS-2DPAGE), specialized db
(PROSITE, - TRANSFAC)
- Implicitly X-referenced to 17 additional db
added by the ExPASy - servers on the WWW (i.e. GeneCards, PRODOM,
HUGE, etc.) - Gasteiger et al., Curr. Issues Mol. Biol.
(2001), 3(3) 47-55
54Domains, functional sites, protein
families PROSITE InterPro Pfam PRINTS SMART Mendel
-GFDb
Human diseases MIM
Protein-specific dbs GCRDb MEROPS REBASE TRANSFAC
2D and 3D Structural dbs HSSP PDB
Organism-spec. dbs DictyDb EcoGene FlyBase HIV Mai
zeDB MGD SGD StyGene SubtiList TIGR TubercuList Wo
rmPep Zebrafish
Swiss-Prot
PTM CarbBank GlycoSuiteDB
2D-gel protein databases SWISS-2DPAGE ECO2DBASE HS
C-2DPAGE Aarhus and Ghent MAIZE-2DPAGE
Nucleotide sequence db EMBL, GeneBank, DDBJ
55- http//pir.georgetown.edu/
56- UniProt
- United Protein database
- Swiss-Prot TrEMBL PIR
- Born in oct 2002
- NIH pledges cash for global protein database
The United States is turning to European
bioinformatics facilities to help it meet - its researchers' future needs for databases of
protein sequences.European institutions are set
to be the main recipients of a 15-million, - three-year grant from the US National Institutes
of Health (NIH), to set up - a global database of information on protein
sequence and function known as the - United Protein Databases, or UniProt (Nature,
419, 101 (2002))
57Species Specific Databases
58Species Specific Databases
- Contain information on gene chromosomal location
(mapping) and nomenclature, and provide links to
sequence databases usually do not contain
sequence (but crosslink to it) - all species whose genome has been sequenced,
annotated and published come with their own
species genome database that enables scientists
to retrieve information about DNA and it's gene
products (see also NAR Database issue 2003 at
http//nar.oupjournals.org/content/vol31/issue1/
and the new NAR Web server Issue 2003 at
http//nar.oupjournals.org/content/vol31/issue13)
- AMOS links http//www.expasy.org/alinks.htmlOrg
anisms - species specific db session
59Species Specific DBs examples
- Human
- GDB The Genome Database is the official central
repository for genomic mapping data resulting
from the Human Genome Initiative. Although GDB
has historically focused on gene mapping, as the
Genome Project moves from mapping to sequence to
functional analysis, GDB's focus will be
broadened. Extensions are under development in
the representation of sequence-level genome
content, including sequence variations, along
with richer descriptions of function and
phenotype. - GeneCards - Db integrating information on human
genes - GeneLynx - Portal to the human genome
60GeneCards
an electronic encyclopedia of biological and
medical information
61Gene Lynx
62Gene Lynx
63Species Specific DBs examples
- Mouse
- MGI Mouse Genome Informatics provides
integrated access to data on the genetics,
genomics, and biology of the laboratory mouse - C.elegans
- WormBase WormBase is an international consortium
of biologists and computer scientists dedicated
to providing the research community with
accurate, current, accessible information
concerning the genetics, genomics and biology of
C. elegans and some related nematodes. - Yeast
- SGD database of the molecular biology and
genetics of the yeast Saccharomyces cerevisiae - Arabidopsis
- TAIR The Arabidopsis Information Resource
provides a comprehensive resource for the
scientific community working with Arabidopsis
thaliana, a widely used model plant. TAIR
consists of a searchable relational database,
which includes many different datatypes. - Drosophila
- The Flybase FlyBase is a database of genetic
and molecular data for Drosophila. FlyBase
includes data on all species from the family
Drosophilidae the primary species represented is
Drosophila melanogaster.
64Protein families domains DB
65Protein families domains DB
- Most proteins have modular structures
- Estimation 3 domains / protein
- Motifs conserved regions within a domain, can be
identified by multiple sequence alignments - Protein motifs can be defined by different
methods (descriptors) - Pattern
- Profiles
- HMMs
66Protein families domains DB
- Contains biologically significant motifs
descriptors (pattern / profiles/ HMM) formulated
in such a way that, with appropriate computional
tools, it can rapidly and reliably determine to
which known family of proteins (if any) a new
sequence belongs to. - Used as a tool to identify the function of
uncharacterized proteins translated from genomic
or cDNA sequences ( functional diagnostic ) - Either manually curated (i.e. PROSITE, Pfam,
etc.) or automatically generated (i.e. ProDom,
DOMO)
67Protein families domains DB
Interpro
PROSITE Patterns / Profiles ProDom Aligned
motifs (PSI-BLAST) (Pfam B) PRINTS Aligned
motifs Pfam HMM (Hidden Markov Models)
SMART HMM TIGRfam HMM DOMO Aligned
motifs BLOCKS Aligned motifs (PSI-BLAST) CDD(CDAR
T) PSI-BLAST(PSSM) of Pfam and SMART
68Protein families domains DB PROSITE
- Contains functional domains fully annotated,
based on two methods patterns and profiles - Entries are deposited in PROSITE in two distinct
files - Pattern/profiles with the list of all matches in
Swiss-Prot - Documentation
69PROSITE Documentation
70PROSITE entry
Diagnostic performance
List of matches
71PROSITE entry
Diagnostic performance
List of matches
72PROSITE access
Search for an entry
Search for the occurrence Of domain in your
protein
73Protein families domains DB Pfam
74Protein families domains DB InterPro
- Composite DB direct access to different protein
families DBs
75Protein families domains DB InterPro
76Mutations Polymorphisms DB
77Mutations Polymorphisms DB
- Contain informations on sequence variations
linked or not to genetic diseases - Mainly human but OMIA - Online Mendelian
Inheritance in Animals - General db
- OMIM
- HMGD - Human Gene Mutation db
- SVD - Sequence variation db
- HGBASE - Human Genic Bi-Allelic Sequences db
- dbSNP - Human single nucleotide polymorphism
(SNP) db - Disease-specific db most of these databases are
either linked to a single gene or to a single
disease - p53 mutation db
- ADB - Albinism db (Mutations in human genes
causing albinism) - Asthma and Allergy gene db
78Mutations Polymorphisms Definitions
- SNPs single nucleotide polymorphisms occur
approximately once every 100 to 300 bases - (distinction between sequencing error and
polymorphism !) - c-SNPs coding single nucleotide polymorphisms
(Single Nucleotide Polymorphisms within cDNA
sequences) - SAPs single amino-acid polymorphisms
- Missense mutation -gt SAP
- Nonsense mutation -gt STOP
- Insertion/deletion of nucleotides -gt frameshift
79Mutations Polymorphisms DB examples
- OMIM Online Mendelian Inheritance in Man
- catalog of human genes and genetic disorders
- contains a summary of literature and reference
information. It also contains links to
publications and sequence information. - TSC The SNP consortium
- Public/private collaboration Bayer, Roche, IBM,
Pfizer, Novartis, Motorola - SNPs dbSNP at NCBI
- Collaboration between the National Human Genome
Research Institute and the National Center for
Biotechnology Information (NCBI) - Chromosome 21 dbSNP
- A joint project between the Division of Medical
Genetics of the
University of Geneva Medical School and the SIB
80(No Transcript)
81Mutations Polymorphisms DB
- Generally modest size lack of coordination and
standards in these databases making it difficult
to access the data. - There are initiatives to unify these databases
- SVD Sequence Variation Database project at EBI
(HMutDB) - (http//www2.ebi.ac.uk/mutations/)
- HUGO Mutation Database Initiative (MDI).
- Human Genome Variation Society
- (http//www.genomic.unimelb.edu.au/mdi/dblist
/dblist.html)
82Proteomics DatabasesSWISS-2DPAGE
83Proteomics DB
- Contain informations obtained by 2D-PAGE images
of master gels and description of identified
proteins - Examples SWISS-2DPAGE, ECO2DBASE, Maize-2DPAGE,
Sub2D, Cyano2DBase, etc. - Composed of image and text files
84Proteomics DB SWISS-2DPAGE
85Proteomics DB SWISS-2DPAGE
86Proteomics DB SWISS-2DPAGE
873D structures DatabasesPDB
883D structures Databases PDB
- Worldwide repository for the processing and
distribution of 3-D biological macromolecular
structure data - Proteins represent more than 90 of available
structures (others are DNA, RNA, sugars, viruses,
protein/DNA complexes) - http//www.pdb.org
- Contains protein structures solved experimentally
(X-Ray, NMR, EM) - Provides
- Coordinates (often structure factors, NOEs,
other experimental data) - stored as pdb or mmCIF file
- Images
- Links to derived data, e.g. similar structures,
fold families, etc.
893D structures Databases PDB
- SHEET 3 S10 PHE 66 PHE 70 -1 O ASN
67 N LEU 60 12CA 68 - SHEET 4 S10 TYR 88 TRP 97 -1 O PHE
93 N VAL 68 12CA 69 - SHEET 5 S10 ALA 116 ASN 124 -1 O HIS
119 N HIS 94 12CA 70 - SHEET 6 S10 LEU 141 VAL 150 -1 O LEU
144 N LEU 120 12CA 71 - SHEET 7 S10 VAL 207 LEU 212 1 O ILE
210 N GLY 145 12CA 72 - SHEET 8 S10 TYR 191 GLY 196 -1 O TRP
192 N VAL 211 12CA 73 - SHEET 9 S10 LYS 257 ALA 258 -1 O LYS
257 N THR 193 12CA 74 - SHEET 10 S10 LYS 39 TYR 40 1 O LYS
39 N ALA 258 12CA 75 - TURN 1 T1 GLN 28 VAL 31 TYPE VIB
(CIS-PRO 30) 12CA 76 - TURN 2 T2 GLY 81 LEU 84 TYPE
II(PRIME) (GLY 82) 12CA 77 - TURN 3 T3 ALA 134 GLN 137 TYPE I
(GLN 136) 12CA 78 - TURN 4 T4 GLN 137 GLY 140 TYPE I
(ASP 139) 12CA 79 - TURN 5 T5 THR 200 LEU 203 TYPE VIA
(CIS-PRO 202) 12CA 80 - TURN 6 T6 GLY 233 GLU 236 TYPE II
(GLY 235) 12CA 81 - CRYST1 42.700 41.700 73.000 90.00 104.60
90.00 P 21 2 12CA 82 - ORIGX1 1.000000 0.000000 0.000000
0.00000 12CA 83 - ORIGX2 0.000000 1.000000 0.000000
0.00000 12CA 84 - ORIGX3 0.000000 0.000000 1.000000
0.00000 12CA 85 - SCALE1 0.023419 0.000000 0.006100
0.00000 12CA 86
Coordinates of each atom
903D structures Databases PDB
91Conclusions
92Database retrieval tools
- Query tools associated with the Databases
- Sequence Retrieval System (SRS, Europe) allows
any flat-file db to be indexed to any other
allows to formulate queries across a wide range
of different db types via a single interface,
without any worry about data structure, query
languages - Entrez (NCBI) less flexible than SRS but
exploits the concept of neighbouring , which
allows related articles in different db to be
linked together, whether or not they are
cross-referenced directly - ATLAS specific for macromolecular sequences db
(i.e. NRL-3D) - .
93Where and what ?
- Swiss Institute of Bioinformatics (Geneva,
Lausanne, Basel)
- Swiss-Prot
- PROSITE
- TrEMBL
- ExPASy server
- EMBnet server BLAST
http//www.isb-sib.ch
- European Bioinformatics Institute (UK/D)
http//www.ebi.ac.uk/embl/
- National Center for Biotechnology Information
(USA)
- BLAST
- Genebank
- Lokuslink
- RefSeq
http//www.ncbi.nlm.nih.gov/
- The Wellcome Trust Sanger Institute (UK)
http//www.sanger.ac.uk/