Title: Sequence%20databases%20and%20retrieval%20systems
1- Sequence databases and retrieval systems
- Guy Perrière
- replaced by Manolo Gouy
- Pôle Bio-Informatique Lyonnais
- Laboratoire de Biométrie et Biologie Évolutive
- UMR CNRS n 5558
- Université Claude Bernard Lyon 1
2In the beginning
- First paper compilation in 1965 (Atlas of Protein
Sequences). - Development of real databanks at the begin-ning
of the 80s - Fast access.
- Make possible analyses that require a lot of
data - Codon usage.
- Molecular phylogeny.
3General databanks
- Nucleotide sequences
- EMBL/GenBank/DDBJ.
- Protein sequences
- Simple translations of coding regions
- GenPept (from GenBank).
- TrEMBL (from EMBL).
- Systems containing additional data
- SWISS-PROT.
- PIR.
4EMBL
- Created in 1980 at the European Molecular Biology
Laboratory in Heidelberg. - Maintained since 1994 at the European
Bioinformatics Institute (EBI) near Cambridge. - Web server
- http//www.ebi.ac.uk/embl
5GenBank
- Set up in 1979 at the Los Alamos National
Laboratory in New Mexico, US. - Maintained since 1992 at the National Cen-ter for
Biotechnology Information (NCBI) in Bethesda. - Web server
- http//www.ncbi.nlm.nih.gov/Genbank/index.html
6DDBJ
- Active since 1984 at the National Institute of
Genetics (NIG) in Mishima, Japan. - Web server
- http//www.ddbj.nig.ac.jp
7EMBL / GenBank / DDBJ
- The International Nucleotide Sequence Database
Collaboration EMBL / GenBank / DDBJ - New sequences are exchanged daily between the
three centers - --gt the three banks have an identical content.
- Data mainly provided by direct submissions from
the authors through Internet - Web forms.
- Email.
8Data growth
log (number of residues)
9GenBank/EMBL size (April 2003)
- 31?109 nucleotides.
- 24?106 sequences.
- 1.8 million genes (proteins and RNA).
- 313,000 bibliographic references.
- 100 gigabytes on disk.
- Growth of 63 in 12 months.
10Taxonomic sampling (April 2003)
- There are 135,560 species for which at least one
sequence is available. - Nine species (0.007 ) correspond to 62 of the
total. - 77,900 species are represented by only one
sequence!
The nine most represented species in GenBank/EMBL
11Distribution format
- The banks are distributed as a set of text files
called divisions ( 292 for EMBL). - A division contains sequences related to
- A taxon (e.g., bacteria, invertebrates, mammals).
- A class of sequences (EST, HTG, GSS).
- Within a division, each sequence is called an
entry.
12Entry structure
- Information is introduced in structured fields.
- The format differs in its form between EMBL and
GenBank/DDBJ - but not in substance.
13ID, AC, SV and DT fields
Contain identifiers and the creation and the last
modification dates for the entries. ID BSAMYL
standard DNA PRO 2680 BP. XX AC V00101
J01547 XX SV V00101.1 XX DT 13-JUL-1983 (Rel.
03, Created) DT 12-NOV-1996 (Rel. 49, Last
updated, Version 11)
14DE, KW, OS and OC fields
Definition, Keywords, Taxonomy. DE
Bacillus subtilis amylase gene. XX KW amyE
gene amylase amylase-alpha KW regulatory
region signal peptide. XX OS Bacillus
subtilis OC Bacteria Firmicutes
Bacillus/Clostridium group OS
Bacillus/Staphylococcus group Bacillus.
The NCBI maintains a unified taxonomy, largely
based on sequence information.
15RN, RX, RA and RT fields
contain bibliographic information. RN 1 RP
1-2680 RX MEDLINE 83143299. RA Yang M.,
Galizzi, A., Henner, D.J. RT "Nucleotide
sequence of the amylase gene from RT Bacillus
subtilis" RL Nucleic Acids Res.
11237-249(1983).
16FT field
contains the descriptions of functional regions.
key location and qualifiers FT
promoter 369..374 FT /note"put.
promoter sequence P2 3 (amyR1)" FT RBS
414..419 FT /note"rRNA-binding site
rbs-1 3" FT CDS 498..2480 FT
/gene"amyE" FT /db_xref"SWISS-PROT
P00691" FT /product"alpha-amylase
precursor" FT /EC_number"3.2.1.1 FT
/protein_id"CAA23437.1" FT
/translation"MFAKRFKTSLLPLFAGFLLLFHLVLAGPAA FT
ASAETANKSNELTAPSIKSGTILHAWNWSFNTLKHNMK
DIHDAG ...
17Intron/exon structure
FT CDS join(242..610,3397..3542,5100..535
1) FT /codon_start1 FT
/db_xref"SWISS-PROTP01308" FT
/note"precursor" FT /gene"INS" FT
/product"insulin" ...
18SQ field
Contains the sequence iself SQ Sequence 2680
BP 825 A 520 C 642 G 693 T 0 other
gctcatgccg agaatagaca ccaaagaaga actgtaaaaa
cgggtgaagc agcagcgaat 60 agaatcaatt
gcttgcgcct ttgcggtagt ggtgcttacg atgtacgaca
gggggattcc 120 ccatacattc ttcgcttggc
tgaaaatgat tcttcttttt atcgtctgcg gcggcgttct
180 gtttctgctt cggtatgtga ttgtgaagct
ggcttacaga agagcggtaa aagaagaaat 240
(...) gatggtttct tttttgttca taaatcagac
aaaacttttc tcttgcaaaa gtttgtgaag 2580
tgttgcacaa tataaatgtg aaatacttca caaacaaaaa
gacatcaaag agaaacatac 2640 cctgcaagga
tgctgatatt gtctgcattt gcgccggagc
2680 //
19Errors in databanks
- There are a lot of errors in the nucleotide
sequence databanks - In annotations
- Inaccuracies, omissions, and even mistakes.
- Inconsistencies between entries.
- In the sequences themselves
- Sequencing errors.
- Cloning vectors inserted.
20Redundancy
- Another major pro-blem is redundancy.
- A lot of entries are partially or entirely
duplicated - 20 of vertebrate se-quences in GenBank.
- Duplicated entries are often different in their
sequence.
21Variations in duplicates
- It is often impossible to decide whether a
difference between two duplicates is due to - Polymorphism.
- Sequencing error.
- True gene duplication.
- And what to do when annotations differ or are
even contradictory?
22Protein sequence databases
- Translation of Coding DNA Sequences (CDS) from
EMBL/GenBank/DDBJ. - Consultation of publications or patents.
- Very small number of direct protein sequence
submission by authors. - In SwissProt and PIR additional annotations.
23SWISS-PROT
- Created by Amos Bairoch in 1986 at the Department
of Medical Biochemistry in Geneva. - Maintained by the Swiss Institute of
Bioinformatics (SIB) and funded by GeneBio, and,
very recently, by NIH. - Web server
- http//www.expasy.ch/sprot/sprot-top.html
24SWISS-PROT characteristics
- Almost no redundancy.
- Cross-references with 60 other databanks.
- High-quality annotations
- Systematic control by a team of annotators.
- Help from a set of gt 200 volunteer experts.
- Embedded in Expasy, a www proteomics server
(http//www.expasy.org) .
25Annotations
- Protein function.
- Post-translational modifications.
- Structural or functional domains.
- Secondary and quaternary structures.
- Similarities with other proteins.
- Conflicts between positions for CDS.
- Disease-related mutations
26Associated databanks
- TrEMBL, built using only annotated CDS from the
EMBL data library. - ENZYME, for the international enzyme
nomenclature. - PROSITE, for biologically significant sites,
patterns and profiles. - SWISS-2DPAGE, for two-dimensional polyacrylamide
gel electrophoresis maps.
27PIR
- PIR (The Protein Information Resource) was
created by Margaret Dayhoff in 1965. - Aims
- To provide exhaustive and non-redundant protein
sequence data. - To give a classification using taxonomic and
similarity data - entries grouped in super-families, families
- and subfamilies.
28Data maintenance
- Three organisms collect and organize the data
introduced in PIR - The National Biomedical Research Foundation
(NBRF) in the United States. - The Martinsried Institute for Protein Sequence
(MIPS) in Germany. - The Japan International Protein Sequence
Information Database (JIPID) in Japan.
29Results
- The exhaustivity is not better than what is
obtained with SWISS-PROTTrEMBL. - Still contains redundancy.
- Less comprehensive annotation.
- Low number of cross-references.
- PIR has recently joined forces with EBI and SIB
to establish the UniProt (United Protein
Databases), the central resource of protein
sequence and function.
30Specialized databanks
- A lot of specialized databanks have been
developed, which are devoted to - Complete genomes.
- Families of homologous genes.
- Non-sequence data.
- These systems are under the responsibility of
curators - Data quality and homogeneity control.
31Complete genomes
- There is a large number of databanks devoted to
specific organisms. - These banks are associated to sequencing or
mapping projects. - For some model organisms there are often several
concurrent systems.
32Examples
33Gene family databanks
- Built with automated procedures
- Similarity search between sets of proteins
(BLASTP, FASTP, Smith-Waterman). - Clustering into homologous families using
similarity criteria. - Include various data
- Protein (and sometimes nucleotide) sequences.
- Multiple sequence alignments and trees.
- Taxonomy.
34 ProtFam
- Developed at MIPS.
- Built with PIR sequences.
- Includes four levels of classification
- Superfamilies (based on function and similarity
criteria). - Families (50 similarity).
- Subfamilies (80 similarity).
- Entries (95 similarity).
35ProtFAm characteristics
- Allows to visualize alignments and dendrograms
for the families. - Integrates Pfam domains.
- Allows users to classify their own protein
sequences. - Web server
- http//mips.gsf.de
36ProtoMap
- Initially developed at the Hebrew University of
Jerusalem now hosted at Cornell University. - Built with SWISS-PROT TrEMBL sequences.
- Combines 3 sequence similarity measures (BLASTP,
FASTA and Smith-Waterman).
37ProtoMap characteristics
- Alignments and trees are visualized with Java
applets. - Users can submit sequences and classify them.
- Web server
- http//protomap.cornell.edu/index.html
38Specialized systems
- HOVERGEN (Homologous Vertebrate Genes Database)
- Based on GenBank CDS.
- HOBACGEN (Homologous Bacterial Genes Database)
for prokaryotes and yeast - Based on SWISS-PROT/TrEMBL.
- HOBACGEN-CG for completely sequenced genomes
- Based on SWISS-PROT/TrEMBL.
39Other specialized systems
- COG (Clusters of Orthologous Groups), also for
complete genomes - Based on GenBank CDS.
- NuReBase (Nuclear Receptors Database) for
mammalian nuclear receptors - Based on EMBL CDS.
- RTKdb (Tyrosine Kinase Receptors)
- Based on EMBL CDS.
40Are COGs real orthologs?
Escherichia coli Bacillus subtilis Pseudomonas aer
uginosa Vibrio cholerae Synechocystis sp.
Glutamate synthase large subunit
41Beyond protein families
ProtFam, Hovergen, Hobacgen, COGs gather protein
sequences homologous on their whole
length Patterns, profiles, domains, are
covered in Terry Attwoods lecture.
42HOBACGEN
- Integrates protein and nucleotide sequences as
well as multiple alignments and trees. - Is based upon a client/server architecture.
- Client software is distributed as well as the
server structure (including all sequences). - Web server
- http//pbil.univ-lyon1.fr/databases/hobacgen.html
43Similarities search
?
SWISS-PROT/TrEMBL sequences
44Segments selection
45Families assembly
46Alignments and trees
Rooting by mid-point
47Domains and Families
Proteins can be made of very different sets of
domains
48Site, Motif, Domain
Simple motifs
Patterns (PROSITE)
Alignments of whole domains
Profiles (PROSITE)
HMM (Pfam)
Fingerprint series of aligned motifs (PRINTS)
Complex motifs
Ungapped alignment of segments (BLOCKS)
49ProDom defining domain structure
6PG1_YEAST
6PGD_CANAL
6PGD_SOYBN
6PG2_BACSU
O32911_MYCLR
P95165_MYCTU
6PGD_CERCA
Q40311_MEDSA
Y770_MYCTU
Y229_SYNY3
ProDom domains for the 6PGD family
50InterPro
prints
InterPro unifies PROSITE, PRINTS, Profile,
ProDom, Pfam, SMART, and TIGRFam.
prosite
InterPro
pfam
smart
prodom
http//www.ebi.ac.uk/interpro
51An InterPro entry
Accession IPR001425 Name Bacterial
rhodopsin Type Family Dates
08-OCT-1999 (created) 28-FEB-2000
(last modified) Signatures PROSITE PS00327
BACTERIAL_OPSIN_RET PROSITE PS00950
BACTERIAL_OPSIN_1 PRINTS
BACTRLOPSIN PFAM PF01036
Bac_rhodopsin Abstract The bacterial opsins
are retinal-binding proteins that provide
light-dependent ion transport and sensory
functions to a family of halophilic
bacteria 1, 2 . They are integral membrane
proteins believed to contain seven
transmembrane (TM) domains, the last
of which contains the attachment point for
retinal (a conserved lysine).
... Example s Q48315 BACH_HALHP
Halorhodopsin Q53496
BACR_HALSR Cruxrhodopsin
P15647 BACH_NATPH P96787
BAC3_HALSD Archaearhodopsin
View examples ...
52Non-sequence data
53Sequence Data retrieval
- Made mainly through Internet access
- With client software (e.g., Entrez, HobacFetch).
- By remote connections to servers providing
on-line access to the banks (INFOBIOGEN). - Using World-Wide Web servers and browsers
54Advantages and limitations
- Users do not have to cope with the usual
databases problems - Storing of large amounts of data.
- Daily updates.
- Software upgrades.
- Simplicity of use.
- Net access is sometimes very slow at peak hours
- consider using other servers besides NCBI
55The ACNUC retrieval system
- Direct access to functional regions described in
feature tables (CDS, tRNA, rRNA). - Selection of entries using various criteria
- Sequence names and accession numbers.
- Bibliographic criteria.
- Keywords.
- Taxonomy.
- Organelle.
- Developed at Lyon University
56ACNUC possible accesses
- Graphical interface distributed along with the
databases themselves. - http//pbil.univ-lyon1.fr/databases/acnuc.html
- Web access at Pôle Bio-Informatique Lyonnais
(PBIL) - http//pbil.univ-lyon1.fr/search/query.html
57ACNUC characteristics
- Allows to query any bank in PIR, SWISS-PROT,
EMBL, or GenBank formats. - Keywords and species browsing.
- Complex queries.
- Links with sequence analysis programs on the Web
server (alignment, codon usage).
58click
click
59The Query form
60Building queries to the sequence data bases
click
61(No Transcript)
62(No Transcript)
63(No Transcript)
64click
65Retrieving sequences
Locally save the received sequence data.
66Browsing the species trees
67(No Transcript)
68(No Transcript)
69HOVERGEN Families of homologous vertebrate genes
70Access to family members
Download tree or alignment
71(No Transcript)
72SRS
- Public version developed at EMBL by Etzold and
Argos (1993). - Presently available on the different Web servers
belonging to EMBnet - EBI (England).
- INFOBIOGEN (France).
- DKFZ (Germany).
73Characteristics
- Database index built with the use of ODD (Object
Design and Definition). - More than 250 databanks have been indexed and are
accessible through 35 SRS servers. - Allows queries to operate simultaneously on
different banks.
74Databanks interconnection
75Entrez
- Developed by Schuler et al. (1996) at NCBI.
- Allows to query several US-made databases
- GenBank, GenPept, NR, MMDB, MEDLINE.
- Access through client software (Unix, Mac or
Windows) or Web server - http//www.ncbi.nlm.nih.gov
76Characteristics
- Introduces the concept of neighbours between
sequences, references and structures. - Sequence neighbours are established using
similarity criteria. - No access to multiple alignments.
77NAR 2003 database issue
http//nar.oupjournals.org/content/vol31/issue1/