Title: Genome, Protein and Model Organism Databases
1Genome, ProteinandModel Organism Databases
Anne Estreicher Swiss-Prot Group Swiss Institute
of Bioinformatics Geneva Switzerland Anne.Estrei
cher_at_isb-sib.ch
Bioinformatic and Comparative Genome Analysis
Course HKU-Pasteur Research Centre - Hong Kong,
China August 17 - August 29, 2009
2- Outline
- Introduction (definitions, history)
- From DNA sequence to genomic tools
- The flow of information from DNA to proteins
- Protein sequence databases
- MODs at a glance
3What is a database ?
- A collection of related data, which are
- structured
- searchable
- updated periodically
- cross-referenced
- Includes also associated tools necessary for
access/query, download, etc.
4- Why do we need databases ?
- Data need to be stored, curated and made
available for analysis and knowledge discovery - Efficient way of sharing data, independently of
regular publications - Essential resources for both experimental and
computational biologists
5Databases in biology not a new issue
- 1954 First protein sequence (insulin by F.
Sanger) - 1965 Atlas of Protein Sequence and Structure (65
proteins)
6The first protein sequence "database" by
Margaret Dayhoff (1965) contained 65 proteins
7Databases not a new issue
- 1954 First protein sequence (insulin by F.
Sanger) - 1965 Atlas of Protein Sequence and Structure (65
proteins) - Mid 70s Improvements in DNA sequencing
- 1979 Los Alamos Sequence Library (Walter Goad)
- 1980 80 genes fully sequenced
- -gt Need to store the data and to make them
available for analysis (in format acceptable for
human eyes and machines) - -gt ARCHIVE
- -gt RACE for the central position in life
sciences
And the winner is
8Databases not a new issue
EMBL-Bank - Europe 1980 GenBank - USA 1982 D
DBJ - Asia 1986
leading to the establishment of the INSDC
(International Nucleotide Sequence Database
Collaboration) -gt daily exchanges of data
9www.insdc.org
10- EMBL-BANK - GenBank - DDBJ
- Main resources for DNA and RNA sequences
- Used to be retrieved from publications -gt direct
submissions from individual researchers, genome
sequencing projects and patent applications - Journal publishers generally require sequence
deposition prior to publication so that an
accession number can be included in the paper. - 1. True for nucleic acid, not for protein
sequences - 2. Not always put into practice
- gt Not submitted sequences are LOST!!!
- Archives (primary databases)
- data belong to submitters
11EMBL-BANK - GenBank - DDBJ Archive (primary
databases) gt data belong to the submitter
- Minimal checks, such as vector contamination
- Annotation by the submitters
12Databases not a new issue
- 1954 First protein sequence (insulin by F.
Sanger) - 1965 Atlas of Protein Sequence and Structure (65
proteins) - 1979 Los Alamos Sequence Library (Walter Goad)
DNA - 1982 EMBL-Bank - DNA
- 1984 GenBank DNA
- 1986 DDBJ - DNA
13Databases not a new issue
- 1954 First protein sequence (insulin by F.
Sanger) - 1965 Atlas of Protein Sequence and Structure (65
proteins) - 1979 Los Alamos Sequence Library (Walter Goad)
DNA - 1982 EMBL-Bank - DNA
- 1984 GenBank DNA
- 1986 DDBJ - DNA
- -gt ARCHIVES (primary databases) may not be
sufficient - -gt need to annotate the data to produce KNOWLEDGE
- 1986 Swiss-Prot protein sequences a paradigm
for annotated (secondary) databases
14The Swiss-Prot concept
- non-redundant
- Protein products of
- 1 gene / 1 species -gt 1 entry,
- Manually annotated (gt curator judgement on data
!), - Highly cross-referenced (1st life-science
database to provide cross-references) (links to gt
130 databases from www.uniprot.org).
15Databases not a new issue
- 1954 First protein sequence (insulin by F.
Sanger) - 1965 Atlas of Protein Sequence and Structure (65
proteins) - 1979 Los Alamos Sequence Library (Walter Goad)
DNA - 1982 EMBL-Bank - DNA
- 1984 GenBank DNA
- Protein information resource (PIR) Protein
sequences - 1986 DDBJ DNA
- Swiss-Prot protein sequences
- 1996 TrEMBL (Translated EMBL) Protein sequences
- Complement of Swiss-Prot to cope with the
increasing amount of new sequences AUTOMATIC
ANNOTATION !
16UniProtKB/Swiss-Prot growth
Swiss-Prot rel. 57.5 (07-Jul-2009) 470369
entries
1996 creation of TrEMBL Swiss-Prot 52205
entries TrEMBL 61137 entries
Number of entries
Release number
1986 3939 entries
17UniProtKB growth
TrEMBL rel.40.5 (07-Jul-2009) 8594382
entries Swiss-Prot rel.57.5 (07-Jul-2009)
470369 entries
- TrEMBL growth (sequences/day)
-
- 2004 ? 1500
- 2006-2007 ? 3500
- ? gt5000
- ? 8000
Number of entries
TrEMBL Automated curation
Swiss-Prot Manual curation
Release number
1986
1996
2009
18- New challenge
- Flood of data -gt need to be stored, curated and
made available for analysis and knowledge
discovery
19(R)evolution of these last 20 years
- Life sciences used to be rich in hypotheses,
well-off in knowledge and poor in data - Today they are very rich in data, not so well-off
in knowledge and very poor in hypotheses.
20Science (1993) 262, 502
21EMBL Database Growth http//www.ebi.ac.uk/embl/Ser
vices/DBStats/
22http//www.ncbi.nlm.nih.gov/genomes/static/gpstat.
html http//www.ncbi.nlm.nih.gov/genomes/GenomesHo
me.cgi?taxid10239hoptstat
In 4 months, 374 new genomes and 77 were
completed 100 genomes/month (in 2008 -gt 50
genomes/month)
 2360 viral ( viroid) genomes gt Total
5600 genomesÂ
23http//genomesonline.org/index2.htm
24http//www.genomesonline.org/gold.cgi
25(No Transcript)
26http//www.genomesonline.org/gold.cgi
27Metagenomicsstudy of genetic material recovered
directly from environmental samples
- Global Ocean Sampling (C. Venter)
- Whale fall
- Soil, sand beach, New-York air,
- Human fluids, mouse gut
Venters Sorcerer II
28- Flood in the world of proteins
- 1965 first protein sequence "database" by
Margaret Dayhoff (65 proteins) - July 2009 20 millions unique protein sequence
(source UniParc - http//www.uniprot.org/uniparc/)
- UniParc
- non-redundant database that contains most of the
publicly available protein sequences in the world
(includes sequences from EMBL-Bank/DDBJ/GenBank
nucleotide sequence databases, Ensembl, FlyBase,
H-Invitational Database (H-Inv), International
Protein Index (IPI), Patent Offices (EPO, JPO and
USPTO), PIR-PSD, Protein Data Bank (PDB), Protein
Research Foundation (PRF), RefSeq, Saccharomyces
Genome database (SGD), TAIR Arabidopsis thaliana
Information Resource, TROME, UniProtKB/Swiss-Prot
and TrEMBL, Vertebrate Genome Annotation database
(VEGA) and WormBase).
29- New challenge
- Flood of data
- Flood of databases
30NAR 1st issue of the year is always dedicated to
databases "clean" list of databases provided (!
not exhaustive !)
31The NAR Online Molecular Biology Database
collection in 2009 A total of 1170 databases
(19 obsolete removed) http//www.oxfordjournals.or
g/nar/database/a/
32NAR "clean" list of databases http//www.oxfordjou
rnals.org/nar/database/a/
33Most recent NAR paper about the database (not
available for all db, some described in other
journals)
34A "clean" list of can be found in the NAR online
molecular biology database collection http//www.
oxfordjournals.org/nar/database/a/
35(No Transcript)
36BIOLOGICAL DATABASE CATEGORIES
- Databases of nucleic acid sequences (RNA, DNA)
- Databases of protein sequences
- Databases of protein motifs and protein domains
- Databases of structures
- Databases of genomes
- Databases of genes
- Databases of expression profiles
- Databases of SNPs and mutations
- Databases of metabolic pathways
- Databases of protein interactions
- Databases of taxonomy
-
Databases containing sequences or data directly
derived from sequences.
37DNA sequences What ? Where ? How ? genomic
tools NCBI UCSC
38Accession number Molecule type Date of
submission Definition
GenBank entry AF415175 http//www.ncbi.nlm.nih.gov
/nuccore/16589063
Nucleotide sequence
39Accession number Molecule type Date of
submission Definition
Taxonomy
Nucleotide sequence
40Accession number Molecule type Date of
submission Definition
Taxonomy
References
Nucleotide sequence
41Accession number Molecule type Date of
submission Definition
Taxonomy
References
Organism Molecule type Chromosomal
location Tissue type Gene name CDS annotation gt
protein sequence Protein IDentifier (PID
stable identifier version number)
Features Information provided by the
submitter May include annotation of the sequence
Nucleotide sequence
42Protein sequence
43"Features"Â may provide much more
information depending upon the sequence and the
submitter
3end of chromosome YÂ EMBL AJ271736
44Very similar view, links and options from the 3
sites EMBL-Bank GenBank - DDBJ
http//www.ddbj.nig.ac.jp/
http//www.ebi.ac.uk/embl/
http//www.ncbi.nlm.nih.gov/
45How to find a DNA sequence at the NCBI
46http//www.ncbi.nlm.nih.gov/
47Databases _at_ NCBI http//www.ncbi.nlm.nih.gov/Datab
ase/datamodel/index.html
The Entrez system integrated, text-based search
and retrieval system used at NCBI for the major
databases, including PubMed, Nucleotide and
Protein Sequences, Protein Structures, Complete
Genomes, Taxonomy, and others gt Maximal
interconnectivity
48Databases _at_ NCBI http//www.ncbi.nlm.nih.gov/Datab
ase/datamodel/index.html
49Simple search with a EMBL-Bank/GenBank/DDBJ
accession number
50(No Transcript)
51(No Transcript)
52Searching from a bibliographic reference
53(No Transcript)
54Search results 2 and 3 -gt accession numbers
provided by the authors in the article -gt GenBank
records
Search result 1 -gt corresponds to the RefSeq
database
55- RefSeq (Reference Sequence)
- Provides a comprehensive, integrated,
non-redundant, well-annotated set of sequences,
including genomic DNA, transcripts, and proteins - Most data extracted from GenBank -gt choice of a
reference sequence and annotation (no documented
comparison between sequences) - Some entries based on predictions (accession
XM_ XR_ XP_ ZP_) - Currently, 8'665 species represented
- Annotation
- Manual annotation (only in entries tagged as
"reviewed") - Collaboration
- Propagation from other sources
- Computation.
56RefSeq (Reference Sequence)
CURATION
GENOME ANNOTATION No
INFERRED No
MODEL No
PREDICTED No
PROVISIONAL No
REVIEWED Yes (sequence functional information and features)
VALIDATED Yes (initial sequence)
WGS No
57RefSeq entry NM_015595 SGEF mRNA
Accession number Definition Taxonomy List of
references
58RefSeq entry NM_015595 SGEF mRNA
Gene name Exon annotation CDS annotation and
sequence
59RefSeq entry NM_015595 SGEF mRNA
Sequence
60Searching with the gene name
61(No Transcript)
62Refseq
63- NCBI Entrez system
- Looks for the request in all NCBI databases
- Cannot be ignored -gt no simple way to search
only in your favourite NCBI database
64Searching using BLAST
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69(No Transcript)
70(No Transcript)
71(No Transcript)
72(No Transcript)
73UniSTS62643 maps to multiple loci in Homo
sapiens
74UniGene
Mapping of known genes
75Mapping of RNA (EMBL/GenBank/DDBJ RefSeq)
UniGene
Mapping of known genes
76Mapping of RNA (EMBL/GenBank/DDBJ RefSeq)
UniGene
Mapping of RefSeq RNA
Mapping of known genes
77Mapping of RNA (EMBL/GenBank/DDBJ RefSeq)
UniGene
Mapping of RefSeq RNA
Mapping of known genes
This view by default can be customized
781. Choose desired option 2. Add it (and/remove
undesired) 3. Apply the new display
79(No Transcript)
80(No Transcript)
81Map viewer 110 organisms represented in Genome
database.
(www.ncbi.nlm.nih.gov/sites/entrez?dbgenome)
82Genomic tools on the UCSC server BLAT search
83(No Transcript)
84Genome browser _at_ UCSC
Feb. 2009 assembly not all data implemented
! May be better to use former assembly for the
time being.
http//genome.ucsc.edu/cgi-bin/hgBlat
85(No Transcript)
86(No Transcript)
87Chromosomal location
gDNA sequence
Consensus CDS other sequences from reliable
resources
88Annotation of genes is provided by multiple
public resources, using different methods, and
resulting in information that is similar but not
always identical. CCDS database goal provide
a standard set of gene annotations.
Collaborative project involving teams (manual
and automated annotation) European
Bioinformatics Institute (EBI) National
Center for Biotechnology Information (NCBI)
Wellcome Trust Sanger Institute (WTSI)
University of California, Santa Cruz (UCSC)
Currently available only for human and mouse
genomes (July 2009) 20'159 human CCDS (including
isoforms) -gt 17'054 CCDS genes 17'707 mouse CCDS
(including isoforms) -gt 16'889 CCDS genes
http//www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrow
se.cgi
89Chromosomal location
gDNA sequence
Consensus CDS other sequences from reliable
resources
All sequences can be retrieved
(Human) mRNAs
(Human) spliced ESTs
(Human) ESTs (including unspliced)
90The view can be completely customized
91including with various tools allowing
comparative genomics
92http//genome.ucsc.edu/
and including your own data !
93Back to the Blat viewer
94Arrows gtgtgtgt show the direction of transcription
952 transcripts from the same locus BDNF
(Brain-Derived Neurotrophic Factor) BDNFOS (BDNF
Opposite Strand)
96(No Transcript)
97View of alternative exons
Alternative exons
98Interested by this exon ?
Just zoom in
99(No Transcript)
100Genome browser _at_ UCSC has many great options,
give it a try! http//genome.ucsc.edu/
101Typical problems or Why wonderful tools will
never replace the brain of a life scientist !
102(No Transcript)
103 Once upon a time, there was a gene on
chromosome 11
1042 essential genome resources are missing from
this lecture Ensembl (http//www.ensembl.org/ind
ex.html) automated annotation of many
genomes Vega (http//vega.sanger.ac.uk/index.htm
l) High quality manual annotation of genomes
(currently Homo sapiens, Mus musculus, Danio
rerio, Gorilla gorilla, Macropus eugenii, Sus
scrofa, Canis familiaris). Please go and visit
them!
105The flow of information From DNA
sequences to protein sequences A little
biology and A few databases
106From genome to proteomethe example of human
Proteome
Genome
Ê
1'000'000 human proteins
20500 human protein-encoding genes
Post-translational modifications (PTMs) Most
PTMs cannot be predicted from DNA sequences
Alternative promoter usage Alternative
splicing Trans-splicing mRNA editing
Increase in complexity 5-10 x
Transcriptome
107The hectic life of a protein sequence
Data not submitted to public databases, delayed
or cancelled
cDNAs, ESTs, genomes,
Nucleic acid databases
DDBJ
EMBL
GenBank
International Nucleotide Sequence Database
Collaboration
www.insdc.org
108!!!! 99 of the protein sequences found in
databases come from the translation nucleotide
sequences gt Experimental evidence may be
lacking!
109EMBL (DNA)
A similar pipeline is used at the NCBI to go from
GenBank to GenPept
110!!!! The quality of UniProtKB/TrEMBL ( GenPept)
entries depends upon the quality of the
submissions in the original EMBL-Bank/GenBank/DDBJ
entry.
111(No Transcript)
112(No Transcript)
113EMBL (DNA)
114Splice variants
Sequence
Sequence features
Ontologies
Annotations
References
Nomenclature
115Evidence for protein existence Annotation in
UniProtKB
5 levels of evidence 1. evidence at protein
level, 2. evidence at transcript level, 3.
inferred by homology, 4. predicted, 5. uncertain.
116http//www.uniprot.org/uniprot/P35613
117(No Transcript)
118http//www.uniprot.org/uniprot/Q9Y471
119http//www.uniprot.org/uniprot/Q9Y471
120Family and domain dbs Gene3D HAMAP InterPro PANTHE
R Pfam PIRSF PRINTS ProDom PROSITE SMART TIGRFAMs
Organism-specific dbs AGD BuruList CGD CTD CYGD
DictyBase EchoBASE EcoGene euHCVdb FlyBase GenAtl
as GeneCards GeneDB_Spombe GeneFarm Gramene H-InvD
B HGNC HPA LegioList Leproma ListiList MaizeGDB
MGI MIM MypuList Orphanet PharmGKB PhotoList Pseu
doCAP RGD SagaList SGD SubtiList TAIR TubercuList
WormBase WormPep Xenbase ZFIN
Genome annotation dbs Ensembl GeneID GenomeReviews
KEGG NMPDR TIGR UCSC VectorBase
Sequence dbs EMBL IPI PIR UniGene RefSeq
Proteomic dbs PeptideAtlas PRIDE ProMEX
Phylogenomic dbs HOGENOM HOVERGEN OMA
Gene expression dbs ArrayExpress Bgee CleanEx Germ
Online
Polymorphism dbs dbSNP
UniProtKB/Swiss-Prot 115 explicit links
2D-gel dbs 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2
DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE
DOSAC-COBS-2DPAGE ECO2DBASE (no
server) HSC-2DPAGE OGP PHCI-2DPAGE PMMA-2DPAGE Rat
-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWI
SS-2DPAGE World-2DPAGE
and 19 implicit links!
Ontologies GO
Protein family/group dbs CAZy MEROPS PeroxiBase Pp
taseDB REBASE TCDB
3D structure dbs DisProt HSSP PDB PDBsum SMR
Enzyme and pathway dbs BioCyc BRENDA Pathway_Inter
action_DB Reactome
Others BindingDB PMAP-CutDB DrugBank NextBio
PTM dbs GlycoSuiteDB PhosphoSite PhosSite
Protein-protein interaction dbs DIP IntAct
121(No Transcript)
122The UniProt consortium
123UniProt mission Provide a comprehensive
high-quality and freely accessible resource of
protein sequence and functional annotation.
124(No Transcript)
125Update frequencyA crucial issue !!
- Sometimes very difficult, or even impossible, to
find - Crucial not only for the database itself, but
also for tools using databases.
126Update frequency
127(No Transcript)
128http//www.matrixscience.com/search_intro.html
129Mascot MS/MS identification tool is fine, but it
cannot be used from this website ! Solution
Download the database of interest and make sure
you work with an up-to-date version.
130Never hesitate to ask for an update
131(No Transcript)
132UniProtKB protein sequence knowledgebase, 2
sections UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL (query, Blast, download)
(9232223 entries) UniParc protein sequence
archive (equivalent to EMBL-Bank/GenBank/DDBJ at
the protein level). Each entry contains a protein
sequence with cross-links to other databases
where you find the sequence (active or not). Not
annotated. (query, no Blast on www.uniprot.org,
Blast _at_ EBI, not downloadable) (20070606
entries)
133UniParc entry contains all records for a unique
sequence in major publicly available databases.
134UniProtKB protein sequence knowledgebase, 2
sections UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL (query, Blast, download)
(9232223 entries) UniParc protein sequence
archive (EMBL equivalent at the protein level).
Each entry contains a protein sequence with
cross-links to other databases where you find the
sequence (active or not). Not annotated. (query,
no Blast on www.uniprot.org, Blast _at_ EBI, not
downloadable) (20070606 entries) UniRef 3
clusters of protein sequences with 100, 90 and 50
similarity useful to speed up sequence
similarity search (BLAST) (query, Blast,
download) (UniRef100 8474689 entries UniRef90
5668'669 entries UniRef50 2'729'565 entries)
135UniRef100, 90 and 50
- One UniRef100 entry -gt merge of identical
sequences (including subfragments, splice
variants). Based on UniProtKB sequences and
selected UniParc records (such as Ensembl
RefSeq). - One UniRef90 entry -gt sequences that have at
least 90 or more identity. Built from UniRef100. - One UniRef50 entry -gt sequences that are at least
50 identical. Built from UniRef100.
136(No Transcript)
137UniProtKB protein sequence knowledgebase, 2
sections UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL (query, Blast, download)
(7097874 entries) UniParc protein sequence
archive (EMBL equivalent at the protein level).
Each entry contains a protein sequence with
cross-links to other databases where you find the
sequence (active or not). Not annotated. (query,
no Blast on www.uniprot.org, Blast _at_ EBI, not
downloadable) (17646564 entries) UniRef 3
clusters of protein sequences with 100, 90 and 50
similarity useful to speed up sequence
similarity search (BLAST) (query, Blast,
download) (UniRef100 6,652,983 entries UniRef90
4438653 entries UniRef50 2104702
entries) UniMES protein sequences derived from
metagenomic projects (Global Ocean Sampling
(GOS)) (Blast, download) (UniMes 6'028'191
entries)
138What is "Non-Redundancy" ?
- UniParc
- One UniParc entry for all entries corresponding
to 100 identical sequences (100 identity over
the entire length) (from many different
databases). - UniRef
- One UniRef100 entry for all entries corresponding
to 100 identical sequences (including fragments)
from UniProtKB, Ensembl, Refseq, PDB. - UniProtKB/Swiss-Prot
- One Swiss-Prot entry for all the protein products
of one gene, including fragments,
variations/polymorphisms, splice variants,
sequencing errors
139Comparing searches NCBI and UniProt
140Search for the human Toll-like receptor 4 Entrez
Protein (NCBI)
141Search for the human Toll-like receptor 4 in
UniProtKB
Swiss-Prot
142Sequences retrieved in Entrez Protein O00206 AAF0
5316 CAH72618 CAH72619 BAG55035 AAI17423
AAF89753 NP_612564 AAC34135 Based on
A126770, BC117422,AL160272 and AA598398
143Major protein sequence resources
Resources integrated in the entries
PIR
PDB
PRF
UniProtKB Swiss-Prot TrEMBL EntrezProtein
Swiss-ProtGenPeptPIRPDBPRFRefSeq
Resources integrated in the search engine
UniProtKB/Swiss-Prot manually annotated protein
sequences (12000 species) UniProtKB/TrEMBL
submitted CDS (EMBL) automated annotation
(202000 species) GenPept submitted CDS
(GenBank) PIR Protein Information Ressource
archive since 2003 integrated into
UniProtKB PDB Protein Databank 3D data and
associated sequences PRF journal scan of
published peptide sequences RefSeq Reference
Sequence for DNA, RNA, protein gene prediction
some manual annotation
144Model Organism Databases (MODs) at a glance
145Model organism Species extensively studied to
understand particular biological phenomena, with
the expectation that discoveries made in the
organism model will provide insight into the
workings of other organisms.
Model organisms MODs Mus musculus MGI
http//www.informatics.jax.org/ Rattus
norvegicus RGD http//rgd.mcw.edu/ Oryza
sativa RAP-DB http//rapdb.dna.affrc.go.jp/ Ara
bidopsis thaliana TAIR http//www.arabidopsis.or
g/ Drosophila melanogaster FlyBase
http//flybase.org/ Schizosaccharomyces pombe S.
pombe GeneDB http//www.genedb.org/genedb/pombe/ S
accharomyces cerevisiae SGD http//www.yeastgenome
.org/ Caenorhabditis elegans WormBase
http//www.wormbase.org/
Dictyostelium discoideum dictyBase
http//dictybase.org/
Bacillus subtilis SubtiList
http//genolist.pasteur.fr/SubtiList/
Escherichia coli ecogene http//ecogene.org/
Danio rerio (zebrafish) ZFIN http//zfin.org/
Just a few examples, not an exhaustive
list!
Methanocaldococcus jannaschii -gt no MOD
146Model organism databases (MODs) Genome
annotation Gene models Gene mapping Official
nomenclature Gene expression Functional
annotation Interactions Information about
mutants/knockout/transgenic animals Phenotypes (
cross-)references Species-specific
reagents Key resources for information on a
given organism Service provided to/from a given
community
147(No Transcript)
148(No Transcript)
149(No Transcript)
150(No Transcript)
151(No Transcript)
152(No Transcript)
153http//gmod.org/wiki/Main_Page
154The world of databases is a jungle
155- A few points to remember
- when using databases
- Content
- - Primary / secondary / meta-databases
- - Curated / non-curated
- - manual / automated curation
- - Redundant / non-redundant.
- Update frequency
- Stable identifiers
- Strategy
- Dataflow
- Collaborations between databases.
156Test a few genomic databases and tools
157Genomes and genomic tools a few sites
NCBI http//www.ncbi.nlm.nih.gov/sites/entrez?db
genome EBI http//www.ebi.ac.uk/genomes/ TIGR
http//cmr.jcvi.org/tigr-scripts/CMR/shared/Genom
es.cgi Genome annotation and analysis
tools http//www.ensembl.org/index.html http//ve
ga.sanger.ac.uk/index.html http//genome.ucsc.edu/
-gt BLAT, Galaxy, Custom tracks,
http//www.jgi.doe.gov/software/ -gt Genome
portal, Integrated Microbial Genomes (IMG) and
other tools Generic Model Organism Database
http//gmod.org/wiki/Main_Page
158Genomes and genomic tools Hands-on
Find your favorite (completely sequenced)
organism in a genome db Follow the links to see
the options on different sites Find the
sequences Look at the annotation of your
favorite gene Compare the entries corresponding
to this gene across sites Test search engines
(restrict searches, compare results, ) Whenever
possible use on-line tutorials, such
as http//www.ensembl.org/info/website/tutorials/
index.html Visit GMOD, see the tools
(http//gmod.org/wiki/GMOD_Components) Play
around with the BLAT search, customize display,
follow the links,
159Genomes and genomic tools Hands-on
Go and visit databases cited in this
lecture The databases/tools that should be
"familiar" to all are http//genome.ucsc.edu/cgi-
bin/hgBlat http//www.ensembl.org/index.html gene/
genome databases/tools on http//www.ncbi
.nlm.nih.gov/ If none of the databases are of
interest for you, go to the NAR database
(http//www.oxfordjournals.org/nar/database/a/)
and find databases that are closest to your
interests Play around Hands on protein
sequence databases and UniProt http//education.e
xpasy.org/cours/HK09/Protein_database_TP.html (cor
rections http//education.expasy.org/cours/HK09/P
rotein_database_TP_correction.html)
160Thank You !