Genome, Protein and Model Organism Databases

About This Presentation

Title:

Genome, Protein and Model Organism Databases

Description:

Genome, Protein and Model Organism Databases – PowerPoint PPT presentation

Number of Views:690

Avg rating:3.0/5.0

Slides: 161

Provided by: Edw485

Category:

more less

Transcript and Presenter's Notes

Title: Genome, Protein and Model Organism Databases

1
Genome, ProteinandModel Organism Databases
Anne Estreicher Swiss-Prot Group Swiss Institute
of Bioinformatics Geneva Switzerland Anne.Estrei
cher_at_isb-sib.ch
Bioinformatic and Comparative Genome Analysis
Course HKU-Pasteur Research Centre - Hong Kong,
China August 17 - August 29, 2009
2

Outline
Introduction (definitions, history)
From DNA sequence to genomic tools
The flow of information from DNA to proteins
Protein sequence databases
MODs at a glance

3
What is a database ?

A collection of related data, which are
structured
searchable
updated periodically
cross-referenced
Includes also associated tools necessary for
access/query, download, etc.

Why do we need databases ?
Data need to be stored, curated and made
available for analysis and knowledge discovery
Efficient way of sharing data, independently of
regular publications
Essential resources for both experimental and
computational biologists

5
Databases in biology not a new issue

1954 First protein sequence (insulin by F.
Sanger)
1965 Atlas of Protein Sequence and Structure (65
proteins)

6
The first protein sequence "database" by
Margaret Dayhoff (1965) contained 65 proteins
7
Databases not a new issue

1954 First protein sequence (insulin by F.
Sanger)
1965 Atlas of Protein Sequence and Structure (65
proteins)
Mid 70s Improvements in DNA sequencing
1979 Los Alamos Sequence Library (Walter Goad)
1980 80 genes fully sequenced
-gt Need to store the data and to make them
available for analysis (in format acceptable for
human eyes and machines)
-gt ARCHIVE
-gt RACE for the central position in life
sciences

And the winner is
8
Databases not a new issue
EMBL-Bank - Europe 1980 GenBank - USA 1982 D
DBJ - Asia 1986
leading to the establishment of the INSDC
(International Nucleotide Sequence Database
Collaboration) -gt daily exchanges of data
9
www.insdc.org
10

EMBL-BANK - GenBank - DDBJ
Main resources for DNA and RNA sequences
Used to be retrieved from publications -gt direct
submissions from individual researchers, genome
sequencing projects and patent applications
Journal publishers generally require sequence
deposition prior to publication so that an
accession number can be included in the paper.
1. True for nucleic acid, not for protein
sequences
2. Not always put into practice
gt Not submitted sequences are LOST!!!
Archives (primary databases)
data belong to submitters

11
EMBL-BANK - GenBank - DDBJ Archive (primary
databases) gt data belong to the submitter

Minimal checks, such as vector contamination
Annotation by the submitters

12
Databases not a new issue

1954 First protein sequence (insulin by F.
Sanger)
1965 Atlas of Protein Sequence and Structure (65
proteins)
1979 Los Alamos Sequence Library (Walter Goad)
DNA
1982 EMBL-Bank - DNA
1984 GenBank DNA
1986 DDBJ - DNA

13
Databases not a new issue

1954 First protein sequence (insulin by F.
Sanger)
1965 Atlas of Protein Sequence and Structure (65
proteins)
1979 Los Alamos Sequence Library (Walter Goad)
DNA
1982 EMBL-Bank - DNA
1984 GenBank DNA
1986 DDBJ - DNA
-gt ARCHIVES (primary databases) may not be
sufficient
-gt need to annotate the data to produce KNOWLEDGE

1986 Swiss-Prot protein sequences a paradigm
for annotated (secondary) databases

14
The Swiss-Prot concept

non-redundant
Protein products of
1 gene / 1 species -gt 1 entry,
Manually annotated (gt curator judgement on data
!),
Highly cross-referenced (1st life-science
database to provide cross-references) (links to gt
130 databases from www.uniprot.org).

15
Databases not a new issue

1954 First protein sequence (insulin by F.
Sanger)
1965 Atlas of Protein Sequence and Structure (65
proteins)
1979 Los Alamos Sequence Library (Walter Goad)
DNA
1982 EMBL-Bank - DNA
1984 GenBank DNA
Protein information resource (PIR) Protein
sequences
1986 DDBJ DNA
Swiss-Prot protein sequences
1996 TrEMBL (Translated EMBL) Protein sequences
Complement of Swiss-Prot to cope with the
increasing amount of new sequences AUTOMATIC
ANNOTATION !

16
UniProtKB/Swiss-Prot growth
Swiss-Prot rel. 57.5 (07-Jul-2009) 470369
entries
1996 creation of TrEMBL Swiss-Prot 52205
entries TrEMBL 61137 entries
Number of entries
Release number
1986 3939 entries
17
UniProtKB growth
TrEMBL rel.40.5 (07-Jul-2009) 8594382
entries Swiss-Prot rel.57.5 (07-Jul-2009)
470369 entries

TrEMBL growth (sequences/day)
2004 ? 1500
2006-2007 ? 3500
? gt5000
? 8000

Number of entries
TrEMBL Automated curation
Swiss-Prot Manual curation
Release number
1986
1996
2009
18

New challenge
Flood of data -gt need to be stored, curated and
made available for analysis and knowledge
discovery

19
(R)evolution of these last 20 years

Life sciences used to be rich in hypotheses,
well-off in knowledge and poor in data
Today they are very rich in data, not so well-off
in knowledge and very poor in hypotheses.

20
Science (1993) 262, 502
21
EMBL Database Growth http//www.ebi.ac.uk/embl/Ser
vices/DBStats/
22
http//www.ncbi.nlm.nih.gov/genomes/static/gpstat.
html http//www.ncbi.nlm.nih.gov/genomes/GenomesHo
me.cgi?taxid10239hoptstat
In 4 months, 374 new genomes and 77 were
completed 100 genomes/month (in 2008 -gt 50
genomes/month)
2360 viral ( viroid) genomes gt Total
5600 genomes
23
http//genomesonline.org/index2.htm
24
http//www.genomesonline.org/gold.cgi
25
(No Transcript)
26
http//www.genomesonline.org/gold.cgi
27
Metagenomicsstudy of genetic material recovered
directly from environmental samples

Global Ocean Sampling (C. Venter)
Whale fall
Soil, sand beach, New-York air,
Human fluids, mouse gut

Venters Sorcerer II
28

Flood in the world of proteins
1965 first protein sequence "database" by
Margaret Dayhoff (65 proteins)
July 2009 20 millions unique protein sequence
(source UniParc - http//www.uniprot.org/uniparc/)
UniParc
non-redundant database that contains most of the
publicly available protein sequences in the world
(includes sequences from EMBL-Bank/DDBJ/GenBank
nucleotide sequence databases, Ensembl, FlyBase,
H-Invitational Database (H-Inv), International
Protein Index (IPI), Patent Offices (EPO, JPO and
USPTO), PIR-PSD, Protein Data Bank (PDB), Protein
Research Foundation (PRF), RefSeq, Saccharomyces
Genome database (SGD), TAIR Arabidopsis thaliana
Information Resource, TROME, UniProtKB/Swiss-Prot
and TrEMBL, Vertebrate Genome Annotation database
(VEGA) and WormBase).

New challenge
Flood of data
Flood of databases

30
NAR 1st issue of the year is always dedicated to
databases "clean" list of databases provided (!
not exhaustive !)
31
The NAR Online Molecular Biology Database
collection in 2009 A total of 1170 databases
(19 obsolete removed) http//www.oxfordjournals.or
g/nar/database/a/
32
NAR "clean" list of databases http//www.oxfordjou
rnals.org/nar/database/a/
33
Most recent NAR paper about the database (not
available for all db, some described in other
journals)
34
A "clean" list of can be found in the NAR online
molecular biology database collection http//www.
oxfordjournals.org/nar/database/a/
35
(No Transcript)
36
BIOLOGICAL DATABASE CATEGORIES

Databases of nucleic acid sequences (RNA, DNA)
Databases of protein sequences
Databases of protein motifs and protein domains
Databases of structures
Databases of genomes
Databases of genes
Databases of expression profiles
Databases of SNPs and mutations
Databases of metabolic pathways
Databases of protein interactions
Databases of taxonomy

Databases containing sequences or data directly
derived from sequences.
37
DNA sequences What ? Where ? How ? genomic
tools NCBI UCSC
38
Accession number Molecule type Date of
submission Definition
GenBank entry AF415175 http//www.ncbi.nlm.nih.gov
/nuccore/16589063
Nucleotide sequence
39
Accession number Molecule type Date of
submission Definition
Taxonomy
Nucleotide sequence
40
Accession number Molecule type Date of
submission Definition
Taxonomy
References
Nucleotide sequence
41
Accession number Molecule type Date of
submission Definition
Taxonomy
References
Organism Molecule type Chromosomal
location Tissue type Gene name CDS annotation gt
protein sequence Protein IDentifier (PID
stable identifier version number)
Features Information provided by the
submitter May include annotation of the sequence
Nucleotide sequence
42
Protein sequence
43
"Features" may provide much more
information depending upon the sequence and the
submitter
3end of chromosome Y EMBL AJ271736
44
Very similar view, links and options from the 3
sites EMBL-Bank GenBank - DDBJ
http//www.ddbj.nig.ac.jp/
http//www.ebi.ac.uk/embl/
http//www.ncbi.nlm.nih.gov/
45
How to find a DNA sequence at the NCBI
46
http//www.ncbi.nlm.nih.gov/
47
Databases _at_ NCBI http//www.ncbi.nlm.nih.gov/Datab
ase/datamodel/index.html
The Entrez system integrated, text-based search
and retrieval system used at NCBI for the major
databases, including PubMed, Nucleotide and
Protein Sequences, Protein Structures, Complete
Genomes, Taxonomy, and others gt Maximal
interconnectivity
48
Databases _at_ NCBI http//www.ncbi.nlm.nih.gov/Datab
ase/datamodel/index.html
49
Simple search with a EMBL-Bank/GenBank/DDBJ
accession number
50
(No Transcript)
51
(No Transcript)
52
Searching from a bibliographic reference
53
(No Transcript)
54
Search results 2 and 3 -gt accession numbers
provided by the authors in the article -gt GenBank
records
Search result 1 -gt corresponds to the RefSeq
database
55

RefSeq (Reference Sequence)
Provides a comprehensive, integrated,
non-redundant, well-annotated set of sequences,
including genomic DNA, transcripts, and proteins
Most data extracted from GenBank -gt choice of a
reference sequence and annotation (no documented
comparison between sequences)
Some entries based on predictions (accession
XM_ XR_ XP_ ZP_)
Currently, 8'665 species represented
Annotation
Manual annotation (only in entries tagged as
"reviewed")
Collaboration
Propagation from other sources
Computation.

56
RefSeq (Reference Sequence)
CURATION
GENOME ANNOTATION No
INFERRED No
MODEL No
PREDICTED No
PROVISIONAL No
REVIEWED Yes (sequence functional information and features)
VALIDATED Yes (initial sequence)
WGS No
57
RefSeq entry NM_015595 SGEF mRNA
Accession number Definition Taxonomy List of
references
58
RefSeq entry NM_015595 SGEF mRNA
Gene name Exon annotation CDS annotation and
sequence
59
RefSeq entry NM_015595 SGEF mRNA
Sequence
60
Searching with the gene name
61
(No Transcript)
62
Refseq
63

NCBI Entrez system
Looks for the request in all NCBI databases
Cannot be ignored -gt no simple way to search
only in your favourite NCBI database

64
Searching using BLAST
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
UniSTS62643 maps to multiple loci in Homo
sapiens
74
UniGene
Mapping of known genes
75
Mapping of RNA (EMBL/GenBank/DDBJ RefSeq)
UniGene
Mapping of known genes
76
Mapping of RNA (EMBL/GenBank/DDBJ RefSeq)
UniGene
Mapping of RefSeq RNA
Mapping of known genes
77
Mapping of RNA (EMBL/GenBank/DDBJ RefSeq)
UniGene
Mapping of RefSeq RNA
Mapping of known genes
This view by default can be customized
78
1. Choose desired option 2. Add it (and/remove
undesired) 3. Apply the new display
79
(No Transcript)
80
(No Transcript)
81
Map viewer 110 organisms represented in Genome
database.
(www.ncbi.nlm.nih.gov/sites/entrez?dbgenome)
82
Genomic tools on the UCSC server BLAT search
83
(No Transcript)
84
Genome browser _at_ UCSC
Feb. 2009 assembly not all data implemented
! May be better to use former assembly for the
time being.
http//genome.ucsc.edu/cgi-bin/hgBlat
85
(No Transcript)
86
(No Transcript)
87
Chromosomal location
gDNA sequence
Consensus CDS other sequences from reliable
resources
88
Annotation of genes is provided by multiple
public resources, using different methods, and
resulting in information that is similar but not
always identical. CCDS database goal provide
a standard set of gene annotations.
Collaborative project involving teams (manual
and automated annotation) European
Bioinformatics Institute (EBI) National
Center for Biotechnology Information (NCBI)
Wellcome Trust Sanger Institute (WTSI)
University of California, Santa Cruz (UCSC)
Currently available only for human and mouse
genomes (July 2009) 20'159 human CCDS (including
isoforms) -gt 17'054 CCDS genes 17'707 mouse CCDS
(including isoforms) -gt 16'889 CCDS genes
http//www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrow
se.cgi
89
Chromosomal location
gDNA sequence
Consensus CDS other sequences from reliable
resources
All sequences can be retrieved
(Human) mRNAs
(Human) spliced ESTs
(Human) ESTs (including unspliced)
90
The view can be completely customized
91
including with various tools allowing
comparative genomics
92
http//genome.ucsc.edu/
and including your own data !
93
Back to the Blat viewer
94
Arrows gtgtgtgt show the direction of transcription
95
2 transcripts from the same locus BDNF
(Brain-Derived Neurotrophic Factor) BDNFOS (BDNF
Opposite Strand)
96
(No Transcript)
97
View of alternative exons
Alternative exons
98
Interested by this exon ?
Just zoom in
99
(No Transcript)
100
Genome browser _at_ UCSC has many great options,
give it a try! http//genome.ucsc.edu/
101
Typical problems or Why wonderful tools will
never replace the brain of a life scientist !
102
(No Transcript)
103
Once upon a time, there was a gene on
chromosome 11
104
2 essential genome resources are missing from
this lecture Ensembl (http//www.ensembl.org/ind
ex.html) automated annotation of many
genomes Vega (http//vega.sanger.ac.uk/index.htm
l) High quality manual annotation of genomes
(currently Homo sapiens, Mus musculus, Danio
rerio, Gorilla gorilla, Macropus eugenii, Sus
scrofa, Canis familiaris). Please go and visit
them!
105
The flow of information From DNA
sequences to protein sequences A little
biology and A few databases
106
From genome to proteomethe example of human
Proteome
Genome
Ê
1'000'000 human proteins
20500 human protein-encoding genes
Post-translational modifications (PTMs) Most
PTMs cannot be predicted from DNA sequences
Alternative promoter usage Alternative
splicing Trans-splicing mRNA editing
Increase in complexity 5-10 x
Transcriptome
107
The hectic life of a protein sequence
Data not submitted to public databases, delayed
or cancelled
cDNAs, ESTs, genomes,
Nucleic acid databases
DDBJ
EMBL
GenBank
International Nucleotide Sequence Database
Collaboration
www.insdc.org
108
!!!! 99 of the protein sequences found in
databases come from the translation nucleotide
sequences gt Experimental evidence may be
lacking!
109
EMBL (DNA)
A similar pipeline is used at the NCBI to go from
GenBank to GenPept
110
!!!! The quality of UniProtKB/TrEMBL ( GenPept)
entries depends upon the quality of the
submissions in the original EMBL-Bank/GenBank/DDBJ
entry.
111
(No Transcript)
112
(No Transcript)
113
EMBL (DNA)
114
Splice variants
Sequence
Sequence features
Ontologies
Annotations
References
Nomenclature
115
Evidence for protein existence Annotation in
UniProtKB
5 levels of evidence 1. evidence at protein
level, 2. evidence at transcript level, 3.
inferred by homology, 4. predicted, 5. uncertain.
116
http//www.uniprot.org/uniprot/P35613
117
(No Transcript)
118
http//www.uniprot.org/uniprot/Q9Y471
119
http//www.uniprot.org/uniprot/Q9Y471
120
Family and domain dbs Gene3D HAMAP InterPro PANTHE
R Pfam PIRSF PRINTS ProDom PROSITE SMART TIGRFAMs
Organism-specific dbs AGD BuruList CGD CTD CYGD
DictyBase EchoBASE EcoGene euHCVdb FlyBase GenAtl
as GeneCards GeneDB_Spombe GeneFarm Gramene H-InvD
B HGNC HPA LegioList Leproma ListiList MaizeGDB
MGI MIM MypuList Orphanet PharmGKB PhotoList Pseu
doCAP RGD SagaList SGD SubtiList TAIR TubercuList
WormBase WormPep Xenbase ZFIN
Genome annotation dbs Ensembl GeneID GenomeReviews
KEGG NMPDR TIGR UCSC VectorBase
Sequence dbs EMBL IPI PIR UniGene RefSeq
Proteomic dbs PeptideAtlas PRIDE ProMEX
Phylogenomic dbs HOGENOM HOVERGEN OMA
Gene expression dbs ArrayExpress Bgee CleanEx Germ
Online
Polymorphism dbs dbSNP
UniProtKB/Swiss-Prot 115 explicit links
2D-gel dbs 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2
DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE
DOSAC-COBS-2DPAGE ECO2DBASE (no
server) HSC-2DPAGE OGP PHCI-2DPAGE PMMA-2DPAGE Rat
-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWI
SS-2DPAGE World-2DPAGE
and 19 implicit links!
Ontologies GO
Protein family/group dbs CAZy MEROPS PeroxiBase Pp
taseDB REBASE TCDB
3D structure dbs DisProt HSSP PDB PDBsum SMR
Enzyme and pathway dbs BioCyc BRENDA Pathway_Inter
action_DB Reactome
Others BindingDB PMAP-CutDB DrugBank NextBio
PTM dbs GlycoSuiteDB PhosphoSite PhosSite
Protein-protein interaction dbs DIP IntAct
121
(No Transcript)
122
The UniProt consortium
123
UniProt mission Provide a comprehensive
high-quality and freely accessible resource of
protein sequence and functional annotation.
124
(No Transcript)
125
Update frequencyA crucial issue !!

Sometimes very difficult, or even impossible, to
find
Crucial not only for the database itself, but
also for tools using databases.

126
Update frequency
127
(No Transcript)
128
http//www.matrixscience.com/search_intro.html
129
Mascot MS/MS identification tool is fine, but it
cannot be used from this website ! Solution
Download the database of interest and make sure
you work with an up-to-date version.
130
Never hesitate to ask for an update
131
(No Transcript)
132
UniProtKB protein sequence knowledgebase, 2
sections UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL (query, Blast, download)
(9232223 entries) UniParc protein sequence
archive (equivalent to EMBL-Bank/GenBank/DDBJ at
the protein level). Each entry contains a protein
sequence with cross-links to other databases
where you find the sequence (active or not). Not
annotated. (query, no Blast on www.uniprot.org,
Blast _at_ EBI, not downloadable) (20070606
entries)
133
UniParc entry contains all records for a unique
sequence in major publicly available databases.
134
UniProtKB protein sequence knowledgebase, 2
sections UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL (query, Blast, download)
(9232223 entries) UniParc protein sequence
archive (EMBL equivalent at the protein level).
Each entry contains a protein sequence with
cross-links to other databases where you find the
sequence (active or not). Not annotated. (query,
no Blast on www.uniprot.org, Blast _at_ EBI, not
downloadable) (20070606 entries) UniRef 3
clusters of protein sequences with 100, 90 and 50
similarity useful to speed up sequence
similarity search (BLAST) (query, Blast,
download) (UniRef100 8474689 entries UniRef90
5668'669 entries UniRef50 2'729'565 entries)
135
UniRef100, 90 and 50

One UniRef100 entry -gt merge of identical
sequences (including subfragments, splice
variants). Based on UniProtKB sequences and
selected UniParc records (such as Ensembl
RefSeq).
One UniRef90 entry -gt sequences that have at
least 90 or more identity. Built from UniRef100.
One UniRef50 entry -gt sequences that are at least
50 identical. Built from UniRef100.

136
(No Transcript)
137
UniProtKB protein sequence knowledgebase, 2
sections UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL (query, Blast, download)
(7097874 entries) UniParc protein sequence
archive (EMBL equivalent at the protein level).
Each entry contains a protein sequence with
cross-links to other databases where you find the
sequence (active or not). Not annotated. (query,
no Blast on www.uniprot.org, Blast _at_ EBI, not
downloadable) (17646564 entries) UniRef 3
clusters of protein sequences with 100, 90 and 50
similarity useful to speed up sequence
similarity search (BLAST) (query, Blast,
download) (UniRef100 6,652,983 entries UniRef90
4438653 entries UniRef50 2104702
entries) UniMES protein sequences derived from
metagenomic projects (Global Ocean Sampling
(GOS)) (Blast, download) (UniMes 6'028'191
entries)
138
What is "Non-Redundancy" ?

UniParc
One UniParc entry for all entries corresponding
to 100 identical sequences (100 identity over
the entire length) (from many different
databases).
UniRef
One UniRef100 entry for all entries corresponding
to 100 identical sequences (including fragments)
from UniProtKB, Ensembl, Refseq, PDB.
UniProtKB/Swiss-Prot
One Swiss-Prot entry for all the protein products
of one gene, including fragments,
variations/polymorphisms, splice variants,
sequencing errors

139
Comparing searches NCBI and UniProt
140
Search for the human Toll-like receptor 4 Entrez
Protein (NCBI)
141
Search for the human Toll-like receptor 4 in
UniProtKB
Swiss-Prot
142
Sequences retrieved in Entrez Protein O00206 AAF0
5316 CAH72618 CAH72619 BAG55035 AAI17423
AAF89753 NP_612564 AAC34135 Based on
A126770, BC117422,AL160272 and AA598398
143
Major protein sequence resources
Resources integrated in the entries
PIR
PDB
PRF
UniProtKB Swiss-Prot TrEMBL EntrezProtein
Swiss-ProtGenPeptPIRPDBPRFRefSeq
Resources integrated in the search engine
UniProtKB/Swiss-Prot manually annotated protein
sequences (12000 species) UniProtKB/TrEMBL
submitted CDS (EMBL) automated annotation
(202000 species) GenPept submitted CDS
(GenBank) PIR Protein Information Ressource
archive since 2003 integrated into
UniProtKB PDB Protein Databank 3D data and
associated sequences PRF journal scan of
published peptide sequences RefSeq Reference
Sequence for DNA, RNA, protein gene prediction
some manual annotation
144
Model Organism Databases (MODs) at a glance
145
Model organism Species extensively studied to
understand particular biological phenomena, with
the expectation that discoveries made in the
organism model will provide insight into the
workings of other organisms.
Model organisms MODs Mus musculus MGI
http//www.informatics.jax.org/ Rattus
norvegicus RGD http//rgd.mcw.edu/ Oryza
sativa RAP-DB http//rapdb.dna.affrc.go.jp/ Ara
bidopsis thaliana TAIR http//www.arabidopsis.or
g/ Drosophila melanogaster FlyBase
http//flybase.org/ Schizosaccharomyces pombe S.
pombe GeneDB http//www.genedb.org/genedb/pombe/ S
accharomyces cerevisiae SGD http//www.yeastgenome
.org/ Caenorhabditis elegans WormBase
http//www.wormbase.org/
Dictyostelium discoideum dictyBase
http//dictybase.org/
Bacillus subtilis SubtiList
http//genolist.pasteur.fr/SubtiList/
Escherichia coli ecogene http//ecogene.org/
Danio rerio (zebrafish) ZFIN http//zfin.org/
Just a few examples, not an exhaustive
list!
Methanocaldococcus jannaschii -gt no MOD

146
Model organism databases (MODs) Genome
annotation Gene models Gene mapping Official
nomenclature Gene expression Functional
annotation Interactions Information about
mutants/knockout/transgenic animals Phenotypes (
cross-)references Species-specific
reagents Key resources for information on a
given organism Service provided to/from a given
community
147
(No Transcript)
148
(No Transcript)
149
(No Transcript)
150
(No Transcript)
151
(No Transcript)
152
(No Transcript)
153
http//gmod.org/wiki/Main_Page
154
The world of databases is a jungle
155

A few points to remember
when using databases
Content
- Primary / secondary / meta-databases
- Curated / non-curated
- manual / automated curation
- Redundant / non-redundant.
Update frequency
Stable identifiers
Strategy
Dataflow
Collaborations between databases.

156
Test a few genomic databases and tools
157
Genomes and genomic tools a few sites
NCBI http//www.ncbi.nlm.nih.gov/sites/entrez?db
genome EBI http//www.ebi.ac.uk/genomes/ TIGR
http//cmr.jcvi.org/tigr-scripts/CMR/shared/Genom
es.cgi Genome annotation and analysis
tools http//www.ensembl.org/index.html http//ve
ga.sanger.ac.uk/index.html http//genome.ucsc.edu/
-gt BLAT, Galaxy, Custom tracks,
http//www.jgi.doe.gov/software/ -gt Genome
portal, Integrated Microbial Genomes (IMG) and
other tools Generic Model Organism Database
http//gmod.org/wiki/Main_Page
158
Genomes and genomic tools Hands-on
Find your favorite (completely sequenced)
organism in a genome db Follow the links to see
the options on different sites Find the
sequences Look at the annotation of your
favorite gene Compare the entries corresponding
to this gene across sites Test search engines
(restrict searches, compare results, ) Whenever
possible use on-line tutorials, such
as http//www.ensembl.org/info/website/tutorials/
index.html Visit GMOD, see the tools
(http//gmod.org/wiki/GMOD_Components) Play
around with the BLAT search, customize display,
follow the links,
159
Genomes and genomic tools Hands-on
Go and visit databases cited in this
lecture The databases/tools that should be
"familiar" to all are http//genome.ucsc.edu/cgi-
bin/hgBlat http//www.ensembl.org/index.html gene/
genome databases/tools on http//www.ncbi
.nlm.nih.gov/ If none of the databases are of
interest for you, go to the NAR database
(http//www.oxfordjournals.org/nar/database/a/)
and find databases that are closest to your
interests Play around Hands on protein
sequence databases and UniProt http//education.e
xpasy.org/cours/HK09/Protein_database_TP.html (cor
rections http//education.expasy.org/cours/HK09/P
rotein_database_TP_correction.html)
160
Thank You !

Write a Comment

User Comments (0)