Title: Nikolaj Blom
1Resources of Biomolecular Data Sequences,
Structures and Functionality PhD course 27803
Nikolaj Blom Center for Biological Sequence
Analysis BioCentrum-DTU Technical University of
Denmark nikob_at_cbs.dtu.dk
2Outline
- Magnitudes and Scales
- Resources Data Sources Tools
- Primary DNA sources
- Sequence Repositories
- Structure Repositories
- Functional Categorization
- Integration of Databases
- The Human Genome
- Genome Browsers
- Prediction Tools
- Evaluation of Prediction Servers
- Starting points
- Link collections
3Learning Objectives
- The student should be able to
- Describe differences between sequence
repositories and curated databases - Describe the challenges of maintaining
genome-wide biological databases - List two entry points for getting an overview of
my gene of interest - Describe how prediction servers may be evaluated
4Resources Sources Tools
- There is A LOT OF biomolecular databases/sources
- A LOT OF overlap of information/redundancy
- A LOT OF TOOLS
- Personal picks/preferences
- User-friendliness
- Update intervals
- Curation efforts / error correction
- Linkage to other DBs
5Faster than Moores law...
6Faster than Moores law...
7Human Genome Published HUGO Nature,
15.feb.2001 Celera Science, 16.feb.2001
8Magnitudes and Scales
- Human genome 3,200,000,000 bp
- Single basepair ? full genome is 9 orders of
magnitude - Genome Football field 3 billion leaves of
grass - Single base A T G C (or SNP) 1 leaf of grass
- Genome browsing
- Zooming from whole stadium to single leaf
9How we got the sequence
- Sanger chain termination method
10Primary DNA sources
- Trace files repositories
- Single read 500-1000 bp (golf ball size / jig
saw puzzle) - Variable quality
- WashU-Merck Human EST Project / Trace files
- Base-calling non-trivial
G, C or nothing?
11Assembly is Non-trivial!
12Sequence repositories - GenBank et al.
- GenBank / EMBL / DDBJ
- Highly redundant (many versions of same gene)
- Cross-updated daily
- Version history is recorded
- Previous sequence records can be retrieved
- Contigs/HTGS (100-200 kb) finishing at different
stages - Draft ? Finished
- Includes genomic DNA, cDNA, ESTs, translated
peptides
13Non-redundant and Curated databases
- Non-redundant
- Manual or automatic curation
- DNA
- RefSeq (NCBI semi-automated)
- Ensembl gene index (automated)
- Protein
- RefSeq (NCBI semi-automated)
- TrEMBL (EMBL automated)
14Curated database UniProt/SwissProt
- SIB - Swiss Institute of Bioinformatics
- Protein Knowledgebase / Sequence Database
- Highly curated
- Experimental evidence evaluated (e.g.
modifications) - All 80,000 entries checked by Amos Bairoch
himself -) - ExPASy - Expert Protein Analysis System
- Proteomics tools links local servers
15Structure databases / Protein Data Bank (PDB)
- X-ray , NMR biomolecular structures
- Protein Data Bank (PDB)
- http//www.rcsb.org/pdb/
16Structure databases / Protein Data Bank (PDB)
17Functional Categorization
- Gene Ontology (GO)
- Hierarchical
- Controlled vocabulary
18Functional Categorization
- Gene Ontology (GO) http//www.geneontology.org/
- Molecular Function - the tasks performed by
individual gene products examples are
transcription factor and DNA helicase - Biological Process - broad biological goals, such
as mitosis or purine metabolism, that are
accomplished by ordered assemblies of molecular
functions - Cellular Component - subcellular structures,
locations, and macromolecular complexes examples
include nucleus, telomere, and origin recognition
complex
19Integration of databases - Webs of web-sites
- Links, links, links...
- SRS Sequence Retrieval System
- Powerful, complex query language
- BioDAS Distributed Annotation System
http//srs.ebi.ac.uk/
20For my gene, how do I
- Get an overview of the sequence information
known? (GeneCardsOMIM) - Examine the Genome Neighbourhood? (Genome
Browsers) - Predict protein post-translational modifications
(PTMs)? (Prediction servers) - (Evaluate the value of predicted features)
21GeneCards http//nciarray.nci.nih.gov/cards/
22GeneCards-II
23GeneCards-III
24GeneCards-IV
25GeneCards-V
26Genetic/Medical Information
- OMIM, Online Mendelian Inheritance in Man (NCBI)
- The OMIM database is a catalog of human genes and
genetic disorders - gt16,000 entries (April, 2006)
- Examples cystic fibrosis, prions, amyloid
precursor protein - Condensed, highly curated descriptions of
genetics/disease/animal models/references
27OMIM-I (http//www3.ncbi.nlm.nih.gov/Omim/)
28OMIM-II
29OMIM-III
30For my gene, how do I
- Get an overview of the sequence information
known? (GeneCardsOMIM) - Examine the Genome Neighbourhood? (Genome
Browsers) - Predict protein post-translational modifications
(PTMs)? (Prediction servers) - (Evaluate the value of predicted features)
31Genome Browsing
- Three public
- Open access
- Use same genome build/assembly
- NCBI (U.S.)
- UCSC (Santa Cruz, U.S.)
- EnsEmbl (EBI, EU)
- (One private)
- (Restricted, commercial closed 2005)
32Celera Discovery System Database
33Genome Browsers - Portals to the Genomic World
- UCSC Univ. California Santa Cruz (U.S.)
- http//genome.ucsc.edu/
- NCBI National Center for Biotechnology
Information (U.S.) - http//www.ncbi.nlm.nih.gov/Genomes/index.html
- EnsEmbl European Molecular Biology Laboratory
(E.U.) - http//www.ensembl.org/
34UCSC Genome Browser
35UCSC Genome Browser II
36NCBI
37NCBI
38(No Transcript)
39EnsEmbl Genome Browser
40EnsEmbl Genome Browser
41EnsEmbl Genome Browser
42EnsEmbl Genome Browser
43EnsEmbl Genome Browser
44EnsEmbl Genome Browser
45For my gene, how do I
- Get an overview of the sequence information
known? (GeneCards) - Examine the Genome Neighbourhood? (Genome
Browsers) - Predict protein post-translational modifications
(PTMs) or Gene Structure? (Prediction servers) - ...and evaluate the reliability of prediction
methods
46CBS Services/Toolbox http//www.cbs.dtu.dk/service
s/
47(No Transcript)
48(No Transcript)
49NetPhos a prediction server
http//www.cbs.dtu.dk/services/NetPhos/
50NetPhos a prediction server
51Evaluating Prediction Servers
- Performance on independent/cross-validated data
presented? - Published in peer-reviewed journal?
- Cited by others?
- Science Citation Index
- Linked to from credible web sites?
- Google Page-rank
- linkURL search
52Evaluating Prediction Servers
532can Bioinformatics Education
- At EBI European Bioinformatics Institute
- http//www.ebi.ac.uk/2can/index.html
- Tutorials, resource links, etc.
54EnsEMBL Bioinformatics Education
55Starting Points
- General Bioinformatics
- NCBI, National Center for Biotechnology
Information, U.S. - EBI, European Bioinformatics Institute
- Prediction Tools
- CBS, DK
- Expasy (Protein analysis), Switzerland
56Dynamic Resources
- Pros
- Includes most recent developments
- Updated regularly
- User interface improves(usually)
- Cons
- Difficult to keep pace
- Tutorials and lectures hard to recycle -(
- Difficult to use at irregular intervals
57Genome Browsers - Portals to the Genomic World
- Three main entry points
- NCBI, UCSC, EnsEmbl
- Essentially contain same information
- High degree of linking to secondary databases
- Advisable to become familiar with only one genome
browser - Learn to navigate and make queries
- GeneCards and OMIM
- well suited for getting a quick overview of a
gene of interest
58Prediction Servers
- Evaluate scientific soundness
- Look for indications of quality (citations, etc.)
- Remember that prediction servers provide...well,
predictions!
59Learning Objectives
- The student should be able to
- Describe differences between sequence
repositories and curated databases - Describe the challenges of maintaining
genome-wide biological databases - List two entry points for getting an overview of
my gene of interest - Describe how prediction servers may be evaluated
60Immediate Feedback
Title Resources of Biomolecular Data
Sequences, Structures and Functionality
- Did the lecture live up to your expectations?
- Did you expect to learn about resources that were
not covered during this lecture? - NB! You can also provide input at the general
course evaluation
61The End
25,000?