Title: The Ensembl Database
1The Ensembl Database
Erin Pleasance Steven Jones Canadas Michael
Smith Genome Sciences Centre, Vancouver
2www.ensembl.org
3What is Ensembl?
- Public annotation of mammalian and other genomes
- Open source software
- Relational database system
- The future of genomic bioinformatics?
4The Ensembl Project
Ensembl is a joint project between EMBL European
Bioinformatics Institute and the Sanger
Institute to develop a software system which
produces and maintains automatic annotation on
eukaryotic genomes. Ensembl is primarily funded
by the Wellcome Trust
5The Ensembl Project
The main aim of this campaign is to encourage
scientists across the world - in academia,
pharmaceutical companies, and the biotechnology
and computer industries - to use this free
information.
- Dr. Mike Dexter, Director of the Wellcome Trust
6Goal An Accessible, Annotated Genome
Diagram of contigview as what we want in the end
7Ensembl Software System
- Uses extensively BioPerl (www.bioperl.org)
- The free mySQL database
- Entire Ensembl code base is freely available
under Apache open source license. - Mainly written in Perl, extensions in C. Some
viewers have been written in Java (e.g. Apollo).
8Ensembl Genome Annotation
- Utilizes raw DNA sequence data from public
sources - Creates a tracking database (The Ensembl
database) - Joins the sequences - based on a sequence
scaffold or Golden Path - Automatically finds genes and other features of
the sequence - Associates sequence and features with data from
other sources - Provides a publicly accessible web based
interface to the database
9The Genome Problem
- The problem with the genome (particularly human)
is that it is large, complicated, and opaque to
analysis (Ewan Birney, Ensembl) - Genome features to identify include
- Genes protein coding, RNA, pseudogenes
- Regulatory elements
- SNPs, repeats, etc.
10DNA sequence in Ensembl
- Sequences are determined in fragments (contigs)
- Features cross boundaries between fragments
- Entire sequence too large and changes too much
(constantly updated and reassembled) to be stored
as one long database entry
11DNA sequence in Ensembl
- Core design feature is the virtual contig
object - Allows genome sequence to be accessed as a single
large contiguous sequence even though it is
stored as a collection of fragments - VC object handles reading and writing features to
the DNA sequence
12Ensembl Gene Build System
- Three-part gene build system
- Best in genome matches for known genes
- Alignment of homologous genes
- Ab initio gene finding
- Genes predicted on repeat-masked DNA
- All genes predicted based on experimental
(available sequence) evidence
13Best in genome predictions
- Find known proteins from SPTREMBL on genome using
pmatch - Incorporate cDNAs using exonerate and EST_genome
- Align with gaps placed preferentially at splice
consensus sites - Allows prediction of 5 and 3 UTRs
- Refine predictions using genewise
14Best in genome predictions
- Alignments shown in ContigView
UTRs predicted
Known gene (p53)
ContigView of best in genome gene with associated
evidence
Proteins aligned
Unigene clusters aligned
cDNAs aligned
15Homology predictions
- Align homologous proteins using BLAST, genewise
- Paralogs (from same organism)
- Orthologs (from closely related organisms)
- Assemble novel genes
16Ab initio gene predictions
- Use Genscan to identify novel exons
- Confirm exons by BLAST to known proteins, mRNAs,
UniGene clusters - Based on ab initio predictions but require
homology evidence
ContigView of homology gene with associated
evidence
Unigene clusters aligned
Proteins aligned
Novel gene
GenScan predictions
17Pseudogenes
- Many pseudogenes also predicted
18Ensembl Gene Build System
- Resulting Ensembl genes are highly accurate
with low false positive rates - Ensembl human gene identifiers are 95 stable
between builds
Snapshot or stats on genes
19Ensembl EST genes
- ESTs not accurate enough to produce Ensembl
genes, but important especially for identifying
alternative transcripts - Create an independent set of EST genes
Known gene
EST genes
Unigene clusters aligned
20Ensembl EST genes
- Map ESTs to genome using Exonerate, BLAST, and
EST2Genome - Define transcripts by merging redundant ends,
setting splice sites to common ends - Finds splice sites and defines UTRs
- Alternative transcript predicted if at least one
alternatively spliced EST exists - Process transcripts with Genomewise to find
longest ORF for each
21Ensembl EST genes
- Evidence for genes shown (ExonView)
22Manual gene annotation Otter
- Manual annotation done with applications eg.
Apollo - Otter database/server allows manual annotations
to be integrated with automated annotations
23Manually curated genes VEGA
- Chromosomes 6,7,13,14, 20 and 22 contain manually
curated genes from VEGA database
24Gene information in Ensembl GeneView
25Transcript information in Ensembl TransView
26Protein information in Ensembl ProteinView
27Comparative genomics in Ensembl
- Gene orthologue pairs
- Human lt-gt Mouse lt-gt Rat lt-gt Fugu lt-gt Zebrafish
- C. elegans lt-gt C. briggsae
- Fly lt-gt Mosquito
- DNA homology
- Human lt-gt Mouse lt-gt Rat
28Comparative genomics in Ensembl Gene orthologs
- Gene ortholog pairs shown in GeneView
- Calculated by BLAST (reciprocal best BLAST hits,
or BLAST synteny) - dN/dS nonsynonymous/synonymous change (measure
of selection)
29Comparative genomics in Ensembl DNA homology
- DNA homology shown in ContigView
Mouse and rat homology
30Comparative genomics in Ensembl Synteny
- Large-scale homology shown in SyntenyView
- Synteny homologous sequence blocks, in same
order and orientation
31Other features in Ensembl
- Menus provide other feature options
- Features eg. SNPs and markers have special views
32Other data sources in Ensembl
- Ensembl incorporates gene and feature info from
many other datasources
OMIM
SwissProt
33Other data sources in Ensembl Link out
34The Distributed Annotation System
- Allows viewing third-party annotation of the
genomic scaffold - Users can choose the annotation they are
interested in - Features are viewed in consistent user
interface/display - Allows specialized feature annotation and the
comparison of different methodologies
35DAS Selecting data
36GeneDAS
- GeneDAS allows exchange of annotations on gene
level - eg. access to SwissProt annotations from GeneView
37DAS Add your own annotations
- Anyone can add data and upload it to DAS server
for others to view
38Sequence similarity searching
- Two search methods
- SSAHA very fast, good for identifying near-exact
DNA-DNA matches - BLAST slower but more accurate, can do DNA or
protein searches - Can search against any species
- Can search against genomic sequence, cDNAs
(Ensembl or Genscan), or protein sequences
39(No Transcript)
40Hits relative to genome
Show alignment A, sequence S, or ContigView
C
41BLAST results
42Data Mining with EnsMart
- EnsMart - organizes data from Ensembl into a
query-optimized database - Allows very fast, cross-data source querying
- Accessible from
- Ensembl website (MartView)
- Stand-alone application (MartExplorer)
- Command-line interface (MartShell)
- Extremely powerful for data mining
43Dataming with Ensmart
44- Choose focus gene set or SNPS
- Choose organism (any species in Ensembl)
45- Filter genes based on info about
- Region
- Genes
- Diseases
- Expression patterns
- Multi-species comparisons
- Protein domains and families
- SNPs
46- Choose output type
- Features (genes with associated info)
- SNPs
- Structures (of genes eg. exons)
- Sequences
- Choose what information to output
47Multiple Programming Interfaces now exist for
Ensembl
48Another example of how to utilize the Ensembl
database Sockeye
www.bcgsc.ca/bioinfo/software
49Apollo java viewer
www.ensembl.org/apollo
50Ensembl updates
- Monthly
- Include
- Changes in genome builds (with new annotations)
- Changes in code or database schema
- Additional views and tools on website
51Pre-Ensembl
- Full annotation can take weeks
- Pre-Ensembl site provides in-progress annotation
- Placement of known proteins
- Ab initio gene predictions
- Repeat masking
- BLAST and SSAHA searching
52Ensembl Software System
- Software can be accessed by FTP
- Can also be accessed through CVS (concurrent
versions system) - Possible to set up a mirror of the entire Ensembl
system.
53Further Information
- The Ensembl Project www.ensembl.org
- VEGA vega.sanger.ac.uk
- EnsMart www.ensembl.org/EnsMart/
- Distribributed Annotation System www.biodas.org
- Human Genome Central Resources
www.ensembl.org/genome/central - References
- Ensembl
- Hubbard et al, 2002. NAR 30 (1), 38-41.
- Clamp et al, 2003. NAR 31 (1), 38-42.
- Birney et al, 2004. NAR 32, D468-D470.
- EnsMart Birney et al, 2004. Genome Res. 14,
160-169.