Title: Ensembl Database and Web Browser www'ensembl'org
1EnsemblDatabase and Web Browserwww.ensembl.org
Stephen Baird Apoptosis Research
Centre Childrens Hospital of Eastern
Ontario sbaird_at_arc.cheo.ca
2- Focus on vertebrates
- No fungi/plants
- Brassica/Arabidopsis genome browser is at
http//ensembl.warwick.ac.uk/
3What is Ensembl?
- Joint project of EBI and Sanger
- Automated annotation of eukaryotic genomes
- Open source software
- Relational database system
- Web interface
The main aim of this campaign is to encourage
scientists across the world - in academia,
pharmaceutical companies, and the biotechnology
and computer industries - to use this free
information.
- Dr. Mike Dexter, Director of the Wellcome Trust
4Ensembl components
Search tools
Data
Chromosomes (FeatureView, KaryoView, Ctyoview,
MapView)
SNPs and Haplotypes (SNPView, GeneSNPView, HaploVi
ew, LDView)
Sequence Similarity (BLAST, SSAHA)
Diseases (DiseaseView)
Genome Sequence (ContigView)
Genes (GeneView, TransView, ExonView, GeneSeqView)
Markers (MarkerView)
Functions (GOView)
Text (TextView)
Other Annotations
Protein (ProtView, DomainView, FamilyView
Anything (BioMart/Martview)
Comparative Genomics (ContigView,
MultiContigView, SyntenyView, GeneView)
5Ensembl Gene Annotation
- Basis for initial analysis and publication of
most vertebrate genomes - Genome assembly from NCBI
- Gene build system
- Targeted gene builds predict known genes
- Similarity gene builds predict novel genes
6Curwen et al, Genome Res 14 942-950, 2004
7Targeted gene build
- Align known proteins with pmatch and BLAST
- Incorporate aligned cDNA sequences to find splice
sites, UTRs with genewise
ContigView of best in genome gene with associated
evidence
UTRs predicted
Known gene (p53)
Proteins aligned
Unigene clusters aligned
cDNAs aligned
8Similarity gene build
- Identify novel exons ab initio using Genscan
- Confirm exons by BLAST to known proteins, mRNAs,
UniGene clusters
9Ensembl Gene Annotation
- Resulting Ensembl genes are highly accurate
with low false positive rates - Ensembl human gene identifiers are 95 stable
between builds - Ensembl and RefSeq differ with 8-12 of the
genes - The Consensus CDS (CCDS) project is a
collaborative effort between Ensembl/EBI, UCSC
and NCBI to identify a core set of human protein
coding regions that are consistently annotated
and of high quality (13,000 genes).
10Manually curated genes VEGA
- Some chromosomes contain manually curated genes
from VEGA database - Otter manual annotation system allows
integration of automatic and manual annotations
(eg. from Apollo) into Ensembl by The Human and
Vertebrate Annotation (HAVANA) group annotators
at the Sanger center
VEGA gene
11Ensembl EST genes
- ESTs not accurate enough to produce Ensembl
genes, but important for identifying alternative
transcripts - ESTs aligned to genome and merged to create an
independent set of EST genes
Known gene
EST genes
Unigene clusters aligned
12Pseudogenes
- Processed pseudogenes in annotation identified
(lack of introns, frameshifts, presence of
multi-exon version elsewhere in genome, etc.)
Pseudogene
13Noncoding RNA Genes
- Genes with no ORFs that are functional (tRNAs,
rRNAs, miRNAs ) - 7220 annotations from Sean Eddy and Tom Jones
miRNAs
Coding gene
14Example 1 Exploring Caspase-3
- Aim to demonstrate basic browsing and views
- Caspase-3 is a gene involved in apoptosis (cell
suicide) - We will look at
- Gene annotation
- SNPs
- Orthologs and genome alignments
- Alternative transcripts and EST genes
- Protein Structure
15Text Search
Species-specific homepage
caspase-3
Gene
16GeneView
GeneSpliceView
GeneRegulationView
ContigView
GeneSNPView
ExonView
TransView of transcript
ProteinView
ExportView
Orthologs predicted by sequence similarity and
synteny
17GeneView
DAS - Distributed Annotation System - external
annotation of splicing, transcripts, array
expression, pubmed links, associated phenotypes,
Protonet, Reactome, UniProt.
Information for each Transcript - similarity
matches, links to RefSeq, OMIM, PDB, Array
probes, GO, InterPro, Protein FamilyView,
transcript structure, protein properties.
18GeneView
GeneSNPView
19GeneSNPView
20Other SNP/Haplotype tools
- SNPView info on a single SNP
- ProteinView (protein sequence with SNP markup)
- LDView View linkage disequilibrium (only limited
regions) - HaploView View haplotypes (only limited regions)
21GeneView
Click Back to
22ContigView
Chromosome and bands
Sequence contigs
To Detailed view
23ContigView Detailed View
See other tracks, options in menus
Genscan predictions
Targetted gene predictions (2 alternative
transcripts)
Gene annotations
EST genes
Other tracks Aligned sequences etc.
Base View Region
24ContigView- Features menu
Export image (ps, pdf, svg) or fasta file
Click on close menu
25MultiContigView
Conserved regions
Rat ortholog
26Other Comparative Genomics Tools
- Up to 6 genome alignments with MLAGAN in
AlignSliceView - Other view is SyntenyView
- Also access comparative genomics through EnsMart
27DAS-Distributed Annotation System
28Data Mining with BioMart
- Allows very fast, cross-data source querying
- Search for genes (features, sequences, etc.) or
SNPs based on - Position function domains similarity
expression etc. - Accessible from Ensembl website (MartView) as
well as stand-alone - Extremely powerful for data mining
29Example 2 BioMart
- A new disease locus has been mapped between
markers D21S1991 and D21S171. It may be that the
gene involved has already been identified as
having a role in another disease. What candidates
are in this region?
30BioMart Choosing your dataset
31BioMart Filtering
21
32BioMart Output
Note you can output different types of information
33BioMart Output
34Sequence Similarity Searching
- Use SSAHA for exact matches (fast)
- Use BLAST for more distant similarity (slow)
35Looking for Help?
36DAS Getting your Own Data in Ensembl
- DAS (Distributed Annotation System)
- Anyone can load data into Ensembl and allow
others to view it in the same view (eg.
ContigView) as other Ensembl annotations - Click on Managesources in DAS dropdownmenu
37Other Ways to Access Ensembl
- MySQL database directly accessible
- APIs for Perl and Java
- Other software
- Apollo Java genome annotation viewer/editor
- Sockeye Java viewer
- You can get your own local version of Ensembl
software and data freely available - http//www.ensembl.org/Docs/
Sockeye
38Exercises
- Ex 1. Homologues of human genes are often present
in Fugu rubripes in more condensed form (with
shorter introns). Is this true for the gene PTEN,
a tumor suppressor often mutated in advanced
cancers? - Try MultiContigView can you think of another way
to get this information as well? - Ex 2. The microRNA bantam regulates the
Drosophila (fruitfly) gene hid by binding the 3
UTR. Hid is involved in apoptosis, and it is
possible that binding sites for bantam could be
found in the 3 UTR of other apoptosis genes as
well. Obtain the 3 UTR sequence of all
Drosophila genes known to be involved in
apoptosis. - Using BioMart, the GO term for apoptosis is
GO0006915, evidence code TAS - Ex 3. The file PCR_product.txt on the webserver
contains the sequence of a PCR product amplified
from a mouse cDNA library. What gene does the
product correspond to? Does it contain the
complete coding sequence of that gene? - Would it be better to use BLAST or SSAHA?