The Ensembl Database - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

The Ensembl Database

Description:

... in C. Some viewers have been written in Java (e.g. Apollo) ... Apollo ... Chromosomes 6,7,13,14, 20 and 22 contain manually curated genes from VEGA database ... – PowerPoint PPT presentation

Number of Views:657
Avg rating:3.0/5.0
Slides: 54
Provided by: steph317
Category:

less

Transcript and Presenter's Notes

Title: The Ensembl Database


1
The Ensembl Database
Erin Pleasance Steven Jones Canadas Michael
Smith Genome Sciences Centre, Vancouver
2
www.ensembl.org
3
What is Ensembl?
  • Public annotation of mammalian and other genomes
  • Open source software
  • Relational database system
  • The future of genomic bioinformatics?

4
The Ensembl Project
Ensembl is a joint project between EMBL European
Bioinformatics Institute and the Sanger
Institute to develop a software system which
produces and maintains automatic annotation on
eukaryotic genomes. Ensembl is primarily funded
by the Wellcome Trust
5
The Ensembl Project
The main aim of this campaign is to encourage
scientists across the world - in academia,
pharmaceutical companies, and the biotechnology
and computer industries - to use this free
information.
- Dr. Mike Dexter, Director of the Wellcome Trust
6
Goal An Accessible, Annotated Genome
Diagram of contigview as what we want in the end
7
Ensembl Software System
  • Uses extensively BioPerl (www.bioperl.org)
  • The free mySQL database
  • Entire Ensembl code base is freely available
    under Apache open source license.
  • Mainly written in Perl, extensions in C. Some
    viewers have been written in Java (e.g. Apollo).

8
Ensembl Genome Annotation
  • Utilizes raw DNA sequence data from public
    sources
  • Creates a tracking database (The Ensembl
    database)
  • Joins the sequences - based on a sequence
    scaffold or Golden Path
  • Automatically finds genes and other features of
    the sequence
  • Associates sequence and features with data from
    other sources
  • Provides a publicly accessible web based
    interface to the database

9
The Genome Problem
  • The problem with the genome (particularly human)
    is that it is large, complicated, and opaque to
    analysis (Ewan Birney, Ensembl)
  • Genome features to identify include
  • Genes protein coding, RNA, pseudogenes
  • Regulatory elements
  • SNPs, repeats, etc.

10
DNA sequence in Ensembl
  • Sequences are determined in fragments (contigs)
  • Features cross boundaries between fragments
  • Entire sequence too large and changes too much
    (constantly updated and reassembled) to be stored
    as one long database entry

11
DNA sequence in Ensembl
  • Core design feature is the virtual contig
    object
  • Allows genome sequence to be accessed as a single
    large contiguous sequence even though it is
    stored as a collection of fragments
  • VC object handles reading and writing features to
    the DNA sequence

12
Ensembl Gene Build System
  • Three-part gene build system
  • Best in genome matches for known genes
  • Alignment of homologous genes
  • Ab initio gene finding
  • Genes predicted on repeat-masked DNA
  • All genes predicted based on experimental
    (available sequence) evidence

13
Best in genome predictions
  • Find known proteins from SPTREMBL on genome using
    pmatch
  • Incorporate cDNAs using exonerate and EST_genome
  • Align with gaps placed preferentially at splice
    consensus sites
  • Allows prediction of 5 and 3 UTRs
  • Refine predictions using genewise

14
Best in genome predictions
  • Alignments shown in ContigView

UTRs predicted
Known gene (p53)
ContigView of best in genome gene with associated
evidence
Proteins aligned
Unigene clusters aligned
cDNAs aligned
15
Homology predictions
  • Align homologous proteins using BLAST, genewise
  • Paralogs (from same organism)
  • Orthologs (from closely related organisms)
  • Assemble novel genes

16
Ab initio gene predictions
  • Use Genscan to identify novel exons
  • Confirm exons by BLAST to known proteins, mRNAs,
    UniGene clusters
  • Based on ab initio predictions but require
    homology evidence

ContigView of homology gene with associated
evidence
Unigene clusters aligned
Proteins aligned
Novel gene
GenScan predictions
17
Pseudogenes
  • Many pseudogenes also predicted

18
Ensembl Gene Build System
  • Resulting Ensembl genes are highly accurate
    with low false positive rates
  • Ensembl human gene identifiers are 95 stable
    between builds

Snapshot or stats on genes
19
Ensembl EST genes
  • ESTs not accurate enough to produce Ensembl
    genes, but important especially for identifying
    alternative transcripts
  • Create an independent set of EST genes

Known gene
EST genes
Unigene clusters aligned
20
Ensembl EST genes
  • Map ESTs to genome using Exonerate, BLAST, and
    EST2Genome
  • Define transcripts by merging redundant ends,
    setting splice sites to common ends
  • Finds splice sites and defines UTRs
  • Alternative transcript predicted if at least one
    alternatively spliced EST exists
  • Process transcripts with Genomewise to find
    longest ORF for each

21
Ensembl EST genes
  • Evidence for genes shown (ExonView)

22
Manual gene annotation Otter
  • Manual annotation done with applications eg.
    Apollo
  • Otter database/server allows manual annotations
    to be integrated with automated annotations

23
Manually curated genes VEGA
  • Chromosomes 6,7,13,14, 20 and 22 contain manually
    curated genes from VEGA database

24
Gene information in Ensembl GeneView
25
Transcript information in Ensembl TransView
26
Protein information in Ensembl ProteinView
27
Comparative genomics in Ensembl
  • Gene orthologue pairs
  • Human lt-gt Mouse lt-gt Rat lt-gt Fugu lt-gt Zebrafish
  • C. elegans lt-gt C. briggsae
  • Fly lt-gt Mosquito
  • DNA homology
  • Human lt-gt Mouse lt-gt Rat

28
Comparative genomics in Ensembl Gene orthologs
  • Gene ortholog pairs shown in GeneView
  • Calculated by BLAST (reciprocal best BLAST hits,
    or BLAST synteny)
  • dN/dS nonsynonymous/synonymous change (measure
    of selection)

29
Comparative genomics in Ensembl DNA homology
  • DNA homology shown in ContigView

Mouse and rat homology
30
Comparative genomics in Ensembl Synteny
  • Large-scale homology shown in SyntenyView
  • Synteny homologous sequence blocks, in same
    order and orientation

31
Other features in Ensembl
  • Menus provide other feature options
  • Features eg. SNPs and markers have special views

32
Other data sources in Ensembl
  • Ensembl incorporates gene and feature info from
    many other datasources

OMIM
SwissProt
33
Other data sources in Ensembl Link out
34
The Distributed Annotation System
  • Allows viewing third-party annotation of the
    genomic scaffold
  • Users can choose the annotation they are
    interested in
  • Features are viewed in consistent user
    interface/display
  • Allows specialized feature annotation and the
    comparison of different methodologies

35
DAS Selecting data
36
GeneDAS
  • GeneDAS allows exchange of annotations on gene
    level
  • eg. access to SwissProt annotations from GeneView

37
DAS Add your own annotations
  • Anyone can add data and upload it to DAS server
    for others to view

38
Sequence similarity searching
  • Two search methods
  • SSAHA very fast, good for identifying near-exact
    DNA-DNA matches
  • BLAST slower but more accurate, can do DNA or
    protein searches
  • Can search against any species
  • Can search against genomic sequence, cDNAs
    (Ensembl or Genscan), or protein sequences

39
(No Transcript)
40
Hits relative to genome
Show alignment A, sequence S, or ContigView
C
41
BLAST results
42
Data Mining with EnsMart
  • EnsMart - organizes data from Ensembl into a
    query-optimized database
  • Allows very fast, cross-data source querying
  • Accessible from
  • Ensembl website (MartView)
  • Stand-alone application (MartExplorer)
  • Command-line interface (MartShell)
  • Extremely powerful for data mining

43
Dataming with Ensmart
44
  • Choose focus gene set or SNPS
  • Choose organism (any species in Ensembl)

45
  • Filter genes based on info about
  • Region
  • Genes
  • Diseases
  • Expression patterns
  • Multi-species comparisons
  • Protein domains and families
  • SNPs

46
  • Choose output type
  • Features (genes with associated info)
  • SNPs
  • Structures (of genes eg. exons)
  • Sequences
  • Choose what information to output

47
Multiple Programming Interfaces now exist for
Ensembl
48
Another example of how to utilize the Ensembl
database Sockeye
www.bcgsc.ca/bioinfo/software
49
Apollo java viewer
www.ensembl.org/apollo
50
Ensembl updates
  • Monthly
  • Include
  • Changes in genome builds (with new annotations)
  • Changes in code or database schema
  • Additional views and tools on website

51
Pre-Ensembl
  • Full annotation can take weeks
  • Pre-Ensembl site provides in-progress annotation
  • Placement of known proteins
  • Ab initio gene predictions
  • Repeat masking
  • BLAST and SSAHA searching

52
Ensembl Software System
  • Software can be accessed by FTP
  • Can also be accessed through CVS (concurrent
    versions system)
  • Possible to set up a mirror of the entire Ensembl
    system.

53
Further Information
  • The Ensembl Project www.ensembl.org
  • VEGA vega.sanger.ac.uk
  • EnsMart www.ensembl.org/EnsMart/
  • Distribributed Annotation System www.biodas.org
  • Human Genome Central Resources
    www.ensembl.org/genome/central
  • References
  • Ensembl
  • Hubbard et al, 2002. NAR 30 (1), 38-41.
  • Clamp et al, 2003. NAR 31 (1), 38-42.
  • Birney et al, 2004. NAR 32, D468-D470.
  • EnsMart Birney et al, 2004. Genome Res. 14,
    160-169.
Write a Comment
User Comments (0)
About PowerShow.com