Title: Investigating Genomes with Ensembl
1Investigating Genomes with Ensembl
- Drs. Bert Overduin and Giulietta Spudich
2Overview of the day
- Introduction and website walk-through
- Hands-on exercises (the browser)
- Tea/Coffee
- Introduction to BioMart
- Hands-on exercises (BioMart)
- Lunch
- Determining the gene set
- Hands-on exercises (gene set)
- Tea/Coffee
- Variations presentation and hands-on
3Introducing
- Genome browsing a comparison
- Consensus genes
- Ensembl annotation and software
- How to find help
4Sequencing the genome
5What can we learn about genomes?
- Within one genome regulatory elements, gene
order, chromatin structure - Through comparative studies Evolution, conserved
regions, rearrangements - Gene quality and prediction.
6Genome Browsers Today
- Ensembl Genome browser
- http//www.ensembl.org
- NCBI Map Viewer
- http//www.ncbi.nlm.nih.gov/mapview/
- UCSC Genome Browser
- http//genome.ucsc.edu
7Ensembl Genome Browser
8NCBI Map Viewer
9UCSC Genome Browser
10What Distinguishes Ensembl from the UCSC and NCBI
Browsers?
- The gene set. Automatic annotation based on mRNA
and protein information. - Programmatic access via the Perl API (open
source) - BioMart
- Integration with other databases (DAS)
- Comparative analysis (gene trees)
11Challenges of genome browsers
- Increasing sequence information
198,879,188,987 nt (Aug 2007)
12Challenges of genome browsers
- Increasing annotation ENCODE
- Pilot project completed in 2007 1 of human
genome - Discovered promoter elements are on either side
of the transcription start site
13To meet a challenge
- Ensembls AIM To provide annotation for the
biological community that is freely available and
of high quality -
- Started in 1999
- Joint project between EBI and Sanger
- Funded primarily by the Wellcome Trust,
additional funding by EMBL, NIH-NIAID, EU, BBSRC
and MRC - Team of ca. 40 people, led by Ewan Birney (EBI)
and Tim Hubbard (Sanger)
14The Ensembl gene set
- All Ensembl genes start from a known protein or
mRNA - Sequence Ensembl
- Assembly gene set
mRNAs protein
- An initial alignment of protein and mRNA to the
genome - begins the Genebuild.
15Have you heard of
- Ensembl strives for best possible gene set
- www.ensembl.org
- Havana (VEGA) same goal
- http//vega.sanger.ac.uk
- HGNC a unique name and symbol for every gene in
human - http//www.genenames.org/
- UniProt focus on proteins, and functional
information - www.uniprot.org
16Ensembl vs Havana annotation
- All genes at once
- (Ensembl Genebuild)
- Quick, keeps current
- Consistent annotation
- Can apply rules to more
- species
- Gene by gene
- (Havana/ VEGA)
- Flexible, can deal with inconsistencies
- Consult publications as well as databases
- Out of the Ordinary Biology
- However Slow, Expensive
17Merging sets
- Havana transcripts are incorporated into Ensembl
- UniProt proteins are aligned to the genome in the
Ensembl genebuild - UniProt imports Ensembl peptides for human
- HGNC moved to Hinxton coordination
18Consensus across genome browsers the CCDS
sethttp//www.ensembl.org/info/about/docs/ccds.ht
ml
- A protein is deposited into the Consensus CDS
protein set or CCDS set if - NCBI
- UCSC
- Havana
- Ensembl
- have determined the same sequence.
19More about Ensembl
- Genome browsing a comparison
- Consensus genes
- Ensembl annotation and software
- How to find help
20Ensembl Genes biological basis
All Ensembl gene predictions are based on
proteins and mRNAs in
- UniProt/Swiss-Prot (manually curated)
- UniProt/TrEMBL
- NCBI RefSeq (manually curated)
-
Protein/ mRNA
Sequence Assembly
Ensembl Genes
21Genes and Transcripts in Ensembl
- Ensembl known genes or transcripts
- Ensembl novel genes or transcripts
- Ensembl EST genes or transcripts
- Non-Ensembl genes
- Imports for yeast, c. elegans, fly, mosquito,
takifugu and tetraodon
22Names in Ensembl
- ENSG Ensembl Gene ID
- ENST Ensembl Transcript ID
- ENSP Ensembl Peptide ID
- ENSE Ensembl Exon ID
- For other species than human a suffix is added
-
- MUS (Mus musculus) for mouse ENSMUSG
- DAR (Danio rerio) for zebrafish ENSDARG,
etc.
23Gene Structure in Ensembl
Calmodulin Chicken
No UTRs
Calmodulin Human
UTRs annotated
24What annotation is available?
- Gene/transcript/peptide models (coding and
noncoding (ncRNAs)) - IDs in other database
- Mapped cDNAs, peptides, micro array probes, BAC
clones etc. - Cytogenetic bands, markers, repeats etc.
- Comparative data
- orthologues and paralogues, protein families,
whole genome alignments, syntenic regions - Variation data
- Single Nucleotide Polymorphisms (SNPs)
- Regulatory data
- best guess set of regulatory elements from
ENCODE - Data from external sources (DAS)
25Specific data sources
- Microarrays (Affimetrix, Illumina, Agilent)
- GO (Gene Ontology functional classes)
- http//www.geneontology.org/
- OMIM (human diseases and phenotypes)
- http//www.ncbi.nlm.nih.gov/sites/entrez?db
OMIM - Identifiers in Entrez, UniProt, Refseq, etc
- PDB, MSD (structural databases)
- http//www.rcsb.org/pdb/
- http//www.ebi.ac.uk/msd/
26Interpro
Collection of protein data Sequences, Motifs,
Structures
http//www.ebi.ac.uk/interpro/
27How is this information organised?
- Ensembl Views (Website)
- Ensembl Database (open source)
- (Perl API, FTP site)
- BioMart DataMining tool
28Ensembl Open Source
- Data and software freely available
- More than 50 installs worldwide
- Academia and industry
- Local or available via the web
- Mirrors with Ensembl data, e.g.
http//ensembl.genome.tugraz.at/index.html - http//ensembl.genomics.org.cn/
- or user projects with own data
28 of 42
29Powered by Ensembl
29 of 42
30Help and Information
- Use our helpdesk!
- helpdesk_at_ensembl.org
- View our help pages!
- (the using Ensembl link)
- View our animated tutorials
- http//www.ensembl.org/common/Workshops_Online
- Mailing lists
- ensembl-announce_at_ebi.ac.uk
- Come visit our blog!
- http//ensembl.blogspot.com/
31Ensembl Team