Title: The Ensembl Gene set The
1The Ensembl Gene setThe Genebuild
21 April 2008
2Outline
- The GeneBuild
- (determining the Ensembl gene set)
- What it means for the scientist?
- annotation pipeline vs manual curation
- Pseudogenes
- ncRNAs
- The CCDS project
3Introduction
- What is available?
- I) Sequence Assemblies from genome sequencing
efforts
4Gene Sequencing- the Assembly
This generates clones, vs new sequencing methods
http//seqcore.brcf.med.umich.edu/doc/educ/dnapr/s
equencing.html
5Clones Available
- Human
- (Tilepath- used in the assembly)
6ContigView Clones and Contigs
Contigs
Clones (Plate/well numbers)
Ensembl Transcripts
7Task
View the tilepath clone in ContigView for the
region containing the human BRCA2 gene. Hint
Start with a search for the BRCA2 gene.
8The Ensembl Geneset
- How does Ensembl use mRNA and protein information
along with the sequence assembly to define
distinct genes on the genome?
Protein
Sequence Assembly
Ensembl Geneset
9Once the Assembly is Imported
- Proteins/mRNAs are aligned.
- These have been submitted to databases such as
- UniProt (manually curated) and
- RefSeq (partially manually curated)
10The Biological Evidence
All Ensembl gene predictions are based on
experimental evidence
- UniProt/Swiss-Prot
- A manually curated database and therefore of
highest accuracy - NCBI RefSeq
- A partially manually curated database
- UniProt/TrEMBL
- Automatically annotated translations of EMBL
coding sequence (CDS) features - EMBL / GenBank / DDBJ
- Primary nucleotide sequence repository
11Database Relationship
NCBI RefSeq
EMBL-Bank DDBJ GenBank
Individual Labs Submission
UniProt
Swiss-Prot
TrEMBL
12Genebuild
Sequence (Assembly)
Manual annotation (HAVANA)
EMBL-Bank GenBank DDBJ
Proteins (e.g. Swiss-Prot)
Ensembl
mRNA
EST genes
13Why do I want to know?
- Ensembl genes may be based on multiple
protein/mRNAs - What is an Ensembl gene based on?
14Task
- Look at the evidence for the human EPO gene.
- What was this gene based on?
- Hint Go to Exon Information from the GeneView
page
15EPO gene supporting evidence
16Species-Specific GeneBuilds
- Pan troglodytes genes are built by projection
from human genes. - Zebrafish has many gene duplications.
Homo sapiens genes must have protein evidence,
not just mRNA.
17Task
- When was the chimpanzee (Pan troglodytes)
Genebuild performed? - Can you find information as to how genes were
annotated? - Hint Look on the chimpanzee index page
18External Gene Set VEGA/Havana
- Human, zebrafish, mouse and dog
- Havana transcripts in blue or gold
- What are Havana transcripts?
19Automatic vs Manual Annotation
- Automatic Annotation
- (Ensembl Genebuild)
- Quick
- Use unfinished sequence or shotgun assembly
- Consistent annotation
Manual Annotation (Havana) Flexible, can deal
with inconsistencies Most rules have
exceptions Consult publications as well as
databases Out of the Ordinary Biology However
Slow Need finished sequence
20Havana and Ensembl match
When a Havana (manually curated) and Ensembl
(automatic methods) predict the same transcript,
basepair for basepair, the transcripts are merged
and coloured gold.
21Manually-curated gene sets in Ensembl
- Vega (Havana)
- Homo sapiens, Danio rerio,
- Mus musculus and Canis familiaris
- WormBase
- Caenorhabditis elegans
- FlyBase
- Drosophila melanogaster
- SGD
- Saccharomyces cerevisiae
22Consensus coding sequences (CCDS)
- Collaboration between NCBI, UCSC, Ensembl and
Havana to agree on a coding sequence for a
transcript. - The long term aim is to have a single gene set
for human - http//www.ncbi.nlm.nih.gov/CCDS/
- The genebuild pipeline has been modified to
retain these CDSs
23What Can Go Wrong?
- A Gap in the assembly
- Gene might not be found in Ensembl
- II) Fused genes
BLAST hit (SwissProt entry)
Gene might be associated with two names
24Outline
- The genome sequence
- The Genebuild
- manual curation by Havana
- Other EST gene set
- Pseudogenes
- ncRNAs
25Expressed Sequence Tags vs cDNA
- ESTs are annotated separately. Why?
- mRNA and cDNA used in the GeneBuild
- Sequenced to high standard, often complete.
- EST Lower quality sequence.
- One shot sequencing of cDNA from the 5 and 3
end creates the EST sequence. - ESTs are only 500-800 nucleotides long
- Low quality fragment- sequence error of 2.
- BUT confers useful expression information
- discovery of new genes esp in diseased organisms
- Tissue type
- Timing/developmental stage
- Samples more transcripts, variants
-
26Where Can I See This EST Geneset?ContigView
Choose EST genes
EST track
27Pseudogenes False Genes
28ncRNAs (non coding RNAs)
- What types are in Ensembl?
- tRNA (transfer RNA)
- rRNA (ribosomal RNA)
- scRNA (small cytoplasmic)
- snRNA (small nuclear)
- snoRNA (small nucleolar)
- miRNA (microRNA)
29ncRNAs (2 types)
- I) RNA with low homology can be identified
through conserved 2ary structure (search genome
using Rfam pattern) - II) High sequence conservation (miRNA)
- BLAST alignment
- RNA fold applied to make sure
- sequences can fold (hairpin)
30ncRNAs where can I see them?
- Find them in ContigView
- or use BioMart.
31Summary Ensembl Genes
- All Ensembl genes are based on biological
evidence (protein and mRNA) - One Ensembl gene may come from proteins and mRNAs
in various databases. - Havana (manually curated) genes are incorporated
into the Ensembl geneset, merged for human. - The CCDS set strives for consensus coding
sequences across databases. - Pseudogenes and RNAs are annotated, along with a
separate EST gene set.
32For more on GeneBuild
- Help and Documentation
- (About Ensembl)
http//www.ensembl.org/info/about/docs/genome_anno
tation.html