Title: 1 of 29
1Evaluating Genes and Transcripts(Genebuild)
2Outline
- Ensembl gene set
- EST genes ab initio predictions
- Manual curation (Vega / Havana)
- Ensembl / Havana merged gene set
- CCDS project
3Biological Evidence
All Ensembl gene predictions are based on
experimental evidence
- UniProt/Swiss-Prot
- A manually curated database and therefore of
highest accuracy - NCBI RefSeq
- A partially manually curated database
- UniProt/TrEMBL
- Automatically annotated translations of EMBL
coding sequence (CDS) features - EMBL / GenBank / DDBJ
- Primary nucleotide sequence repository
4The Ensembl Genebuild
Genome assembly
Experimental evidence
Ensembl Genes
Computer programs
5The Ensembl Genebuild
A new release of Ensembl doesnt contain a new
genebuild for each species! New genebuilds are
only done if there is
- a new genome assembly
- a lot of new supporting evidence
6Genome Assemblies
- Genome assemblies are not created by
- Ensembl, but provided by other institutes /
- consortia
- NCBI human, mouse
- Rat Genome Sequencing Consortium rat
- Sanger zebrafish
- Broad Institute mammals
- Baylor College cow
- Washington University chicken
- etc. etc.
7The Ensembl Genebuild
- Targeted build
- Align species-specific proteins to the genome to
create transcripts - Similarity build
- Align proteins from closely related species to
locate additional transcripts - Add UTRs using mRNA evidence
- Eliminate redundant transcripts and create genes
8Special cases
- Pseudogenes
- Non-coding RNA genes
- sequences from RFAM and miRBase dbs and Infernal
- hand-checked set
- Ig Segment Genes (Immunoglobulin and T-cell
receptor segments) - sequences from IMGT db and Exonerate
9Classification of Transcripts
- Ensembl Transcripts and Proteins are mapped to
UniProt/Swiss-Prot, NCBI RefSeq and
UniProt/TrEMBL entries - Genes that map to species-specific protein/mRNA
records are classified as known - Genes that do not map to species-specific
protein/mRNA records are classified as novel
10Names and Descriptions
- Transcript names are inferred from mapped
transcripts and proteins - Swiss-Prot gt RefSeq gt TrEMBL ID
- Novel transcripts have only Ensembl identifiers
- Genes are assigned the official gene symbol if
available - HGNC (HUGO) symbol for human genes
- Species-specific nomenclature committees (MGI,
ZFIN etc.) - Otherwise Swiss-Prot gt RefSeq gt TrEMBL ID
- Gene description is inferred from mapped database
entries, the source is always given
11Projections
- Name and description of gene is sometimes
projected from its human or mouse orthologue - Done for
- human to most other mammals
- mouse to rat
- human to fish
- Gene classified as known_by_projection
12- Find the supporting evidence for the transcript
of the Ensembl GALP (Galanin-like peptide
precursor) gene of human. - On how many pieces of evidence was this
transcript build? - Why do two pieces of evidence not support the
first exon of the transcript? - From what source did Ensembl get the name for the
gene? And from where did it get the description?
13More Supporting evidence
ContigView
14Configuring the Genebuild
- Genebuild configured for each species
- Data availibility
- Targeted build most useful in human, mouse
- Similarity build most useful in C. intestinalis,
mosquito - Structural issues
- Zebrafish
- Many duplications
- Genome from different haplotypes
- Mosquito
- Many single-exon genes
- Genes within genes
15Low Coverage Genomes
- Low coverage genomes (2x) come in lots of
scaffolds classic genebuild will result in
many partial and fragmented genes - Whole Genome Alignment (WGA) to an annotated
reference genome this method reduces
fragmentation by piecing together scaffolds into
gene-scaffolds that contain complete gene(s) -
16Low Coverage Genomes
reference assembly
NNNNNN
gene-scaffold
17EST Gene Set
- ESTs (Expressed Sequence Tags) are single reads,
high chance of sequencing mistakes - EST libraries are regularly contaminated with
genomic DNA - Generally 400 bp, so unlikely to cover a whole
gene - THEREFORE
- EST gene predictions are less reliable and thus
kept separate from the core Ensembl Gene Set
18EST Gene Set
ContigView
EST
EST gene
19Ab initio Predictions
- Predict translatable transcript structures solely
on the basis of genome sequence - No validation with biological expression
information - GENSCAN for vertebrate genomes
- SNAP better for invertebrates
- NB Both programs are over-predicting transcript
structures
20Ab initio Predictions
ContigView
GENSCAN prediction
21Automatic vs Manual Annotation
- Automatic Annotation
- Quick
- Use unfinished sequence or shotgun assembly
- Consistent annotation
- Manual Annotation
- Slow
- Need finished sequence
- Flexible, can deal with inconsistencies
- Most rules have exceptions
- Consult publications as well as databases
22Annotation that Causes Problems for Ensembl
- Multiple variants
- UTRs
- Pseudogenes
- Non-coding genes (ncRNAs)
- Overlapping genes, anti-sense genes
- Gene duplication events
23Manually Curated Gene Sets
- Whole genome
- FlyBase fruitfly
- WormBase C. elegans
- SGD yeast
-
- Part of the genome
- Vega human, zebrafish, mouse, dog
24Vega Genome Browser
http//vega.sanger.ac.uk
25Vega Transcripts
Vega Havana transcripts annotated by the Havana
team at Sanger Vega External transcripts
annotated by other Vega teams
26Ensembl / Havana Merge
Full-length protein-coding transcripts annotated
by the Sanger Havana team (part of Vega) are
merged with the human and mouse Ensembl
transcript sets
- Transcripts
- Ensembl/Havana gold
- Ensembl red / black
- Havana blue
- Genes
- Ensembl/Havana gold
- Ensembl red / black
- Havana blue
27- Find the Ensembl Epc1 (enhancer of polycomb
homolog 1) gene of mouse. - How many transcripts has Ensembl annotated for
this gene? And Havana? - About how many transcripts does the Ensembl and
Havana annotation agree?
28CCDS(Consensus Coding Sequences)
- Collaboration between NCBI, UCSC, Ensembl and
Havana to produce a set of stable, reliable,
complete (ATG-gtstop) CDS structures for human and
mouse - Long term aim is to get to a single gene set for
human and mouse - The genebuild pipeline has been modified to
retain these blessed CDSs (stored in a database
for incorporation in the build)
29Q
A
Q U E S T I O N S A N S W E R S