1 of 29 - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

1 of 29

Description:

Sanger: zebrafish. Broad Institute: mammals. Baylor College: cow. Washington University: chicken ... Zebrafish. Many duplications. Genome from different ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 30
Provided by: bertov
Category:
Tags: zebrafish

less

Transcript and Presenter's Notes

Title: 1 of 29


1
Evaluating Genes and Transcripts(Genebuild)
2
Outline
  • Ensembl gene set
  • EST genes ab initio predictions
  • Manual curation (Vega / Havana)
  • Ensembl / Havana merged gene set
  • CCDS project

3
Biological Evidence
All Ensembl gene predictions are based on
experimental evidence
  • UniProt/Swiss-Prot
  • A manually curated database and therefore of
    highest accuracy
  • NCBI RefSeq
  • A partially manually curated database
  • UniProt/TrEMBL
  • Automatically annotated translations of EMBL
    coding sequence (CDS) features
  • EMBL / GenBank / DDBJ
  • Primary nucleotide sequence repository

4
The Ensembl Genebuild
Genome assembly

Experimental evidence
Ensembl Genes

Computer programs
5
The Ensembl Genebuild
A new release of Ensembl doesnt contain a new
genebuild for each species! New genebuilds are
only done if there is
  • a new genome assembly
  • a lot of new supporting evidence

6
Genome Assemblies
  • Genome assemblies are not created by
  • Ensembl, but provided by other institutes /
  • consortia
  • NCBI human, mouse
  • Rat Genome Sequencing Consortium rat
  • Sanger zebrafish
  • Broad Institute mammals
  • Baylor College cow
  • Washington University chicken
  • etc. etc.

7
The Ensembl Genebuild
  • Targeted build
  • Align species-specific proteins to the genome to
    create transcripts
  • Similarity build
  • Align proteins from closely related species to
    locate additional transcripts
  • Add UTRs using mRNA evidence
  • Eliminate redundant transcripts and create genes

8
Special cases
  • Pseudogenes
  • Non-coding RNA genes
  • sequences from RFAM and miRBase dbs and Infernal
  • hand-checked set
  • Ig Segment Genes (Immunoglobulin and T-cell
    receptor segments)
  • sequences from IMGT db and Exonerate

9
Classification of Transcripts
  • Ensembl Transcripts and Proteins are mapped to
    UniProt/Swiss-Prot, NCBI RefSeq and
    UniProt/TrEMBL entries
  • Genes that map to species-specific protein/mRNA
    records are classified as known
  • Genes that do not map to species-specific
    protein/mRNA records are classified as novel

10
Names and Descriptions
  • Transcript names are inferred from mapped
    transcripts and proteins
  • Swiss-Prot gt RefSeq gt TrEMBL ID
  • Novel transcripts have only Ensembl identifiers
  • Genes are assigned the official gene symbol if
    available
  • HGNC (HUGO) symbol for human genes
  • Species-specific nomenclature committees (MGI,
    ZFIN etc.)
  • Otherwise Swiss-Prot gt RefSeq gt TrEMBL ID
  • Gene description is inferred from mapped database
    entries, the source is always given

11
Projections
  • Name and description of gene is sometimes
    projected from its human or mouse orthologue
  • Done for
  • human to most other mammals
  • mouse to rat
  • human to fish
  • Gene classified as known_by_projection

12
  • Find the supporting evidence for the transcript
    of the Ensembl GALP (Galanin-like peptide
    precursor) gene of human.
  • On how many pieces of evidence was this
    transcript build?
  • Why do two pieces of evidence not support the
    first exon of the transcript?
  • From what source did Ensembl get the name for the
    gene? And from where did it get the description?

13
More Supporting evidence
ContigView
14
Configuring the Genebuild
  • Genebuild configured for each species
  • Data availibility
  • Targeted build most useful in human, mouse
  • Similarity build most useful in C. intestinalis,
    mosquito
  • Structural issues
  • Zebrafish
  • Many duplications
  • Genome from different haplotypes
  • Mosquito
  • Many single-exon genes
  • Genes within genes

15
Low Coverage Genomes
  • Low coverage genomes (2x) come in lots of
    scaffolds classic genebuild will result in
    many partial and fragmented genes
  • Whole Genome Alignment (WGA) to an annotated
    reference genome this method reduces
    fragmentation by piecing together scaffolds into
    gene-scaffolds that contain complete gene(s)

16
Low Coverage Genomes
reference assembly
NNNNNN
gene-scaffold
17
EST Gene Set
  • ESTs (Expressed Sequence Tags) are single reads,
    high chance of sequencing mistakes
  • EST libraries are regularly contaminated with
    genomic DNA
  • Generally 400 bp, so unlikely to cover a whole
    gene
  • THEREFORE
  • EST gene predictions are less reliable and thus
    kept separate from the core Ensembl Gene Set

18
EST Gene Set
ContigView
EST
EST gene
19
Ab initio Predictions
  • Predict translatable transcript structures solely
    on the basis of genome sequence
  • No validation with biological expression
    information
  • GENSCAN for vertebrate genomes
  • SNAP better for invertebrates
  • NB Both programs are over-predicting transcript
    structures

20
Ab initio Predictions
ContigView
GENSCAN prediction
21
Automatic vs Manual Annotation
  • Automatic Annotation
  • Quick
  • Use unfinished sequence or shotgun assembly
  • Consistent annotation
  • Manual Annotation
  • Slow
  • Need finished sequence
  • Flexible, can deal with inconsistencies
  • Most rules have exceptions
  • Consult publications as well as databases

22
Annotation that Causes Problems for Ensembl
  • Multiple variants
  • UTRs
  • Pseudogenes
  • Non-coding genes (ncRNAs)
  • Overlapping genes, anti-sense genes
  • Gene duplication events

23
Manually Curated Gene Sets
  • Whole genome
  • FlyBase fruitfly
  • WormBase C. elegans
  • SGD yeast
  • Part of the genome
  • Vega human, zebrafish, mouse, dog

24
Vega Genome Browser
http//vega.sanger.ac.uk
25
Vega Transcripts
Vega Havana transcripts annotated by the Havana
team at Sanger Vega External transcripts
annotated by other Vega teams
26
Ensembl / Havana Merge
Full-length protein-coding transcripts annotated
by the Sanger Havana team (part of Vega) are
merged with the human and mouse Ensembl
transcript sets
  • Transcripts
  • Ensembl/Havana gold
  • Ensembl red / black
  • Havana blue
  • Genes
  • Ensembl/Havana gold
  • Ensembl red / black
  • Havana blue

27
  • Find the Ensembl Epc1 (enhancer of polycomb
    homolog 1) gene of mouse.
  • How many transcripts has Ensembl annotated for
    this gene? And Havana?
  • About how many transcripts does the Ensembl and
    Havana annotation agree?

28
CCDS(Consensus Coding Sequences)
  • Collaboration between NCBI, UCSC, Ensembl and
    Havana to produce a set of stable, reliable,
    complete (ATG-gtstop) CDS structures for human and
    mouse
  • Long term aim is to get to a single gene set for
    human and mouse
  • The genebuild pipeline has been modified to
    retain these blessed CDSs (stored in a database
    for incorporation in the build)

29
Q

A
Q U E S T I O N S A N S W E R S
Write a Comment
User Comments (0)
About PowerShow.com