The Ensembl Gene set The - PowerPoint PPT Presentation

About This Presentation
Title:

The Ensembl Gene set The

Description:

http://seqcore.brcf.med.umich.edu/doc/educ/dnapr ... Drosophila melanogaster. SGD. Saccharomyces cerevisiae. 22 of 32. Consensus coding sequences (CCDS) ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 31
Provided by: xos86
Category:

less

Transcript and Presenter's Notes

Title: The Ensembl Gene set The


1
The Ensembl Gene setThe Genebuild
21 April 2008
2
Outline
  • The GeneBuild
  • (determining the Ensembl gene set)
  • What it means for the scientist?
  • annotation pipeline vs manual curation
  • Pseudogenes
  • ncRNAs
  • The CCDS project

3
Introduction
  • What is available?
  • I) Sequence Assemblies from genome sequencing
    efforts

4
Gene Sequencing- the Assembly
This generates clones, vs new sequencing methods
http//seqcore.brcf.med.umich.edu/doc/educ/dnapr/s
equencing.html
5
Clones Available
  • Human
  • (Tilepath- used in the assembly)

6
ContigView Clones and Contigs
Contigs
Clones (Plate/well numbers)
Ensembl Transcripts
7
Task
View the tilepath clone in ContigView for the
region containing the human BRCA2 gene. Hint
Start with a search for the BRCA2 gene.
8
The Ensembl Geneset
  • How does Ensembl use mRNA and protein information
    along with the sequence assembly to define
    distinct genes on the genome?

Protein
Sequence Assembly
Ensembl Geneset
9
Once the Assembly is Imported
  • Proteins/mRNAs are aligned.
  • These have been submitted to databases such as
  • UniProt (manually curated) and
  • RefSeq (partially manually curated)

10
The Biological Evidence
All Ensembl gene predictions are based on
experimental evidence
  • UniProt/Swiss-Prot
  • A manually curated database and therefore of
    highest accuracy
  • NCBI RefSeq
  • A partially manually curated database
  • UniProt/TrEMBL
  • Automatically annotated translations of EMBL
    coding sequence (CDS) features
  • EMBL / GenBank / DDBJ
  • Primary nucleotide sequence repository

11
Database Relationship
NCBI RefSeq
EMBL-Bank DDBJ GenBank
Individual Labs Submission
UniProt
Swiss-Prot
TrEMBL
12
Genebuild
Sequence (Assembly)
Manual annotation (HAVANA)
EMBL-Bank GenBank DDBJ
Proteins (e.g. Swiss-Prot)
Ensembl
mRNA
EST genes
13
Why do I want to know?
  • Ensembl genes may be based on multiple
    protein/mRNAs
  • What is an Ensembl gene based on?

14
Task
  • Look at the evidence for the human EPO gene.
  • What was this gene based on?
  • Hint Go to Exon Information from the GeneView
    page

15
EPO gene supporting evidence
16
Species-Specific GeneBuilds
  • Pan troglodytes genes are built by projection
    from human genes.
  • Zebrafish has many gene duplications.

Homo sapiens genes must have protein evidence,
not just mRNA.
17
Task
  • When was the chimpanzee (Pan troglodytes)
    Genebuild performed?
  • Can you find information as to how genes were
    annotated?
  • Hint Look on the chimpanzee index page

18
External Gene Set VEGA/Havana
  • Human, zebrafish, mouse and dog
  • Havana transcripts in blue or gold
  • What are Havana transcripts?

19
Automatic vs Manual Annotation
  • Automatic Annotation
  • (Ensembl Genebuild)
  • Quick
  • Use unfinished sequence or shotgun assembly
  • Consistent annotation

Manual Annotation (Havana) Flexible, can deal
with inconsistencies Most rules have
exceptions Consult publications as well as
databases Out of the Ordinary Biology However
Slow Need finished sequence
20
Havana and Ensembl match
When a Havana (manually curated) and Ensembl
(automatic methods) predict the same transcript,
basepair for basepair, the transcripts are merged
and coloured gold.
21
Manually-curated gene sets in Ensembl
  • Vega (Havana)
  • Homo sapiens, Danio rerio,
  • Mus musculus and Canis familiaris
  • WormBase
  • Caenorhabditis elegans
  • FlyBase
  • Drosophila melanogaster
  • SGD
  • Saccharomyces cerevisiae

22
Consensus coding sequences (CCDS)
  • Collaboration between NCBI, UCSC, Ensembl and
    Havana to agree on a coding sequence for a
    transcript.
  • The long term aim is to have a single gene set
    for human
  • http//www.ncbi.nlm.nih.gov/CCDS/
  • The genebuild pipeline has been modified to
    retain these CDSs

23
What Can Go Wrong?
  • A Gap in the assembly
  • Gene might not be found in Ensembl
  • II) Fused genes

BLAST hit (SwissProt entry)
Gene might be associated with two names
24
Outline
  • The genome sequence
  • The Genebuild
  • manual curation by Havana
  • Other EST gene set
  • Pseudogenes
  • ncRNAs

25
Expressed Sequence Tags vs cDNA
  • ESTs are annotated separately. Why?
  • mRNA and cDNA used in the GeneBuild
  • Sequenced to high standard, often complete.
  • EST Lower quality sequence.
  • One shot sequencing of cDNA from the 5 and 3
    end creates the EST sequence.
  • ESTs are only 500-800 nucleotides long
  • Low quality fragment- sequence error of 2.
  • BUT confers useful expression information
  • discovery of new genes esp in diseased organisms
  • Tissue type
  • Timing/developmental stage
  • Samples more transcripts, variants

26
Where Can I See This EST Geneset?ContigView
Choose EST genes
EST track
27
Pseudogenes False Genes
28
ncRNAs (non coding RNAs)
  • What types are in Ensembl?
  • tRNA (transfer RNA)
  • rRNA (ribosomal RNA)
  • scRNA (small cytoplasmic)
  • snRNA (small nuclear)
  • snoRNA (small nucleolar)
  • miRNA (microRNA)

29
ncRNAs (2 types)
  • I) RNA with low homology can be identified
    through conserved 2ary structure (search genome
    using Rfam pattern)
  • II) High sequence conservation (miRNA)
  • BLAST alignment
  • RNA fold applied to make sure
  • sequences can fold (hairpin)

30
ncRNAs where can I see them?
  • Find them in ContigView
  • or use BioMart.

31
Summary Ensembl Genes
  • All Ensembl genes are based on biological
    evidence (protein and mRNA)
  • One Ensembl gene may come from proteins and mRNAs
    in various databases.
  • Havana (manually curated) genes are incorporated
    into the Ensembl geneset, merged for human.
  • The CCDS set strives for consensus coding
    sequences across databases.
  • Pseudogenes and RNAs are annotated, along with a
    separate EST gene set.

32
For more on GeneBuild
  • Help and Documentation
  • (About Ensembl)

http//www.ensembl.org/info/about/docs/genome_anno
tation.html
Write a Comment
User Comments (0)
About PowerShow.com