Evaluating genes and transcripts in Ensembl - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Evaluating genes and transcripts in Ensembl

Description:

refined EST matches using EST2GENOME, spangle. Pfam ... Orthology can provide useful confirmation. If no description, check for any Family description ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 53
Provided by: xosmfr
Category:

less

Transcript and Presenter's Notes

Title: Evaluating genes and transcripts in Ensembl


1
Evaluating genes and transcripts in Ensembl
November 2004
2
  • Ensembl gene set
  • Ensembl EST genes
  • Ab initio predictions
  • Manual curation (Vega)
  • Gene models from other groups
  • Known v. novel genes
  • Gene names descriptions

3
Overview
manual curation
Ensembl transcript predictions
evidence
other groups models
4
Annotation process
  • Automated analysis
  • Repeat masking
  • RepeatMasker (Smit), tandem, inverted
  • Gene prediction
  • Genscan (Burge), FGENESH (Solovyev)
  • Database searches
  • initial protein and DNA matches using BLAST
  • refined protein matches using Genewise (wise2)
  • refined EST matches using EST2GENOME, spangle
  • Pfam annotation using halfwise (wise2)
  • Integrate results, display, annotate
  • ACEDB, web-based tools (spangle)
  • Investigate gene predictions experimentally
  • Submit to EMBL

5
Strategies for analysis ofunfinished sequence
Full analysis at DNA level
Slow
Unfinished Sequence
Predicted Genes
Fast
Predicted peptides (Genscan)
Analysis at Protein level
6
The challenge
  • 4.3Gb DNA (480,000 sequences)
  • 16 analysis types
  • Automatic submission of jobs
  • Dependencies between them
  • Track their progress
  • Retry failed jobs
  • Need access to large file based databases
  • Store output
  • Easy to include new analysis types

7
  • Ensembl gene set
  • Ensembl EST genes
  • Ab initio predictions
  • Manual curation (Vega)
  • Gene models from other groups
  • Known v. novel genes
  • Gene names descriptions

8
Ensembl gene set
  • Place all available species-specific proteins to
    make transcripts
  • Place similar proteins to make transcripts
  • Use mRNA data to add UTRs
  • Build transcripts using cDNA evidence
  • Build additional transcripts using Genscan
    homology evidence
  • Combine annotations to make genes with
    alternative transcripts

9
(No Transcript)
10
Using BLAST - how not to do it
12 35000 17,500 days 24 35000
35,000 days 52,500 days CPU
  • 3.5Gb DNA
  • BLAST against all known proteins/cDNA
  • Realign using Genewise and est2genome

11
The trouble with BLAST
BLAST is good for finding possible exon
positions In large genomic sequences.
12
BLAST replacements
  • Exonerate (Guy Slater)
  • Fast gapped DNA-DNA matcher
  • 10,000 x faster than BLAST
  • Pmatch (Richard Durbin)
  • Fast exact protein-dna matcher
  • gt10,000 x faster than BLAST

13
A better way
  • 3.5Gb DNA
  • Fast exact match against known proteins/cDNAs
  • Realign using minigenewise and miniest2genome
  • 26 days
  • 42 days
  • Total 68 days

14
Protein Based Gene Prediction
15
Gene build is protein based
  • DNA-DNA alignments dont give us translatable
    genes
  • Essential to align at the protein level allowing
    for frameshifts and splice sites
  • Genewise (Ewan Birney)
  • Protein genomic alignment
  • Has splice site model
  • Penalises stop codons
  • Allows for frameshifts

E Birney et al. Genome Research 14 (2004)
16
Genes from proteins


Other protein sequences Swall
blast v. assembly
blast and Miniseq
R. Durbin, unpublished
17
Miniseq - the need for speed
Minigenomic 1kb on either side run Genewise
Map back to genomic
Spliced alignment
18
The basic rule
Fast
gt 1Mb dna
Pmatch/Blat/Exonerate/SSAHA
100kb - 1Mb dna
Blast and miniseq
lt 100kb
Genewise/est2genome/sim4
Slow
19
UTR Addition
20
Adding UTRs

protein - Genewise (phases, no UTRs)
cDNA - exonerate (UTRs, no phases)
Combined prediction
protein - Genewise (phases, no UTRs)
cDNA - exonerate (UTRs, no phases)
Genewise prediction
protein - Genewise (phases, no UTRs)
cDNA - exonerate (UTRs, no phases)
Genewise prediction
21
The Genebuilder
22
Make genes
  • Combine results Genewise and ab initio
    predictions
  • Group transcripts which share exons
  • Reject non-translating transcripts
  • Remove duplicate exons
  • Attach supporting evidence
  • Write genes to database

23
Addition of cDNA transcripts
Genewise genes with UTRs
24
GeneBuild Summary
25
Pseudogene Analysis
Transcripts
  • Single Exon transcripts with only frameshifts
  • Single Exon transcripts whose evidence is
    spliced elsewhere
  • Transcripts whose introns are gt 80 repeat
    sequence

A Gene is labeled as a pseudogene in all
Transcripts in that gene are labeled as pseudo
26
Latest Full Human Build
  • NCBI 35 assembly, released May 2004
  • Ensembl genesĀ  24,195
  • Ensembl coding transcripts 35,913
  • (plus 1,974 pseudogenes)
  • Ensembl exons 245,938
  • Input human seqs 48,176 proteins 86,918 cDNAs
  • Transcripts made from
  • Human proteins with (without) UTRs 73 (12)
  • Non-human proteins with (without) UTRs 2 (12)

27
Gene building summary
  • Initial location of possible genes using Genscan
    peptides and BLAST.
  • Reblasting of all high scoring proteins with
    BLAST to find regions Genscan has missed
  • Realignment of proteins using Genewise
  • mRNA/EST genes built using Genscan exons.

28
Comparison to manual annotation
Genes Sensitivity 90 of manual genes are in
Specificity 75 of genes are in the
manual sets Exon bps Sensitivity 70 of
manual bps are in exons (90 of coding
bps) Specificity 80 of bps are in
manual exons Alternative transcripts per
gene manual 3 1.3
Figures are for the gene build on NCBI 33 (human)
and manual annotation for chr 6, 13 14
29
Configuring the Genebuild
Data availability Targetted build most useful in
mouse, rat, human Similarity build more
important other species Structural
Issues Zebrafish Many similar genes near each
other Genome from different haplotypes C.
briggsae Very dense genome Short
introns Mosquito Many single-exon
genes Genes within genes Configuration Files
provide flexibility
30
Current releases
We also display T. nigroviridis, D. melanogaster
and C. elegans
31
Evaluating genes and transcripts
  • Ensembl gene set
  • Ensembl EST genes
  • Ab initio predictions
  • Manual curation (Vega)
  • Gene models from other groups
  • Known v. novel genes
  • Gene names descriptions

32
EST GeneBuild Summary
33
Intitial EST Alignment
Human ESTs
Exonerate
Aligned ESTs
34
EST analysis

Map ESTs using Exonerate (determine coverage,
identity and location in genome)
Filter on identity and depth (5.5 million ESTs
from dbEST we map about 1/3)
35
EST GeneBuild Summary
Exonerate
ClusterMerge
36
Alt-Splicing from ESTs
Isoforms
37
Display of ESTs and EST genes
Lastest Full Human Build NCBI 34 assembly,
released Jul 04
EST Genes 24,980 EST Transcripts
43,710
EST transcripts
Human ESTs
Display limited to 7 at any one point full data
accessible in the databases
38
Evaluating genes and transcripts
  • Ensembl gene set
  • Ensembl EST genes
  • Ab initio predictions
  • Manual curation (Vega)
  • Gene models from other groups
  • Known v. novel genes
  • Gene names descriptions

39
Ab initio Genscan predictions
Genscan prediction
Evidence supporting Genscan exons
40
Evaluating genes and transcripts
  • Ensembl gene set
  • Ensembl EST genes
  • Ab initio predictions
  • Manual curation (Vega)
  • Gene models from other groups
  • Known v. novel genes
  • Gene names descriptions

41
Manual curation (human)
  • Manual annotation of finished clones
  • New Vega system for storing data
    http//vega.sanger.ac.uk
  • Currently has 6, 9, 10,13, 20, 22 and X from
    Sanger, 14 from Genoscope 7 from Washington
    University.
  • Other groups will also contribute to Vega
  • Displayed in Ensembl when available

42
Manual curation (other species)
  • C. elegans / Drosophila
  • main gene set in Ensembl is the
  • WormBase / FlyBase manually-curated set
  • Vega system includes some manually-curated
    finished clones for mouse and zebrafish

43
Vega genes
Vega manual curation
44
Evaluating genes and transcripts
  • Ensembl gene set
  • Ensembl EST genes
  • Ab initio predictions
  • Manual curation (Vega)
  • Gene models from other groups
  • Known v. novel genes
  • Gene names descriptions

45
Gene models from other groups
Turn on DAS sources
Other models as DAS sources
FASTAView display
46
Evaluating genes and transcripts
  • Ensembl gene set
  • Ensembl EST genes
  • Ab initio predictions
  • Manual curation (Vega)
  • Gene models from other groups
  • Known v. novel genes
  • Gene names descriptions

47
Known v. novel transcripts
  • Naming takes place after the gene build is
    completed
  • Transcripts/proteins mapped to SwissProt, RefSeq
    and SPTrEMBL entries
  • If mapped known if not novel
  • Require high sequence similarity, but allow
    incomplete coverage
  • Note
  • Difficult for families of closely-related genes
  • Wrongly annotated pseudogenes may also cause
    problems

48
Evaluating genes and transcripts
  • Ensembl gene set
  • Ensembl EST genes
  • Ab initio predictions
  • Manual curation (Vega / Sanger)
  • Gene models from other groups
  • Known v. novel genes
  • Gene names descriptions

49
Names and descriptions
  • Names taken from mapped database entries
  • Official HGNC (HUGO) name used if available (or
    equivalent for other species)
  • Otherwise Swiss-Prot gt RefSeq gt SPTrEMBL
  • Novel transcripts have only Ensembl stable ids
  • Genes named after best-named transcript
  • Gene description taken from mapped database
    entries (source given)
  • Hints
  • Orthology can provide useful confirmation
  • If no description, check for any Family
    description

50
GeneView
ExonView
Gene name description
Alternative transcripts
Links to putative orthologues
links to ExonView
Transcript name
Mapping to external databases
51
Evidence tracks in ContigView
Expanded tracks
Compressed tracks
52
Outlook
  • Improved pseudogene annotation, for all species
  • Upstream regulatory elements - using CpG islands,
  • Eponine predictions, motifs to aid in
    prediction of
  • transcription start sites
  • Improve use of cDNAs - can already use to add
  • alternatively spliced transcripts
  • Improve UTR extension
  • Make use of comparative data
  • New Species, Xenopus, Cow and Opossum are on the
    horizon

53
Acknowledgements
Genebuilders Dan Andrews Martin Hammond Kevin
Howe Vivek Iyer Kerstin Jekosch Jan-Hinnerk
Vogel Mario Caccamo Simon White Laura Clarke
Guy
Slater Steve Searle Val Curwen And the rest of
the Ensembl team!
Write a Comment
User Comments (0)
About PowerShow.com