Title: Evaluating genes and transcripts in Ensembl
1Evaluating genes and transcripts in Ensembl
November 2004
2- Ensembl gene set
- Ensembl EST genes
- Ab initio predictions
- Manual curation (Vega)
- Gene models from other groups
- Known v. novel genes
- Gene names descriptions
3Overview
manual curation
Ensembl transcript predictions
evidence
other groups models
4Annotation process
- Automated analysis
- Repeat masking
- RepeatMasker (Smit), tandem, inverted
- Gene prediction
- Genscan (Burge), FGENESH (Solovyev)
- Database searches
- initial protein and DNA matches using BLAST
- refined protein matches using Genewise (wise2)
- refined EST matches using EST2GENOME, spangle
- Pfam annotation using halfwise (wise2)
- Integrate results, display, annotate
- ACEDB, web-based tools (spangle)
- Investigate gene predictions experimentally
- Submit to EMBL
5Strategies for analysis ofunfinished sequence
Full analysis at DNA level
Slow
Unfinished Sequence
Predicted Genes
Fast
Predicted peptides (Genscan)
Analysis at Protein level
6The challenge
- 4.3Gb DNA (480,000 sequences)
- 16 analysis types
- Automatic submission of jobs
- Dependencies between them
- Track their progress
- Retry failed jobs
- Need access to large file based databases
- Store output
- Easy to include new analysis types
7- Ensembl gene set
- Ensembl EST genes
- Ab initio predictions
- Manual curation (Vega)
- Gene models from other groups
- Known v. novel genes
- Gene names descriptions
8Ensembl gene set
- Place all available species-specific proteins to
make transcripts - Place similar proteins to make transcripts
- Use mRNA data to add UTRs
- Build transcripts using cDNA evidence
- Build additional transcripts using Genscan
homology evidence - Combine annotations to make genes with
alternative transcripts
9(No Transcript)
10Using BLAST - how not to do it
12 35000 17,500 days 24 35000
35,000 days 52,500 days CPU
- 3.5Gb DNA
- BLAST against all known proteins/cDNA
- Realign using Genewise and est2genome
11The trouble with BLAST
BLAST is good for finding possible exon
positions In large genomic sequences.
12BLAST replacements
- Exonerate (Guy Slater)
- Fast gapped DNA-DNA matcher
- 10,000 x faster than BLAST
- Pmatch (Richard Durbin)
- Fast exact protein-dna matcher
- gt10,000 x faster than BLAST
13A better way
- 3.5Gb DNA
- Fast exact match against known proteins/cDNAs
- Realign using minigenewise and miniest2genome
- 26 days
- 42 days
- Total 68 days
14Protein Based Gene Prediction
15Gene build is protein based
- DNA-DNA alignments dont give us translatable
genes - Essential to align at the protein level allowing
for frameshifts and splice sites - Genewise (Ewan Birney)
- Protein genomic alignment
- Has splice site model
- Penalises stop codons
- Allows for frameshifts
E Birney et al. Genome Research 14 (2004)
16Genes from proteins
Other protein sequences Swall
blast v. assembly
blast and Miniseq
R. Durbin, unpublished
17Miniseq - the need for speed
Minigenomic 1kb on either side run Genewise
Map back to genomic
Spliced alignment
18The basic rule
Fast
gt 1Mb dna
Pmatch/Blat/Exonerate/SSAHA
100kb - 1Mb dna
Blast and miniseq
lt 100kb
Genewise/est2genome/sim4
Slow
19UTR Addition
20Adding UTRs
protein - Genewise (phases, no UTRs)
cDNA - exonerate (UTRs, no phases)
Combined prediction
protein - Genewise (phases, no UTRs)
cDNA - exonerate (UTRs, no phases)
Genewise prediction
protein - Genewise (phases, no UTRs)
cDNA - exonerate (UTRs, no phases)
Genewise prediction
21The Genebuilder
22Make genes
- Combine results Genewise and ab initio
predictions - Group transcripts which share exons
- Reject non-translating transcripts
- Remove duplicate exons
- Attach supporting evidence
- Write genes to database
23Addition of cDNA transcripts
Genewise genes with UTRs
24GeneBuild Summary
25Pseudogene Analysis
Transcripts
- Single Exon transcripts with only frameshifts
- Single Exon transcripts whose evidence is
spliced elsewhere - Transcripts whose introns are gt 80 repeat
sequence
A Gene is labeled as a pseudogene in all
Transcripts in that gene are labeled as pseudo
26Latest Full Human Build
- NCBI 35 assembly, released May 2004
- Ensembl genesĀ 24,195
- Ensembl coding transcripts 35,913
- (plus 1,974 pseudogenes)
- Ensembl exons 245,938
- Input human seqs 48,176 proteins 86,918 cDNAs
- Transcripts made from
- Human proteins with (without) UTRs 73 (12)
- Non-human proteins with (without) UTRs 2 (12)
27Gene building summary
- Initial location of possible genes using Genscan
peptides and BLAST. - Reblasting of all high scoring proteins with
BLAST to find regions Genscan has missed - Realignment of proteins using Genewise
- mRNA/EST genes built using Genscan exons.
28Comparison to manual annotation
Genes Sensitivity 90 of manual genes are in
Specificity 75 of genes are in the
manual sets Exon bps Sensitivity 70 of
manual bps are in exons (90 of coding
bps) Specificity 80 of bps are in
manual exons Alternative transcripts per
gene manual 3 1.3
Figures are for the gene build on NCBI 33 (human)
and manual annotation for chr 6, 13 14
29Configuring the Genebuild
Data availability Targetted build most useful in
mouse, rat, human Similarity build more
important other species Structural
Issues Zebrafish Many similar genes near each
other Genome from different haplotypes C.
briggsae Very dense genome Short
introns Mosquito Many single-exon
genes Genes within genes Configuration Files
provide flexibility
30Current releases
We also display T. nigroviridis, D. melanogaster
and C. elegans
31Evaluating genes and transcripts
- Ensembl gene set
- Ensembl EST genes
- Ab initio predictions
- Manual curation (Vega)
- Gene models from other groups
- Known v. novel genes
- Gene names descriptions
32EST GeneBuild Summary
33Intitial EST Alignment
Human ESTs
Exonerate
Aligned ESTs
34EST analysis
Map ESTs using Exonerate (determine coverage,
identity and location in genome)
Filter on identity and depth (5.5 million ESTs
from dbEST we map about 1/3)
35EST GeneBuild Summary
Exonerate
ClusterMerge
36Alt-Splicing from ESTs
Isoforms
37Display of ESTs and EST genes
Lastest Full Human Build NCBI 34 assembly,
released Jul 04
EST Genes 24,980 EST Transcripts
43,710
EST transcripts
Human ESTs
Display limited to 7 at any one point full data
accessible in the databases
38Evaluating genes and transcripts
- Ensembl gene set
- Ensembl EST genes
- Ab initio predictions
- Manual curation (Vega)
- Gene models from other groups
- Known v. novel genes
- Gene names descriptions
39Ab initio Genscan predictions
Genscan prediction
Evidence supporting Genscan exons
40Evaluating genes and transcripts
- Ensembl gene set
- Ensembl EST genes
- Ab initio predictions
- Manual curation (Vega)
- Gene models from other groups
- Known v. novel genes
- Gene names descriptions
41Manual curation (human)
- Manual annotation of finished clones
- New Vega system for storing data
http//vega.sanger.ac.uk - Currently has 6, 9, 10,13, 20, 22 and X from
Sanger, 14 from Genoscope 7 from Washington
University. - Other groups will also contribute to Vega
- Displayed in Ensembl when available
42Manual curation (other species)
- C. elegans / Drosophila
- main gene set in Ensembl is the
- WormBase / FlyBase manually-curated set
- Vega system includes some manually-curated
finished clones for mouse and zebrafish
43Vega genes
Vega manual curation
44Evaluating genes and transcripts
- Ensembl gene set
- Ensembl EST genes
- Ab initio predictions
- Manual curation (Vega)
- Gene models from other groups
- Known v. novel genes
- Gene names descriptions
45Gene models from other groups
Turn on DAS sources
Other models as DAS sources
FASTAView display
46Evaluating genes and transcripts
- Ensembl gene set
- Ensembl EST genes
- Ab initio predictions
- Manual curation (Vega)
- Gene models from other groups
- Known v. novel genes
- Gene names descriptions
47Known v. novel transcripts
- Naming takes place after the gene build is
completed - Transcripts/proteins mapped to SwissProt, RefSeq
and SPTrEMBL entries - If mapped known if not novel
- Require high sequence similarity, but allow
incomplete coverage - Note
- Difficult for families of closely-related genes
- Wrongly annotated pseudogenes may also cause
problems
48Evaluating genes and transcripts
- Ensembl gene set
- Ensembl EST genes
- Ab initio predictions
- Manual curation (Vega / Sanger)
- Gene models from other groups
- Known v. novel genes
- Gene names descriptions
49Names and descriptions
- Names taken from mapped database entries
- Official HGNC (HUGO) name used if available (or
equivalent for other species) - Otherwise Swiss-Prot gt RefSeq gt SPTrEMBL
- Novel transcripts have only Ensembl stable ids
- Genes named after best-named transcript
- Gene description taken from mapped database
entries (source given) - Hints
- Orthology can provide useful confirmation
- If no description, check for any Family
description
50GeneView
ExonView
Gene name description
Alternative transcripts
Links to putative orthologues
links to ExonView
Transcript name
Mapping to external databases
51Evidence tracks in ContigView
Expanded tracks
Compressed tracks
52Outlook
- Improved pseudogene annotation, for all species
-
- Upstream regulatory elements - using CpG islands,
- Eponine predictions, motifs to aid in
prediction of - transcription start sites
- Improve use of cDNAs - can already use to add
- alternatively spliced transcripts
- Improve UTR extension
- Make use of comparative data
- New Species, Xenopus, Cow and Opossum are on the
horizon
53Acknowledgements
Genebuilders Dan Andrews Martin Hammond Kevin
Howe Vivek Iyer Kerstin Jekosch Jan-Hinnerk
Vogel Mario Caccamo Simon White Laura Clarke
Guy
Slater Steve Searle Val Curwen And the rest of
the Ensembl team!