Evaluating genes and transcripts in Ensembl

About This Presentation

Title:

Evaluating genes and transcripts in Ensembl

Description:

refined EST matches using EST2GENOME, spangle. Pfam ... Orthology can provide useful confirmation. If no description, check for any Family description ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 53

Provided by: xosmfr

Category:

more less

Transcript and Presenter's Notes

Title: Evaluating genes and transcripts in Ensembl

1
Evaluating genes and transcripts in Ensembl
November 2004
2

Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega)
Gene models from other groups
Known v. novel genes
Gene names descriptions

3
Overview
manual curation
Ensembl transcript predictions
evidence
other groups models
4
Annotation process

Automated analysis
Repeat masking
RepeatMasker (Smit), tandem, inverted
Gene prediction
Genscan (Burge), FGENESH (Solovyev)
Database searches
initial protein and DNA matches using BLAST
refined protein matches using Genewise (wise2)
refined EST matches using EST2GENOME, spangle
Pfam annotation using halfwise (wise2)
Integrate results, display, annotate
ACEDB, web-based tools (spangle)
Investigate gene predictions experimentally
Submit to EMBL

5
Strategies for analysis ofunfinished sequence
Full analysis at DNA level
Slow
Unfinished Sequence
Predicted Genes
Fast
Predicted peptides (Genscan)
Analysis at Protein level
6
The challenge

4.3Gb DNA (480,000 sequences)
16 analysis types
Automatic submission of jobs
Dependencies between them
Track their progress
Retry failed jobs
Need access to large file based databases
Store output
Easy to include new analysis types

Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega)
Gene models from other groups
Known v. novel genes
Gene names descriptions

8
Ensembl gene set

Place all available species-specific proteins to
make transcripts
Place similar proteins to make transcripts
Use mRNA data to add UTRs
Build transcripts using cDNA evidence
Build additional transcripts using Genscan
homology evidence
Combine annotations to make genes with
alternative transcripts

9
(No Transcript)
10
Using BLAST - how not to do it
12 35000 17,500 days 24 35000
35,000 days 52,500 days CPU

3.5Gb DNA
BLAST against all known proteins/cDNA
Realign using Genewise and est2genome

11
The trouble with BLAST
BLAST is good for finding possible exon
positions In large genomic sequences.
12
BLAST replacements

Exonerate (Guy Slater)
Fast gapped DNA-DNA matcher
10,000 x faster than BLAST
Pmatch (Richard Durbin)
Fast exact protein-dna matcher
gt10,000 x faster than BLAST

13
A better way

3.5Gb DNA
Fast exact match against known proteins/cDNAs
Realign using minigenewise and miniest2genome

26 days
42 days
Total 68 days

14
Protein Based Gene Prediction
15
Gene build is protein based

DNA-DNA alignments dont give us translatable
genes
Essential to align at the protein level allowing
for frameshifts and splice sites
Genewise (Ewan Birney)
Protein genomic alignment
Has splice site model
Penalises stop codons
Allows for frameshifts

E Birney et al. Genome Research 14 (2004)
16
Genes from proteins

Other protein sequences Swall
blast v. assembly
blast and Miniseq
R. Durbin, unpublished
17
Miniseq - the need for speed
Minigenomic 1kb on either side run Genewise
Map back to genomic
Spliced alignment
18
The basic rule
Fast
gt 1Mb dna
Pmatch/Blat/Exonerate/SSAHA
100kb - 1Mb dna
Blast and miniseq
lt 100kb
Genewise/est2genome/sim4
Slow
19
UTR Addition
20
Adding UTRs

protein - Genewise (phases, no UTRs)
cDNA - exonerate (UTRs, no phases)
Combined prediction
protein - Genewise (phases, no UTRs)
cDNA - exonerate (UTRs, no phases)
Genewise prediction
protein - Genewise (phases, no UTRs)
cDNA - exonerate (UTRs, no phases)
Genewise prediction
21
The Genebuilder
22
Make genes

Combine results Genewise and ab initio
predictions
Group transcripts which share exons
Reject non-translating transcripts
Remove duplicate exons
Attach supporting evidence
Write genes to database

23
Addition of cDNA transcripts
Genewise genes with UTRs
24
GeneBuild Summary
25
Pseudogene Analysis
Transcripts

Single Exon transcripts with only frameshifts
Single Exon transcripts whose evidence is
spliced elsewhere
Transcripts whose introns are gt 80 repeat
sequence

A Gene is labeled as a pseudogene in all
Transcripts in that gene are labeled as pseudo
26
Latest Full Human Build

NCBI 35 assembly, released May 2004
Ensembl genes 24,195
Ensembl coding transcripts 35,913
(plus 1,974 pseudogenes)
Ensembl exons 245,938
Input human seqs 48,176 proteins 86,918 cDNAs
Transcripts made from
Human proteins with (without) UTRs 73 (12)
Non-human proteins with (without) UTRs 2 (12)

27
Gene building summary

Initial location of possible genes using Genscan
peptides and BLAST.
Reblasting of all high scoring proteins with
BLAST to find regions Genscan has missed
Realignment of proteins using Genewise
mRNA/EST genes built using Genscan exons.

28
Comparison to manual annotation
Genes Sensitivity 90 of manual genes are in
Specificity 75 of genes are in the
manual sets Exon bps Sensitivity 70 of
manual bps are in exons (90 of coding
bps) Specificity 80 of bps are in
manual exons Alternative transcripts per
gene manual 3 1.3
Figures are for the gene build on NCBI 33 (human)
and manual annotation for chr 6, 13 14
29
Configuring the Genebuild
Data availability Targetted build most useful in
mouse, rat, human Similarity build more
important other species Structural
Issues Zebrafish Many similar genes near each
other Genome from different haplotypes C.
briggsae Very dense genome Short
introns Mosquito Many single-exon
genes Genes within genes Configuration Files
provide flexibility
30
Current releases
We also display T. nigroviridis, D. melanogaster
and C. elegans
31
Evaluating genes and transcripts

Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega)
Gene models from other groups
Known v. novel genes
Gene names descriptions

32
EST GeneBuild Summary
33
Intitial EST Alignment
Human ESTs
Exonerate
Aligned ESTs
34
EST analysis

Map ESTs using Exonerate (determine coverage,
identity and location in genome)
Filter on identity and depth (5.5 million ESTs
from dbEST we map about 1/3)
35
EST GeneBuild Summary
Exonerate
ClusterMerge
36
Alt-Splicing from ESTs
Isoforms
37
Display of ESTs and EST genes
Lastest Full Human Build NCBI 34 assembly,
released Jul 04
EST Genes 24,980 EST Transcripts
43,710
EST transcripts
Human ESTs
Display limited to 7 at any one point full data
accessible in the databases
38
Evaluating genes and transcripts

Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega)
Gene models from other groups
Known v. novel genes
Gene names descriptions

39
Ab initio Genscan predictions
Genscan prediction
Evidence supporting Genscan exons
40
Evaluating genes and transcripts

Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega)
Gene models from other groups
Known v. novel genes
Gene names descriptions

41
Manual curation (human)

Manual annotation of finished clones
New Vega system for storing data
http//vega.sanger.ac.uk
Currently has 6, 9, 10,13, 20, 22 and X from
Sanger, 14 from Genoscope 7 from Washington
University.
Other groups will also contribute to Vega
Displayed in Ensembl when available

42
Manual curation (other species)

C. elegans / Drosophila
main gene set in Ensembl is the
WormBase / FlyBase manually-curated set
Vega system includes some manually-curated
finished clones for mouse and zebrafish

43
Vega genes
Vega manual curation
44
Evaluating genes and transcripts

Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega)
Gene models from other groups
Known v. novel genes
Gene names descriptions

45
Gene models from other groups
Turn on DAS sources
Other models as DAS sources
FASTAView display
46
Evaluating genes and transcripts

Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega)
Gene models from other groups
Known v. novel genes
Gene names descriptions

47
Known v. novel transcripts

Naming takes place after the gene build is
completed
Transcripts/proteins mapped to SwissProt, RefSeq
and SPTrEMBL entries
If mapped known if not novel
Require high sequence similarity, but allow
incomplete coverage
Note
Difficult for families of closely-related genes
Wrongly annotated pseudogenes may also cause
problems

48
Evaluating genes and transcripts

Ensembl gene set
Ensembl EST genes
Ab initio predictions
Manual curation (Vega / Sanger)
Gene models from other groups
Known v. novel genes
Gene names descriptions

49
Names and descriptions

Names taken from mapped database entries
Official HGNC (HUGO) name used if available (or
equivalent for other species)
Otherwise Swiss-Prot gt RefSeq gt SPTrEMBL
Novel transcripts have only Ensembl stable ids
Genes named after best-named transcript
Gene description taken from mapped database
entries (source given)
Hints
Orthology can provide useful confirmation
If no description, check for any Family
description

50
GeneView
ExonView
Gene name description
Alternative transcripts
Links to putative orthologues
links to ExonView
Transcript name
Mapping to external databases
51
Evidence tracks in ContigView
Expanded tracks
Compressed tracks
52
Outlook

Improved pseudogene annotation, for all species
Upstream regulatory elements - using CpG islands,
Eponine predictions, motifs to aid in
prediction of
transcription start sites
Improve use of cDNAs - can already use to add
alternatively spliced transcripts
Improve UTR extension
Make use of comparative data
New Species, Xenopus, Cow and Opossum are on the
horizon

53
Acknowledgements
Genebuilders Dan Andrews Martin Hammond Kevin
Howe Vivek Iyer Kerstin Jekosch Jan-Hinnerk
Vogel Mario Caccamo Simon White Laura Clarke
Guy
Slater Steve Searle Val Curwen And the rest of
the Ensembl team!

Write a Comment

User Comments (0)

About PowerShow.com

Evaluating genes and transcripts in Ensembl - PowerPoint PPT Presentation

Evaluating genes and transcripts in Ensembl

refined EST matches using EST2GENOME, spangle. Pfam ... Orthology can provide useful confirmation. If no description, check for any Family description ... – PowerPoint PPT presentation