Title: Short comparion GASP 99 EGASP 05
1Short comparion GASP 99- EGASP 05
- Martin Reese (mreese_at_omicia.com
- Omicia Inc.
- 5980 Horton Street
- Emeryville, CA 94602
2The challenge of annotating a complete eukaryotic
genomeA case study in Drosophila melanogaster
- Martin G. Reese (mgreese_at_lbl.gov)
- Nomi L. Harris (nlharris_at_lbl.gov)
- George Hartzell (hartzell_at_cs.berkeley.edu)
- Suzanna E. Lewis (suzi_at_fruitfly.berkeley.edu)
- Later added Josep April
- Drosophila Genome CenterDepartment of Molecular
and Cell Biology539 Life Sciences
AdditionUniversity of California, Berkeley
3The genome annotation experiment GASP 1999
- Annotation of 2.9 Mb of Drosophila melanogaster
genomic DNA 44 separate regions - Open to everybody, announced on several mailing
lists - Participants can use any analysis methods they
like (gene finding programs, homology searches,
by-eye assessment, combination methods, etc.) and
should disclose their methods. - CASP like
- 12 participating groups EGASP at least 20 groups
4URL http//www-hgc.lbl.gov/homes/reese/genome-ann
otation
5Goals of the experiment
- Compare and contrast various genome annotation
methods - Objective assessment of the state of the art in
gene finding and functional site prediction - Identify outstanding problems in computational
methods for the annotation process
6Adh contig
- 2.9 Mb contiguous Drosophila sequence from the
Adh region, one of the best studied genomic
regions - From chromosome 2L (34D-36A)
- Ashburner et al., (to appear in Genetics)
- 222 gene annotations (as of July 22, 1999)
450 genes - 375,585 bases are coding (12.95) ENCODE region
30Mb - We chose the Adh region because it was thought to
be typical. A representative test bed to evaluate
annotation techniques.
7Adh paper (to appear in Genetics)
URL http//www.fruitfly.org/publications/PDF/ADH.
pdf
8Submissions
- MAGPIE Team T. Gaasterland et al.
- Computational Genomics Group, The Sanger Centre
V. Solovyev - University of Erlangen U. Ohler
- Genome Annotation Group, The Sanger Centre E.
Birney - Oakridge Nat. Laboratory GRAIL R. Mural et al.
- CBS Technical University of Denmark HMMGene A.
Krogh - Georgia Institute of Technology GeneMark.hmm
M. Borodovsky - IMIM, Spain GeneID Roderic Guigó et al.
- Fred Hutchinson Cancer Center BLOCKS Henikoff
Henikoff - GSF, Neuherberg, Germany M. Scherf
- Mount Sinai School of Medicine Gary Benson
- UCB/UC Santa Cruz/Neomorphic Genie M. Reese
and D. Kulp
9Submission classes
10Submission classes (cont.)
11Measuring success
- By nucleotide
- Sensitivity/Specificity (Sn/Sp)
- By exon
- Sn/Sp
- Missed exons (ME), wrong exons (WE)
- By gene
- Sn/Sp
- Missed genes (MG), wrong genes (WG)
- Average overlap statistics
- Based on Burset and Guigo (1996), Evaluation of
gene structure prediction programs. Genomics,
34(3), 353-367.
12Definition Joined and split genes
Actual genes that overlap predicted genes
JG -------------------------------------------
Predicted genes that overlap one or more actual
genes
Predicted genes that overlap actual genes
SG -------------------------------------------
Actual genes that overlap one or more predicted
genes
- JG gt 1, tendency to join multiple actual genes
into one prediction - SG gt 1, tendency to split actual genes into
separate gene predictions
Inspired by Hayes and Guigó (1999), unpublished.
13Results Base level
- Sensitivity Sn 93 9_101_1
- Low variability among predictors Sp 92 20_79_1
- 95 coverage of the proteome
- Specificity
- 90
- Programs that are more like Genscan (used for
original annotation) might do better?
14Results Exon level
- Higher variability among predictors Sn 89.8
14_87_3 - Up to 75 sensitivity (both exon boundaries
correct) - 55 specificity Sp 88 20_78_3
- Low specificity because partial exon overlaps do
not count - Missing exons below 5
- Many wrong exons (20)
15Results Gene level
Sn 71 36_46_1 Sp 66 34_55_3
16Results Gene level
- 60 of actual genes predicted completely correct
- Specificity only 30-40
- 5-10 missed genes (comparable to Sanger Center)
- 40 wrong genes, a lot of short genes
overpredicted (possibly not annotated in Standard
3) - Splitting genes is a bigger problem than joining
genes
Sn 71 36_46_1 Sp 66 34_55_3
17DRO Human comparison
18Results (protein homology) Gene level
19Discussion
- Good predictive improvements
- expression improves predictions
- gene finding became automatic annotation
tools - Gene sensitivity/specificity at roughly 70 is
excellent - No correct answer/real golden standard (like
CASP) - Superb community
20Open questions
- How many protein coding genes/loci missed?
- How many total human protein coding loci are
there? (Dro lt14,500) - How much and what is the function of array
detected transcripts (coding non-coding?) - Can we get an exhaustive alternative splicing
golden standard?