Short comparion GASP 99 EGASP 05 - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Short comparion GASP 99 EGASP 05

Description:

Short comparion GASP 99 EGASP 05 – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 21
Provided by: martin409
Category:

less

Transcript and Presenter's Notes

Title: Short comparion GASP 99 EGASP 05


1
Short comparion GASP 99- EGASP 05
  • Martin Reese (mreese_at_omicia.com
  • Omicia Inc.
  • 5980 Horton Street
  • Emeryville, CA 94602

2
The challenge of annotating a complete eukaryotic
genomeA case study in Drosophila melanogaster
  • Martin G. Reese (mgreese_at_lbl.gov)
  • Nomi L. Harris (nlharris_at_lbl.gov)
  • George Hartzell (hartzell_at_cs.berkeley.edu)
  • Suzanna E. Lewis (suzi_at_fruitfly.berkeley.edu)
  • Later added Josep April
  • Drosophila Genome CenterDepartment of Molecular
    and Cell Biology539 Life Sciences
    AdditionUniversity of California, Berkeley

3
The genome annotation experiment GASP 1999
  • Annotation of 2.9 Mb of Drosophila melanogaster
    genomic DNA 44 separate regions
  • Open to everybody, announced on several mailing
    lists
  • Participants can use any analysis methods they
    like (gene finding programs, homology searches,
    by-eye assessment, combination methods, etc.) and
    should disclose their methods.
  • CASP like
  • 12 participating groups EGASP at least 20 groups

4
URL http//www-hgc.lbl.gov/homes/reese/genome-ann
otation
5
Goals of the experiment
  • Compare and contrast various genome annotation
    methods
  • Objective assessment of the state of the art in
    gene finding and functional site prediction
  • Identify outstanding problems in computational
    methods for the annotation process

6
Adh contig
  • 2.9 Mb contiguous Drosophila sequence from the
    Adh region, one of the best studied genomic
    regions
  • From chromosome 2L (34D-36A)
  • Ashburner et al., (to appear in Genetics)
  • 222 gene annotations (as of July 22, 1999)
    450 genes
  • 375,585 bases are coding (12.95) ENCODE region
    30Mb
  • We chose the Adh region because it was thought to
    be typical. A representative test bed to evaluate
    annotation techniques.

7
Adh paper (to appear in Genetics)
URL http//www.fruitfly.org/publications/PDF/ADH.
pdf
8
Submissions
  • MAGPIE Team T. Gaasterland et al.
  • Computational Genomics Group, The Sanger Centre
    V. Solovyev
  • University of Erlangen U. Ohler
  • Genome Annotation Group, The Sanger Centre E.
    Birney
  • Oakridge Nat. Laboratory GRAIL R. Mural et al.
  • CBS Technical University of Denmark HMMGene A.
    Krogh
  • Georgia Institute of Technology GeneMark.hmm
    M. Borodovsky
  • IMIM, Spain GeneID Roderic Guigó et al.
  • Fred Hutchinson Cancer Center BLOCKS Henikoff
    Henikoff
  • GSF, Neuherberg, Germany M. Scherf
  • Mount Sinai School of Medicine Gary Benson
  • UCB/UC Santa Cruz/Neomorphic Genie M. Reese
    and D. Kulp

9
Submission classes
10
Submission classes (cont.)
11
Measuring success
  • By nucleotide
  • Sensitivity/Specificity (Sn/Sp)
  • By exon
  • Sn/Sp
  • Missed exons (ME), wrong exons (WE)
  • By gene
  • Sn/Sp
  • Missed genes (MG), wrong genes (WG)
  • Average overlap statistics
  • Based on Burset and Guigo (1996), Evaluation of
    gene structure prediction programs. Genomics,
    34(3), 353-367.

12
Definition Joined and split genes
Actual genes that overlap predicted genes
JG -------------------------------------------
Predicted genes that overlap one or more actual
genes
Predicted genes that overlap actual genes
SG -------------------------------------------
Actual genes that overlap one or more predicted
genes
  • JG gt 1, tendency to join multiple actual genes
    into one prediction
  • SG gt 1, tendency to split actual genes into
    separate gene predictions

Inspired by Hayes and Guigó (1999), unpublished.
13
Results Base level
  • Sensitivity Sn 93 9_101_1
  • Low variability among predictors Sp 92 20_79_1
  • 95 coverage of the proteome
  • Specificity
  • 90
  • Programs that are more like Genscan (used for
    original annotation) might do better?

14
Results Exon level
  • Higher variability among predictors Sn 89.8
    14_87_3
  • Up to 75 sensitivity (both exon boundaries
    correct)
  • 55 specificity Sp 88 20_78_3
  • Low specificity because partial exon overlaps do
    not count
  • Missing exons below 5
  • Many wrong exons (20)

15
Results Gene level
Sn 71 36_46_1 Sp 66 34_55_3
16
Results Gene level
  • 60 of actual genes predicted completely correct
  • Specificity only 30-40
  • 5-10 missed genes (comparable to Sanger Center)
  • 40 wrong genes, a lot of short genes
    overpredicted (possibly not annotated in Standard
    3)
  • Splitting genes is a bigger problem than joining
    genes

Sn 71 36_46_1 Sp 66 34_55_3
17
DRO Human comparison
18
Results (protein homology) Gene level
19
Discussion
  • Good predictive improvements
  • expression improves predictions
  • gene finding became automatic annotation
    tools
  • Gene sensitivity/specificity at roughly 70 is
    excellent
  • No correct answer/real golden standard (like
    CASP)
  • Superb community

20
Open questions
  • How many protein coding genes/loci missed?
  • How many total human protein coding loci are
    there? (Dro lt14,500)
  • How much and what is the function of array
    detected transcripts (coding non-coding?)
  • Can we get an exhaustive alternative splicing
    golden standard?
Write a Comment
User Comments (0)
About PowerShow.com