JIGSAW: a better way to combine predictions - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

JIGSAW: a better way to combine predictions

Description:

JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in ... donor site (don), acceptor site (acc), intron (inr) and amino acid codon (cod) ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 16
Provided by: StevenS79
Category:

less

Transcript and Presenter's Notes

Title: JIGSAW: a better way to combine predictions


1
JIGSAW a better way to combine predictions
J.E. Allen, W.H. Majoros, M. Pertea, and S.L.
Salzberg. JIGSAW, GeneZilla, and GlimmerHMM
puzzling out the features of human genes in the
ENCODE regions. Genome Biology 2007,
7(Suppl)S9. J. E. Allen and S. L. Salzberg.
JIGSAW integration of multiple sources of
evidence for gene prediction. Bioinformatics
21(18) 3596-3603, 2005. J. E. Allen, M. Pertea
and S. L. Salzberg. Computational gene prediction
using mutliple sources of evidence. Genome
Research, 14(1), 2004.
2
Collecting gene structure evidence for JIGSAW
Figure 1. Evidence from the UCSC genome browser
used as input to JIGSAW. Evidence includes
computational gene finders, alignments from gene
expression evidence and evidence of
cross-species sequence conservation.
3
Representing gene structure evidence in JIGSAW
  • Each evidence source can predict up to six gene
    features
  • Start codon
  • Stop codon
  • Intron
  • Protein coding nucleotides
  • Donor site
  • Acceptor site

4
Figure 3. Four evidence sources mapped to
sequence S gene prediction (GP1) with no
confidence score, gene prediction with confidence
score 0.65 (GP2), cDNA aligned with 86 identity
and an EST aligned with 95 identity. Examples of
the different feature vector types are shown
start codon (sta), stop codon (stp), donor site
(don), acceptor site (acc), intron (inr) and
amino acid codon (cod). Each element in the
feature vector is an evidence sources prediction
for that feature type. The possible exon
boundaries are k0, k1, , k6.
5
Training
0.6
0.6
0.9
0.9
0.92
0.92
0.92
Gene pred. 1
Gene pred. 2
59
59
59
cDNA

92
92
85
85
95
EST alignment
S2
S1
Sm
Terminal exon
Terminal exon
Initial exon
Initial exon
Single exon
Internal exon
Stop site feature vectors
Acceptor site feature vectors
Example coding feature vectors
Example intron feature vectors
Start site feature vectors
Donor site feature vectors
Schematic of the JIGSAW training procedure. Known
genes are used to evaluate the accuracy of the
different combinations of evidence. Prediction
accuracy for each feature type (start codon, stop
codon, acceptor, donor, amino acid codon and
intron) is measured separately.
6
Fig 4a. The plot shows the accuracy of
predictions based on alignments to non-human
sequences that overlap a gene finders
predictions. Each point is a pair of alignments
observed in training and their percent identity
to the genomic sequence. points are labeled
accurate and x points are labeled
inaccurate. The two lines correspond to the
non-leaf nodes in the decision tree.
7
Figure 4b. Decision tree used to partition the
feature vector space from Figure 4a into the
three sub-regions V1, V2 and V3. The decision
tree identifies the pattern that non-human cDNA
alignments with gt 95 identity to the human
sequence (region V1) are accurate protein
coding predictors in the training set.
8
JIGSAW parse
JIGSAW gene model
9
Evidence types for JIGSAW experiments on human DNA
  • cDNA from human genes
  • UniGene transcripts
  • GenBank cDNAs matching SwissProt proteins w/at
    least 98 identity
  • RefSeq genes from non-human species
  • TIGR Gene Index (human and other)
  • Ab initio gene finders
  • Genscan
  • GeneID
  • GeneZilla
  • GlimmerHMM
  • Alignment-based gene finders
  • Twinscan
  • SGP
  • Predicted conserved elements from phylogenetic
    analysis (PhastCons)

10
Effects of different evidence sources
Figure 6. JIGSAW prediction performance using
different combinations of evidence. Gene finders
ab initio gene finders only non-human EST
gene finders non human expression evidence
human mRNA gene finders human mRNA curated
cDNA gene finders KnownGene, All all
evidence. KnownGene cDNA evidence from curated
proteins (from UCSC) without using JIGSAW.
11
Comparison of JIGSAW and other methods on human
ENCODE regions
Sensitivity(Sn) of exons correctly
predicted Specificity(Sp) exons predictions
that are correct F-score(2 x Sn x Sp) / (Sn Sp)
12
Gene Prediction Accuracy at the exon level
Sensitivity versus specificity. Top panel
dotplot for sensitivity versus specificity at the
exon level for CDS evaluation. Each dot
represents the overall value for each program on
the 31 test sequences. Fig. 6 from Guigo et al.,
Genome Biology 2006, 7(Suppl 1)S2
13
Gene Prediction Accuracy at the exon level
Sensitivity versus specificity. Bottom panel
boxplots of the average sensitivity and
specificity for each program. Each dot
corresponds to the average in each of the test
sequences for which GENCODE annotation existed.
Fig. 6 from Guigo et al., Genome Biology 2006,
7(Suppl 1)S2.
14
EGASP results Gene level accuracy
15
JIGSAW on other species
Write a Comment
User Comments (0)
About PowerShow.com