Title: JIGSAW: a better way to combine predictions
1JIGSAW a better way to combine predictions
J.E. Allen, W.H. Majoros, M. Pertea, and S.L.
Salzberg. JIGSAW, GeneZilla, and GlimmerHMM
puzzling out the features of human genes in the
ENCODE regions. Genome Biology 2007,
7(Suppl)S9. J. E. Allen and S. L. Salzberg.
JIGSAW integration of multiple sources of
evidence for gene prediction. Bioinformatics
21(18) 3596-3603, 2005. J. E. Allen, M. Pertea
and S. L. Salzberg. Computational gene prediction
using mutliple sources of evidence. Genome
Research, 14(1), 2004.
2Collecting gene structure evidence for JIGSAW
Figure 1. Evidence from the UCSC genome browser
used as input to JIGSAW. Evidence includes
computational gene finders, alignments from gene
expression evidence and evidence of
cross-species sequence conservation.
3Representing gene structure evidence in JIGSAW
- Each evidence source can predict up to six gene
features - Start codon
- Stop codon
- Intron
- Protein coding nucleotides
- Donor site
- Acceptor site
4Figure 3. Four evidence sources mapped to
sequence S gene prediction (GP1) with no
confidence score, gene prediction with confidence
score 0.65 (GP2), cDNA aligned with 86 identity
and an EST aligned with 95 identity. Examples of
the different feature vector types are shown
start codon (sta), stop codon (stp), donor site
(don), acceptor site (acc), intron (inr) and
amino acid codon (cod). Each element in the
feature vector is an evidence sources prediction
for that feature type. The possible exon
boundaries are k0, k1, , k6.
5Training
0.6
0.6
0.9
0.9
0.92
0.92
0.92
Gene pred. 1
Gene pred. 2
59
59
59
cDNA
92
92
85
85
95
EST alignment
S2
S1
Sm
Terminal exon
Terminal exon
Initial exon
Initial exon
Single exon
Internal exon
Stop site feature vectors
Acceptor site feature vectors
Example coding feature vectors
Example intron feature vectors
Start site feature vectors
Donor site feature vectors
Schematic of the JIGSAW training procedure. Known
genes are used to evaluate the accuracy of the
different combinations of evidence. Prediction
accuracy for each feature type (start codon, stop
codon, acceptor, donor, amino acid codon and
intron) is measured separately.
6Fig 4a. The plot shows the accuracy of
predictions based on alignments to non-human
sequences that overlap a gene finders
predictions. Each point is a pair of alignments
observed in training and their percent identity
to the genomic sequence. points are labeled
accurate and x points are labeled
inaccurate. The two lines correspond to the
non-leaf nodes in the decision tree.
7Figure 4b. Decision tree used to partition the
feature vector space from Figure 4a into three
sub-regions. This decision tree indicates that
non-human cDNA alignments with gt 95 identity to
the human sequence (region V1) are accurate
protein coding predictors.
8JIGSAW dynamic programming
- Dynamic programming algorithm
- at the end of each interval (e0, for example),
store the score of the best parse ending at that
location - Modification store scores for every parse type
ending at e0 - Types are start, stop, coding, intron, donor,
acceptor
9JIGSAW GHMM gene model
10Evidence types for JIGSAW experiments on human DNA
- cDNA from human genes
- UniGene transcripts
- GenBank cDNAs matching SwissProt proteins w/at
least 98 identity - RefSeq genes from non-human species
- TIGR Gene Index (human and other)
- Ab initio gene finders
- Genscan, GeneID, GeneZilla, GlimmerHMM
- NOTE JIGSAW allows you to use the same gene
finder as multiple lines of evidence - e.g.,
GlimmerHMM with different parameter settings - Alignment-based gene finders
- Twinscan
- SGP
- Predicted conserved elements from phylogenetic
analysis (PhastCons)
11Effects of different evidence sources
Figure 6. JIGSAW prediction performance using
different combinations of evidence. Gene finders
ab initio gene finders only non-human EST
gene finders non human expression evidence
human mRNA gene finders human mRNA curated
cDNA gene finders KnownGene, All all
evidence. KnownGene cDNA evidence from curated
proteins (from UCSC) without using JIGSAW.
12Comparison of JIGSAW and other methods on human
ENCODE regions
Sensitivity(Sn) of exons correctly
predicted Specificity(Sp) exons predictions
that are correct F-score(2 x Sn x Sp) / (Sn Sp)
13Gene Prediction Accuracy at the exon level
Sensitivity versus specificity. Top panel
dotplot for sensitivity versus specificity at the
exon level for CDS evaluation. Each dot
represents the overall value for each program on
the 31 test sequences. Fig. 6 from Guigo et al.,
Genome Biology 2006, 7(Suppl 1)S2
14Gene Prediction Accuracy at the exon level
Sensitivity versus specificity. Bottom panel
boxplots of the average sensitivity and
specificity for each program. Each dot
corresponds to the average in each of the test
sequences for which GENCODE annotation existed.
Fig. 6 from Guigo et al., Genome Biology 2006,
7(Suppl 1)S2.
15EGASP results Gene level accuracy
16JIGSAW on other species