Title: Gene Structure Prediction
1Gene Structure Prediction
- Eukaryotic gene structures
- Statement of the gene prediction problem
- Overview of the GENSCAN program
- Interpolated Markov chains
- Performance evaluation of gene prediction programs
EPFL Bioinformatics I 12 Dec 2005
2EPFL Bioinformatics I 12 Dec 2005
3Variation in human gene structure
EPFL Bioinformatics I 12 Dec 2005
4Determination of gene structure
Experimental Isolation and sequencing of RNA
sequences. Public data Assembled cDNA sequences
and single-read EST sequences. Limitations Rare
transcripts, tissue and cell type specificity,
conditional expression. It may not be possible to
extract RNA from all cell types, at all
developmental stages, and under all possible
environmental condition from a complex
organism Computational ab initio prediction
based on sequence features. Hybrid ab initio
prediction with constraints derived from partial
RNA sequences or matches with known homologous
genes from other organisms (protein or DNA
sequence similarity)
EPFL Bioinformatics I 12 Dec 2005
5Gene prediction statement of the problem
- Given a genome sequence
- predict the structure of all transcripts
- Predict the structure of the coding part of all
transcripts. - Performance criteria
- of correctly predicted/missed genes
- of correctly predicted/missed exons
- of correctly predicted/missed coding
nucleotides - Further complications
- It is not known in advance how many genes a
sequence contains - The sequence may start or end in the middle of a
gene
EPFL Bioinformatics I 12 Dec 2005
6GENSCAN, and example of a gene finding algorithm
- Principle GENSCAN finds the optimal parse of a
sequence - A parse is a succession of
- intergenic regions
- 5UTR (untranslated regions)
- Exons
- Introns
- 3.UTRs
- Evaluation of alternative parses with the aid of
- Weight matrices or similar models for sites
promoters, translation starts, splice donors and
acceptors, translation stops, polyadenylation
sites. - Interpolated Markov chains (3-periodic HMMs), and
length distributions for exons, introns, 5 and
3 UTRs, and intergenic regions.
EPFL Bioinformatics I 12 Dec 2005
7Interpolated Markov chains
There are three scoring tables for hexamers
starting at the 1st, 2nd, and 3rd codon position,
respectively. The scores are computed as log-odds
scores from the observed hexamer
frequencies. The score of an exonic region is
computed as the sum of the scores of its
over-lapping hexamers plus a score for its
length. Hypothetical example
frame 1 frame 2 frame 3 ACTTGCAGAAC... ACTTGC
-2 -1 0 CTTGCA
-7 -3 2 TTGCAG 1
-1 5 TGCAGA -3 -2
4 Red, green, and blue scores are from tables 1,
2 and 3, respectively. The total number of
parameters of the exon model is 3 4096.
EPFL Bioinformatics I 12 Dec 2005
8GENESCAN model (variant HMM)
EPFL Bioinformatics I 12 Dec 2005
9EPFL Bioinformatics I 12 Dec 2005
10Evaluation of performance of GENSCAN
Performance measures Sensitivity
TP/AP Specificity TP/PP Correlation
coefficient Approximate correlation TP
true positives, TN true negatives, FP false
positives, FN false negatives, AP actual
positives, AN actual negatives, PP predicted
positives, PN predicted negatives
EPFL Bioinformatics I 12 Dec 2005
11EPFL Bioinformatics I 12 Dec 2005