Gene Structure Prediction - PowerPoint PPT Presentation

1 / 11

About This Presentation

Title:

Gene Structure Prediction

Description:

Statement of the gene prediction problem. Overview of the GENSCAN program ... for hexamers starting at the 1st, 2nd, and 3rd codon position, respectively. ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 12

Provided by: isrecI

Category:

more less

Transcript and Presenter's Notes

Title: Gene Structure Prediction

1
Gene Structure Prediction

Eukaryotic gene structures
Statement of the gene prediction problem
Overview of the GENSCAN program
Interpolated Markov chains
Performance evaluation of gene prediction programs

EPFL Bioinformatics I 12 Dec 2005
2
EPFL Bioinformatics I 12 Dec 2005
3
Variation in human gene structure
EPFL Bioinformatics I 12 Dec 2005
4
Determination of gene structure
Experimental Isolation and sequencing of RNA
sequences. Public data Assembled cDNA sequences
and single-read EST sequences. Limitations Rare
transcripts, tissue and cell type specificity,
conditional expression. It may not be possible to
extract RNA from all cell types, at all
developmental stages, and under all possible
environmental condition from a complex
organism Computational ab initio prediction
based on sequence features. Hybrid ab initio
prediction with constraints derived from partial
RNA sequences or matches with known homologous
genes from other organisms (protein or DNA
sequence similarity)
EPFL Bioinformatics I 12 Dec 2005
5
Gene prediction statement of the problem

Given a genome sequence
predict the structure of all transcripts
Predict the structure of the coding part of all
transcripts.
Performance criteria
of correctly predicted/missed genes
of correctly predicted/missed exons
of correctly predicted/missed coding
nucleotides
Further complications
It is not known in advance how many genes a
sequence contains
The sequence may start or end in the middle of a
gene

EPFL Bioinformatics I 12 Dec 2005
6
GENSCAN, and example of a gene finding algorithm

Principle GENSCAN finds the optimal parse of a
sequence
A parse is a succession of
intergenic regions
5UTR (untranslated regions)
Exons
Introns
3.UTRs
Evaluation of alternative parses with the aid of
Weight matrices or similar models for sites
promoters, translation starts, splice donors and
acceptors, translation stops, polyadenylation
sites.
Interpolated Markov chains (3-periodic HMMs), and
length distributions for exons, introns, 5 and
3 UTRs, and intergenic regions.

EPFL Bioinformatics I 12 Dec 2005
7
Interpolated Markov chains
There are three scoring tables for hexamers
starting at the 1st, 2nd, and 3rd codon position,
respectively. The scores are computed as log-odds
scores from the observed hexamer
frequencies. The score of an exonic region is
computed as the sum of the scores of its
over-lapping hexamers plus a score for its
length. Hypothetical example
frame 1 frame 2 frame 3 ACTTGCAGAAC... ACTTGC
-2 -1 0 CTTGCA
-7 -3 2 TTGCAG 1
-1 5 TGCAGA -3 -2
4 Red, green, and blue scores are from tables 1,
2 and 3, respectively. The total number of
parameters of the exon model is 3 4096.
EPFL Bioinformatics I 12 Dec 2005
8
GENESCAN model (variant HMM)
EPFL Bioinformatics I 12 Dec 2005
9
EPFL Bioinformatics I 12 Dec 2005
10
Evaluation of performance of GENSCAN
Performance measures Sensitivity
TP/AP Specificity TP/PP Correlation
coefficient Approximate correlation TP
true positives, TN true negatives, FP false
positives, FN false negatives, AP actual
positives, AN actual negatives, PP predicted
positives, PN predicted negatives
EPFL Bioinformatics I 12 Dec 2005
11
EPFL Bioinformatics I 12 Dec 2005

Write a Comment

User Comments (0)