Title: Gene Prediction in Eukaryotes Simplified
1Gene Prediction in Eukaryotes Simplified
- For highly conserved proteins
- Translate DNA sequence in all 6 reading frames
- BLASTX or FASTX to compare the sequence to a
protein sequence database - Or
- Protein compared against nucleic acid database
including genomic sequence that is translated in
all six possible reading frames by TBLASTN,
TFASTX/TFASTY programs. - Note Approximation of the gene structure only.
2- Transcript-based prediction
- How it works
- Align transcript data to genomic sequence using
a - pair-wise sequence comparison
Gene Model
EST
cDNA
3- Transcript-based gene
- prediction algorithm
- BLAST (Altshul) (36 hours)
- Widely used and understood
- HSPs often have ragged ends so extends to the
end of the introns - EST_GENOME (Mott) (3 days)
- Dynamic programming post-process of BLAST
- Slow and sometimes cryptic
- BLAT (Kent) (1/2 hour)
- Next generation of alignment algorithm
- Design for looking at nearly identical sequences
- Faster and more accurate than BLAST
4- Peptide-based gene prediction algorithm
- BLAST (Altshul)
- Widely used and understood
- Smith-Waterman
- Preliminary to further processing
-
- Used in preference to DNA-based similarities for
evolutionary diverged species as peptide
conservation is significantly higher than
nucleotide
5Gene prediction in eukaryotes
- When assessing a gene prediction program, two
criteria are used - Sensitivity proportion of true sites (e.g.
exon or donor splice sites) - predicted correctly
- Specificity proportion of predicted sites that
are correct - Most gene prediction programs concentrate on the
prediction of protein- - coding exons and distinguish 4 types of exons
- 1. Initial exon initiation codon _ first 5
splice junction - 2. Internal exon 3 splice site _ 5 splice site
- 3. Terminal exon 3 splice site _ stop codon
- 4. Single exon intronless gene
- The ideal gene finding program would perfectly
mimic the cells - transcription, splicing and translation
machinery. This is not yet possible - but a number of biologically important signal
sequences inform todays - algorithms
6Unfortunately, only 70 of human promoters
actually contain the core signal sequences cited
above. Moreover, the AATAAA polyadenylation
signal is absent from 50 of untranslated 3
regions. Hence it is hard to determine the
beginning and the end of a gene. Translational
Signals Two signals are important here 1. Start
codons (ATG) 2. The optimal context for
initiation of translation in vertebrate mRNA
is GCCACCatgG. This is sometimes referred to
as the Kozak signal. 3. Termination codon
TGA, TTA, TAGthey should be absent from
exons. Splicing Signals Nuclear pre-mRNA
introns are excised at spliceosomes. These are
large ribonucleoprotein complexes that recognize
three kinds of sites 1. 5 donor site GT 2. 3
acceptor site AG 3. branch point internal
site In addition, upstream of the acceptor site
there is a bias towards pyrimidines (T,C). The
rules about the donor and acceptor sites are
almost universal. However, splice-site usage is
often influenced by exonic and intronic signals
that are located away from the splice
junctions.
7Gene Finding Challenges
- Need the correct reading frame
- Introns can interrupt an exon in mid-codon
- There is no hard and fast rule for identifying
donor and acceptor splice sites - Signals are very weak
8(No Transcript)
9Overpredicting Genes
- Easy to predict all exons
- Report all sequences flanked by ..AG and GT.. as
exons - Sensitivity 100
- Specificity 0
10Methods for GENE Identification
- Homology based (e.g. Procrustes)
- sequence similarity with known proteins (need
close - homologs)
- coding regions fairly well conserved
- average identity at AA level of human and mouse gt
- 85
- TBLASTX used to find exons
- does not attempt to find complete gene structure
- (I.e. doesnt effectively find actual splice
boundaries) - Similarity searches misses some genes!!!!!
- HMMs (GenScan HMMgene VEIL)
- probabilistic model
- uses description of gene structure (e.g. splice
junctions, coding regions, start/stop codons) - mixed HMMs and other probabilistic models
- Neural Nets (GRAIL NetGene2 (splice sites)
11HMMgene 1.1
http//www.cbs.dtu.dk/services/HMMgene/
The methods used are described in the paperA.
Krogh Two methods for improving performance of
an HMM and their application for gene finding.
In Proc. of Fifth Int. Conf. on Intelligent
Systems for Molecular Biology, ed. Gaasterland,
T. et al., Menlo Park, CA AAAI Press, 1997, pp.
179-186.
- The program predicts whole genes, so the
predicted exons always splice correctly. -
- It can predict several whole or partial genes in
one sequence, so it can be used on - whole cosmids or even longer sequences. HMMgene
can also be used to predict - splice sites and start/stop codons.
- The program is based on a hidden Markov model,
which is a probabilistic model of - the gene structure.
- Apart from reporting the best prediction, HMMgene
can also report the N best gene - predictions for a sequence. This is useful if
the there are several equally likely gene - structures and may even indicate alternative
splicing. - HMMgene takes an input file with one or more DNA
sequences in FASTA format. - It also has a few options for changing the
default behavior of the program. - The output is a prediction of partial or complete
genes in the sequences. - The output is in a standardized format that is
easily read by other programs, - which specifies the location of all the
predicted genes and their coding regions and
SEQ1 HMMgene1.1 firstex 692 702
0.347 2 bestparsecds_1 SEQ1 HMMgene1.1
exon_1 2473 2711 0.421 1
bestparsecds_1 SEQ1 HMMgene1.1 exon_2 2897
3081 0.544 0 bestparsecds_1 SEQ1
HMMgene1.1 exon_3 10376 10563 0.861
2 bestparsecds_1 SEQ1 HMMgene1.1 exon_4 11841
11891 0.857 2 bestparsecds_1 SEQ1
HMMgene1.1 exon_5 12387 12483 0.993
0 bestparsecds_1 SEQ1 HMMgene1.1 exon_6 13076
13211 0.970 1 bestparsecds_1 SEQ1
HMMgene1.1 exon_7 13332 13415 0.926
1 bestparsecds_1 SEQ1 HMMgene1.1 exon_8 13515
13603 1.000 0 bestparsecds_1 SEQ1
HMMgene1.1 exon_9 14180 14235 1.000
2 bestparsecds_1 SEQ1 HMMgene1.1 exon_10 14321
14408 0.999 0 bestparsecds_1 SEQ1
HMMgene1.1 exon_11 14483 14579 0.877 1
bestparsecds_1 SEQ1 HMMgene1.1 exon_12 14697
14764 0.639 0 bestparsecds_1 SEQ1
HMMgene1.1 exon_13 14901 15030 0.835 1
bestparsecds_1 SEQ1 HMMgene1.1 lastex
15643 15704 0.987 0 bestparsecds_1
SEQ1 HMMgene1.1 CDS 692 15704
0.132 . bestparsecds_1
12(No Transcript)
13- GENSCAN
- differs from the majority of gene finding
algorithms as it can identify - complete, partial and multiple genes on both
DNA strands. - The program is based on a probabilistic model of
gene structure/ - compositional properties and does not make use
of protein sequence - homology information.
- The program is suitable for vertebrate, maize and
Arabidopsis sequences. - The vertebrate version also works fairly well
for Drosophila sequences.
http//genome.dkfz-heidelberg.de/cgi-bin/GENSCAN/g
enscan.cgi
14http//genes.mit.edu/GENSCAN.html
15(No Transcript)
16(No Transcript)
17.
TIGRscan has now been replaced by a new gene
finder called                                 Â
                                                 Â
  Â
www.genezilla.org
18GenLang
- GenLang is a syntactic pattern recognition
system, which uses the - tools and techniques of computational
linguistics to find genes and other - higher-order features in biological sequence
data. - Patterns are specified by means of rule sets
called grammars, and a - general purpose parser, implemented in the logic
programming language - Prolog, then performs the search.
http//arete.ibb.waw.pl/PL/html/gene_lang.html
19VEIL (Viterbi Exon-Intron Locator(Henderson et
al)
-used Expectation Maximization (E-M) to train the
model -Viterbi algorithm(dyn. Prog) to align new
sequences -I.e. finds the most likely sequence
of states EXPERIMENTAL RESULTS -correctly
located both ends of 53 of coding exons - 49
of exons that VEIL predicted were exactly correct
http//www.tigr.org/salzberg/veil.html
20Exon and Stop Codon models in VEIL
2 blank states on either side can output any base
(allow alignment to proper reading frame)
21Intron model (VEIL)
22Neural Networks
23- GrailEXP is a software package that predicts
exons, genes, - promoters, polyAs, CpG islands, EST
similarities, and repetitive - elements within DNA sequence.
- GrailEXP is used by the Computational Biosciences
Section at - Oak Ridge National Laboratory to annotate the
entire known - portion of the human genome (including both
finished and - draft data).
GrailPro
http//compbio.ornl.gov/grailexp/
Not in our package
Score of 6-m in candidate Score of 6-m in
flanks Markov model score Flanks GC Candidate
GC Score for splicing/aceptor
Neural Networks
Exon score
Output
Hidden layer
Input layer
GC-reach regions preference score correction
24(No Transcript)
25(No Transcript)
26GeneParser
Snyder, E. E., Stormo, G. D. (1995)
Identification of Coding Regions in Genomic
DNA. J. Mol. Biol. 248 1-18.
- The program scores all subintervals in a sequence
for content statistics - indicative of introns and exons and for sites
which identify their boundaries. - This information is weighted by a neural network
to approximate the - log-likelihood that each subinterval exactly
represents an intron or exon - (first, internal or last).
- A dynamic programming (DP) algorithm is then
applied to this data to find - the combination of introns and exons which
maximizes the likelihood - function.
Display of suboptimal solutions for the human
growth hormone gene.
27Integrated Systems
Adding Homology
- Can try to include information from databases of
known proteins - to help decide whether an exon is coding
- For each candidate exon, increase the score if
there is homology - with a known protein
- This approach used by Genie, GeneID,
GeneParser3, Grail
Adding ESTs
- Can try to include information from EST databases
- EST (Expressed Sequence Tag) databases show
sequences that are known to be present in mRNA
(cDNA) - For each candidate exon, increase the score if it
matches to an EST - Used by AAT, Grail
Drawbacks
Using homology or ESTs may bias results toward
genes similar to known genes (homology) or
highly expressed genes (ESTs)
28Homology-based gene prediction
TWAIN is a new syntenic genefinder which employs
a Generalized Pair Hidden Markov Model (GPHMM) to
predict genes in two closely related eukaryotic
genomes simultaneously.Â
http//www.tigr.org/software/pirate/twain/twain.ht
ml
29GeneSeqer Brendel et al.
http//deepc2.psi.iastate.edu/cgi-bin/gs.cgi
Spliced Alignment Algorithm
- Perform pairwise alignment with large gaps in
one sequence (introns) - Align genomic DNA with cDNA, EST or protein
- Score semi-conserved sequences at splice
junctions - Score coding constraints in translated exons
Genomic Sequence
Fast Search
Spliced Alignment
EST or protein database
Output
Assembly
30(No Transcript)
31(No Transcript)
32(No Transcript)
33Evaluation of Gene Prediction Methods
- What to consider when comparing
- type of analysis (neural nw, linear discriminant
etc.) - and types of sequences user for training and
test - Also, parameters affect the predictions..
- An ideal method should use
- A known set of gene structures (training)
- A different set for test
- Evaluation is more stringent when
- Test set includes a gene and neighboring
sequence, rather than sequence between the first
and the last exons
34Evaluation of Gene Prediction
Am I finding the things that Im supposed to
find
What fraction of my predictions are true?
35Ideal Distribution of Scores
More Realistically
36- of actual positives APTPFN
- of actual negatives AN FPTN
- Predicted of positives PPTPFP
- Predicted of negatives PNTNFN
- Sensitivity SN TP/APTP/(TPFN)
- Specificity SP TP/PPTP/(TPFP)
- Correlation coefficient -1,1
37- In a later study (Zhang 97)
- Programs including protein sequence DB searches
- (GeneID, GeneParser3) achieved substantially
greater - accuracy (Burset 96)
- Gene prediction programs reliably locate genomic
regions, - but provide only an approximation of gene
structure
38- 2001 Rogic redid comparison with better test set
of 195 genes - http//www.cse.ucsc.edu/rogic/evaluation.html
REFERENCES
A. Krogh. 1998. Gene finding putting the
parts together. http//www.cbs.dtu.dk/krogh/publi
cations/ps/Krogh98b.pdf D. Haussler. 1998.
Computational genefinding. http//www.cse.ucsc.e
du/haussler/grpaper.pdf
39Exons Predicted in an Arabodopsis Genomic Sequence
Note Arabodopsis UVH1 gene (with approx. 250 bp
upstream from the first exon and 200 bp
downstream from the last exon) used. NOT to be
taken as a measure of reliability of these
programs.
x not predicted includes the termination
codon
40About GeneZilla
http//www.genezilla.org/
About TWAIN
http//www.tigr.org/software/pirate/twain/twain.ht
ml
41- Some of the best programs
- GenScan http//genes.mit.edu/GENSCAN.html
- GeneMark http//opal.biology.gatech.edu/GeneMark
/ - Other programs
- AAT
- EcoParse
- Fexeh
- Fgeneh
- Fgenes
- Finex
- GeneHacker
- GeneID-3
- GeneParser 2
- GeneScope
- Genie
- GenLang
- Glimmer, GlimmerM
- Grail II
42BCM GeneFinder Baylor College of
Medicine Houston, TX
http//searchlauncher.bcm.tmc.edu/seq-search/gene-
search.html
http//www.genefinding.org/software.html
http//www.tigr.org/software/
http//www.fruitfly.org/seq_tools/genie.html
INDEX SITES
http//restools.sdsc.edu/biotools/biotools16.html
http//www.bioinformatics.vg/index.shtml
43PROMOTER PREDICTION IN EUKARYOTES
44(No Transcript)
45- Transcriptional Regulation in Eukaryotes
- Transcription involves the interaction of TFs
(Transcription Factorsprotein complexes) with - Each other
- DNA-binding sites in the promoter region
- Degree of expression of gene is influenced by
- the region upstream from transcription start
point - the region downstream
- A TATA box is present in most eukaryotes (75 in
vertebrates) - A TATA box HMM trained for vertebrates has the
consensus sequence TATAWDR starting at 17 bp
from TSS - W A/T D not C R G/A
46- INR also influences the start position of
transcription. - a loosely defined sequence around TSS
- may be recognized by other protein subunits of
TFIID (a TF that recognizes and binds to the
promoter DNA) - CCAAT and GC boxes also discovered around TSS (at
variable distances) - Many different TFs may be involved in the
regulation of a particular eukaryotic gene.
DNA-binding sites for many of these TFs are
unknown, which limits promoter pred.
47- Gene expression is also influenced by the region
upstream of the core promoter and other enhancer
sites. - Eukaryotic sequences show variation not only b/w
species but also among genes within a species.
Hence, a set of promoters in an organism that
share a common regulatory response is analyzed - The programs can predict 13-54 of the TSSs
correctly, but also each program predicted a
number of false-positive TSSs.
48Finding Less-conserved Binding Sites
- In E.coli the sequences could be aligned by TSS,
-10 and 35 regions. In many cases, it is not
possible to find conserved binding site by
aligning the sequences. - Similar to finding patterns common to a set of
protein sequences that cannot be aligned.
However, more difficult. - Methods
- Expectation maximization
- Guess an initial scoring matrix of estimated
length. - Scan each sequence, calculate probability of
matches, update (sequence pos. x probability)
scoring matrix, then repeat until no change.
- Hidden Markov Models
- Statistical Method of Finding Patterns
- A dinucleotide analysis performed to reduce
background noise. A Gibbs sampling method
considering inverted repeats (e.g. for lexA) is
applied - Hertz, Stormo and Hartzell Method
49Hertz, Stormo and Hartzell Method (for
DNA-binding Sites)
- Object find the 4-mer in each sequence that
constitutes as nearly as can be found in ALL
seq.s
Information content
Consensus sequence
50- Methods
- Neural nw trained on TATA and Inr Sites allowing
a variable spacing between sites. NN-GA approach
to identify conserved patterns in RNA PolII
promoters and conserved spacing among them
(PROMOTER2.0). - TATA box recognition using weight matrix and
density analysis of TF sites.
- Usage of linear (TSSD and TSSW) /quadratic
(CorePromoter) discriminant function. The
function is based on - TATA box score
- Base-pair frequencies around TSS (triplet)
- Frequencies in consecutive 100-bp upstream
regions - TF binding site prediction
- Searches of weight matrices for different
organism against a test sequence (TFSearch/
TESS). MatInspector and ConInspector allows
user-provided limits on type of weight matrix,
generation of new matrices etc. - Testing for presence of clustered groups (or
modules) of TF binding sites which are
characteristics of a given pattern of gene
regulation.
51Brain tissue Functional promoters Scoring
matrices TEST and Selection Log it value
of the promoter (0-1)
Neural Networks (PROMOTER 2.0)
http//www.cbs.dtu.dk/services/promoter/
Density of TF from EPD (PromoterScan)
http//bimas.dcrt.nih.gov/molbio/proscan/
exercise
Searches of weight matrices against a test
sequence (TFSearch/TESS)
http//www.cbil.upenn.edu/cgi-bin/tess/tess
52Promoter Databases TRANSFAC is a database on
eukaryotic cis-acting regulatory DNA elements
and trans-acting factors. It covers the whole
range from yeast to human. Biological
Databases/Biologische Datenbanken GmbH In
release 4.0, it contains 8415 entries, 4504 of
them referring to sites within 1078 eukaryotic
genes, the species of which ranging from yeast
to human. Additionally, this table comprises 3494
artificial sequences which resulted from
mutagenesis studies, in vitro selection
Procedures starting from random oligonucleotide
mixtures or from specific theoretical
considerations. And finally, there are 417
entries with consensus binding sequences given in
the IUPAC code.
http//www.gene-regulation.com/
Free registration
MatInspector Search for potential transcription
factor binding sites in your own sequences
with the matrix search program MatInspector
using the TRANSFAC 4.0
matrices. FastM A program for the generation
of models for regulatory regions in DNA
sequences. FastM using the TRANSFAC 3.4
matrices. PatSearch Search for potential
transcription factor binding sites in your own
sequences with the pattern search program
using TRANSFAC 3.5 TRRD 3.5 sites. FunSiteP Run
interactively FunSiteP. Recognition and
classification of eukaryotic promoters by
searching
transcription factor binding sites using a
collection of Transcription factor consensi.
53TESS Transcription Element Search System
Computational Biology and Informatics
Laboratory, School of Medicine, University of
Pennsylvania, 1997
http//www.cbil.upenn.edu/cgi-bin/tess/tess33?WELC
OME
Eukaryotic Promoter Database Swiss Institute for
Experimental Cancer Research
- The Eukaryotic Promoter Database is an annotated
non-redundant - collection of eukaryotic POL II promoters, for
which the transcription start - site has been determined experimentally.
- The annotation part of an entry includes
description of the initiation site - mapping data, cross-references to other
databases, and bibliographic - references.
- EPD is structured in a way that facilitates
dynamic extraction - of biologically meaningful promoter subsets for
comparative sequence - analysis.
- EPDEX is a complementary database which allows
users to view - available gene expression data for human EPD
promoters. - EPDEX is also accessible from the ISREC-TRADAT
database entry server.
http//www.epd.isb-sib.ch/
Prediction of transcription factor binding sites
by constructing matrices on the fly from
TRANSFAC 4.0 sites.
AliBaba
http//darwin.nmsu.edu/molb470/fall2003/Projects/
solorz/
http//www.epd.isb-sib.ch/TRADAT.html
54McPromoter MMII -- The Markov Chain Promoter
Prediction ServerMassachusetts Institute of
Technology
http//genes.mit.edu/McPromoter.html
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59Predicting Genes - Basic steps
- Obtain genomic DNA sequence
- Translate in all 6 reading frames
- Compare with protein sequence database
- Also perform database similarity search
- with EST cDNA databases, if available
- Use gene prediction programs to locate
- genes
- Analyze gene regulatory sequences
60AACGGACTTCCACTGAGCGATGTGAAAACGTTACAGGTTCAGTACTTCCA
AAGGAAGAAACCTCCAAACCCAAAAAAGAATAAA TATGAATTTGTATTT
TTGAAGAATGTGAAATAATGGTGTTTGCTTAATTGCTCATTTTGTATAAA
CTTAATATTGTACTTTAAAATATCTGCTAAAAAGTGAAAATTTAACTTTT
TGGAATTGAAAAAGCAATATTAAATACTAATGAAATCCTAATTAAATGCT
TATTTAAATCTGGTAGTATCTGTGGCATTTCTTACCAACCCTGCCCATAG
TTGACATTTTTCCACCACTCCCCCCTTCCCAGCCATCAGTCTTGGAGAGG
GGACAGAAAGGAAACGTCGGTCACCAGGAGAGTCTGCAGGTTTCCTTTTA
ATCAAGGCTCTACTGAAGGTGTTTTGTGGGGCTAAAAGCCCCCAAAACAT
GAAATGGACATGTAACACCACCTGGATCCCCCATAGCAGGCCAGACCACT
CTGGCGAGCACTGCTGGTCTGCCCAAATCTGGGTAATCAGACTGGGTATT
CATTGGCTGCATTTCAAAGCACAGCACTGCTTTCAGCCAGGATGAAGTGG
GAGTGAACCCAGCTGCTAGCAGAGCTGCCACTCCAGGCTGAGAGCCAAGT
ACCAGCCACTGCCAGTGAAGACTGGCCCCTTTACTGAAGGGAGTTGTTCA
GAGTCCAGCCACCGGCCCTGGGGAGGGAGAGAAGTCAGGGTATTCTGCTC
GGGGATGGTCAGGGCTCCGCAGCTCCATCGCCAGCATCCTTTGGAAAGCC
GCCTCTGGCGGAGACAGCCGGCTGGGGGGGCGCTCCAGGTTTGGCTGAGA
CGTTCTAGTTGGAACAGAAAGGAAAAAAGTGAGGCTGGGAGGCAAGGCCT
TGGATTAGGCCCCACAAGGATGTGGCCATTTGGCATTTGGATAGTATTAA
CTTTTTCGAAACCTCTCACCAGATCAAAGGAGGTTAGGGATAAAGCGGCG
GAGACATACTTCCCCCCTCCAGGGTAAGCTAGGGCTTGGCCAGCCTAGCC
AGTGGGCAGACCCCACCCCACCCCAGCCCAGCCCAGGGTGGGCACTAACC
CCGCCACCAGCCGGCTCCGGGCGCCGGCGGCCCAGCTGCCGTAACATCTC
CTCGCAGGCTGCGATGGTGTCCAGGAGCTGCCGCTGCCGCTGCTCCACCG
CGTCCAGCAGCTGCTGGGCGCGCTCCTCCCGGGGCGGCTGTGGGGGTGGC
CTCCCGCCGAGCCCCAGCCCCGCCTTCCCGCGGTCCACGCCGGCAGCCTC
CCTGCCCGGGAGAGAGCGAGAGACAGACGGTCAGGGCCGGCGCTTGCGCG
GGGCCAAGCCCCTTCCTCCCGCCCCGACGGGCCCCCTCTCACCCCCGTGA
CCAGTCTGAGCCCGGGCCCCATTCCATCTCCGCTTGCGCGGCCCGACCAC
CGCCCCCCTTTCGGCCGCCCCCCTCCCCAGCGCTGCGTTAGGGCTTCGCA
AGGCTGCGCCCCGCCCCGTCCCCACCGGTCTCCTTCAATCCTCCTGGGGG
TCGTGGTCCCTTTAAGCTGCCCGGCGCAGAGGCGGGGCCGAGTCTCCTGG
ACCGGAAGCTGGCTGGGAGCGTCACTTCCTCCCGGAAGCGGGCCTGGGCG
G