Title: Gene Prediction
1Gene Prediction
- Chuong Huynh
- NIH/NLM/NCBI
- July 18, 2002
- huynh_at_ncbi.nlm.nih.gov
Acknowledgement Daniel Lawson, Neil Hall
2Basic Gene Prediction Flow Chart
Obtain new genomic DNA sequence
1. Translate in all six reading frames and
compare to protein sequence databases 2. Perform
database similarity search of expressed sequence
tag Sites (EST) database of same organism, or
cDNA sequences if available
Use gene prediction program to locate genes
Analyze regulatory sequences in the gene
3 Why is gene prediction important?
- Increased volume of genome data generated
- Paradigm shift from gene by gene sequencing
(small scale) to large-scale genome sequencing. - No more one gene at a time. A lot of data.
- Foundation for all further investigation.
Knowledge of the protein-coding regions underpins
functional genomics.
Note this presentation is for the prediction of
genes that encode protein only Not promoter
prediction, sequences regulate activity of
protein encoding genes
4 What is gene prediction?
- Detecting meaningful signals in uncharacterised
DNA sequences. - Knowledge of the interesting information in DNA.
- Sorting the chaff from the wheat
- Gene prediction is recognising protein-coding
regions in genomic sequence
5(No Transcript)
6(No Transcript)
7(No Transcript)
8Artemis Free Genome Visualization/Annotation
Workbench
9 Knowing what to look for
- What is a gene?
- Not a full transcript with control regions
- The coding sequence (ATG -gt STOP)
10ORF Finding in Prokaryotes
- Simplest method of finding DNA sequences that
encode proteins by searching for open reading
frames - An ORF is a DNA sequence that contains a
contiguous set of codons that species an amino
acid - Six possible reading frames
- Good for prokaryotic system (no/little post
translation modification) - Runs from Met (AUG) on mRNA ? stop codon TER
(UAA, UAG, UGA) - http//www.ncbi.nih.gov/gorf/ ???? NCBI ORF Finder
11Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction(w/o prior knowledge)
transcription
Unprocessed RNA
RNA processing
Comparative gene prediction (use other biological
data)
AAAAAAA
Gm3
Mature mRNA
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
12Gene finding Issues
- Issues regarding gene finding in general
- Genome size
- (larger genome more genes, but )
- Genome composition
- Genome complexity
- (more complexity -gt less coding density fewer
genes per kb) - cis-splicing (processing mRNA in Eukaryotics)
- trans-splicing (in kinetisplastid)
- alternate splicing (e.g. in different tissues
higher organism) - Variation of genetic code from the universal code
13 Gene finding genome
- Genome composition
- Long ORFs tend to be coding
- Presence of more putative ORFs in GC rich
genomes (Stop codons UAA, UAG UGA) - Genome complexity
- Simple repetitive sequences (e.g. dinucleotide)
and dispersed repeats tend to be anti-coding - May need to mask sequence prior to gene prediction
14 Gene finding coding density
- As the coding/non-coding length ratio decreases,
exon prediction becomes more complex
Human
Fugu
worm
E.coli
15 Gene finding splicing
- cis-splicing of genes
- Finding multiple (short) exons is harder than
finding a single (long) exon.
- trans-splicing of genes
- A trans-splice acceptor is no different to a
normal splice acceptor
worm
E.coli
16 Gene finding alternate splicing
- Alternate splicing (isoforms) are very difficult
to predict.
Human A
Human B
Human C
17 ab initio prediction
- What is ab initio gene prediction?
- Prediction from first principles using the raw
DNA sequence only.
Requires training sets of known gene structures
to generate statistical tests for the likelihood
of a prediction being real.
18 Gene finding ab initio
- What features of a ORF can we use?
- Size - large open reading frames
- DNA composition - codon usage / 3rd position
codon bias - Kozak sequence CCGCCAUGG
- Ribosome binding sites
- Termination signal (stops)
- Splice junction boundaries (acceptor/donor)
19 Gene finding features
- Think of a CDS gene prediction as a linear
series of sequence features
Initiation codon
Coding sequence (exon)
Splice donor (5)
N times
Non-coding sequence (intron)
Splice acceptor (3)
Coding sequence (exon)
Termination codon
20 A model ab initio predictor
- Locate and score all sequence features used in
gene models - dynamic programming to make the high scoring
model from available features. - e.g. Genefinder (Green)
- Running a 5-gt 3 pass the sequence through a
Markov model based on a typical gene model - e.g. TBparse (Krogh), GENSCAN (Burge) or GLIMMER
(Salzberg) - Running a 5-gt3 pass the sequence through a
neural net trained with confirmed gene models - e.g. GRAIL (Oak Ridge)
21Ab initio Gene finding programs
- Most gene finding software packages use a some
variant of Hidden Markov Models (HMM). - Predict coding, intergenic, and intron sequences
- Need to be trained on a specific organism.
- Never perfect!
22What is an HMM?
- A statistical model that represents a gene.
- Similar to a weight matrix that can recognise
gaps and treat them in a systematic way. - Has different states that represent introns,
exons, and intergenic regions.
23Malaria Gene Prediction Tool
- Hexamer ftp//ftp.sanger.ac.uk/pub/pathogens/sof
tware/hexamer/ - Genefinder email colin_at_u.washington.edu
- GlimmerM http//www.tigr.org/softlab/glimmerm
- Phat http//www.stat.berkeley.edu/users/scawley/
Phat - Already Trained for Malaria!!!! The more
experimental derived genes used for training the
gene prediction tool the more reliable the gene
predictor.
24GlimmerMSalzberg et al. (1999) genomics 59 24-31
- Adaption of the prokaryotic genefinder Glimmer.
- Delcher et al. (1999) NAR 2 4363-4641
- Based on a interpolated HMM (IHMM).
- Only used short chains of bases (markov chains)
to generate probabilities. - Trained identically to Phat
25 An end to ab initio prediction
- ab initio gene prediction is inaccurate
- High false positive rates for most predictors
- Exon prediction sensitivity can be good
- Rarely used as a final product
- Human annotation runs multiple algorithms and
scores exon predicted by multiple predictors. - Used as a starting point for refinement/verificat
ion - Prediction need correction and validation
- -- Why not just build gene models by comparative
means?
26PAUSE(continue)
27Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction (w/o prior knowledge)
transcription
Unprocessed RNA
RNA processing
AAAAAAA
Gm3
Mature mRNA
Comparative gene prediction(use other biological
data)
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
28 If a cell was human?
- The cell knows how to splice a gene together.
- We know some of these signals but not all and
not all of the time - So compare with known examples from the species
and others
Central dogma for molecular biology Genome
Transcriptome Proteome
29 When a human looks at a cell
- Compare with the rest of the genome/transcriptome
/proteome data
30 comparative gene prediction
- Use knowledge of known coding sequences to
identify region of genomic DNA by similarity - transcriptome - transcribed DNA sequence
- proteome - peptide sequence
- genome - related genomic sequence
31 Transcript-based prediction datasets
- Generation of large numbers of Expressed
Sequence Tags (ESTs) - Quick, cheap but random
- Subtractive hybridisation to find rare
transcripts - Use multiple libraries for different
life-stages/conditions - Single-pass sequence prone to errors
- Generation of small number of full length cDNA
sequences - Slow and laborious but focused
- Large-scale sequencing of (presumed) full length
cDNAs - Systematic, multiplexed cloning/sequencing of
CDS - Expensive and only viable if part of bigger
project
32Gene Prediction in Eukaryotes Simplified
- For highly conserved proteins
- Translate DNA sequence in all 6 reading frames
- BLASTX or FASTAX to compare the sequence to a
protein sequence database - Or
- Protein compared against nucleic acid database
including genomic sequence that is translated in
all six possible reading frame sby TBLASTN,
TFASTAX/TFASTY programs. - Note Approximation of the gene structure only.
33 Transcript-based prediction How it works
- Align transcript data to genomic sequence using
a pair-wise sequence comparison
Gene Model
EST
cDNA
34 Transcript-based gene prediction algorithm
- BLAST (Altshul) (36 hours)
- Widely used and understood
- HSPs often have ragged ends so extends to the
end of the introns - EST_GENOME (Mott) (3 days)
- Dynamic programming post-process of BLAST
- Slow and sometimes cryptic
- BLAT (Kent) (1/2 hour)
- Next generation of alignment algorithm
- Design for looking at nearly identical sequences
- Faster and more accurate than BLAST
35 Peptide-based gene prediction algorithm
- BLAST (Altshul)
- Widely used and understood
- Smith-Waterman
- Preliminary to further processing
-
- Used in preference to DNA-based similarities for
evolutionary diverged species as peptide
conservation is significantly higher than
nucleotide
36 Genomic-based gene prediction algorithm
- BLAST (Altshul)
- Can be used in TBLASTX mode
- BLAT (Kent)
- Can be used in a translated DNA vs translated
DNA mode - Significantly faster than BLAST
- WABA (Kent)
- Designed to allow for 3rd position codon wobble
- Slow with some outstanding problems
- Only really used in C.elegans v C.briggsae
analysis
37 Comparative gene predictors
- This can be viewed as an extension of the ab
initio prediction tools where coding exons are
defined by similarities and not codon bias - GAZE (Howe) is an extension of Phil Greens
Genefinder in which transcript data is used to
define coding exons. Other features are scored as
in the original Genefinder implementation. This
is being evaluated and used in the C.elegans
project. - GENEWISE (Birney) is a HMM based gene predictor
which attempts to predict the closest CDS to a
supplied peptide sequence. This is the workhorse
predictor for the ENSEMBL project.
38 Comparative gene predictors
- A new generation of comparative gene prediction
tools is being developed to utilise the large
amount of genomic sequence available. - Twinscan (WashU) attempts to predict genes using
related genomic sequences. - Doublescan (Sanger) is a HMM based gene
predictor which attempts to predict 2 orthologous
CDSs from genomic regions pre-defined as
matching. - Both of these predictors are in development and
will be used for the C.elegans v C.briggsae match
and the Mouse v Human match later this year.
39 Summary
- Genes are complex structure which are difficult
to predict with the required level of
accuracy/confidence - We can predict stops better than starts
- We can only give gross confidence levels to
predictions (i.e. confirmed, partially confirmed
or predicted) - Gene prediction is only part of the annotation
procedure - Movement from ab initio to comparative
methodology as sequence data becomes
available/affordable - Curation of gene models is an active process
the set of gene models for a genome is fluid and
WILL change over time.
40The Annotation Process
DNA SEQUENCE
Useful Information
ANNALYSIS SOFTWARE
Annotator
41Annotation Process
42Artemis
- Artemis is a free DNA sequence viewer and
annotation tool that allows visualization of
sequence features and the results of analyses
within the context of the sequence, and its
six-frame translation. - http//www.sanger.ac.uk/Software/Artemis/
43atcttttacttttttcatcatctatacaaaaaatcatagaatattcatca
tgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt
tttaattaattcacattttatatctttaagtataatatcatttaacatt
atgttatcttcctcagtgtttttcattattatttgcatgtacagtttatc
a tttttatgtaccaaactatatcttatattaaatggatctctacttata
aagttaaaatctttttttaattttttcttttcacttccaattttatattc
cg cagtacatcgaattctaaaaaaaaaaataaataatatataatatata
ataaataatatataataaataatatataatatataataaataatatataa
tat ataatatataataaataatatataatatataatatataataaataa
tatataataaataatatataatatataatatataatactttggaaagatt
attt atatgaatatatacacctttaataggatacacacatcatatttat
atatatacatataaatattccataaatatttatacaacctcaaataaaat
aaaca tacatatatatatataaatatatacatatatgtatcattacgta
aaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggt
attagg agatatatttactgattcctcatttttataaatgttaaaatta
ttatccctagtccaaatatccacatttattaaattcacttgaatattgtt
ttttaaa ttgctagatatattaatttgagatttaaaattctgacctata
taaacctttcgagaatttataggtagacttaaacttatttcatttgataa
actaatat tatcatttatgtccttatcaaaatttattttctccatttca
gttattttaaacatattccaaatattgttattaaacaagggcggacttaa
acgaagtaa ttcaatcttaactccctccttcacttcactcattttatat
attccttaatttttactatgtttattaaattaacatatatataaacaaat
atgtcactaa taatatatatatatatatatatatatatatatattataa
atgttttactctattttcacatcttgtccttttttttttaaaaatcccaa
ttcttattcat taaataataatgtatttttttttttttttttttttttt
attaattattatgttactgttttattatatacactcttaatcatatatat
atatttatatat atatatatatatatatatatattattcccttttcatg
ttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatattt
ttataacatatgt attattaaaatgtatatataaaaatatatattccat
ttattattatttttttatatacattgttataagagtatcttctcccttct
ggtttatattacta ccatttcactttgaacttttcataaaaattaatag
aatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaata tatatatatatatatatacatataatatatattt
catctaatcatttaaaattattattatatattttttaaaaaatatattta
tgataacataaaaaga atttaattttaattaaatatatataattacata
catctaatattattatatatatataataagttttccaaatagaatactta
tatattatatatatata tatatatatatatattcttccataaaaagaat
aaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacatt
gaatatatagttgtattt ataaaattaaagaaaaagcataaagttacca
tttaatagtggagattagtaacattttcttcattatcaaaaatatttatt
tcctaattttttttttttg taaaatatatttaaaaatgtaatagattat
gtattaaataatataaatatagcaaaatgttcaattttagaaatttgcct
ctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataa
agtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatg
acatgttataatataatataa taaataaaaattatgtaatatatcataa
tcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaat
acatatataactaacattcata tctttatttttgtagatgatataaaaa
attttataaactcttatgaagggatatatttttcatcatccaataaattt
ataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaa
atgttattacaataaatacagatctgtatgtagttgatttcctttttaat
gagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagat
ttatatttttatttatttattatatattattttttaatttttcttttata
tatttattttatttagtgtataaaa tgatatcctttatatttatattta
catgggatattcaaataataacaaaaatgagtatacacatatatatatat
atatatatatatatgtatattttttt tttttttttatgttcctatagga
aagggaagaattcactgatttgtagtgtttacaatattagggaatgcaac
tttacacttttgaaaaaaattcagtta agcaaaaatattaataacatta
aaaagacactgatagcaaaatgtaatgaatatataataacattagaaaat
aagaaaattactttttatttcttaaata aagattatagtataaatcaaa
gtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtca
aaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatatacc
aattagatattaaaaattcccatattagttatacacttattgatagtttc
aatttaaatttatcctacctcagagaatct ataaataataaaaaaaagc
atataaataaaataaatgatgtatcaaataatgacccaaaaaaggataat
aatgaaaaaaatacttcatctaataatataa cacataacaattataatg
acatatcaaataataataataataataataatattaatggggtgaaagac
catataaataataacactctggaaaataatga tgaaccaatcttatcta
tatataatgaagatcttaatgttttatatatatgccaaaatatgtataac
gtcctttttgttttgaatttaaataacctaagt
44DNA in Artemis
GC content
Black bar stop codon
Forward translations
Reverse Translations
DNA and amino acids
45Extra Slides
46 Gene prediction
- What is gene prediction?
- Why is gene prediction important?
- Ab initio gene prediction (w/o prior knowledge)
- Comparative gene prediction (use other
biological data) - Summary
47Genome annotation is central to functional
genomics
ORFeome based functional genomics
Gene Knockout
RNAi phenotypes
Expression Microarray
48Gene finding
- Artemis genome viewer
- Coding sequence vs non coding sequence
- Gene finding software
- Homology between species
- ESTs
49(No Transcript)
50Pretty Handy Annotation Tool (PHAT)
- Based on a generalised hidden Markov model (GHMM)
- Free easily installed and run.
- Is good at predicting multiexon genes but will in
some cases miss out genes altogether and will
over predict. - Cawley et al. (2001) Mol. Bio. Para. 118
p167http//www.stat.berkeley.edu/users/scawley/Ph
at/
51Phat
http//linkage.rockefeller.edu/wli/gene/krogh98.pd
f
52GlimmerM
- Under predicts splicing
- Hardly hardly ever misses a gene completely.
- Does over predict.
- Free with TIGR license
53Comparison Of Gene Finders