Title: Gene Prediction and Annotation techniques Basics
1Gene Prediction and Annotation techniques Basics
- Chuong Huynh
- NIH/NLM/NCBI
- Sept 30, 2004
- huynh_at_ncbi.nlm.nih.gov
Acknowledgement Daniel Lawson, Neil Hall
2What is gene prediction?
Detecting meaningful signals in uncharacterised
DNA sequences. Knowledge of the interesting
information in DNA. Sorting the chaff from the
wheat
- Gene prediction is recognising protein-coding
regions in genomic sequence
3Basic Gene Prediction Flow Chart
Obtain new genomic DNA sequence
1. Translate in all six reading frames and
compare to protein sequence databases 2. Perform
database similarity search of expressed sequence
tag Sites (EST) database of same organism, or
cDNA sequences if available
Use gene prediction program to locate genes
Analyze regulatory sequences in the gene
4ACEDB View
5Why is gene prediction important?
- Increased volume of genome data generated
- Paradigm shift from gene by gene sequencing
(small scale) to large-scale genome sequencing. - No more one gene at a time. A lot of data.
- Foundation for all further investigation.
Knowledge of the protein-coding regions underpins
functional genomics.
Note this presentation is for the prediction of
genes that encode protein only Not promoter
prediction, sequences regulate activity of
protein encoding genes
6(No Transcript)
7Map Viewer
Genome Scan Models
Contig
GenBank
Genes
Mouse EST hits
Human EST hits
8(No Transcript)
9Artemis Free Genome Visualization/Annotation
Workbench
10Genome WorkBench
11Knowing what to look for
What is a gene? Not a full transcript with
control regions The coding sequence (ATG -gt
STOP)
12ORF Finding in Prokaryotes
- Simplest method of finding DNA sequences that
encode proteins by searching for open reading
frames - An ORF is a DNA sequence that contains a
contiguous set of codons that species an amino
acid - Six possible reading frames
- Good for prokaryotic system (no/little post
translation modification) - Runs from Met (AUG) on mRNA ? stop codon TER
(UAA, UAG, UGA) - http//www.ncbi.nlm.nih.gov/gorf/ NCBI ORF Finder
13ORF Finder (Open Reading Frame Finder)
14Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction(w/o prior knowledge)
transcription
Unprocessed RNA
RNA processing
Comparative gene prediction (use other biological
data)
AAAAAAA
Gm3
Mature mRNA
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
15Two Classes of Sequence Information
- Signal Terms short sequence motifs (such as
splice sites, branch points,Polypyrimidine
tracts, start codons, and stop codons) - Content Terms pattern of codon usage that are
unique to a species and allow coding sequences to
be distinguished from surrounding noncoding
sequences by a statistical detection algorithm
16Problem Using Codon Usage
- Program must be taught what the codon usage
patterns look like by presenting the program with
a TRAINING SET of known coding sequences. - Different programs search for different patterns.
- A NEW training set is needed for each species
- Untranslated regions (UTR) at the ends of the
genes cannot be detected, but most programs can
identify polyadenylation sites - Non-protein coding RNA genes cannot be detected
(attempt detection in a few specialized programs) - Non of these program can detect alternatively
spliced transcripts
17Explanation of False Positive/Negative in Gene
Prediction Programs
18Gene finding Issues
- Issues regarding gene finding in general
- Genome size
- (larger genome more genes, but )
- Genome composition
- Genome complexity
- (more complexity -gt less coding density fewer
genes per kb) - cis-splicing (processing mRNA in Eukaryotics)
- trans-splicing (in kinetisplastid)
- alternate splicing (e.g. in different tissues
higher organism) - Variation of genetic code from the universal code
19Gene finding genome
- Genome composition
- Long ORFs tend to be coding
- Presence of more putative ORFs in GC rich
genomes (Stop codons UAA, UAG UGA) - Genome complexity
- Simple repetitive sequences (e.g. dinucleotide)
and dispersed repeats tend to be anti-coding - May need to mask sequence prior to gene prediction
20Gene finding coding density
As the coding/non-coding length ratio decreases,
exon prediction becomes more complex
Human
Fugu
worm
E.coli
21Gene finding splicing
- cis-splicing of genes
- Finding multiple (short) exons is harder than
finding a single (long) exon.
- trans-splicing of genes
- A trans-splice acceptor is no different to a
normal splice acceptor
worm
E.coli
22 Gene finding alternate splicing
- Alternate splicing (isoforms) are very difficult
to predict.
Human A
Human B
Human C
23ab initio prediction
- What is ab initio gene prediction?
- Prediction from first principles using the raw
DNA sequence only.
Requires training sets of known gene structures
to generate statistical tests for the likelihood
of a prediction being real.
24Gene finding ab initio
- What features of an ORF can we use?
- Size - large open reading frames
- DNA composition - codon usage / 3rd position
codon bias - Kozak sequence CCGCCAUGG
- Ribosome binding sites
- Termination signal (stops)
- Splice junction boundaries (acceptor/donor)
25Gene finding features
Think of a CDS gene prediction as a linear series
of sequence features
Initiation codon
Coding sequence (exon)
Splice donor (5)
N times
Non-coding sequence (intron)
Splice acceptor (3)
Coding sequence (exon)
Termination codon
26 A model ab initio predictor
- Locate and score all sequence features used in
gene models - dynamic programming to make the high scoring
model from available features. - e.g. Genefinder (Green)
- Running a 5-gt 3 pass the sequence through a
Markov model based on a typical gene model - e.g. TBparse (Krogh), GENSCAN (Burge) or GLIMMER
(Salzberg) - Running a 5-gt3 pass the sequence through a
neural net trained with confirmed gene models - e.g. GRAIL (Oak Ridge)
27Ab initio Gene finding programs
- Most gene finding software packages use a some
variant of Hidden Markov Models (HMM). - Predict coding, intergenic, and intron sequences
- Need to be trained on a specific organism.
- Never perfect!
28What is an HMM?
- A statistical model that represents a gene.
- Similar to a weight matrix that can recognise
gaps and treat them in a systematic way. - Has different states that represent introns,
exons, and intergenic regions.
29Malaria Gene Prediction Tool
- Hexamer ftp//ftp.sanger.ac.uk/pub/pathogens/sof
tware/hexamer/ - Genefinder email colin_at_u.washington.edu
- GlimmerM http//www.tigr.org/softlab/glimmerm
- Phat http//www.stat.berkeley.edu/users/scawley/
Phat - Already Trained for Malaria!!!! The more
experimental derived genes used for training the
gene prediction tool the more reliable the gene
predictor.
30GlimmerMSalzberg et al. (1999) genomics 59 24-31
- Adaption of the prokaryotic genefinder Glimmer.
- Delcher et al. (1999) NAR 2 4363-4641
- Based on a interpolated HMM (IHMM).
- Only used short chains of bases (markov chains)
to generate probabilities. - Trained identically to Phat
31An end to ab initio prediction
- ab initio gene prediction is inaccurate
- Have high false positive rates, but also low
false negative rates for most predictors - Incorporating similarity info is meant to reduce
false positive rate, but at the same also
increase false negative rate. - Biggest determinant of false positive/negative is
gene size. - Exon prediction sensitivity can be good
- Rarely used as a final product
- Human annotation runs multiple algorithms and
scores exon predicted by multiple predictors. - Used as a starting point for refinement/verificat
ion - Prediction need correction and validation
- -- Why not just build gene models by comparative
means?
32PAUSE(continue)
33Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction (w/o prior knowledge)
transcription
Unprocessed RNA
RNA processing
AAAAAAA
Gm3
Mature mRNA
Comparative gene prediction(use other biological
data)
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
34 If a cell was human?
- The cell knows how to splice a gene together.
- We know some of these signals but not all and
not all of the time - So compare with known examples from the species
and others
Central dogma for molecular biology Genome
Transcriptome Proteome
35 When a human looks at a cell
- Compare with the rest of the genome/transcriptome
/proteome data
36 comparative gene prediction
- Use knowledge of known coding sequences to
identify region of genomic DNA by similarity - transcriptome - transcribed DNA sequence
- proteome - peptide sequence
- genome - related genomic sequence
37 Transcript-based prediction datasets
- Generation of large numbers of Expressed
Sequence Tags (ESTs) - Quick, cheap but random
- Subtractive hybridisation to find rare
transcripts - Use multiple libraries for different
life-stages/conditions - Single-pass sequence prone to errors
- Generation of small number of full length cDNA
sequences - Slow and laborious but focused
- Large-scale sequencing of (presumed) full length
cDNAs - Systematic, multiplexed cloning/sequencing of
CDS - Expensive and only viable if part of bigger
project
38Gene Prediction in Eukaryotes Simplified
- For highly conserved proteins
- Translate DNA sequence in all 6 reading frames
- BLASTX or FASTAX to compare the sequence to a
protein sequence database - Or
- Protein compared against nucleic acid database
including genomic sequence that is translated in
all six possible reading frame sby TBLASTN,
TFASTAX/TFASTY programs. - Note Approximation of the gene structure only.
39 Transcript-based prediction How it works
- Align transcript data to genomic sequence using
a pair-wise sequence comparison
Gene Model
EST
cDNA
40 Transcript-based gene prediction algorithm
- BLAST (Altshul) (36 hours)
- Widely used and understood
- HSPs often have ragged ends so extends to the
end of the introns - EST_GENOME (Mott) (3 days)
- Dynamic programming post-process of BLAST
- Slow and sometimes cryptic
- BLAT (Kent) (1/2 hour)
- Next generation of alignment algorithm
- Design for looking at nearly identical sequences
- Faster and more accurate than BLAST
41 Peptide-based gene prediction algorithm
- BLAST (Altshul)
- Widely used and understood
- Smith-Waterman
- Preliminary to further processing
-
- Used in preference to DNA-based similarities for
evolutionary diverged species as peptide
conservation is significantly higher than
nucleotide
42 Genomic-based gene prediction algorithm
- BLAST (Altshul)
- Can be used in TBLASTX mode
- BLAT (Kent)
- Can be used in a translated DNA vs translated
DNA mode - Significantly faster than BLAST
- WABA (Kent)
- Designed to allow for 3rd position codon wobble
- Slow with some outstanding problems
- Only really used in C.elegans v C.briggsae
analysis
43 Comparative gene predictors
- This can be viewed as an extension of the ab
initio prediction tools where coding exons are
defined by similarities and not codon bias - GAZE (Howe) is an extension of Phil Greens
Genefinder in which transcript data is used to
define coding exons. Other features are scored as
in the original Genefinder implementation. This
is being evaluated and used in the C.elegans
project. - GENEWISE (Birney) is a HMM based gene predictor
which attempts to predict the closest CDS to a
supplied peptide sequence. This is the workhorse
predictor for the ENSEMBL project.
44 Comparative gene predictors
- A new generation of comparative gene prediction
tools is being developed to utilise the large
amount of genomic sequence available. - Twinscan (WashU) attempts to predict genes using
related genomic sequences. - Doublescan (Sanger) is a HMM based gene
predictor which attempts to predict 2 orthologous
CDSs from genomic regions pre-defined as
matching. - Both of these predictors are in development and
will be used for the C.elegans v C.briggsae match
and the Mouse v Human match later this year.
45 Summary
- Genes are complex structure which are difficult
to predict with the required level of
accuracy/confidence - We can predict stops better than starts
- We can only give gross confidence levels to
predictions (i.e. confirmed, partially confirmed
or predicted) - Gene prediction is only part of the annotation
procedure - Movement from ab initio to comparative
methodology as sequence data becomes
available/affordable - Curation of gene models is an active process
the set of gene models for a genome is fluid and
WILL change over time.
46The Annotation Process
DNA SEQUENCE
Useful Information
ANNALYSIS SOFTWARE
Annotator
47Annotation Process
48Artemis
- Artemis is a free DNA sequence viewer and
annotation tool that allows visualization of
sequence features and the results of analyses
within the context of the sequence, and its
six-frame translation. - http//www.sanger.ac.uk/Software/Artemis/
49atcttttacttttttcatcatctatacaaaaaatcatagaatattcatca
tgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt
tttaattaattcacattttatatctttaagtataatatcatttaacatt
atgttatcttcctcagtgtttttcattattatttgcatgtacagtttatc
a tttttatgtaccaaactatatcttatattaaatggatctctacttata
aagttaaaatctttttttaattttttcttttcacttccaattttatattc
cg cagtacatcgaattctaaaaaaaaaaataaataatatataatatata
ataaataatatataataaataatatataatatataataaataatatataa
tat ataatatataataaataatatataatatataatatataataaataa
tatataataaataatatataatatataatatataatactttggaaagatt
attt atatgaatatatacacctttaataggatacacacatcatatttat
atatatacatataaatattccataaatatttatacaacctcaaataaaat
aaaca tacatatatatatataaatatatacatatatgtatcattacgta
aaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggt
attagg agatatatttactgattcctcatttttataaatgttaaaatta
ttatccctagtccaaatatccacatttattaaattcacttgaatattgtt
ttttaaa ttgctagatatattaatttgagatttaaaattctgacctata
taaacctttcgagaatttataggtagacttaaacttatttcatttgataa
actaatat tatcatttatgtccttatcaaaatttattttctccatttca
gttattttaaacatattccaaatattgttattaaacaagggcggacttaa
acgaagtaa ttcaatcttaactccctccttcacttcactcattttatat
attccttaatttttactatgtttattaaattaacatatatataaacaaat
atgtcactaa taatatatatatatatatatatatatatatatattataa
atgttttactctattttcacatcttgtccttttttttttaaaaatcccaa
ttcttattcat taaataataatgtatttttttttttttttttttttttt
attaattattatgttactgttttattatatacactcttaatcatatatat
atatttatatat atatatatatatatatatatattattcccttttcatg
ttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatattt
ttataacatatgt attattaaaatgtatatataaaaatatatattccat
ttattattatttttttatatacattgttataagagtatcttctcccttct
ggtttatattacta ccatttcactttgaacttttcataaaaattaatag
aatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaata tatatatatatatatatacatataatatatattt
catctaatcatttaaaattattattatatattttttaaaaaatatattta
tgataacataaaaaga atttaattttaattaaatatatataattacata
catctaatattattatatatatataataagttttccaaatagaatactta
tatattatatatatata tatatatatatatattcttccataaaaagaat
aaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacatt
gaatatatagttgtattt ataaaattaaagaaaaagcataaagttacca
tttaatagtggagattagtaacattttcttcattatcaaaaatatttatt
tcctaattttttttttttg taaaatatatttaaaaatgtaatagattat
gtattaaataatataaatatagcaaaatgttcaattttagaaatttgcct
ctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataa
agtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatg
acatgttataatataatataa taaataaaaattatgtaatatatcataa
tcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaat
acatatataactaacattcata tctttatttttgtagatgatataaaaa
attttataaactcttatgaagggatatatttttcatcatccaataaattt
ataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaa
atgttattacaataaatacagatctgtatgtagttgatttcctttttaat
gagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagat
ttatatttttatttatttattatatattattttttaatttttcttttata
tatttattttatttagtgtataaaa tgatatcctttatatttatattta
catgggatattcaaataataacaaaaatgagtatacacatatatatatat
atatatatatatatgtatattttttt tttttttttatgttcctatagga
aagggaagaattcactgatttgtagtgtttacaatattagggaatgcaac
tttacacttttgaaaaaaattcagtta agcaaaaatattaataacatta
aaaagacactgatagcaaaatgtaatgaatatataataacattagaaaat
aagaaaattactttttatttcttaaata aagattatagtataaatcaaa
gtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtca
aaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatatacc
aattagatattaaaaattcccatattagttatacacttattgatagtttc
aatttaaatttatcctacctcagagaatct ataaataataaaaaaaagc
atataaataaaataaatgatgtatcaaataatgacccaaaaaaggataat
aatgaaaaaaatacttcatctaataatataa cacataacaattataatg
acatatcaaataataataataataataataatattaatggggtgaaagac
catataaataataacactctggaaaataatga tgaaccaatcttatcta
tatataatgaagatcttaatgttttatatatatgccaaaatatgtataac
gtcctttttgttttgaatttaaataacctaagt
50DNA in Artemis
GC content
Black bar stop codon
Forward translations
Reverse Translations
DNA and amino acids
51Extra Slides
52 Gene prediction
- What is gene prediction?
- Why is gene prediction important?
- Ab initio gene prediction (w/o prior knowledge)
- Comparative gene prediction (use other
biological data) - Summary
53Genome annotation is central to functional
genomics
ORFeome based functional genomics
Gene Knockout
RNAi phenotypes
Expression Microarray
54Gene finding
- Artemis genome viewer
- Coding sequence vs non coding sequence
- Gene finding software
- Homology between species
- ESTs
55(No Transcript)
56Pretty Handy Annotation Tool (PHAT)
- Based on a generalised hidden Markov model (GHMM)
- Free easily installed and run.
- Is good at predicting multiexon genes but will in
some cases miss out genes altogether and will
over predict. - Cawley et al. (2001) Mol. Bio. Para. 118
p167http//www.stat.berkeley.edu/users/scawley/Ph
at/
57Phat
http//linkage.rockefeller.edu/wli/gene/krogh98.pd
f
58GlimmerM
- Under predicts splicing
- Hardly hardly ever misses a gene completely.
- Does over predict.
- Free with TIGR license
59Comparison Of Gene Finders