Gene Prediction and Annotation techniques Basics

About This Presentation

Title:

Gene Prediction and Annotation techniques Basics

Description:

Basic Gene Prediction Flow Chart ACEDB View Why is gene prediction important? Slide 6 Map Viewer Slide 8 Artemis ... – PowerPoint PPT presentation

Number of Views:318

Avg rating:3.0/5.0

Slides: 50

Provided by: Chuo6

Category:

more less

Transcript and Presenter's Notes

Title: Gene Prediction and Annotation techniques Basics

1
Gene Prediction and Annotation techniques Basics

Chuong Huynh
NIH/NLM/NCBI
Sept 30, 2004
huynh_at_ncbi.nlm.nih.gov

Acknowledgement Daniel Lawson, Neil Hall
2
What is gene prediction?
Detecting meaningful signals in uncharacterised
DNA sequences. Knowledge of the interesting
information in DNA. Sorting the chaff from the
wheat

Gene prediction is recognising protein-coding
regions in genomic sequence

3
Basic Gene Prediction Flow Chart
Obtain new genomic DNA sequence
1. Translate in all six reading frames and
compare to protein sequence databases 2. Perform
database similarity search of expressed sequence
tag Sites (EST) database of same organism, or
cDNA sequences if available
Use gene prediction program to locate genes
Analyze regulatory sequences in the gene
4
ACEDB View
5
Why is gene prediction important?

Increased volume of genome data generated
Paradigm shift from gene by gene sequencing
(small scale) to large-scale genome sequencing.
No more one gene at a time. A lot of data.
Foundation for all further investigation.
Knowledge of the protein-coding regions underpins
functional genomics.

Note this presentation is for the prediction of
genes that encode protein only Not promoter
prediction, sequences regulate activity of
protein encoding genes
6
(No Transcript)
7
Map Viewer
Genome Scan Models
Contig
GenBank
Genes
Mouse EST hits
Human EST hits
8
(No Transcript)
9
Artemis Free Genome Visualization/Annotation
Workbench
10
Genome WorkBench
11
Knowing what to look for
What is a gene? Not a full transcript with
control regions The coding sequence (ATG -gt
STOP)
12
ORF Finding in Prokaryotes

Simplest method of finding DNA sequences that
encode proteins by searching for open reading
frames
An ORF is a DNA sequence that contains a
contiguous set of codons that species an amino
acid
Six possible reading frames
Good for prokaryotic system (no/little post
translation modification)
Runs from Met (AUG) on mRNA ? stop codon TER
(UAA, UAG, UGA)
http//www.ncbi.nlm.nih.gov/gorf/ NCBI ORF Finder

13
ORF Finder (Open Reading Frame Finder)
14
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction(w/o prior knowledge)
transcription
Unprocessed RNA
RNA processing
Comparative gene prediction (use other biological
data)
AAAAAAA
Gm3
Mature mRNA
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
15
Two Classes of Sequence Information

Signal Terms short sequence motifs (such as
splice sites, branch points,Polypyrimidine
tracts, start codons, and stop codons)
Content Terms pattern of codon usage that are
unique to a species and allow coding sequences to
be distinguished from surrounding noncoding
sequences by a statistical detection algorithm

16
Problem Using Codon Usage

Program must be taught what the codon usage
patterns look like by presenting the program with
a TRAINING SET of known coding sequences.
Different programs search for different patterns.
A NEW training set is needed for each species
Untranslated regions (UTR) at the ends of the
genes cannot be detected, but most programs can
identify polyadenylation sites
Non-protein coding RNA genes cannot be detected
(attempt detection in a few specialized programs)
Non of these program can detect alternatively
spliced transcripts

17
Explanation of False Positive/Negative in Gene
Prediction Programs
18
Gene finding Issues

Issues regarding gene finding in general
Genome size
(larger genome more genes, but )
Genome composition
Genome complexity
(more complexity -gt less coding density fewer
genes per kb)
cis-splicing (processing mRNA in Eukaryotics)
trans-splicing (in kinetisplastid)
alternate splicing (e.g. in different tissues
higher organism)
Variation of genetic code from the universal code

19
Gene finding genome

Genome composition
Long ORFs tend to be coding
Presence of more putative ORFs in GC rich
genomes (Stop codons UAA, UAG UGA)
Genome complexity
Simple repetitive sequences (e.g. dinucleotide)
and dispersed repeats tend to be anti-coding
May need to mask sequence prior to gene prediction

20
Gene finding coding density
As the coding/non-coding length ratio decreases,
exon prediction becomes more complex
Human
Fugu
worm
E.coli
21
Gene finding splicing

cis-splicing of genes
Finding multiple (short) exons is harder than
finding a single (long) exon.

trans-splicing of genes
A trans-splice acceptor is no different to a
normal splice acceptor

worm
E.coli
22
Gene finding alternate splicing

Alternate splicing (isoforms) are very difficult
to predict.

Human A
Human B
Human C
23
ab initio prediction

What is ab initio gene prediction?
Prediction from first principles using the raw
DNA sequence only.

Requires training sets of known gene structures
to generate statistical tests for the likelihood
of a prediction being real.
24
Gene finding ab initio

What features of an ORF can we use?
Size - large open reading frames
DNA composition - codon usage / 3rd position
codon bias
Kozak sequence CCGCCAUGG
Ribosome binding sites
Termination signal (stops)
Splice junction boundaries (acceptor/donor)

25
Gene finding features
Think of a CDS gene prediction as a linear series
of sequence features
Initiation codon
Coding sequence (exon)
Splice donor (5)
N times
Non-coding sequence (intron)
Splice acceptor (3)
Coding sequence (exon)
Termination codon
26
A model ab initio predictor

Locate and score all sequence features used in
gene models
dynamic programming to make the high scoring
model from available features.
e.g. Genefinder (Green)
Running a 5-gt 3 pass the sequence through a
Markov model based on a typical gene model
e.g. TBparse (Krogh), GENSCAN (Burge) or GLIMMER
(Salzberg)
Running a 5-gt3 pass the sequence through a
neural net trained with confirmed gene models
e.g. GRAIL (Oak Ridge)

27
Ab initio Gene finding programs

Most gene finding software packages use a some
variant of Hidden Markov Models (HMM).
Predict coding, intergenic, and intron sequences
Need to be trained on a specific organism.
Never perfect!

28
What is an HMM?

A statistical model that represents a gene.
Similar to a weight matrix that can recognise
gaps and treat them in a systematic way.
Has different states that represent introns,
exons, and intergenic regions.

29
Malaria Gene Prediction Tool

Hexamer ftp//ftp.sanger.ac.uk/pub/pathogens/sof
tware/hexamer/
Genefinder email colin_at_u.washington.edu
GlimmerM http//www.tigr.org/softlab/glimmerm
Phat http//www.stat.berkeley.edu/users/scawley/
Phat
Already Trained for Malaria!!!! The more
experimental derived genes used for training the
gene prediction tool the more reliable the gene
predictor.

30
GlimmerMSalzberg et al. (1999) genomics 59 24-31

Adaption of the prokaryotic genefinder Glimmer.
Delcher et al. (1999) NAR 2 4363-4641
Based on a interpolated HMM (IHMM).
Only used short chains of bases (markov chains)
to generate probabilities.
Trained identically to Phat

31
An end to ab initio prediction

ab initio gene prediction is inaccurate
Have high false positive rates, but also low
false negative rates for most predictors
Incorporating similarity info is meant to reduce
false positive rate, but at the same also
increase false negative rate.
Biggest determinant of false positive/negative is
gene size.
Exon prediction sensitivity can be good
Rarely used as a final product
Human annotation runs multiple algorithms and
scores exon predicted by multiple predictors.
Used as a starting point for refinement/verificat
ion
Prediction need correction and validation
-- Why not just build gene models by comparative
means?

32
PAUSE(continue)
33
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction (w/o prior knowledge)
transcription
Unprocessed RNA
RNA processing
AAAAAAA
Gm3
Mature mRNA
Comparative gene prediction(use other biological
data)
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
34
If a cell was human?

The cell knows how to splice a gene together.
We know some of these signals but not all and
not all of the time
So compare with known examples from the species
and others

Central dogma for molecular biology Genome
Transcriptome Proteome
35
When a human looks at a cell

Compare with the rest of the genome/transcriptome
/proteome data

36
comparative gene prediction

Use knowledge of known coding sequences to
identify region of genomic DNA by similarity
transcriptome - transcribed DNA sequence
proteome - peptide sequence
genome - related genomic sequence

37
Transcript-based prediction datasets

Generation of large numbers of Expressed
Sequence Tags (ESTs)
Quick, cheap but random
Subtractive hybridisation to find rare
transcripts
Use multiple libraries for different
life-stages/conditions
Single-pass sequence prone to errors
Generation of small number of full length cDNA
sequences
Slow and laborious but focused
Large-scale sequencing of (presumed) full length
cDNAs
Systematic, multiplexed cloning/sequencing of
CDS
Expensive and only viable if part of bigger
project

38
Gene Prediction in Eukaryotes Simplified

For highly conserved proteins
Translate DNA sequence in all 6 reading frames
BLASTX or FASTAX to compare the sequence to a
protein sequence database
Or
Protein compared against nucleic acid database
including genomic sequence that is translated in
all six possible reading frame sby TBLASTN,
TFASTAX/TFASTY programs.
Note Approximation of the gene structure only.

39
Transcript-based prediction How it works

Align transcript data to genomic sequence using
a pair-wise sequence comparison

Gene Model
EST
cDNA
40
Transcript-based gene prediction algorithm

BLAST (Altshul) (36 hours)
Widely used and understood
HSPs often have ragged ends so extends to the
end of the introns
EST_GENOME (Mott) (3 days)
Dynamic programming post-process of BLAST
Slow and sometimes cryptic
BLAT (Kent) (1/2 hour)
Next generation of alignment algorithm
Design for looking at nearly identical sequences
Faster and more accurate than BLAST

41
Peptide-based gene prediction algorithm

BLAST (Altshul)
Widely used and understood
Smith-Waterman
Preliminary to further processing
Used in preference to DNA-based similarities for
evolutionary diverged species as peptide
conservation is significantly higher than
nucleotide

42
Genomic-based gene prediction algorithm

BLAST (Altshul)
Can be used in TBLASTX mode
BLAT (Kent)
Can be used in a translated DNA vs translated
DNA mode
Significantly faster than BLAST
WABA (Kent)
Designed to allow for 3rd position codon wobble
Slow with some outstanding problems
Only really used in C.elegans v C.briggsae
analysis

43
Comparative gene predictors

This can be viewed as an extension of the ab
initio prediction tools where coding exons are
defined by similarities and not codon bias
GAZE (Howe) is an extension of Phil Greens
Genefinder in which transcript data is used to
define coding exons. Other features are scored as
in the original Genefinder implementation. This
is being evaluated and used in the C.elegans
project.
GENEWISE (Birney) is a HMM based gene predictor
which attempts to predict the closest CDS to a
supplied peptide sequence. This is the workhorse
predictor for the ENSEMBL project.

44
Comparative gene predictors

A new generation of comparative gene prediction
tools is being developed to utilise the large
amount of genomic sequence available.
Twinscan (WashU) attempts to predict genes using
related genomic sequences.
Doublescan (Sanger) is a HMM based gene
predictor which attempts to predict 2 orthologous
CDSs from genomic regions pre-defined as
matching.
Both of these predictors are in development and
will be used for the C.elegans v C.briggsae match
and the Mouse v Human match later this year.

45
Summary

Genes are complex structure which are difficult
to predict with the required level of
accuracy/confidence
We can predict stops better than starts
We can only give gross confidence levels to
predictions (i.e. confirmed, partially confirmed
or predicted)
Gene prediction is only part of the annotation
procedure
Movement from ab initio to comparative
methodology as sequence data becomes
available/affordable
Curation of gene models is an active process
the set of gene models for a genome is fluid and
WILL change over time.

46
The Annotation Process
DNA SEQUENCE
Useful Information
ANNALYSIS SOFTWARE
Annotator
47
Annotation Process
48
Artemis

Artemis is a free DNA sequence viewer and
annotation tool that allows visualization of
sequence features and the results of analyses
within the context of the sequence, and its
six-frame translation.
http//www.sanger.ac.uk/Software/Artemis/

49
atcttttacttttttcatcatctatacaaaaaatcatagaatattcatca
tgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt
tttaattaattcacattttatatctttaagtataatatcatttaacatt
atgttatcttcctcagtgtttttcattattatttgcatgtacagtttatc
a tttttatgtaccaaactatatcttatattaaatggatctctacttata
aagttaaaatctttttttaattttttcttttcacttccaattttatattc
cg cagtacatcgaattctaaaaaaaaaaataaataatatataatatata
ataaataatatataataaataatatataatatataataaataatatataa
tat ataatatataataaataatatataatatataatatataataaataa
tatataataaataatatataatatataatatataatactttggaaagatt
attt atatgaatatatacacctttaataggatacacacatcatatttat
atatatacatataaatattccataaatatttatacaacctcaaataaaat
aaaca tacatatatatatataaatatatacatatatgtatcattacgta
aaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggt
attagg agatatatttactgattcctcatttttataaatgttaaaatta
ttatccctagtccaaatatccacatttattaaattcacttgaatattgtt
ttttaaa ttgctagatatattaatttgagatttaaaattctgacctata
taaacctttcgagaatttataggtagacttaaacttatttcatttgataa
actaatat tatcatttatgtccttatcaaaatttattttctccatttca
gttattttaaacatattccaaatattgttattaaacaagggcggacttaa
acgaagtaa ttcaatcttaactccctccttcacttcactcattttatat
attccttaatttttactatgtttattaaattaacatatatataaacaaat
atgtcactaa taatatatatatatatatatatatatatatatattataa
atgttttactctattttcacatcttgtccttttttttttaaaaatcccaa
ttcttattcat taaataataatgtatttttttttttttttttttttttt
attaattattatgttactgttttattatatacactcttaatcatatatat
atatttatatat atatatatatatatatatatattattcccttttcatg
ttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatattt
ttataacatatgt attattaaaatgtatatataaaaatatatattccat
ttattattatttttttatatacattgttataagagtatcttctcccttct
ggtttatattacta ccatttcactttgaacttttcataaaaattaatag
aatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaata tatatatatatatatatacatataatatatattt
catctaatcatttaaaattattattatatattttttaaaaaatatattta
tgataacataaaaaga atttaattttaattaaatatatataattacata
catctaatattattatatatatataataagttttccaaatagaatactta
tatattatatatatata tatatatatatatattcttccataaaaagaat
aaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacatt
gaatatatagttgtattt ataaaattaaagaaaaagcataaagttacca
tttaatagtggagattagtaacattttcttcattatcaaaaatatttatt
tcctaattttttttttttg taaaatatatttaaaaatgtaatagattat
gtattaaataatataaatatagcaaaatgttcaattttagaaatttgcct
ctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataa
agtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatg
acatgttataatataatataa taaataaaaattatgtaatatatcataa
tcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaat
acatatataactaacattcata tctttatttttgtagatgatataaaaa
attttataaactcttatgaagggatatatttttcatcatccaataaattt
ataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaa
atgttattacaataaatacagatctgtatgtagttgatttcctttttaat
gagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagat
ttatatttttatttatttattatatattattttttaatttttcttttata
tatttattttatttagtgtataaaa tgatatcctttatatttatattta
catgggatattcaaataataacaaaaatgagtatacacatatatatatat
atatatatatatatgtatattttttt tttttttttatgttcctatagga
aagggaagaattcactgatttgtagtgtttacaatattagggaatgcaac
tttacacttttgaaaaaaattcagtta agcaaaaatattaataacatta
aaaagacactgatagcaaaatgtaatgaatatataataacattagaaaat
aagaaaattactttttatttcttaaata aagattatagtataaatcaaa
gtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtca
aaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatatacc
aattagatattaaaaattcccatattagttatacacttattgatagtttc
aatttaaatttatcctacctcagagaatct ataaataataaaaaaaagc
atataaataaaataaatgatgtatcaaataatgacccaaaaaaggataat
aatgaaaaaaatacttcatctaataatataa cacataacaattataatg
acatatcaaataataataataataataataatattaatggggtgaaagac
catataaataataacactctggaaaataatga tgaaccaatcttatcta
tatataatgaagatcttaatgttttatatatatgccaaaatatgtataac
gtcctttttgttttgaatttaaataacctaagt
50
DNA in Artemis
GC content
Black bar stop codon
Forward translations
Reverse Translations
DNA and amino acids
51
Extra Slides
52
Gene prediction

What is gene prediction?
Why is gene prediction important?
Ab initio gene prediction (w/o prior knowledge)
Comparative gene prediction (use other
biological data)
Summary

53
Genome annotation is central to functional
genomics
ORFeome based functional genomics
Gene Knockout
RNAi phenotypes
Expression Microarray
54
Gene finding

Artemis genome viewer
Coding sequence vs non coding sequence
Gene finding software
Homology between species
ESTs

55
(No Transcript)
56
Pretty Handy Annotation Tool (PHAT)

Based on a generalised hidden Markov model (GHMM)
Free easily installed and run.
Is good at predicting multiexon genes but will in
some cases miss out genes altogether and will
over predict.
Cawley et al. (2001) Mol. Bio. Para. 118
p167http//www.stat.berkeley.edu/users/scawley/Ph
at/

57
Phat
http//linkage.rockefeller.edu/wli/gene/krogh98.pd
f
58
GlimmerM