Gene Prediction and Annotation techniques Basics - PowerPoint PPT Presentation

About This Presentation
Title:

Gene Prediction and Annotation techniques Basics

Description:

Basic Gene Prediction Flow Chart ACEDB View Why is gene prediction important? Slide 6 Map Viewer Slide 8 Artemis ... – PowerPoint PPT presentation

Number of Views:318
Avg rating:3.0/5.0
Slides: 50
Provided by: Chuo6
Category:

less

Transcript and Presenter's Notes

Title: Gene Prediction and Annotation techniques Basics


1
Gene Prediction and Annotation techniques Basics
  • Chuong Huynh
  • NIH/NLM/NCBI
  • Sept 30, 2004
  • huynh_at_ncbi.nlm.nih.gov

Acknowledgement Daniel Lawson, Neil Hall
2
What is gene prediction?
Detecting meaningful signals in uncharacterised
DNA sequences. Knowledge of the interesting
information in DNA. Sorting the chaff from the
wheat
  • Gene prediction is recognising protein-coding
    regions in genomic sequence

3
Basic Gene Prediction Flow Chart
Obtain new genomic DNA sequence
1. Translate in all six reading frames and
compare to protein sequence databases 2. Perform
database similarity search of expressed sequence
tag Sites (EST) database of same organism, or
cDNA sequences if available
Use gene prediction program to locate genes
Analyze regulatory sequences in the gene
4
ACEDB View
5
Why is gene prediction important?
  • Increased volume of genome data generated
  • Paradigm shift from gene by gene sequencing
    (small scale) to large-scale genome sequencing.
  • No more one gene at a time. A lot of data.
  • Foundation for all further investigation.
    Knowledge of the protein-coding regions underpins
    functional genomics.

Note this presentation is for the prediction of
genes that encode protein only Not promoter
prediction, sequences regulate activity of
protein encoding genes
6
(No Transcript)
7
Map Viewer
Genome Scan Models
Contig
GenBank
Genes
Mouse EST hits
Human EST hits
8
(No Transcript)
9
Artemis Free Genome Visualization/Annotation
Workbench
10
Genome WorkBench
11
Knowing what to look for
What is a gene? Not a full transcript with
control regions The coding sequence (ATG -gt
STOP)
12
ORF Finding in Prokaryotes
  • Simplest method of finding DNA sequences that
    encode proteins by searching for open reading
    frames
  • An ORF is a DNA sequence that contains a
    contiguous set of codons that species an amino
    acid
  • Six possible reading frames
  • Good for prokaryotic system (no/little post
    translation modification)
  • Runs from Met (AUG) on mRNA ? stop codon TER
    (UAA, UAG, UGA)
  • http//www.ncbi.nlm.nih.gov/gorf/ NCBI ORF Finder

13
ORF Finder (Open Reading Frame Finder)
14
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction(w/o prior knowledge)
transcription
Unprocessed RNA
RNA processing
Comparative gene prediction (use other biological
data)
AAAAAAA
Gm3
Mature mRNA
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
15
Two Classes of Sequence Information
  • Signal Terms short sequence motifs (such as
    splice sites, branch points,Polypyrimidine
    tracts, start codons, and stop codons)
  • Content Terms pattern of codon usage that are
    unique to a species and allow coding sequences to
    be distinguished from surrounding noncoding
    sequences by a statistical detection algorithm

16
Problem Using Codon Usage
  • Program must be taught what the codon usage
    patterns look like by presenting the program with
    a TRAINING SET of known coding sequences.
  • Different programs search for different patterns.
  • A NEW training set is needed for each species
  • Untranslated regions (UTR) at the ends of the
    genes cannot be detected, but most programs can
    identify polyadenylation sites
  • Non-protein coding RNA genes cannot be detected
    (attempt detection in a few specialized programs)
  • Non of these program can detect alternatively
    spliced transcripts

17
Explanation of False Positive/Negative in Gene
Prediction Programs
18
Gene finding Issues
  • Issues regarding gene finding in general
  • Genome size
  • (larger genome more genes, but )
  • Genome composition
  • Genome complexity
  • (more complexity -gt less coding density fewer
    genes per kb)
  • cis-splicing (processing mRNA in Eukaryotics)
  • trans-splicing (in kinetisplastid)
  • alternate splicing (e.g. in different tissues
    higher organism)
  • Variation of genetic code from the universal code

19
Gene finding genome
  • Genome composition
  • Long ORFs tend to be coding
  • Presence of more putative ORFs in GC rich
    genomes (Stop codons UAA, UAG UGA)
  • Genome complexity
  • Simple repetitive sequences (e.g. dinucleotide)
    and dispersed repeats tend to be anti-coding
  • May need to mask sequence prior to gene prediction

20
Gene finding coding density
As the coding/non-coding length ratio decreases,
exon prediction becomes more complex
Human
Fugu
worm
E.coli
21
Gene finding splicing
  • cis-splicing of genes
  • Finding multiple (short) exons is harder than
    finding a single (long) exon.
  • trans-splicing of genes
  • A trans-splice acceptor is no different to a
    normal splice acceptor

worm
E.coli
22
Gene finding alternate splicing
  • Alternate splicing (isoforms) are very difficult
    to predict.

Human A
Human B
Human C
23
ab initio prediction
  • What is ab initio gene prediction?
  • Prediction from first principles using the raw
    DNA sequence only.

Requires training sets of known gene structures
to generate statistical tests for the likelihood
of a prediction being real.
24
Gene finding ab initio
  • What features of an ORF can we use?
  • Size - large open reading frames
  • DNA composition - codon usage / 3rd position
    codon bias
  • Kozak sequence CCGCCAUGG
  • Ribosome binding sites
  • Termination signal (stops)
  • Splice junction boundaries (acceptor/donor)

25
Gene finding features
Think of a CDS gene prediction as a linear series
of sequence features
Initiation codon
Coding sequence (exon)
Splice donor (5)
N times
Non-coding sequence (intron)
Splice acceptor (3)
Coding sequence (exon)
Termination codon
26
A model ab initio predictor
  • Locate and score all sequence features used in
    gene models
  • dynamic programming to make the high scoring
    model from available features.
  • e.g. Genefinder (Green)
  • Running a 5-gt 3 pass the sequence through a
    Markov model based on a typical gene model
  • e.g. TBparse (Krogh), GENSCAN (Burge) or GLIMMER
    (Salzberg)
  • Running a 5-gt3 pass the sequence through a
    neural net trained with confirmed gene models
  • e.g. GRAIL (Oak Ridge)

27
Ab initio Gene finding programs
  • Most gene finding software packages use a some
    variant of Hidden Markov Models (HMM).
  • Predict coding, intergenic, and intron sequences
  • Need to be trained on a specific organism.
  • Never perfect!

28
What is an HMM?
  • A statistical model that represents a gene.
  • Similar to a weight matrix that can recognise
    gaps and treat them in a systematic way.
  • Has different states that represent introns,
    exons, and intergenic regions.

29
Malaria Gene Prediction Tool
  • Hexamer ftp//ftp.sanger.ac.uk/pub/pathogens/sof
    tware/hexamer/
  • Genefinder email colin_at_u.washington.edu
  • GlimmerM http//www.tigr.org/softlab/glimmerm
  • Phat http//www.stat.berkeley.edu/users/scawley/
    Phat
  • Already Trained for Malaria!!!! The more
    experimental derived genes used for training the
    gene prediction tool the more reliable the gene
    predictor.

30
GlimmerMSalzberg et al. (1999) genomics 59 24-31
  • Adaption of the prokaryotic genefinder Glimmer.
  • Delcher et al. (1999) NAR 2 4363-4641
  • Based on a interpolated HMM (IHMM).
  • Only used short chains of bases (markov chains)
    to generate probabilities.
  • Trained identically to Phat

31
An end to ab initio prediction
  • ab initio gene prediction is inaccurate
  • Have high false positive rates, but also low
    false negative rates for most predictors
  • Incorporating similarity info is meant to reduce
    false positive rate, but at the same also
    increase false negative rate.
  • Biggest determinant of false positive/negative is
    gene size.
  • Exon prediction sensitivity can be good
  • Rarely used as a final product
  • Human annotation runs multiple algorithms and
    scores exon predicted by multiple predictors.
  • Used as a starting point for refinement/verificat
    ion
  • Prediction need correction and validation
  • -- Why not just build gene models by comparative
    means?

32
PAUSE(continue)
33
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction (w/o prior knowledge)
transcription
Unprocessed RNA
RNA processing
AAAAAAA
Gm3
Mature mRNA
Comparative gene prediction(use other biological
data)
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
34
If a cell was human?
  • The cell knows how to splice a gene together.
  • We know some of these signals but not all and
    not all of the time
  • So compare with known examples from the species
    and others

Central dogma for molecular biology Genome
Transcriptome Proteome
35
When a human looks at a cell
  • Compare with the rest of the genome/transcriptome
    /proteome data

36
comparative gene prediction
  • Use knowledge of known coding sequences to
    identify region of genomic DNA by similarity
  • transcriptome - transcribed DNA sequence
  • proteome - peptide sequence
  • genome - related genomic sequence

37
Transcript-based prediction datasets
  • Generation of large numbers of Expressed
    Sequence Tags (ESTs)
  • Quick, cheap but random
  • Subtractive hybridisation to find rare
    transcripts
  • Use multiple libraries for different
    life-stages/conditions
  • Single-pass sequence prone to errors
  • Generation of small number of full length cDNA
    sequences
  • Slow and laborious but focused
  • Large-scale sequencing of (presumed) full length
    cDNAs
  • Systematic, multiplexed cloning/sequencing of
    CDS
  • Expensive and only viable if part of bigger
    project

38
Gene Prediction in Eukaryotes Simplified
  • For highly conserved proteins
  • Translate DNA sequence in all 6 reading frames
  • BLASTX or FASTAX to compare the sequence to a
    protein sequence database
  • Or
  • Protein compared against nucleic acid database
    including genomic sequence that is translated in
    all six possible reading frame sby TBLASTN,
    TFASTAX/TFASTY programs.
  • Note Approximation of the gene structure only.

39
Transcript-based prediction How it works
  • Align transcript data to genomic sequence using
    a pair-wise sequence comparison

Gene Model
EST
cDNA
40
Transcript-based gene prediction algorithm
  • BLAST (Altshul) (36 hours)
  • Widely used and understood
  • HSPs often have ragged ends so extends to the
    end of the introns
  • EST_GENOME (Mott) (3 days)
  • Dynamic programming post-process of BLAST
  • Slow and sometimes cryptic
  • BLAT (Kent) (1/2 hour)
  • Next generation of alignment algorithm
  • Design for looking at nearly identical sequences
  • Faster and more accurate than BLAST

41
Peptide-based gene prediction algorithm
  • BLAST (Altshul)
  • Widely used and understood
  • Smith-Waterman
  • Preliminary to further processing
  • Used in preference to DNA-based similarities for
    evolutionary diverged species as peptide
    conservation is significantly higher than
    nucleotide

42
Genomic-based gene prediction algorithm
  • BLAST (Altshul)
  • Can be used in TBLASTX mode
  • BLAT (Kent)
  • Can be used in a translated DNA vs translated
    DNA mode
  • Significantly faster than BLAST
  • WABA (Kent)
  • Designed to allow for 3rd position codon wobble
  • Slow with some outstanding problems
  • Only really used in C.elegans v C.briggsae
    analysis

43
Comparative gene predictors
  • This can be viewed as an extension of the ab
    initio prediction tools where coding exons are
    defined by similarities and not codon bias
  • GAZE (Howe) is an extension of Phil Greens
    Genefinder in which transcript data is used to
    define coding exons. Other features are scored as
    in the original Genefinder implementation. This
    is being evaluated and used in the C.elegans
    project.
  • GENEWISE (Birney) is a HMM based gene predictor
    which attempts to predict the closest CDS to a
    supplied peptide sequence. This is the workhorse
    predictor for the ENSEMBL project.

44
Comparative gene predictors
  • A new generation of comparative gene prediction
    tools is being developed to utilise the large
    amount of genomic sequence available.
  • Twinscan (WashU) attempts to predict genes using
    related genomic sequences.
  • Doublescan (Sanger) is a HMM based gene
    predictor which attempts to predict 2 orthologous
    CDSs from genomic regions pre-defined as
    matching.
  • Both of these predictors are in development and
    will be used for the C.elegans v C.briggsae match
    and the Mouse v Human match later this year.

45
Summary
  • Genes are complex structure which are difficult
    to predict with the required level of
    accuracy/confidence
  • We can predict stops better than starts
  • We can only give gross confidence levels to
    predictions (i.e. confirmed, partially confirmed
    or predicted)
  • Gene prediction is only part of the annotation
    procedure
  • Movement from ab initio to comparative
    methodology as sequence data becomes
    available/affordable
  • Curation of gene models is an active process
    the set of gene models for a genome is fluid and
    WILL change over time.

46
The Annotation Process
DNA SEQUENCE
Useful Information
ANNALYSIS SOFTWARE
Annotator
47
Annotation Process
48
Artemis
  • Artemis is a free DNA sequence viewer and
    annotation tool that allows visualization of
    sequence features and the results of analyses
    within the context of the sequence, and its
    six-frame translation.
  • http//www.sanger.ac.uk/Software/Artemis/

49
atcttttacttttttcatcatctatacaaaaaatcatagaatattcatca
tgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt
tttaattaattcacattttatatctttaagtataatatcatttaacatt
atgttatcttcctcagtgtttttcattattatttgcatgtacagtttatc
a tttttatgtaccaaactatatcttatattaaatggatctctacttata
aagttaaaatctttttttaattttttcttttcacttccaattttatattc
cg cagtacatcgaattctaaaaaaaaaaataaataatatataatatata
ataaataatatataataaataatatataatatataataaataatatataa
tat ataatatataataaataatatataatatataatatataataaataa
tatataataaataatatataatatataatatataatactttggaaagatt
attt atatgaatatatacacctttaataggatacacacatcatatttat
atatatacatataaatattccataaatatttatacaacctcaaataaaat
aaaca tacatatatatatataaatatatacatatatgtatcattacgta
aaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggt
attagg agatatatttactgattcctcatttttataaatgttaaaatta
ttatccctagtccaaatatccacatttattaaattcacttgaatattgtt
ttttaaa ttgctagatatattaatttgagatttaaaattctgacctata
taaacctttcgagaatttataggtagacttaaacttatttcatttgataa
actaatat tatcatttatgtccttatcaaaatttattttctccatttca
gttattttaaacatattccaaatattgttattaaacaagggcggacttaa
acgaagtaa ttcaatcttaactccctccttcacttcactcattttatat
attccttaatttttactatgtttattaaattaacatatatataaacaaat
atgtcactaa taatatatatatatatatatatatatatatatattataa
atgttttactctattttcacatcttgtccttttttttttaaaaatcccaa
ttcttattcat taaataataatgtatttttttttttttttttttttttt
attaattattatgttactgttttattatatacactcttaatcatatatat
atatttatatat atatatatatatatatatatattattcccttttcatg
ttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatattt
ttataacatatgt attattaaaatgtatatataaaaatatatattccat
ttattattatttttttatatacattgttataagagtatcttctcccttct
ggtttatattacta ccatttcactttgaacttttcataaaaattaatag
aatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaata tatatatatatatatatacatataatatatattt
catctaatcatttaaaattattattatatattttttaaaaaatatattta
tgataacataaaaaga atttaattttaattaaatatatataattacata
catctaatattattatatatatataataagttttccaaatagaatactta
tatattatatatatata tatatatatatatattcttccataaaaagaat
aaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacatt
gaatatatagttgtattt ataaaattaaagaaaaagcataaagttacca
tttaatagtggagattagtaacattttcttcattatcaaaaatatttatt
tcctaattttttttttttg taaaatatatttaaaaatgtaatagattat
gtattaaataatataaatatagcaaaatgttcaattttagaaatttgcct
ctttttgacaaggataattc aaaagatacaggtaaaaaaaaaaaaataa
agtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatg
acatgttataatataatataa taaataaaaattatgtaatatatcataa
tcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaat
acatatataactaacattcata tctttatttttgtagatgatataaaaa
attttataaactcttatgaagggatatatttttcatcatccaataaattt
ataaatgtatttctagacaaaat tctgatcattgatccgtcttccttaa
atgttattacaataaatacagatctgtatgtagttgatttcctttttaat
gagaaaaataagaatcttattgtt ttagggtaatgaaatatatatagat
ttatatttttatttatttattatatattattttttaatttttcttttata
tatttattttatttagtgtataaaa tgatatcctttatatttatattta
catgggatattcaaataataacaaaaatgagtatacacatatatatatat
atatatatatatatgtatattttttt tttttttttatgttcctatagga
aagggaagaattcactgatttgtagtgtttacaatattagggaatgcaac
tttacacttttgaaaaaaattcagtta agcaaaaatattaataacatta
aaaagacactgatagcaaaatgtaatgaatatataataacattagaaaat
aagaaaattactttttatttcttaaata aagattatagtataaatcaaa
gtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtca
aaaaatcatatcttgttagtaataaaaaa ttcatatgtatatatatacc
aattagatattaaaaattcccatattagttatacacttattgatagtttc
aatttaaatttatcctacctcagagaatct ataaataataaaaaaaagc
atataaataaaataaatgatgtatcaaataatgacccaaaaaaggataat
aatgaaaaaaatacttcatctaataatataa cacataacaattataatg
acatatcaaataataataataataataataatattaatggggtgaaagac
catataaataataacactctggaaaataatga tgaaccaatcttatcta
tatataatgaagatcttaatgttttatatatatgccaaaatatgtataac
gtcctttttgttttgaatttaaataacctaagt
50
DNA in Artemis
GC content
Black bar stop codon
Forward translations
Reverse Translations
DNA and amino acids
51
Extra Slides
52
Gene prediction
  • What is gene prediction?
  • Why is gene prediction important?
  • Ab initio gene prediction (w/o prior knowledge)
  • Comparative gene prediction (use other
    biological data)
  • Summary

53
Genome annotation is central to functional
genomics
ORFeome based functional genomics
Gene Knockout
RNAi phenotypes
Expression Microarray
54
Gene finding
  • Artemis genome viewer
  • Coding sequence vs non coding sequence
  • Gene finding software
  • Homology between species
  • ESTs

55
(No Transcript)
56
Pretty Handy Annotation Tool (PHAT)
  • Based on a generalised hidden Markov model (GHMM)
  • Free easily installed and run.
  • Is good at predicting multiexon genes but will in
    some cases miss out genes altogether and will
    over predict.
  • Cawley et al. (2001) Mol. Bio. Para. 118
    p167http//www.stat.berkeley.edu/users/scawley/Ph
    at/

57
Phat
http//linkage.rockefeller.edu/wli/gene/krogh98.pd
f
58
GlimmerM
  • Under predicts splicing
  • Hardly hardly ever misses a gene completely.
  • Does over predict.
  • Free with TIGR license

59
Comparison Of Gene Finders
Write a Comment
User Comments (0)
About PowerShow.com