Gene Prediction Methods - PowerPoint PPT Presentation

About This Presentation
Title:

Gene Prediction Methods

Description:

Gene Prediction Methods G P S Raghava Prokaryotic gene structure Prokaryotes Advantages Simple gene structure Small genomes (0.5 to 10 million bp) No introns Genes ... – PowerPoint PPT presentation

Number of Views:919
Avg rating:3.0/5.0
Slides: 18
Provided by: Jonath271
Category:

less

Transcript and Presenter's Notes

Title: Gene Prediction Methods


1
Gene Prediction Methods
G P S Raghava
2
Prokaryotic gene structure
ORF (open reading frame)
TATA box
Stop codon
Start codon
ATGACAGATTACAGATTACAGATTACAGGATAG
Frame 1
Frame 2
Frame 3
3
Prokaryotes
  • Advantages
  • Simple gene structure
  • Small genomes (0.5 to 10 million bp)
  • No introns
  • Genes are called Open Reading Frames (ORFs)
  • High coding density (gt90)
  • Disadvantages
  • Some genes overlap (nested)
  • Some genes are quite short (lt60 bp)

4
Gene finding approaches
  1. Rule-based (e.g, start stop codons)
  2. Content-based (e.g., codon bias, promoter sites)
  3. Similarity-based (e.g., orthologs)
  4. Pattern-based (e.g., machine-learning)
  5. Ab-initio methods (FFT)

5
Simple rule-based gene finding
  • Look for putative start codon (ATG)
  • Staying in same frame, scan in groups of three
    until a stop codon is found
  • If of codons gt50, assume its a gene
  • If of codons lt50, go back to last start codon,
    increment by 1 start again
  • At end of chromosome, repeat process for reverse
    complement

6
Example ORF
7
Content based gene prediction method
  • RNA polymerase promoter site (-10, -30 site or
    TATA box)
  • Shine-Dalgarno sequence (10, Ribosome Binding
    Site) to initiate protein translation
  • Codon biases
  • High GC content

8
Similarity-based gene finding
  • Take all known genes from a related genome and
    compare them to the query genome via BLAST
  • Disadvantages
  • Orthologs/paralogs sometimes lose function and
    become pseudogenes
  • Not all genes will always be known in the
    comparison genome (big circularity problem)
  • The best species for comparison isnt always
    obvious
  • Summary Similarity comparisons are good
    supporting evidence for prediction validity

9
Machine Learning Techniques
  • Hidden Markov Model
  • ANN based method
  • Bayes Networks

10
Ab-initio Methods
  • Fast Fourier Transform based methods
  • Poor performance
  • Able to identify new genes
  • FTG method
  • http//www.imtech.res.in/raghava/ftg/

11
Eukaryotic genes
12
Eukaryotes
  • Complex gene structure
  • Large genomes (0.1 to 3 billion bases)
  • Exons and Introns (interrupted)
  • Low coding density (lt30)
  • 3 in humans, 25 in Fugu, 60 in yeast
  • Alternate splicing (40-60 of all genes)
  • Considerable number of pseudogenes

13
Finding Eukaryotic Genes Computationally
  • Rule-based
  • Not as applicable too many false positives
  • Content-based Methods
  • CpG islands, GC content, hexamer repeats,
    composition statistics, codon frequencies
  • Feature-based Methods
  • donor sites, acceptor sites, promoter sites,
    start/stop codons, polyA signals, feature lengths
  • Similarity-based Methods
  • sequence homology, EST searches
  • Pattern-based
  • HMMs, Artificial Neural Networks
  • Most effective is a combination of all the above

14
Gene prediction programs
  • Rule-based programs
  • Use explicit set of rules to make decisions.
  • Example GeneFinder
  • Neural Network-based programs
  • Use data set to build rules.
  • Examples Grail, GrailEXP
  • Hidden Markov Model-based programs
  • Use probabilities of states and transitions
    between these states to predict features.
  • Examples Genscan, GenomeScan

15
Combined Methods
  • GRAIL (http//compbio.ornl.gov/Grail-1.3/)
  • FGENEH (http//www.bioscience.org/urllists/genefin
    d.htm)
  • HMMgene (http//www.cbs.dtu.dk/services/HMMgene/)
  • GENSCAN(http//genes.mit.edu/GENSCAN.html)
  • GenomeScan (http//genes.mit.edu/genomescan.html)
  • Twinscan (http//ardor.wustl.edu/query.html)

16
Egpred Prediction of Eukaryotic Genes
http//www.imtech.res.in/raghava/ (Genome
Research 141756-66)
  • Similarity Search
  • First BLASTX against RefSeq datbase
  • Second BLASTX against sequences from first BLAST
  • Detection of significant exons from BLASTX output
  • BLASTN against Introns to filter exons
  • Prediction using ab-initio programs
  • NNSPLICE used to compute splice sites
  • Combined method

17
Thankyou
Write a Comment
User Comments (0)
About PowerShow.com