Genomics: - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Genomics:

Description:

Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 50
Provided by: admin1424
Category:

less

Transcript and Presenter's Notes

Title: Genomics:


1
Genomics Gene prediction and Annotations
Kishor K. Shende Information Officer Bioinformatic
s Center, Barkatullah University Bhopal
2
Gene Prediction Strategies
TAA TAG TGA
Prokaryotes Gene Architecture
Initiation
Termination
ATG
-10
-36
Protein 1
Protein 2
Protein 3
Promoter
Gene
Termination
Regulatory Seq.
ATG
Exon-1
Intron-1
Exon-2
Splicing Sites
TAA TAG TGA
Initiation
Eukaryotes Gene Architecture
3
Codon Usage Tables
  • Each amino acid can be encoded by several codons
  • Each organism has characteristic pattern of
    codon usage

4
Problems in Gene Prediction
  • Distinguishing Pseudogenes from Genes
  • Exon-Intron Structure in Eukaryotes, Exon
    flanking regions not very well conserved
  • Alternative Splicing Shuffling of Exons
  • Genes can overlap each other and occur on
    different strand of DNA

5
Gene Identification
  • 1. Homology Based Gene prediction
  • Sequence Similarity Search against gene database
    using BLAST and FAST searching tools
  • EST (Expressed Sequence Tags) similarity search
  • 1. Homology Based Gene prediction
  • Sequence Similarity Search against gene database
    using BLAST and FAST searching tools
  • EST (Expressed Sequence Tags) similarity search
  • 2. Ab initio Gene Prediction
  • Prokaryotes
  • - ORF finding
  • Eukaryotes
  • - Promoter prediction
  • - Start-Stop codon prediction
  • - Splice site Prediction (Exon-Intron and Intron
    Exon)
  • - PolyA signal prediction
  • 2. Ab initio Gene Prediction
  • Prokaryotes
  • - ORF finding
  • Eukaryotes
  • - Promoter prediction
  • - Start-Stop codon prediction
  • - Splice site Prediction (Exon-Intron and Intron
    Exon)
  • - PolyA signal prediction

6
ORF Finding in Prokaryotes
  • Easier due to ..
  • Small Genome have high gene density (Haemophilus
    influenza 85 genic)
  • No Introns or Few Introns
  • Operons
  • - One Transcript, many genes
  • Open Reading Frames (ORF)
  • - Contigous set of codons, start with Met-codon,
    ends with stop codon

7
  • 1. ORF Findings
  • Simplest method
  • Length of DNA sequence that contains a
    contiguous set of codons, each of which
    specifies an Amino Acid
  • Six possible reading frames

Start Codon
1
2
3
5
3
A T G C C A T C A G
Sense Strand
Antisense Strand
T G C C A T T G T A
5
3
3
1
2
Position 3 Position 2 Position 1
Start Codon
Central Dogma
DNA
mRAN
Protein
8
ORF Prediction Based on Position of Start Codon
Stop Codon
ORF
Start Codon
Stop Codon
A U G
U G A
OR
U A A
OR
U A G
Protein Coding Region
No Protein Due to the Presence of many
in-frame stop codons
Code for Protein
9
Example of ORF
There are six possible ORFs in each sequence for
both directions of transcription.
10
  • Difficulty in ORF Prediction
  • Prokaryotes Viruses Presence of multiple genes
    on mRNA and Overlapping genes in which two
    different proteins may be encoded in different
    reading frames of the same mRNA
  • Eukaryotes Protein coding region (Exon) is
    followed by non-coding region (Intron)
  • Differential mRNA splicing create different
    mRNA, hence different proteins
  • Variation in Genetic Code from Universal code
  • Reliability of ORF Prediction Characteristics of
    ORF regions
  • Ordered list of specific codons that reflects the
    evolutionary origin of the gene and constraints
    associated with gene expressions
  • Characteristics pattern of use of synonymous
    codons i.e. codons that stands for same Amino
    Acid
  • In Eukaryotes strong preferences for codon pairs
    at Intron-Exon or Exon-Intron junction
  • High genome content of GC have a strong bias of
    G C in the third codon positions

11
3 Test of ORF First Test It is based on an
unusual type of sequence variation that is found
in ORF have been devised to variety that a
predicted ORF is in fact likely to encode a
protein Second Test It is analyzed, to
determine whether the codon in the ORF correspond
to these used in other genes of the same
organism Third Test ORF may be translated into
an amino acid sequence and the resulting sequence
then compound to the databases of existing
sequence
12
Repeated Sequence Elements and Nucleosome
Structure 1. Eukaryotic DNA is wrapped around
histon-protein complexes 2. Some base pairs in
the major or minor grooves of the DNA molecules
face the nucleosome surface 3. Other pair face
outside of the structures 4. Nucleosome located
in the promoter regions are remodeled in a manner
that can influence the availability of binding
sites for regulatory proteins making them more or
less available
Hidden Morkov Model (HMM) of Eukaryotic Internal
Exon Computational Background Repeated patterns
of sequence have been found in the Introns and
Exons and near the start site of Transcriptuion
of Eukaryotic genes
  • Bending Pattern Bending is influenced by
  • Repeated pattern i.e. not T, A or G, G
  • AA/TT dinucleotide

13
Ab initio gene prediction
  • Predictions are based on the observation that
    gene DNA sequence is not random
  • - Gene-coding sequence has start and stop
    codons.
  • Each species has a characteristic pattern of
    synonymous codon usage.
  • Non-coding ORFs are very short.
  • Gene would correspond to the longest ORF.
  • These methods look for the characteristic
    features of genes and score them high.

14
Ab initio gene prediction methods
  • GeneScan Fourier transform of DNA sequence to
    find characteristic patterns.
  • GeneParser predicts the most likely combination
    of exons/introns. Dynamic programming.
  • GeneMark mostly for prokaryotes, Hidden Markov
    Models. Also for Eukaryotes
  • Grail II predicts exons, promoters, Poly(A)
    sites. Neural network plus dynamic programming.

15
Gene Preference Score Important indicator of
coding region
Observation frequencies of codons and codon
pairs in coding and non-coding regions are
different. Given a sequence of codons
and assuming independence, the probability
of finding coding region The
probability of finding sequence C in non-coding
regions The gene preference score

16
Confirming gene location using EST libraries
  • Expressed Sequence Tags (ESTs) sequenced short
    segments of cDNA. They are organized in the
    database UniGene.
  • If region matches ESTs with high statistical
    significance, then it is a gene or pseudogene.

17
Gene prediction accuracy
True positives (TP) nucleotides, which are
correctly predicted to be within the gene. Actual
positives (AP) nucleotides, which are located
within the actual gene. Predicted positives (PP)
nucleotides, which are predicted in the gene.
Sensitivity TP / AP
Specificity TP / PP
18
Gene prediction accuracy
19
Common Difficulties of Gene Prediction
  • First and last exons difficult to annotate
    because they contain UTRs.
  • Smaller genes are not statistically significant
    so they are thrown out.
  • Algorithms are trained with sequences from known
    genes which biases them against genes about which
    nothing is known.

20
Genome Analysis for Gene Prediction
Genome analysis
Genome the sum of genes and intergenic
sequences of haploid cell.
The value of genome sequences lies in their
annotation
  • Annotation Characterizing genomic features
    using computational and experimental methods
  • Genes levels of annotation
  • Gene Prediction Where are genes?
  • What do they encode?
  • What proteins/pathways involved in?

21
Flowchart Gene Prediction Process
  • Translate in all
  • six Reading Frames
  • compare to Protein
  • sequence database
  • 2. Perform database
  • similarity search of
  • EST database of
  • some Organism

Genomic DNA Sequence
Use Gene Prediction program to locate genes
Analyze the Regulatory Sequences in the Gene
22
ORF Finding
Try this first using BLAST FASTA
PSI-BLAST, PHI-BLAST Other BLAST/FASTA
programs EST, cDNA database search
Promoter, Splicing Site, Poly-A tail, 5 TUR, 3
UTR
Compare with Genome of Other Organism
23
Lets have some Practice on Gene Finding using
some Gene Finding Programs
  1. GenMark (http//exon.gatech.edu/GeneMark/ )
  2. Genscan (http//genes.mit.edu/GENSCAN.html )
  3. Grail II (http//compbio.ornl.gov/Grail-1.3/ )
  4. Gene Finder in GlimmerM (http//www.tigr.org/tdb/g
    limmerm/glmr_form.html )

24
   HMMgene - Prediction of genes in vertebrate
and C. elegans    Gene Discovery Page   
FramePlot - protein-coding region prediction tool
for high GC-content bacteria    tRNAscan-SE
Search for transfer RNA genes in genomic sequence
   NETGENE - Predict splice sites in human
genes    ORF Finder    BCM Gene Finder
   Grail    Genemark    Genie A Gene
Finder Based on Generalized Hidden Markov Models
   GENSCAN - predict complete gene structures
   Splice Site Prediction by Neural Network
   Procrustes    GenePrimer    GenLang
   MZEF Gene Finder    Webgene - Tools for
prediction and analysis of protein-coding gene
structure    MAR-Finder - Nuclear matrix
attachment region prediction    Glimmer
bacterial/archael gene finder
25
  • Promoter Region, Transscription Factor and
    Signals
  •   TRANSFAC - Transcription Factor database  
    TFD Transcription Factor Database   TransTerm -
    A Translational Signal Database   PLACE - a
    database of plant cis-acting regulatory DNA
    elements   NNPP Promoter Prediction by Neural
    Network   FastM/ModelInspector   TFSEARCH
    MatInd and MatInspector   Transcription Element
    Search Software (TESS)   CorePromoter
    (Core-Promoter Prediction Program) Gene Express
    - analysis of genomic regulatory sequences
    Signal Scan PromoterInspector Promoter Scan
    II Pol3scan TargetFinder - finds DNA-binding
    proteins.

26
Overview GENE PREDICTION TOOLS
27
GenMarkTM (http//exon.gatech.edu/GeneMark/
) Mark Borodovsky's Bioinformatics Group at the
Georgia Institute of Technology, Atlanta, Georgia

28
GeneMark.hmm for Prokaryotes (Version 2.4)
Reference Lukashin A. and Borodovsky M., GeneMark.hmm new solutions for gene finding, NAR, 1998, Vol. 26, No. 4, pp. 1107-1115
Bacterial and archaeal gene prediction, you can
use the parallel combination of the GeneMark and
GeneMark.hmm programs Heuristic Approach for
Gene Prediction in Prokaryotes If the DNA
sequence of interest belongs to a species whose
name is not in the list of available models, use
the Heuristic models option Self Training
Program of Genmarks If the sequence is longer
than 1 Mb, generate models with the self-training
program GeneMarkS
29
(No Transcript)
30
Gene Prediction in Eukaryotes
Eukaryotic gene prediction Use the parallel
combination of the GeneMark and GeneMark.hmm
31
Select the Related Organisms from this list
32
Gene Prediction in EST and cDNA
To analyze ESTs and cDNAs
33
(No Transcript)
34
Gene Prediction in Viruses
Viral gene prediction through virus database
VIOLIN
35
(No Transcript)
36
GenMark Output
37
GenMark Output
38
New GENSCAN Web Server at MIT
39
(No Transcript)
40
Genescan Output
41
(No Transcript)
42
(No Transcript)
43
GrailEXP
  1. Locate protein coding genes within DNA sequence,
  2. Locate EST/mRNA alignments,
  3. Locate certain types of promoters,
    polyadenylation sites, CpG islands, and
    repetitive elements.
  • GrailEXP is a gene finder.
  • EST alignment utility
  • exon prediction program,
  • a promoter/polya recognizer,
  • a CpG island finer,
  • a repeat masker,

44
GrailEXP
Predicts exons, genes, promoters, polyas, CpG
islands, EST similarities, and repetitive
elements within DNA sequence
45
(No Transcript)
46
GlimmerM http//www.tigr.org/tdb/glimmerm/glmr_fo
rm.html
A system for finding genes in microbial DNA,
especially the genomes of bacteria and
archaea.Glimmer (Gene Locator and Interpolated
Markov Modeler) uses interpolated Markov models
(IMMs) to identify the coding regions and
distinguish them from noncoding DNA.
GlimmerHMM For Eukaryotic Organisms
Genesplicer Fast, flexible system for detecting
splice sites in the genomic DNA of various
eukaryotes.
47
GLimmerM Gene Finder
48
(No Transcript)
49
THANK YOU
Kishor K. Shende Information Officer Bioinformatic
s Center, Barkatullah University Bhopal
Write a Comment
User Comments (0)
About PowerShow.com