Gene Prediction in silico - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Gene Prediction in silico

Description:

The ultimate goal of molecular cell biology is to understand ... Smith-Waterman (SW) algorithm - dynamic programming algorithm. Limitations of Homology Search ... – PowerPoint PPT presentation

Number of Views:973
Avg rating:5.0/5.0
Slides: 56
Provided by: rek33
Category:

less

Transcript and Presenter's Notes

Title: Gene Prediction in silico


1
Gene Prediction in silico
Nita Parekh BIRC, IIIT, Hyderabad
2
Goal
  • The ultimate goal of molecular cell biology is
    to understand the physiology of living cells in
    terms of the information that is encoded in the
    genome of the cell
  • How computational approaches can help
  • in achieving this goal ?

3
A gene codes for a protein
CCTGAGCCAACTATTGATGAA
CCUGAGCCAACUAUUGAUGAA
PEPTIDE
4
What is Computational Gene Finding?
  • Given an uncharacterized DNA sequence, find
  • Which region codes for a protein?
  • Which DNA strand is used to encode the gene?
  • Which reading frame is used in that strand?
  • Where does the gene start and end?
  • Where are the exon-intron boundaries in
    eukaryotes?
  • (optionally) Where are the regulatory sequences
    for that gene?
  • Search space - 2-5 of Genomic DNA
  • ( 100 1000 Mbp)

5
  • Need for Computational Gene Prediction
  • It is the first step towards getting at the
    function of a protein.
  • It also helps accelerate the annotation of
    genomes.

6
  • Deoxyribonucleic acid (DNA)
  • is a blueprint of the cell
  • Composed of four basic units
  • - called nucleotides
  • Each nucleotide contains
  • - a sugar, a phosphate and
  • one of the 4 bases
  • Adenine(A), Thymine(T),
  • Guanine(G), Cytosine(C)

7
  • For all computational purposes, a DNA sequence
    is considered to be a string on a 4-letter
    alphabet A, T, G, C
  • ACGCTGAATAGC
  • The aim is to find grammar syntax rules of DNA
    language based on the 4-letter alphabet,
  • - similar to English Grammar to form meaningful
    sentences

8
Biological Sequences
  • Order of occurrence of bases
  • not completely random
  • - Different regions of the genome exhibit
    different patterns of the four bases, A, T, G, C
  • e.g., protein coding regions, regulatory
    regions, intron/exon boundaries, repeat regions,
    etc.
  • Aim identifying these various patterns to infer
    their functional roles

9
  • Assumption in biological sequence analysis
  • - strings carrying information will be different
    from random strings
  • If a hidden pattern can be identified in a
    string, it must be carrying some functional
    information

10
Example
  • This is a lecture on bioinformatics
  • This is a lecture on bioinformatics
  • asjd lkjfl jdjd sjftye nvcrow nzcdjhspu
  • asjd lkjfl jdjd sjftye nvcrow nzcdjhspu

11
Frequency of letters
  • A. 7.3 N. 7.8
  • B. 0.9 O. 7.4
  • C. 3.0 P. 2.7
  • D. 4.4 Q. 0.3
  • E. 13.0 R. 7.7
  • F. 2.8 S. 6.3
  • G. 1.6 T. 9.3
  • H. 3.5 U. 2.7
  • I. 7.4 V. 1.3
  • J. 0.2 W. 1.6
  • K. 0.3 X. 0.5
  • L. 3.5 Y. 1.9
  • M. 2.5 Z. 0.1

12
Other statistics
  • Frequencies of the most common first letter of a
    word, last letter of a word, doublets, triplets
    etc.
  • 20 most used words in
  • Written English
  • the of to in and a for was is that on at he with
    by be it an as his
  • Spoken English
  • the and I to of a you that in it is yes was this
    but on well he have for

13
Parallels in DNA language
  • ATGGTGGTCATGGCGCCCCGAACCCTCTTCCTGCTGCTCTCGGGGGCCC
    TGACCCTGACCGAGACCTGGGCGGGTGAGTGCGGGGTCAGGAGGGAAACA
    GCCCCTGCGCGGAGGAGGGAGGGGCCGGCCCGGCGGG
  • GTCTCAACCCCTCCTCGCCCCCAGGCTCCCACTCCATGAGGTATTTCAG
    CGCCGCCGTGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCCATGG
    GCTACGTGGACGACACGCAGTTCGTGCGGTTC

14
Parallels in DNA language
  • ATG GTG GTC ATG GCG CCC CGA ACC CTC TTC CTG CTG
    CTC TCG GGG GCC CTG ACC CTG ACC GAG ACC TGG GCG
    GGT GAG TGC GGG GTC AGG AGG GAA ACA GCC CCT GCG
    CGG AGG AGG GAG GGG CCG GCC CGG CGG
  • GTC TCA ACC CCT CCT CGC CCC CAG GCT CCC ACT CCA
    TGA GGT ATT TCA GCG CCG CCG TGT CCC GGC CCG GCC
    GCG GGG AGC CCC GCT TCA TCG CCA TGG GCT ACG TGG
    ACG ACA CGC AGT TCG TGC GGT TC

15
  • This task needs to be automated because of the
    large genome sizes
  • Smallest genome
  • Mycoplasma genitalium 0.5 x 106 bp
  • Human genome 3 x 109 bp (not the largest)

16
Finding genes in Prokaryotes
  • each gene is one continuous stretch of bases
  • most of the DNA sequence codes for protein
  • (70 of the H.influenzea bacterium genome is
    coding)

17
Gene Structure in Prokaryotes
  • Starts with the promoter region, which is
    followed by a transcription start site
  • Non-coding region called the 5 untranslated
    region (5 UTR)
  • The coding region, CDS, is a continuous region
  • starts with a start codon ATG, and terminates
    with a stop codon (TAA/TAG/TGA)
  • Followed by another non-coding region called the
    3untranslated region (3 UTR)
  • In the end there is a polyadenylation signal

18
Finding genes in Prokaryotes
  • Gene prediction in prokaryotes is considerably
    simple and involves
  • identifying long reading frames
  • using codon frequencies

19
Finding genes in Eukaryotes
  • the coding region is usually discontinuous
  • composed of alternating stretches of exons and
    introns
  • Only 2-3 of the human genome (3 x 109bp) codes
    for proteins

20
Gene Structure in Eukaryotes
  • Starts with a promoter region, which is followed
    by a transcription start site
  • Non-coding region called the 5 untranslated
    region (5 UTR)
  • The initial exon which contains the start codon
    (AUG)
  • An alternating sequence of introns and internal
    exons.
  • The terminating exon, which contains the stop
    codon (TAA, TAG, TGA)

21
Gene Structure in Eukaryotes
  • Non-coding region called the 3 untranslated
    region (3 UTR)
  • In the end there is a polyadenylation signal
  • Exon-intron boundaries, called splice sites, are
    signaled by certain short sequences
  • donor sites (GT) 5 (3) end of an intron
    (exon)
  • acceptor sites (AG) 3 (5) end of an intron
    (exon)
  • Note Some eukaryotic genes are single-exon, or,
    intron-less genes

22
Finding genes in Eukaryotes
  • Gene finding problem complicates
  • due to the existence of interweaving exons and
    introns stop codons may exist in intronic
    regions making it difficult to identify correct
    ORF
  • a gene region may encode many proteins due to
    alternative splicing
  • Exon length need not be multiple of three
    resulting in frameshift between exons
  • Gene may be intron-less (single-exon genes)
  • Relatively low gene density - only 2 - 5 of the
    human genome codes for proteins

23
Methods for Identifying Coding Regions
  • Finding Open Reading Frames (ORFs)
  • Homology Search
  • DNA vs. Protein Searches
  • Content-based methods
  • Coding statistics, viz., codon usage bias,
    periodicity in base occurrence, etc.
  • Signal-based methods
  • CpG islands
  • Start/Stop signals, promoters, poly-A sites,
    intron/exon boundaries, etc.
  • Integration of these methods

24
Finding Open Reading Frames (ORF)
  • Once a gene has been sequenced it is important to
    determine the correct open reading frame (ORF).
  • Every region of DNA has six possible reading
    frames, three in each direction
  • The reading frame that is used determines which
    amino acids will be encoded by a gene.
  • Typically only one reading frame is used in
    translating a gene, and this is often the longest
    open reading frame

25
Finding Open Reading Frames (ORF)
  • Detecting a relatively long sequence deprived of
    stop codons indicate a coding region
  • An open reading frame starts with a start codon
    (atg) in most species and ends with a stop codon
    (taa, tag or tga)
  • Once the open reading frame is known the DNA
    sequence can be translated into its corresponding
    amino acid sequence using the genetic code
  • The codons are triplet of bases

26
The Genetic Code
27
Finding Open Reading Frames (ORF)
  • Consider the following sequence of DNA
  • 5 TCAATGTAACGCGCTACCCGGAGCTCTGGG
  • CCCAAATTTCATCCACT  3 Forward Strand
  • Its complementary Strand is
  • 3 AGTTACATTGCGCGATGGGCCTCGAGACCCGGGTTTAAAGTAGGTGA
      5 Reverse Strand
  • The DNA sequence can be read in six reading
    frames - three in the forward and three in the
    reverse direction depending on the start position

28
Finding Open Reading Frames (ORF)
  • 5   TCAATGTAACGCGCTACCCGGAGCTCTGGGCCCAAATTTCATCCA
    CT  3
  • Three reading frames in the forward direction
  • TCA ATG TAA CGC GCT ACC CGG AGC TCT GGG CCC AAA
    TTT CAT CCA CT 
  • CAA TGT AAC GCG CTA CCC GGA GCT CTG GGC CCA AAT
    TTC ATC CAC T
  • AAT GTA ACG CGC TAC CCG GAG CTC TGG GCC CAA ATT
    TCA TCC ACT

Start codon
29
Finding Open Reading Frames (ORF)
  • 3  AGTTACATTGCGCGATGGGCCTCGAGACCCGGGTTTAAAGTAGGTG
    A   5
  • Three reading frames in the reverse direction
  • AG TTA CAT TGC GCG ATG GGC CTC GAG ACC CGG GTT
    TAA AGT AGG TGA
  • A GTT ACA TTG CGC GAT GGG CCT CGA GAC CCG GGT TTA
    AAG TAG GTG
  • AGT TAC ATT GCG CGA TGG GCC TCG AGA CCC GGG TTT
    AAA GTA GGT

Start codon
stop codon
30
Finding Open Reading Frames (ORF)
  • In this case the longest open reading frame
    (ORF) is the 3rd reading frame of the
    complementary strand
  • AGT TAC ATT GCG CGA TGG GCC TCG AGA CCC GGG TTT
    AAA GTA
  • When read 5 to 3, the longest ORF is
  • ATG AAA TTT GGG CCC AGA GCT CCG GGT AGC GCG TTA
    CAT TGA

31
Finding Long ORFs
  • First step to distinguish between a coding and a
    non-coding region is to look at the frequency of
    stop codons
  • Sequence similarity search (database search)
  • When no sequence similarity is found, an ORF can
    still be considered gene-like according to some
    statistical features
  • the three-base periodicity
  • higher GC content
  • signal sequence patterns

32
Finding Long ORFs
  • Once a long ORF/ all ORFs above a certain
    threshold are identified,
  • - these ORF sequences are called putative coding
    sequences
  • - translate each ORF using the Universal Genetic
    code to obtain amino acid sequence
  • - search against the protein database for
    homologs

33
Finding genes in Prokaryotes
  • Drawbacks
  • The addition or deletion of one or more bases
    will cause all the codons scanned to be different
  • ? sensitive to frame shift errors
  • Fails to identify very small coding regions
  • Fails to identify the occurrence of overlapping
    long ORFs on opposite DNA strands (genes and
    shadow genes)

34
Web-based tools
  • ORF Finder (NCBI)
  • http//www.ncbi.nih.gov/gorf/gorf.html
  • EMBOSS
  • getorf - Finds and extracts open reading frames
  • plotorf - Plot potential open reading frames
  • Sixpack - Display a DNA sequence with 6-frame
    translation and ORFs
  • http//www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/ge
    torf.html

35
Homology Search
  • This involves Sequence-based Database Searching
  •  
  • DNA Database searching  
  • Protein Database searching

36
Homology Search
  • Why search databases?
  •  
  • When one obtains a new DNA sequence, one needs
    to know
  • whether it already exists in the databanks
  • whether it has any homologous sequences (i.e.,
    sequences derived from a common ancestry) in the
    databases
  • Given a putative coding ORF, search for
    homologous proteins proteins similar in their
    folding or structure or function.

37
Homology Search
  • DNA vs. Protein Searches
  • Use protein for database similarity searches
    whenever possible

38
Homology Search
  • Three main search tools used for database
    search
  • BLAST - algorithm by Karlin Altschul
  • http//www.ncbi.nlm.nih.gov/BLAST/
  • FastA - algorithm by Pearson Lipman
  • http//www.ebi.ac.uk/fasta33/
  • Smith-Waterman (SW) algorithm
  • - dynamic programming algorithm

39
Limitations of Homology Search
  • Only limited number of genes are available in
    various databases.
  • Currently only 50 of the sequences are found to
    be similar to previously known sequences.
  • It should always be kept in mind that
    similarity-based methods are only as reliable as
    the databases that are searched, and apparent
    homology can be misleading at times

40
Content-based Methods
At the core of all gene identification programs
there exist one or more coding measures A
coding statistic - a function that computes the
likelihood that the sequence is coding for a
protein. A good knowledge of core coding
statistics is important to understand how gene
identification programs work.
41
 Classification of Coding Measures
  • Coding statistics measure
  • base compositional bias
  • periodicity in base occurrence
  • codon usage bias
  • Main distinction is between
  • measures dependent of a model of coding DNA
  • measures independent of such a model.

42
  • Model dependent coding statistics capture the
    specific features of coding DNA
  • Unequal usage of codons in the coding regions -
    a universal feature of the genomes
  • Dependencies between nucleotide positions
  • Base compositional bias between codon positions
  • - requires a representative sample of coding
    DNA from the species under consideration to
    estimate the model's parameters

43
(No Transcript)
44
Markov Models
Dependencies between nucleotide positions in
coding regions - can be explicitly described by
means of Markov Models In Markov Models - the
probability of a nucleotide at a particular
codon position depends on the nucleotide(s)
preceding it. Probability of a DNA sequence of
length L
transition probabilities
45
Markov Models
Table III Probabilities of the four nucleotides
at the different codon positions conditioned to
the nucleotide in the preceding codon position
46
  • Model independent coding statistics capture
    only the universal features of coding DNA
  • Position Asymmetry how asymmetric is the
    distribution of nucleotides at the 3 triplet
    positions
  • Periodic Correlation - correlations between
    nucleotide positions
  • - do not require a sample of coding DNA

47
Signal-based Methods

Signal a string of DNA recognized by the
cellular machinery
GT
AG
48
  • Signals for gene identification
  • There are many signals associated with genes,
    each of which suggests but does not prove the
    existence of a gene
  • Most of these signals can be modeled using
    weight matrices

49
  • Signals for gene identification
  • CpG Islands identify the 2 of the genome that
    codes for proteins
  • Start Stop Codons signifies the start end
    of a coding region
  • Transcription Start Site to identify the start
    of coding region
  • Donor Acceptor Sites - signifies the start
    end of intronic regions
  • Cap Site found in the 5 UTR

50
  • Signals for gene identification
  • Promoters to initiate transcription (found in
    5 UTR region)
  • Enhancers regulates gene expression, (found in
    5 or 3 UTR regions, intronic regions, or up to
    few Kb away from the gene)
  • Transcription Factor Binding Sites short DNA
    sequences where proteins bind to initiate
    transcription /translation process
  • Poly-A Site identify the end of coding region
    (found in 3 UTR region)

51
Promoter Detection
  • Not all ORFs are genes
  • True coding regions have specific sequences
    upstream of the start site known as promoters
    where the RNA polymerase binds to initiate
    transcription, e.g., in E. coli
  • No two patterns are identical
  • All genes do not have these patterns

Consensus patterns
52
Promoter Detection
  • Signals are short and variable
  • Pattern search using positional frequencies
  • Compute f(b,i) the frequency of nucleotide b at
    position i
  • Probability of sequence S to be a promoter is
  • ?i f(b,i) (i 1,,6)
  • Probability S is not a promoter ?f(b), where f(b)
    is the expected frequency of b
  • Find odds ratio of S being a promoter to not
    being a promoter

53
Positional Weight Matrixfor TATA box

54
Complications in Gene Prediction
The problem of gene identification is further
complicated in case of eukaryotes by the vast
variation that is found in the structure of
genes. On an average, a vertebrate gene is 30Kb
long. Of this, the coding region is only about
1Kb. The coding region typically consists of 6
exons, each about 150bp long. These are average
statistics
55
Complications in Gene Prediction
Huge variations from the average are observed
Biggest human gene, dystrophin is 2.4Mb long.
Blood coagulation human factor VIII gene is
186Kb. It has 26 exons with sizes varying from 69
bp to 3106 bp and its 25 introns range in size
from 207 to 32,400 bp. An average 5 UTR is
750bp long, but it can be longer and span several
exons (for e.g., in MAGE family). On an average,
the 3 UTR is about 450bp long, but for e.g., in
case of the gene for Kallmans syndrome, the
length exceeds 4Kb
56
Some facts about human genes
  • Comprise about 3 of the genome
  • Average gene length 8,000 bp
  • Average of 5-6 exons/gene
  • Average exon length 200 bp
  • Average intron length 2,000 bp
  • 8 genes have a single exon
  • Some exons can be as small as 1 or 3 bp

57
Complications in Gene Prediction
  • In higher eukaryotes the gene finding becomes far
    more difficult because
  • It is now necessary to combine multiple ORFs to
    obtain a spliced coding region.
  • Alternative splicing is not uncommon,
  • Exons can be very short, and introns can be
    very long.
  • Given the nature of genomic sequence in humans,
    where large introns are known to exist, there is
    definitely a need for highly specific gene
    finding algorithms.

58
Gene prediction programs
GENSCAN http//genes.mit.edu/GENSCAN.html Probabi
listic model based on HMM. The different states
of the model correspond to different functional
units on a gene, e.g., promoter region, exon,
intron etc. It uses a homogenous 5th order Markov
model for non-coding regions and a 3-periodic
(inhomogenous) 5th order Markov model for coding
regions Signals are modeled by weight matrixes,
weight arrays and maximal dependence
decomposition techniques. Species Vertebrates,
Maize Arabidopsis, Trained on human genes,
Accuracy lower for non-vertebrates
59
Gene prediction programs
Fgene Uses Dynamic Programming to find the
optimal combinations of exons, promoters, and
polyA sites detected by a pattern recognition
algorithm. Species Human, Drosophila, Nematode,
Yeast, Plant http//dot.imgen.bcm.tmc.edu.9331/ge
ne-finder/gf.html GrailExp Uses HMM as the
underlying computational technique to determine
its genomic predictions. Uses Neural Network to
combine the information from various gene finding
signals Species Human, Mouse http//compbio.ornl
.gov/grailexp/
Write a Comment
User Comments (0)
About PowerShow.com