Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b

About This Presentation
Title:

Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b

Description:

Genes sometimes have unusual codons for a reason ... The translated region begins with a start signal and ends with a stop codon. ... –

Number of Views:22
Avg rating:3.0/5.0
Slides: 35
Provided by: Nob49
Category:

less

Transcript and Presenter's Notes

Title: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b


1
Gene prediction and HMMComputational Genomics
2005/6Lecture 9b
  • Slides taken from (and rapidly mixed) Larry
    Hunter, Tom Madej, William Stafford Noble, and
    Eyal Pribman. Partially modified by Benny Chor.

2
Annotation of Genomic Sequence
  • Given the sequence of an organisms genome, we
    would like to be able to identify
  • Genes
  • Exon boundaries splice sites
  • Beginning and end of translation
  • Alternative splicings
  • Regulatory elements (e.g. promoters)
  • The only certain way to do this is
    experimentally,
  • but it is time consuming and expensive.
    Computational methods can achieve reasonable
    accuracy quickly, and help direct experimental
    approaches.

primary goals
secondary goals
3
Prokaryotic Gene Structure
Promoter CDS Terminator
Genomic DNA

transcription
mRNA
  • Most bacterial promoters contain the
    Shine-Delgarno signal, at about -10 that has the
    consensus sequence 5'-TATAAT-3'.
  • The terminator a signal at the end of the
    coding sequence that terminates the transcription
    of RNA
  • The coding sequence is composed of nucleotide
    triplets. Each triplet codes for an amino acid.
    The AAs are the building blocks of proteins.

4
Pieces of a (Eukaryotic) Gene(on the genome)
exons (cds utr) / introns ( 102-103 bp)
( 102-105 bp)
5
What is it about genes that we can measure (and
model)?
  • Most of our knowledge is biased towards
  • protein-coding characteristics
  • ORF (Open Reading Frame) a sequence defined by
    in-frame AUG and stop codon, which in turn
    defines a putative amino acid sequence.
  • Codon Usage most frequently measured by CAI
    (Codon Adaptation Index)
  • Other phenomena
  • Nucleotide frequencies and correlations
  • value and structure
  • Functional sites
  • splice sites, promoters, UTRs, polyadenylation
    sites

6
A simple measure ORF length Comparison of
Annotation and Spurious ORFs in S. cerevisiae
Basrai MA, Hieter P, and Boeke J Genome Research
1997 7768-771
7
Codon Adaptation Index (CAI)
  • Parameters are empirically determined by
    examining a large set of example genes
  • This is not perfect
  • Genes sometimes have unusual codons for a reason
  • The predictive power is dependent on length of
    sequence

8
CAI Example Counts per 1000 codons
9
Splice signals (mice) GT , AG
10
General Things to Remember about (Protein-coding)
Gene Prediction Software
  • It is, in general, organism-specific
  • It works best on genes that are reasonably
    similar to something seen previously
  • It finds protein coding regions far better than
    non-coding regions
  • In the absence of external (direct) information,
    alternative forms will not be identified
  • It is imperfect! (Its biology, after all)

11
Rest of Lecture Outline
  • Eukaryotic gene structure
  • Modeling gene structure
  • Using the model to make predictions
  • Improving the model topology
  • Modeling fixed-length signals

12
A eukaryotic gene
  • This is the human p53 tumor suppressor gene on
    chromosome 17.
  • Genscan is one of the most popular gene
    prediction algorithms.

13
A eukaryotic gene
Introns
Final exon
Initial exon
3 untranslated region
Internal exons
This particular gene lies on the reverse strand.
14
An Intron
revcomp(CT)AG
revcomp(AC)GT
GT signals start of intron AG signals end of
intron
3 splice site
5 splice site
15
Signals vs contents
  • In gene finding, a small pattern within the
    genomic DNA is referred to as a signal, whereas a
    region of genomic DNA is a content.
  • Examples of signals splice sites, starts and
    ends of transcription or translation, branch
    points, transcription factor binding sites
  • Examples of contents exons, introns, UTRs,
    promoter regions

16
Prior knowledge
  • We want to build a probabilistic model of a gene
    that incorporates our prior knowledge.
  • E.g., the translated region must have a length
    that is a multiple of 3.

17
Prior knowledge
  • The translated region must have a length that is
    a multiple of 3.
  • Some codons are more common than others.
  • Exons are usually shorter than introns.
  • The translated region begins with a start signal
    and ends with a stop codon.
  • 5 splice sites (exon to intron) are usually GT
  • 3 splice sites (intron to exon) are usually AG.
  • The distribution of nucleotides and dinucleotides
    is usually different in introns and exons.

18
A simple gene model
Intergenic
Intergenic
Intergenic
Gene
Transcription stop
Transcription start
Start
End
Intergenic
19
A probabilistic gene model
Pr(TACAGTAGATATGA) 0.0001 Pr(AACAGT)
0.001 Pr(AACAGTAC) 0.002
Intergenic
Intergenic
0.25
Intergenic
Gene
Transcription stop
Transcription start
0.67
1.00
0.75
Start
End
0.33
Intergenic
Every box stores transition probabilities for
outgoing arrows. Every arrow stores emission
probabilities for emitted nucleotides.
20
Parse
S ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCG P
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGG
TATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC
GGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
  • For a given sequence, a parse is an assignment of
    gene structure to that sequence.
  • In a parse, every base is labeled, corresponding
    to the content it (is predicted to) belongs to.
  • In our simple model, the parse contains only I
    (intergenic) and G (gene).
  • A more complete model would contain, e.g., -
    for intergenic, E for exon and I for intron.

21
The probability of a parse
Pr(ATGCGTATGTTTTGA) 0.00000000142
Pr(ACTGACTACTACGACTACGATCTACTACGGGCGCGACCT)
0.0000543
Pr(ACTGACTATGCGATCTACGACTCGACTAGCTAC) 0.0000789
Intergenic
Intergenic
0.25
Intergenic
Gene
Transcription stop
Transcription start
0.67
1.00
0.75
Start
End
0.33
Intergenic
S ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCGTA
TGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC P
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGGGGGGGG
GGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Pr(parse P sequence S, model M) 0.67 ?
0.0000543 ? 1.00 ? 0.00000000142 ? 0.75 x
0.0000789 3.057 ? 10-18
22
Finding the best parse
  • For a given sequence S, the model M assigns a
    probability Pr(PS,M) to every parse P.
  • We want to find the parse P that receives the
    highest probability.

23
An analogy
  • Given two sequences, S1 and S2, and a scoring
    system M, find the alignment A that maximizes
    the score.
  • Solution dynamic programming

24
Dynamic programming
A C A C C T T G T A C G T
Intergenic 1 Intergenic 2 Intergenic 3 Intergenic
4 Gene
Intergenic 2
Intergenic 4
Intergenic 1
Gene
Transcription stop
Transcription start
Start
End
Intergenic 3
25
DP matrix
A C A A T G T G A C

I1
I2
I3
I4
G
S ACAATGTGAC P IIIGGGGGGI
26
Beyond Simplest Model
  • Improving the gene model topology
  • Fixed-length signals
  • PSSMs
  • Dependencies between positions
  • Variable-length contents
  • Using HMMs
  • Semi-Markov models
  • Parsing algorithms
  • Viterbi
  • Posterior decoding
  • Including other types of data
  • Expressed sequence tags
  • Orthology

27
Improved model topology
Intergenic 2
Intergenic 4
Intergenic 1
Gene
Transcription stop
Transcription start
Start
End
Intergenic 3
  • Draw a model that includes introns

28
Improved model topology
Start
Transcription start
5 splice site
3 splice site
Transcription stop
End
29
Improved model topology
Start
Transcription start
5 splice site
3 splice site
4 intergenics 1 intron 4 exons
Transcription stop
End
30
Improved model topology
Start
Transcription start
Single exon
Initial exon
Internal exon
5 splice site
3 splice site
Intron
Final exon
Transcription stop
End
31
Modeling the 5 splice site
5 splice site
3 splice site
Intron
GT
  • Most introns begin with the letters GT.
  • We can add this signal to the model.

32
Modeling the 5 splice site
5 splice site
3 splice site
Intron
G
T
Pr(A)0 Pr(C)0 Pr(G)0 Pr(T)1
Pr(A)0 Pr(C)0 Pr(G)1 Pr(T)0
  • Most introns begin with the letters GT.
  • We can add this signal to the model.
  • Indeed, we can model each nucleotide with its own
    arrow.

33
Modeling the 5 splice site
5 splice site
3 splice site
Intron
G
T
Pr(A)0.01 Pr(C)0.01 Pr(G)0.01 Pr(T)0.97
Pr(A)0.01 Pr(C)0.01 Pr(G)0.97 Pr(T)0.01
  • Like most biological phenomenon, the splice site
    signal admits exceptions.
  • The resulting model of the 5 splice site is a
    length-2 PSSM.

34
Real splice sites
  • Real splice sites show some conservation at
    positions beyond the first two.
  • We can add additional arrows to model these
    states.

weblogo.berkeley.edu
35
Modeling the 5 splice site
5 splice site
3 splice site
Intron
Write a Comment
User Comments (0)
About PowerShow.com