Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b

About This Presentation

Title:

Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b

Description:

Genes sometimes have unusual codons for a reason ... The translated region begins with a start signal and ends with a stop codon. ... –

Number of Views:22

Avg rating:3.0/5.0

Slides: 35

Provided by: Nob49

Category:

more less

Transcript and Presenter's Notes

Title: Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b

1
Gene prediction and HMMComputational Genomics
2005/6Lecture 9b

Slides taken from (and rapidly mixed) Larry
Hunter, Tom Madej, William Stafford Noble, and
Eyal Pribman. Partially modified by Benny Chor.

2
Annotation of Genomic Sequence

Given the sequence of an organisms genome, we
would like to be able to identify
Genes
Exon boundaries splice sites
Beginning and end of translation
Alternative splicings
Regulatory elements (e.g. promoters)
The only certain way to do this is
experimentally,
but it is time consuming and expensive.
Computational methods can achieve reasonable
accuracy quickly, and help direct experimental
approaches.

primary goals
secondary goals
3
Prokaryotic Gene Structure
Promoter CDS Terminator
Genomic DNA

transcription
mRNA

Most bacterial promoters contain the
Shine-Delgarno signal, at about -10 that has the
consensus sequence 5'-TATAAT-3'.
The terminator a signal at the end of the
coding sequence that terminates the transcription
of RNA
The coding sequence is composed of nucleotide
triplets. Each triplet codes for an amino acid.
The AAs are the building blocks of proteins.

4
Pieces of a (Eukaryotic) Gene(on the genome)
exons (cds utr) / introns ( 102-103 bp)
( 102-105 bp)
5
What is it about genes that we can measure (and
model)?

Most of our knowledge is biased towards
protein-coding characteristics
ORF (Open Reading Frame) a sequence defined by
in-frame AUG and stop codon, which in turn
defines a putative amino acid sequence.
Codon Usage most frequently measured by CAI
(Codon Adaptation Index)
Other phenomena
Nucleotide frequencies and correlations
value and structure
Functional sites
splice sites, promoters, UTRs, polyadenylation
sites

6
A simple measure ORF length Comparison of
Annotation and Spurious ORFs in S. cerevisiae
Basrai MA, Hieter P, and Boeke J Genome Research
1997 7768-771
7
Codon Adaptation Index (CAI)

Parameters are empirically determined by
examining a large set of example genes
This is not perfect
Genes sometimes have unusual codons for a reason
The predictive power is dependent on length of
sequence

8
CAI Example Counts per 1000 codons
9
Splice signals (mice) GT , AG
10
General Things to Remember about (Protein-coding)
Gene Prediction Software

It is, in general, organism-specific
It works best on genes that are reasonably
similar to something seen previously
It finds protein coding regions far better than
non-coding regions
In the absence of external (direct) information,
alternative forms will not be identified
It is imperfect! (Its biology, after all)

11
Rest of Lecture Outline

Eukaryotic gene structure
Modeling gene structure
Using the model to make predictions
Improving the model topology
Modeling fixed-length signals

12
A eukaryotic gene

This is the human p53 tumor suppressor gene on
chromosome 17.
Genscan is one of the most popular gene
prediction algorithms.

13
A eukaryotic gene
Introns
Final exon
Initial exon
3 untranslated region
Internal exons
This particular gene lies on the reverse strand.
14
An Intron
revcomp(CT)AG
revcomp(AC)GT
GT signals start of intron AG signals end of
intron
3 splice site
5 splice site
15
Signals vs contents

In gene finding, a small pattern within the
genomic DNA is referred to as a signal, whereas a
region of genomic DNA is a content.
Examples of signals splice sites, starts and
ends of transcription or translation, branch
points, transcription factor binding sites
Examples of contents exons, introns, UTRs,
promoter regions

16
Prior knowledge

We want to build a probabilistic model of a gene
that incorporates our prior knowledge.
E.g., the translated region must have a length
that is a multiple of 3.

17
Prior knowledge

The translated region must have a length that is
a multiple of 3.
Some codons are more common than others.
Exons are usually shorter than introns.
The translated region begins with a start signal
and ends with a stop codon.
5 splice sites (exon to intron) are usually GT
3 splice sites (intron to exon) are usually AG.
The distribution of nucleotides and dinucleotides
is usually different in introns and exons.

18
A simple gene model
Intergenic
Intergenic
Intergenic
Gene
Transcription stop
Transcription start
Start
End
Intergenic
19
A probabilistic gene model
Pr(TACAGTAGATATGA) 0.0001 Pr(AACAGT)
0.001 Pr(AACAGTAC) 0.002
Intergenic
Intergenic
0.25
Intergenic
Gene
Transcription stop
Transcription start
0.67
1.00
0.75
Start
End
0.33
Intergenic
Every box stores transition probabilities for
outgoing arrows. Every arrow stores emission
probabilities for emitted nucleotides.
20
Parse
S ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCG P
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGG
TATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC
GGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

For a given sequence, a parse is an assignment of
gene structure to that sequence.
In a parse, every base is labeled, corresponding
to the content it (is predicted to) belongs to.
In our simple model, the parse contains only I
(intergenic) and G (gene).
A more complete model would contain, e.g., -
for intergenic, E for exon and I for intron.

21
The probability of a parse
Pr(ATGCGTATGTTTTGA) 0.00000000142
Pr(ACTGACTACTACGACTACGATCTACTACGGGCGCGACCT)
0.0000543
Pr(ACTGACTATGCGATCTACGACTCGACTAGCTAC) 0.0000789
Intergenic
Intergenic
0.25
Intergenic
Gene
Transcription stop
Transcription start
0.67
1.00
0.75
Start
End
0.33
Intergenic
S ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCGTA
TGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC P
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGGGGGGGG
GGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Pr(parse P sequence S, model M) 0.67 ?
0.0000543 ? 1.00 ? 0.00000000142 ? 0.75 x
0.0000789 3.057 ? 10-18
22
Finding the best parse

For a given sequence S, the model M assigns a
probability Pr(PS,M) to every parse P.
We want to find the parse P that receives the
highest probability.

23
An analogy

Given two sequences, S1 and S2, and a scoring
system M, find the alignment A that maximizes
the score.
Solution dynamic programming

24
Dynamic programming
A C A C C T T G T A C G T
Intergenic 1 Intergenic 2 Intergenic 3 Intergenic
4 Gene
Intergenic 2
Intergenic 4
Intergenic 1
Gene
Transcription stop
Transcription start
Start
End
Intergenic 3
25
DP matrix
A C A A T G T G A C

I1
I2
I3
I4
G
S ACAATGTGAC P IIIGGGGGGI
26
Beyond Simplest Model

Improving the gene model topology
Fixed-length signals
PSSMs
Dependencies between positions
Variable-length contents
Using HMMs
Semi-Markov models
Parsing algorithms
Viterbi
Posterior decoding
Including other types of data
Expressed sequence tags
Orthology

27
Improved model topology
Intergenic 2
Intergenic 4
Intergenic 1
Gene
Transcription stop
Transcription start
Start
End
Intergenic 3

Draw a model that includes introns

28
Improved model topology
Start
Transcription start
5 splice site
3 splice site
Transcription stop
End
29
Improved model topology
Start
Transcription start
5 splice site
3 splice site
4 intergenics 1 intron 4 exons
Transcription stop
End
30
Improved model topology
Start
Transcription start
Single exon
Initial exon
Internal exon
5 splice site
3 splice site
Intron
Final exon
Transcription stop
End
31
Modeling the 5 splice site
5 splice site
3 splice site
Intron
GT