Title: Presentazione di PowerPoint
1THEME 1Statistical properties of DNA and RNA
why and how
1) Basic analysis of genomes - basic
statistical features of genomes 1 - basic
probabilistic models for genomes 2
2) Case Study - genomic data -
statistical analysis of viral genomes -
models for IRs distribution in viral genomes
2 G. Reinert, S. Schbath, M.S. Waterman,
Probabilistic and Statistical Properties of
Words an overviewJ. Comp. Biol., 7, 1-46,
(2000)
1 M.S. Waterman,Mathematical Methods for DNA
sequencesCRC Press, Inc., Boca Raton, Florida,
(1989) ISBN 0-8493-6664-X
chap 5
21.1 Basic statistical features of genomes
1) Codon preferenceCODON triplets of bases
laying on the mRNA. 43 64 possible codons.61
codons contains infos for an aminoacid.3 codons
are termination signals.There exists only 20
aminoacids. gt codons are degenerate, I.e. the
same protein can be expressed by different
codons.Aminoacid glutamine is expressed by CAG
or CAA codons CAG codon occurs more frequently
114.
114 T. Maniatis, E. F. Fritsch, J. Sambrook,
Molecular cloning a laboratory manualCold
Spring Harbour Laboratory, Cold Spring Harbour,
NY, (1982)
31.1 Basic statistical features of genomes
1) Maniatis 114 suggests that this has
practical importance. Suppose one is looking for
a rare mRNA (i.e. nucleotide region)
corresponding to a desired gene product. Suppose
one partially knows the sequence of such gene,
i.e. one partially knows the sequence of codons.
How do we isolate the mRNA that produces such
gene????Due to the degeneracy mentioned above,
even though the aminoacid sequence is short, the
pieces of mRNA that code for it may be numerous.
However, one can take advantage of codon
preference in order to reduce the number od
sequences likely to specify a given protein
sequence and make the mRNA screening process
manageable.
114 T. Maniatis, E. F. Fritsch, J. Sambrook,
Molecular cloning a laboratory manualCold
Spring Harbour Laboratory, Cold Spring Harbour,
NY, (1982)
41.1 Basic statistical features of genomes
2) Suppose one is looking for the possible
function of a certain DNA sequence.One is
usually looking for whether or not the
nucleotides are part of CR or NCR. Staden
119 suggests that codon preference is useful,
because if one founds richness of base sequences
known to be preferred codons than this would be a
CR region. Alternatively one should look for
ribosome binding sites or for absence of stopping
codons, which is much more complicated, ....
119 R. Staden, Measurement of the effect that
coding protein has on a DNA sequences and their
use for finding genes Nucl. Acids Res., 12, 551,
(1984)
51.1 Basic statistical features of genomes
2) Oligonucleotides repetitionsCodons are used
with unequal frequency in coding sequences.
Fickett 120 noted that oligonucleotides tend
to be repeated with a periodicity of three in CR.
Interestingly enough, such periodicity is
absent in NCR.
120 J. W. Fickett, Recognition of protein
coding regions in DNA sequencesNucl. Acids Res.,
10, 5303, (1982)
61.1 Basic statistical features of genomes
3) Nearest Neighborough I Dinucleotides
frequenciesThis is a very basic analyses
127. It shows- CG rarity in eukariotes-
Pu-Pu pairs Pua,g - Py-Py pairs Pyt,c Suc
h effects seem to be due to structural
consideration (minimization of steric effects).
Therefore they may be relevant in molecular
evolution.
127 R. Nussinov, Strong doublet preferencies
in nucleotide sequences and DNA geometryJ. Mol.
Evol., 20, 111, (1984)
71.1 Basic statistical features of genomes
4) Nearest Neighborough II Poly A1) Long
chains of A (mainly triplets) are found more
frequently than expected under a random
distribution hypothesis 129.2) G, C, CC, GG
are found more frequently than expected under a
random distribution hypothesis.3) Long clusters
of C and G are found less frequently than
expected under a random distribution
hypothesis4) Since G-C pairs have more hydrogen
bonds than A-T, consequently they have higher
bond-energy bonds. Therefore the AAAA clustering
facilitates the unzipping of DNA and expedites
such processes as replication, transcription
and/or translation.
129 R. Nussinov, Strong adenine clustering
in nucleotide sequencesJ. Theor. Biol., 85, 285,
(1980)
81.1 Basic statistical features of genomes
5) Nearest Neighborough III Transcription
Initiation SitesIn DNA sequences, near
transcription initiation sites something happens
1301) In eukariotics, transcription factors
recognize and bind to certain promoters regions
such as CCAAT, TATA box, which are located
upstream.2) There also exists TAT/ATA triplets,
ATAT quartets occurring 275 bps upstream. They
are double the number of CAAT
130 R. Nussinov, Compilation of eukariotic
sequences around transcription iniziation
sitesJ. Theor. Biol., 120, 479, (1986)
91.1 Basic statistical features of genomes
6) GC skew
3 J.R. Lobry, Asymmetric substitution Patterns
in the two DNA strands of BacteriaMol. Biol.
Evol., 13, 660, (1996)
101.2 Basic probabilistic models for genomes
Words in Biology . The naïve idea is the
following a word may be significantly rare in a
DNA sequence because it disrupts replication or
gene expression, whereas a significantly frequent
word may have a fundamental activity with regard
to genome stability.
Basic IngredientsAlphabet
Aa1,a2, . ,am
mlength of the alphabet
m4 for genomic sequences Aa,c,g,t
Probability Distributions
111.2 Basic probabilistic models for genomes
Markov Model of order k0 - M0 model
Such model is completely specified if we give the
1-point probabilities
m4 for genomic sequences Aa,c,g,tThe k0
hypothesis would correspond to the hypothesis of
uniformity of base composition.
Markov Model of order k1 - M1 model
Such model is completely specified if we give the
1-point (marginal) and 2-point (joint)
probabilities
121.2 Basic probabilistic models for genomes
Markov Model of order k (modulo 3) - Mk-3 model
A coding DNA sequence is naturally read as
successive non-overlapping 3-letter words
(codons). Letters may have different importance
depending on their position with respect to the
codon partition. To distinguish the letter
probabilities according to their position modulo
3 in the coding DNA sequence, one consider a
stationary Markov chain with three distinct
transition matrices ?1,?2,?3.Details can be
found in Reference 2
131.2 Basic probabilistic models for genomes
Markov Model of order k1 - Estimation of
parameters
Assume (X1, XN) is a Markov chain on the
alphabet A with transition matrix ? ?(
a, b) and stationary distributions (marginals)
?(a).
The Likelihood iswhere N(ab) is the number
of occurences of words with 2 letters.
As a result the better estimate for ?( a, b),
which maximize Log(L), is given by
141.2 Basic probabilistic models for genomes
Markov Model of order k - Estimation of
parameters
Assume (X1, XN) is a Markov chain on the
alphabet A with transition matrix ? ?(
a1...ak, b) and stationary distributions
(marginals) ?(a).
The Likelihood iswhere N(ab) is the number
of occurences of words with k1 letters.
As a result the better estimate for ?( a, b),
which maximize Log(L), is given by
151.2 Basic probabilistic models for genomes
Markov Model of order k0 - Test for the order
The most straigthforward test is a ?2-test, under
the NULL hypothesis of independence, i.e. H0
?( a, b) ?(a,)
?(,b)Under H0, the better estimate for ?( a,
b) is where N is the length of the
considered sequence. This estimate must be
compared with
161.2 Basic probabilistic models for genomes
Markov Model of order k0 - Test for the order
The ?2-test becomes
Typical confidence level 5
171.2 Basic probabilistic models for genomes
Markov Model of order k1 - Test for the order
The most straigthforward test is a ?2-test, under
the NULL hypothesis that H0
?( ab, c) ?( a, b) ?( b, c) Under H0,
the better estimate for ?( a, b) is where
N is the length of the considered sequence.
This estimate must be compared with
181.2 Basic probabilistic models for genomes
Markov Model of order k1 - Test for the order
The ?2-test becomes
Typical confidence level 5
191.2 Basic probabilistic models for genomes
Example Bacteriophage ? 1
Resultsorder k1 dependance in regions
Silent
Early 1
Early 2 Control
order k2 dependance in region
Late
Coding Regions
The low order of dependance may be due to
inhomogeneity in the sequences
Example ? X174 confirms results
201.2 Basic probabilistic models for genomes
Example Bacteriophage ? 1
The low order of dependance may be due to
inhomogeneity in the sequences
The Lysis region shows an order k0. This is
probably due to different patterns of codon usage
among the genes.
21Master in Bioinformatica Modelli Probabilistici
2.1 Where do we get data from?
Modello probabilistico
dati
dati
descrivere
predire
Interpretare Teoria della Probabilità ? Statistic
a
Empirical and Phenomenological Analysis
Explain !?!
222.1 Where do we get data from? GenBank
http//www.ncbi.nlm.nih.gov
GenBank
Complete genomes
232.1 Where do we get data from? NC_001807.gbk
242.1 Where do we get data from? NC_001807.gbk
252.2 Viral Genomes
There are 746 complete genomes of viruses
completely sequenced
They are classified as ? satellite 40 ?
retroid 76 ? ss RNA 272 ? ss RNA
- 33 ? ds RNA 20 ? ss DNA 99 ? ds DNA
198 ? others 8
The length is varying from a minimum of 220 bp to
a maximum of 335593 bp. About 18 million bp in
total
262.2 Viral Genomes statistics - GC versus BP
272.2 Viral Genomes statistics - GC distribution
282.3 Inverted Repeats in Viral Genomes
...5... AGACCCCCACTGCTAAATATAGTGGGTGGGTG
... 3... stem loop stem
A single strand RNA (or double strand DNA) can
present a nucleotide sequence (IR) allowing the
formation of a hairpin structure (in RNA) or of a
cruciform structure (in DNA).
292.3 Inverted Repeats in Viral Genomes
Rho-independent - IRs may form hairpin structures
in mRNA and play the role of Rho-independent
transcription terminator (Bacteria) .
Regulatory Function
Depending on the level of Tryptophan
concentration, IR formation may be enhanced or
prevented. This results in complete or incomplete
transcription of mRNA in E. Coli.
Virus Induced Gene Silencing - Technology that
spots RNA-mediated antiviral defense mechanisms.
In plants infected with modified viruses, such
mechanism can be targeted against specific genes
of the plant itself.
302.3 Inverted Repeats in Viral Genomes statistics
312.3 Inverted Repeats in Viral Genomes statistics
0.40ltGClt0.45
0.35ltGClt0.40
In most cases IRs are more abundant than
expected by the null hypothesis. The number of
IRs depends on m!!
322.3 Inverted Repeats in Viral Genomes ?2 test I
We are interested in understanding whether or not
IRs are randomly distributed within a (viral)
genome.
A NULL hypothesis for the expected number nex of
IRs of stem length l and loop length m can be
done by assuming a Bernoullian DNA.
The starting point !!
332.3 Inverted Repeats in Viral Genomes ?2 test I
342.3 Inverted Repeats in Viral Genomes ?2 test I
We perform a ?2 test to evaluate if the number of
detected IRs is compatible with the null
hypothesis of Bernoullian DNA
p-values are less than 0.05 in 57 of genomes !!
352.3 Inverted Repeats in Viral Genomes ?2 test II
We are interested in understanding whether or not
IRs are uniformly distributed within the Coding
and NonCoding Regions of a (viral) genome.
The NULL hypothesis for the expected number nex
of IRs of stem length l and loop length m can be
done by assuming that the number of IRs is
proportional to the fractions of CR and NCR
genome.
The starting point !!
362.3 Inverted Repeats in Viral Genomes ?2 test II
372.3 Inverted Repeats in Viral Genomes ?2 test II
IRs with higher stemlength seems to be more
distant from the unitary line.
Look at the error bars !!
382.3 Inverted Repeats in Viral Genomes ?2 test II
Look at the error bars !!
392.3 Inverted Repeats in Viral Genomes ?2 test II
402.3 Inverted Repeats in Viral Genomes ?2 test II
IRs with higher stemlength seems to be more
distant from the unitary line. Picture above is
misleading if one does not consider the
error-bars !