Title: A mathematical model of the genetic code: structure and applications
1 A mathematical model of the genetic code
structure and applications
- Antonino Sciarrino
- Università di Napoli Federico II
INFN, Sezione
di Napoli - TAG 2006 Annecy-leVieux, 9
November 2006
2Mathematical Model of the Genetic Code
Work in collaboration with
Luc FRAPPAT Paul SORBA Diego COCURULLO
3SUMMARY
- Introduction
- Description of the model
- Applications Codon usage frequencies
- DNA dimers free
energy - Work in progress
-
4It is amazing that the complex biochemical
relations between DNA and proteins were very
quickly reduced to a mathematical model. Just
few months after the WATSON-CRICK discovery G.
GAMOW proposed the diamond code
5Gamow diamond code
Gamow, Nature (1954)
Nucleotides are denoted by number 1,2,3,4
Amino-acids FIT the rhomb -shaped holes
formed by the 4 nucleotides
? 20 a.a. !
6Since 1954 many mathematical modelisations of the
genetic coded have been proposed
(based on informatiom, thermodynamic,
symmetry, topology arguments)
Weak point of the models
often poor explanatory and/or predictive power
7The genetic code
8Crystal basis model of
the genetic code
L.Frappat, A. Sciarrino, P. Sorba Phys.Lett. A
(1998)
? 4 basis C, U/T (Pyrimidines) G, A
(Purines) are identified by a couple of
spin labels
( ? 1/2, - ? -1/2)
Mathematically - C,U/T,G,A transform as the 4
basis vectors of irrep. (1/2, 1/2) of U q ?
0 (sl(2)H ? sl(2)V)
9Crystal basis model of
the genetic code
- Dinucleotides are composite states
- (? 16 basis vectors of (1/2, 1/2)?2 )
-
- belonging to sets identified by two
integer numbers - JH JV In each set the dinucleotide
is - identified by two labels
- - JH ? JH,3 ? JH
- JV ? JV,3 ? JV - Ex.
- CU (,) ? (, -)
- ( JH 1/2, JH,3 1/2 JV 1/2, JV,3
1/2) - ? Follows from property of U(q ? 0) (sl(2))
10 DINUCLEOTIDE
Representation Content
11Crystal basis model of
the genetic code
- Codons are composite states
- (? 64 basis vectors of (1/2, 1/2)?? )
- belonging to sets identified by half-
integer JH JV - (set ? irreducible
representation irrep.) - Ex.
- CUA (,) ? (-, ) ? (-,-)
- ( JH 1/2, JH,3 1/2 JV 1/2, JV,3
1/2) - ? Follows from property of U(q ? 0) (sl(2))
12Codons in the crystal basis
13Codon usage frequency
- Synonymous codons are not used uniformly (codon
bias) - codon bias (not fully understood) ascribed to
evolutive-selective effects - codon bias depends
- ? Biological species (b.sp.)
- ? Sequence analysed
- ? Amino acid (a.a.) encoded
- ? Structure of the considered multiplet
- ? Nature of codon XYZ
- ? .
14Codon usage in Homo sap.
15Our analysis deals with global codon usage , i.e.
computed over all the coding sequences (exonic
region) for the b.sp. of the considered specimen
? To put into evidence possible general features
of the standard eukaryotic genetic code
ascribable to its organisation and its evolution
16 Let us define the codon usage probability for
the codon XZN (X,Z,N ? A,C,G,U?T in
DNA )P(XZN) limit n ? ? n XZN / N tot
n XZN number of times codon XZN
used in the processes
N tot total number of codons in the same
processes For fixed XZ Normalization ? N
P(XZN) 1
Note - Sextets are considered
quartets doublets ?
8 quartets
17Def. - Correlation coefficient rXY for two
variables X ? P..X Y ? P..Y
18Specimen (GenBank Release
149.0 09/2005 - Ncodons gt 100.000)
- 26 VERTEBRATES
- 28 INVERTEBRATES
- 38 PLANTS
- TOTAL - 92 Biological species
19Correlation coefficient VERTEBRATES
20Correlation coefficient PLANTS
21Correlation coefficient INVERTEBRATES
22Averaged value of P(..N)
23Averaged value of P(..N)
24Averaged value of sum of two correlated P(N)
?
?
25Ratios of ?obs2(XY) and ?th2(XY) ?obs2(X)
?obs2(Y) averaged over the 8 a.a. for the sum of
two codon probabilities
26 ? Indication for correlation for codon usage
probabilities P(A) and P(C) (? P(U) and P(G))
for quartets.
27Correlation between codon probabilities for
different a.a.
- Correlation coefficients between the 28 couples
P XZN-XZN where XZ (XZ) specify 8
quartets. The following pattern comes out for the
whole eucaryotes specimen (n 92)
28The set of 8 quartets splits into 3 subsets
- 4 a.a. with correlated codon usage (Ser,
Pro, Arg, Thr) - 2 a.a. with correlated codon usage (Leu,
Val) - 2 a.a. with generally uncorrelated codon usage
(Arg, Gly)
29- Statistical analysis
- ?
- ? Correlation for P(XZA)-P(XZC), XZ ? quartets
- ? Correlation for P(N) between Ser, Pro, Thr,
Ala and - Leu, Val
The observed correlations well fit in the
mathematical scheme of the crystal basis model
of the genetic code
30In the crystal basis model P(XYZ) can be written
as function of
31ASSUMPTION
32?
SUM RULES
K INDEPENDENT OF THE b.s.
XZ ? QUARTETS
33SUM RULES ?
Theoretical correlation matrixXZ
NC,CG,GG,CU,GU
34Observed averaged value of the correlation
matrix , in red the theoretical value
35(No Transcript)
36 Shannon Entropy
Let us define the Shannon entropy for the
amino-acid specified by the first two nucleotide
XZ (8 quartes)
37Shannon Entropy
Using the previous expression for P(XZN) we get
?N ? ?(XZN), HbsN ? Hbs(XZN), PN ? P(XZN)
?
SXZ largely independent of the b.sp.
38Shannon Entropy
39 DNA dinucleotide free energy
Free energy for a pair of nucleotides, ex. GC,
lying on one strand of DNA, coupled with
complementary pair, CG, on the other
strand. CG from 5 ? 3 correlated with GC
from 3 ? 5
40 DINUCLEOTIDE
Representation Content
41(No Transcript)
42SUM RULES for FREE ENERGY
43Comparison with exp. data
?G in Kcal/mol
44DINUCLEOTIDE Distribution
45(No Transcript)
46 Comparison with experimental data
47Work in progress and future perspectives
Fron the correspondence C,U/T,G,A ? I.R.
(1/2,1/2) of U q ? 0 (sl(2)H ? sl(2)V)
?
Any ordered N nucleotides sequence ? Vector of
I.R. ? (1/2,1/2)?N of U q ? 0 (sl(2)H ?
sl(2)V)
?
New pametrization of nucleotidees sequences
48 Spin parametrisation
49Algorithm for the spin parametrisation of
orderedn-nucleotide sequence
50From this parametrisation
- Alternative construction of mutation model, where
mutation intensitydoes not depend from the
Hamming distance between the sequences, but from
the change of labels of the sets.
C. Minichini, A.S., Biosystems
(2006) - Characterization of particular sequences (exons,
introns, promoter, 5 or 3 UTR sequences,.) - L. Frappat, P. Sorba, A.S., L. Vuillon, in
progress
51For each gene of Homo Sap. (total 28.000 genes)
- Consider the N-nucleotide coding sequence (CDS)
- Compute the labels JH, J3H JV, J3V
- for any n-nucleotide subsequence
(1 ? n ? N) - ? Plot labels versus n
-
-
52Red JH - Green J3H Blue JV - Black J3V
53Red JH - Green J3H Blue JV - Black J3V
54Red JH - Green J3H Blue JV - Black J3V
55Red JH - Green J3H Blue JV - Black J3V
56Numerical estimator
- Define for any sequence of length N
Plot number of CDS with the same value of Diff
(Sum) versus Diff (Sum) Compute Diff (Sum) for
28.000 random sequences (300 lt N lt 4300) with
uniform probability for each nucleotide Comparison
number of CDS - random sequences
57(No Transcript)
58(No Transcript)
59Conclusions
- Correlations in codon usage frequencies computed
over the whole exonic region fit well in the
mathematical scheme of the crystal basis model
of the genetic code Missing explanation for the
correlations - Formalism of crystal basis model useful to
parametrize free energy for DNA dimers - More generally, use of U q ? 0 (sl(2)H ? sl(2)V)
mathematical structure may be useful to describe
sequences of nucleotides .