Title: P1254503540CaQPv
1DNA CODES BASED ON HAMMING STEM SIMILARITIES
A.G. Dyachkov1, A.N. Voronina1
1 Dept. of Probability Theory, MechMath., Moscow
State University, Russia
2OUTLINE
- DNA background
- Modeling the hybridization energy
- DNA codes
- Example of code construction
- Bounds on the rate on DNA codes
- On sphere sizes
- Further generalizations
- Bibliography
3DNA STRANDS
Single DNA strand
- DNA strands consist of nucleotides, composed of
sugar and phosphate backbone and 1 base - There are 4 types of bases
5 end
adenine
guanine
cytosine
thymine
- Base A is said to be complement to T and C to G
- DNA strands are oriented. Thus, for example,
strand AATG is different from strand GTAA - 2 oppositely directed strands containing
complement bases at corresponding positions are
called reverse-complement strands. For example,
this 2 strands are reverse-complement
Bases
Nucleotide
The strands have different directions
Sugar phosphate backbone
3 end
4HYBRIDIZATION
- 2 oppositely directed DNA strands are capable of
coalescing into duplex, or double helix - The process of forming of duplex is referred to
as hybridization - The basis of this process is forming of the
hydrogen bonds between complement bases - Duplex, formed of reverse-complement strands is
called a Watson-Crick duplex. Here is the example
of it
Watson-Crick duplex
5CROSS-HYBRIDIZATION AND ENERGY OF HYBRIDIZATION
- Though, hybridization is not a perfect process
and non-complementary strands can also hybridize - This is one example of cross-hybridization
This bases are not complement
This bases are not complement
- The indicator of strength, or stability of
formed duplex is its energy of hybridization. Its
value depends on the total number of bonds formed - Thus, the greatest hybridization energy is
obtained when Watson-Crick duplex is formed
rather than is case of cross-hybridization
6LONE BONDS AND PAIRWISE METRIC
- If a pair of bases is bonded but neither of its
neighbor bases form a bond as well, then it is
called a lone bond. Here it is
Lone bond does not contribute to hybr. energy
A triplet is counted as 2 adjacent pairs
A pair of bonds add 1 to total hybr. energy
Hybr. Energy 3
- The lone bond is too weak to form a strong
connection, so it does not contribution much to
the total energy of hybridization - Moreover, in fact, the energy of hybridization
depends not on the number of bonds formed, but on
the number of pairs of adjacent bonds - Thus, if we suppose, that hybridization energy is
equal to the number of pairs, then in the example
above it is equal to 3, not 5 or 6
7OUTLINE
- DNA background
- Modeling the hybridization energy
- DNA codes
- Example of code construction
- Bounds on the rate on DNA codes
- On sphere sizes
- Further generalizations
- Bibliography
8NOTATIONS
- General notations
- Let be an arbitrary
even integer - Denote by
the standard alphabet of size - Denote by the largest (smallest)
integer - Reverse-complementation
- For any letter , define
the complement of the
letter - For any q-ary sequence
, define its reverse
complement - Note, that if , then
for any .
9STEM HAMMING SIMILARITY
- For 2 q-ary sequences of length n
-
and - stem Hamming similarity is equal to
-
where - is equal to the total number of
common 2-blocks containing adjacent symbols in
the longest common Hamming subsequence -
10HAMMING VS. STEM HAMMING
- Hamming similarity is element-wise while stem
Hamming similarity is pair-wise (though still
additive) - Re-ordering the elements in the sequence does not
influence Hamming similarity, but may change stem
Hamming similarity - Example
11STEM HAMMING DISTANCE
- Note, that
and if and
only if - Stem Hamming distance between
is - Example
- Let and
- The longest common Hamming subsequence is
- Stem Hamming similarity is equal to
- Stem Hamming distance is equal to
12OUTLINE
- DNA background
- Modeling the hybridization energy
- DNA codes
- Example of code construction
- Bounds on the rate on DNA codes
- On sphere sizes
- Further generalizations
- Bibliography
13MOTIVATION
- Study of DNA codes was motivated by the needs of
DNA computing and biomolecular nanotechnology - In these applications, one must form a collection
of DNA strands, which will serve as markers,
while the collection of reverse-complement (to
that first strands) DNA strands will be utilized
for reading, or recognition
Probing Complement Strands for Reading
Coding Strands for Ligation
- Collection of mutually reverse-complement pairs
- No self-reverse complement words
- No cross-hybridization
TACGCGACTTTC ATCAAACGATGC TGTGTGCTCGTC ATTTTTGCGTT
A CACTAAATACAA GAAAAAGAAGAA
GAAAGTCGCGTA GCATCGTTTGAT GACGAGCACACA TAACGCAAAAA
T TTGTATTTAGTG TTCTTCTTTTTC
14DNA CODE
-
is a code of length and size - , where
are the codewords of code - is called a DNA -code based
on stem Hamming similarity if the following 2
conditions are fulfilled - For any , there exists
, such that - For any
- Let be the maximal size of
DNA -codes. - Is called a rate of DNA codes
15OUTLINE
- DNA background
- Modeling the hybridization energy
- DNA codes
- Example of code construction
- Bounds on the rate on DNA codes
- On sphere sizes
- Further generalizations
- Bibliography
16Q-ARY REED-MULLER CODES
- q-ary Reed-Muller codeLet
-
- Define mapping
, with - Reed-Muller code of order
is the image - Reed-Muller code of order 1
satisfy the condition of reverse-complementarity - It may contain self-reverse complement words,
that should be excluded from the final
construction
17EXAMPLE OF CODE
Let q4 and m1
Mutually-reverse complement
0 1 2 3
0 0 0 0
0 1 2 3
0 1 2 3
0 2 0 2
0 3 2 1
1 1 1 1
1 2 3 0
1 3 1 3
1 0 3 2
2 2 2 2
2 3 0 1
2 0 2 0
2 1 0 3
3 3 3 3
3 0 1 2
3 1 3 1
3 2 1 0
18OUTLINE
- DNA background
- Modeling the hybridization energy
- DNA codes
- Example of DNA codes
- Bounds on the rate on DNA codes
- Lower Gilbert-Varshamov bound
- Upper bounds
- Graphs
- On sphere sizes
- Possible generalizations
- Bibliography
19RANDOM CODING
- and are independent identically
distributed random sequences with uniform
distribution on - Define
- Probability distribution of
- Sum of
20GILBERT-VARSHAMOV BOUND
- Let . Introduce
- We construct random code as a collection of
independent variables and their
reverse-complements. This fact leads to necessity
of special random coding technique for DNA codes - One can check, that
- Random coding bound (Gilbert-Varshamov bound)
if then
21CALCULATION OF THE BOUND
- are dependent variables and
both depend on and - do not constitute a Markov chain
-
vs. - are deterministic functions of Markov chain
- and
- We cannot apply standard technique as in case of
Hamming similarity - We have to use Large Deviations Principle for
Markov chains for
22GILBERT-VARSHAMOV BOUND
- Introduce
- Gilbert-Varshamov lower bound on the rate
If
then , where - and is a
decreasing -convex function with
23OUTLINE
- DNA background
- Modeling the hybridization energy
- DNA codes
- Example of DNA codes
- Bounds on the rate on DNA codes
- Lower Gilbert-Varshamov bound
- Upper bounds
- Graphs
- On sphere sizes
- Possible generalizations
- Bibliography
24UPPER BOUNDS
- Plotkin upper bound
- If , then
and -
-
if - Elias upper boundIf
, then ,
where is presented by parametric
equation - Elias bound improves Plotkin bound for small
values of . We
calculated and
.
25OUTLINE
- DNA background
- Modeling the hybridization energy
- DNA codes
- Example of DNA codes
- Bounds on the rate on DNA codes
- Lower Gilbert-Varshamov bound
- Upper bounds
- Graphs
- On sphere sizes
- Possible generalizations
- Bibliography
26BOUNDS ON THE RATE (Q2)
Bound on the rate of DNA code, q2
0.75
27BOUNDS ON THE RATE (Q4)
Bound on the rate of DNA code, q4
0.9375
28OUTLINE
- DNA background
- Modeling the hybridization energy
- DNA codes
- Example of code construction
- Bounds on the rate on DNA codes
- On sphere sizes
- Further generalizations
- Bibliography
29FIBONACCI NUMBERS
- q-ary Fibonacci numbers are defined by recurrent
equation - with initial conditions
- q-ary Fibonacci numbers may also be calculated as
sum - q-ary Fibonacci number may be
interpreted as the numberof q-ary sequences of
length , which do not contain 2-stems of the
form (0,0)
30COMBINATORIAL CALCULATION
- Space with metric is
homogeneous, i.e., the volume of a sphere does
not depend on its center - Define
- for any
- Consider a sphere with center
. Anysequence
must have no
common2-stems (pairs) with . In other
words, is must have no 2-stems of type (0,0).
Thus, - Sphere sizes for other may be obtained using
the same technique with some corresponding
modifications
31GRAPH OF PROBABILITIES
Probability distribution
32OUTLINE
- DNA background
- Modeling the hybridization energy
- DNA codes
- Example of code construction
- Bounds on the rate on DNA codes
- On sphere sizes
- Further generalizations
- Bibliography
33B-STEM HAMMING SIMILARITY
- -stem Hamming similarity in spite of
counting the number of 2-stems (pairs)
calculate the number of -stems -
where
34WEIGTHED STEM HAMMING SIMILARITY
- Weighted stem Hamming similarity assign weight
to each type of q-ary pairs and take it into
account while calculating the sum - Let
be a weight function such that - Similarity is defined as follows
- , where
35INSERTION-DELETION STEM SIMILARITY
- Insertion-deletion stem similarityallow loops
and shifts at the DNA duplex - is a common block
subsequence between and , if is an
ordered collection of non-overlapping common (
, )-blocks of length - common ( , )-block of length ,
is a subsequence of and ,
consisting of consecutive elements of and - is the set of all common block
subsequences between and - is the minimal number of
blocks of consecutive elements of and in
the given subsequence - Similarity is defined as follows
Shift
Loop
36OUTLINE
- DNA background
- Modeling the hybridization energy
- DNA codes
- Example of code construction
- Bounds on the rate on DNA codes
- On sphere sizes
- Further generalizations
- Bibliography
37BIBLIOGRAPHY
- Probability theory and Large Deviation Principle
- V.N. Tutubalin, The Theory of Probability and
Random Processes. Moscow Publishing House of
Moscow State University, 1992 (in Russian). - A. Dembo, O. Zeitouni, Large Deviations
Techniques and Applications. Boston, MA Jones
and Bartlett, 1993. - DNA codes
- D'yachkov A.G., Macula A.J., Torney D.C.,
Vilenkin P.A., White P.S., Ismagilov I.K.,
Sarbayev R.S., On DNA Codes. Problemy Peredachi
Informatsii, 2005, V. 41, N. 4, P. 57-77, (in
Russian). English translation Problems of
Information Transmission, V. 41, N. 4, 2005, P.
349-367. - Bishop M.A.,D'yachkov A.G., Macula A.J., Renz
T.E., Rykov V.V., Free Energy Gap and Statistical
Thermodynamic Fidelity of DNA Codes. Journal of
Computational Biology, 2007, V. 14, N. 8, P.
1088-1104. - A. Dyachkov, A. Macula, T. Renz and V. Rykov,
Random Coding Bounds for DNA Codes Based on
Fibonacci Ensembles of DNA Sequences. Proc. of
2008 IEEE International Symposium on Information
Theory, Toronto, Canada, 2008, in print.