Title: Poster Madrid ECCB 2005
1Rényi entropic profiles of DNA sequences
and statistical significance of motifs
Susana Vinga(a,b), Jonas S Almeida(a,c)
b) INESC-ID Instituto de Engenharia de Sistemas e
Computadores Investigação Desenvolvimento -
Lisboa, Portugal
c) Dept. Biostatistics, Bioinformatics and
Epidemiology - Medical Univ. South Carolina -
Charleston SC 29425, USA
a) Biomathematics Group ITQB/UNL Instituto de
Tecnologia Química e Biológica, Universidade Nova
de Lisboa - Oeiras, Portugal
1. Abstract
2. Methods and Algorithms
In a recent report 1 the authors presented a
new measure of Rényi continuous entropy for DNA
sequences, which allows the estimation of their
randomness level. The definition therein explored
was based on the Rényi entropy of the probability
density estimation (pdf) using the Parzens
window method and applied to Chaos Game
Representation/Universal Sequence Maps
(CGR/USM). This work extends those concepts of
continuous entropy by defining DNA sequence
entropic profiles using the pdf estimations
obtained. These profiles are applied to the study
of a sequence dataset constituted by artificial
and real DNA and a new fractal-kernel function,
more adjusted to the estimation, is explored,
instead of the Gaussians functions previously
used. This work shows that the entropic profiles
are directly related to the statistical
significance of motifs, allowing the study of
under and over-representation of sub-strings.
Furthermore, by spanning the parameters of the
fractal-kernel function, it is possible to
extract important information about the scale of
each DNA region, which can have future
applications in the recognition of biologically
significant segments of the genome. Keywords
Rényi entropy, DNA, Information Theory, kernel
functions, CGR/USM.
2. Rényi continuous entropy of DNA sequences
Definition of DNA entropy based on CGR/USM and
Parzens Method with parameter s - variance of
Gaussian function used.
Simplification!
Simplification Integral ? Sum Convolution of two
Gaussians is Gaussian
CGR/USM estimation
-ATC- Motif detected
where
All pairwise squared Euclidean distances between
CGR/USM coordinates xi
http//bioinformatics.musc.edu/renyi
svinga_at_itqb.unl.pt
3. Results
Example
DNA testset
Rényi entropic profiles
vs.
4. Conclusions and Future work
- Method provides new tools for the study of motifs
and repeatability in biological sequences - Explore theoretical properties of the entropic
profiles - Optimize algorithm to accommodate longer
sequences
- Rényi entropic profiles provide local information
about motifs and their statistical significance - Continuous quadratic entropy H2 is a good measure
of DNA sequence randomness
Acknowledgments S.Vinga and J.S.Almeida
thankfully acknowledge the financial support by
grants SFRH/BPD/24254/2005 and POCTI/BIO/48333/200
2 from Fundação para a Ciência e a Tecnologia
(FCT) of the Portuguese Ministério da Ciência,
Tecnologia e Ensino Superior.
References 1 Vinga, S. and Almeida, J. S.
(2004) Rényi continuous entropy of DNA sequences
J Theor Biol, 231(3)377-388.