Title: Folie 1
1Localising regulatory elements using statistical
analysis and shortest unique substrings of
DNA Nora Pierstorff1, Rodrigo Nunes de Fonseca2,
Thomas Wiehe1 1 - Institute for Genetics,
University of Cologne, Germany, Email
nora.pierstorff_at_uni-koeln.de 2 - Institute for
Developmental Biology, University of Cologne
ABSTRACT Several regulatory region prediction
methods using computation have been developed in
the last few years. Most of the available methods
require transcription factor binding site
matrices to achieve reasonable results. In order
to avoid the need of biological information, we
developed a program named SHUREG to predict
regulatory regions without any extrinsic
information but the sequence itself. Calculating
shustrings (shortest unique substrings) we find
statistically overrepresented motifs which are
assumed to be indicators of regulatory elements.
3
RESULTS
Figure 1a SHUREG prediction in the giant region
Figure 1b AHAB prediction in the giant region
- INTRODUCTION
- In order to localize regulatory regions three
basic computational approaches have been
followed. - Search for bindingsites of known transcription
factors using Position Weight Matrices. 1 - Search for conserved motifs in upstream-regions
of homologous or coregulated genes. 2 - Search for statistically overrepresented motifs
3 - Our program SHUREG follows the third approach
which is supported by two hypotheses - Degenerate binding site lead the transcription
factor to the bindingsite - New bindingsites can be created easily from
degenerate bindingsites through few mutations to
adapt the organism to environmental changes.
Figure 2a SHUREG prediction in the hairy region
Figure 2b Ahab prediction in the hairy region
We applied our program to different well
explored regions of the Drosophila melanogaster
genome. Our dataset includes segmentation and
dorsal-ventral genes. We compare our predictions
to the results of AHAB1, a program that uses
PWMs Figure 1 shows two predictions for the
giant region. 1a is computed using Shureg. 1b is
the result of the Ahab-program applied to the
same sequence. Figure 2a shows the Shureg
prediction for the regulatory regions of the
hairy gene. 2b shows the corresponding
Ahab-prediction. Figure 3 is partitioned into 3
predictions. Figure 3a is the Shureg prediction
for the dorsal regulated enhancer of the sog
gene. Figure 3b shows the Ahab prediction using
only the PWM of the Dorsal binding site. Figure
3c shows the Ahab-prediction using all known
PWMs in an hypothetical case that we do not know
the actual factors responsable for this gene
regulation.
Figure 3a SHUREG prediction in the sog region
WHY SHORTEST UNIQUE SUBSTRINGS? Analyzing the
human (mouse-) genome we found 255 (293) global
shustrings of length 11bp. 4 29 (22) of the
shustrings are positioned in 1000bp-upstream-regio
ns. The probability of this distribution is
3.3 x 10-24 (5.0 x 10-18 )
- SHUREG - ALGORITHM
- Calculation of shustrings (shortest unique
substrings) at every position relative to a
surrounding window on forward- and
backwardstrand. - Counting of neighbours (exact repeats in the
surrounding) - Calculation of P-values for each shustring
- Smoothing of P-values
Figure 3c AHAB prediction in the sog region
using all known PWMs
DISCUSSION To localize regulatory regions
without any extrinsic information is a hard
topic. To use the amount of overrepresented
patterns in a region as indicator of regulatory
regions is a reasonable measure and can lead to
reasonable results. But it also leads to a lot
false positive predictions, because we find
additional overrepresented patterns which cannot
be set into correlation to binding sites. To
improve the predictions of our method we need to
find more features to distinguish between true
positive and false positive predictions, we are
currently investigating the conservation of
overrepresented motifs between species.
References 1 N. Rajewsky, M. Vergassola, U.
Gaul, and E. D. Siggia (2002) Computational
detection of genomic cis-regulatory modules,
applied to body patterning in the early
Drosophila embryo. BMC Bioinformatics, 330 2
H. Bussemaker, H. Li, E Siggia (200) Building a
dictionary for genomes Identification of
presumptive regulatory sites by statistical
analysis. PNAS, Aug 2000 97 3 Nazina A.,
Papatsenko D. (2003). Statistical extraction of
Drosophila cis-regulatory modules using
exhaustive assessment of local word frequency.
BMC Bioinformatics 41471-2105/4/65 4 Haubold,
B., Pierstorff, N., Moeller, F., Wiehe, T.
(2005). Genome comparison without alignment using
shortest unique substrings. BMC Bioinformatics,
6123.