Title: Folie 1
1A new way of seeing the genome
Combining sequence- and signal- based genome
analyses
Maik Friedel, Thomas Wilhelm, Jürgen
Sühnel FLI-Jena, Germany http//www.fli-leibniz.de
/tsb
Introduction During the last 10 years, a large
number of complete genomes has been sequenced.
Having these data at hand, the basic aim is now
to convert this information into biological
knowledge. This requires the identification of
biologically meaningful motifs in genomic data.
Computational motif discovery has been used with
some success in simple organisms such as yeast,
for example. For higher organisms with more
complex genomes more sensitive methods are
required. There is also a growing awareness that
not single motifs but motif combinations usually
called modules may be relevant to biological
function. We describe here a new type of
GenomeBrowser that offers user-friendly genome
analysis tools for the statistical analysis of
single and multiple sequences as well as for the
visual exploration of single sequences. A
peculiarity is that not only the standard
sequence representation in terms of the bases A,
T, G and C can be adopted, but also a reduced
sequence representation by purine/pyrimidine and
AT/GC characteristics and finally a
representation in terms of a large number of
dinucleotide parameters that can encode
geometrical information on DNA structure, for
example. All of these coding schemes can be
converted into a signal representation that
allows for a very effective visual motif
discovery. Analyses can be performed for the
and as well as for the double strand. Combining
these sequence- and signal-based representations
offers a new approach for the detection of new
regulatory elements. The functionalities
described make the GenomeBrowser a unique tool
for the identification and analysis of functional
motifs in genomes.
Implemented tools
- Repeat finder
- Tool to search for any type of simple repeat in
the sequence or signal representation
2. Motif finder Tool for searching DNA motifs in
the sequence or signal
3. Average statistic Tool for calculating the
average for any type of DNA feature of selected
DNA fragments
4. Showing underlying DNA sequence Feature that
allows to show the underlying DNA sequence of a
selected part of the signal representation
5. Property editor Tool for searching, filtering
and selecting all types of features indicated in
the GenBank file
Parameters
Main window
The main window of the GenomeBrowser consists of
three panel. The first (1) is the control panel
which allows uploading and manipulation of
sequence and coding parameter information. In the
main field (2) the signal curve is shown and in
the third panel (3) the position information of
the actually depicted sequence range. The DNA
sequence in FASTA or GenBank format is converted
into a signal representation by applying
dinucleotide parameters and smoothened using a
shifting window technique. All sequence features
included in the GenBank file can be selected and
shown in different colors.
To visualize biochemical and biophysical
properties of a DNA strand we have included about
40 different dinucleotide properties. All
parameters are available for the complete set of
16 dinucleotide combinations.
AA -1,20
AC -1,50
AG -1,50
AT -0,90
CA -1,70
CC -2,10
CG -2,80
CT -1,50
GA -1,50
GC -2,30
GG -2,10
GT -1,50
TA -0,90
TC -1,50
TG -1,70
TT -1,20
The table shows, as an example, the free energy
change (B-DNA) Kcal/mol for the set of all 16
dinucleotides (M. Aida J. Theor. Biol. 130,
327-335 (1988)).
First applications
3. Repeats which cannot be found by standard
repeat search methods We have shown this by
hiding DNA sequence repeats in an artificial
sequence with only 50 alignment identity. The
new sequence contains the same repeats that are
only visible in the signal representation.
- Visualization of evolutionary events
- The GenomeBrowser can be used to distinguish
between 3 types of rRNA gene clusters in
chloroplast genomes. The patterns can be best
seen applying the free energy change measure for
the DNA double strand.
2. Visualization of gene and exon/intron
organization With help of the GenomeBrowser it
can be shown that genes tend to be purine-rich.
In both pictures below the positive strand is
encoded by the pyrimidine content. On the left
side all genes of the strand and on the right
side all genes of the strand are shown in red.
1.) Inverted Repeats (25kB) 79 of 88 genomes
1.) original sequence repeats
2.) the same repeats hidden in an artificial
sequence with only 50 sequence identity
2.) Inverted Repeat Lacking Clade 7 of 88
genomes
Conclusion The GenomeBrowser is a powerful new
tool for motif discovery in genomes. In addition
to the standard sequence representation the DNA
is also analysed considering biochemical and
biophysical dinucleotide properties. This allows
to identify and visualize a broad range of both
known and unknown genome patterns. The new way of
seeing the genome can lead to a better
understanding of its organisation and function.
The exon (red) and intron (green) structure of a
given gene can be seen adopting a GC content
representation. Exons tend to have a higher GC
content than introns.
3.) 3 Directed Repeats 2 of 88
genomes (subclass Euglenozoa)