Title: Kein Folientitel
1A Model Conditioned Data Compression-based
Similarity Measure Authors Daniele Cerra(2),
Mihai Datcu(1,2) (1)GET/Télécom Paris, 46 rue
Barrault, 75013 Paris, France (2)German Aerospace
Center (DLR), Oberpfaffenhofen, 82234 Weßling,
Germany
Frame Earth Observation applications are seldom
usable on different kinds of data types, being
strongly dependant on the characteristics of the
sensor used (i.e. spatial, spectral and
radiometric resolutions of the data), models
adopted and a priori assumptions. Data
compression based techniques provide a
parameter-free, model independent methodology to
perform image classification and indexing, so
they have the major advantage of being
universally applicable, and can be a powerful and
reliable instrument to discover similarities
between heterogeneous kinds of data. Goals The
fusion of a methodology for pattern recognition
and a well-known similarity metric, both of them
based on data compression, results in the
definition of a new more robust methodology to
discover similarities in the data, the Model
Conditioned Data Compression-Based Similarity
Measure (McDCSM).
Model Conditioned Data Compression-Based
Similarity Measure
At first sight NCD and PRDC are quite different
the former is a direct metric while the latter is
a methodology which computes a compression
distance with an intermediate step of encoding
files into texts. In spite of this, it is
possible to demonstrate that also PRDC may be
regarded as based on estimates of Kolmogorov
complexities. This results in the definition of a
new measure the Model Conditioned Data
Compression based Similarity Measure (McDCSM),
which is a modified version of PRDC.
Pattern Recognition using Data Compression
(PRDC) PRDC is a methodology for classification
of general data. In the case of satellite data,
PRDC is performed by encoding the local gradients
of an image and arranging them as edges into an
undirected graph, where each pixel of the image
represents a node. By removing steep gradients
one preserves homogeneous areas that can be
segmented while keeping part of the spatial
features. For classification, one can then encode
the segments into text strings that can be
compressed with dictionaries extracted from the
various classes of interest.
Link McDCSM -gt PRDC Its straightforward to
notice that McDCSM is a normalized version of PRDC
Link McDCSM -gt NCD The definition of McDCSM may
be regarded as
, iff
The condition is ensured with the
selection of small dictionaries for the class x,
which give the best results. This is explained by
the a priori probability for a model, higher for
simple models, the universal distribution defined
by Solomonoff as
PRDC Workflow. Workflow to compute the distance
between a general input file I and a class C ,
encoding first the files to strings using an
alphabet A, which differs according to the kind
of data we are encoding. Each distance becomes an
element of the compression ratio vector CV for
the input file I.
Shortest program q that outputs the string x and
halts on an universal Turing machine.
- Mutual Information
- According to Kolmogorov Complexity, between two
objects x and y
NCD
- Normalized Information Distance (NID)
- From the definition of algorithmic mutual
information is possible to derive a similarity
metric, the NID, between two datasets x and y
- Normalized Compression Distance (NCD)
- An approximation of the Normalized Information
Distance, where uncomputable Kolmogorov
complexities are approximated by compression
factors.
McDCSM
PRDC
Image indexing with the three methods.
Hierarchical clustering using NCD, PRDC and
McDCSM distances on 60 Spot 64x64 images
belonging to 6 different classes. The classes
result well separated with the exception of a sea
image, which is generally considered to be closer
to the classes "Clouds" and "Desert". In PRDC
there is also an additional false alarm a forest
image within the class sea. This false alarm is
removed passing from the PRDC indices to the
McDCSM. False alarms are circled in red.
Normalized Compression Distance schema. In this
sketch, the coder represents a general lossless
compressor. The lengths of the compressed files
C(x), C(y) and C(xy) are used with the function f
to compute the information distance between two
objects x and y, where xy represents the two
objects merged in one.
Putting together PRDC and NCD
Classification Methodology (PRDC)
Improved Classification Methodology (McDCSM)
Compression-based Similarity Metric (NCD)