Distances between Data Sets Based on Summary Statistics - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Distances between Data Sets Based on Summary Statistics

Description:

Define a dissimilarity measure, the constrained minimum (CM) distance, between ... Bible, Addresses, Beatles, 20Newsgroups, TopGenres, TopDecades, Abstact ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 14
Provided by: Yut7
Category:

less

Transcript and Presenter's Notes

Title: Distances between Data Sets Based on Summary Statistics


1
Distances between Data Sets Based on Summary
Statistics
Machine Learning Paper Reading Series
  • Nikolaj Tatti, JMLR, 01/2007

Presented by Yuting Qi ECE Dept. Duke
Univ. 02/02/2007
2
Introduction
  • Goal
  • Define a dissimilarity measure, the constrained
    minimum (CM) distance, between two data sets D1
    and D2 by comparing summary statistics of
    datasets.
  • Requirements
  • It should be a metric.
  • It should consider the statistical nature of
    data.
  • It should be evaluated quickly.

3
The Constrained Minimum (CM) Distance 1/5
  • Definition
  • Basic notations
  • D data set, a finite collection of samples in ?.
  • ? finite sample space, ? is the of elements
    in ?.
  • S feature function, , known or
    learned.
  • T frequency, , the average values of S
    over D,

S(D)
Example ?A,B,C, D1(C,C,C,A),
D2(C,A,B,A) The only feature of interest is the
proportion of C in the data set, then the feature
function S is S(D1)3/4, S(D2)1/4
4
The Constrained Minimum (CM) Distance 2/5
  • Constrained set of distributions
  • An alternative definition of
  • Constrained space

P is the set of all distributions defined on ?.
Calculated from given data sets
We estimate statistics from given data set, then
examine the distributions that can produce such
statistics.
If think ?1,2,,?, P is a set of vectors, u,
in R? satisfying non-negative elements and
summing to 1.
uip(i)
5
The Constrained Minimum (CM) Distance 3/5
  • Illustration

Example ?A,B,C, D1(C,C,C,A),
D2(C,A,B,A) the feature function S
is S(D1)0.75, S(D2)0.25 P is the triangle,
is a plane Then, C(S, 0.75), C(S,
0.25) are parallel lines The constrained set of
distributions C(S, 0.75), C(S, 0.25) are the
segments Motivate A nature way to measure the
distance between two parallel spaces find the
shortest length from two points from each space.
C
B
A
6
The Constrained Minimum (CM) Distance 4/5
  • CM Distance
  • Pick a vector from each constrained space
  • CM distance between D1 and D2 is
  • Theorem 1
  • Computation time
  • ? could be very large, O(N3) time is feasible

7
The Constrained Minimum (CM) Distance 5/5
  • Properties

8
CM Distance and Binary Data Sets 1/2
  • Basic definitions
  • Sample space
  • Itemset , ai corresponds to ith
    dimension.
  • Boolean formula S ?-gt0,1
  • Conjunction function SB
  • SB(w)wi1wi2wiL, given itemset Bai1, ,
    aiL
  • Parity function TB
  • TB(w)wi1wi2wiL ( XOR)
  • Given a collection of itemsets FB1,, BN, we
    have

9
CM Distance and Binary Data Sets 2/2
  • CM distance can be calculated in O(N) time
    assuming know ?1 and ?2.

10
CM Distance and Event Sequences 1/1
  • Transform a sequence s to a binary data set
  • Given a window length k, pick a window in s
    and transform it into a binary vector of length
    ? (the alphabet) by setting 1 if the
    corresponding symbol occurs in window. S-gtD
  • Define a way F to represent the statistics of
    sequence s, popular choice is episodes.
  • Given transformed data sets D1, D2, F, the CM
    distance between s1 and s2 is

11
Empirical Tests
  • 7 datasets
  • Bible, Addresses, Beatles, 20Newsgroups,
    TopGenres, TopDecades, Abstact
  • Compare CM distance to a base distance
  • Clustering experiments using different algorithms
    based on CM distance.

12
Empirical Tests
13
Conclusions Discussion
  • CM distance has nice statistical properties and
    can be evaluated efficiently
  • It takes properly into account the correlation
    between features
  • For many types of feature functions, the
    computation time of CM distance is fast.
  • The performance of CM distance depends heavily on
    the data set.
Write a Comment
User Comments (0)
About PowerShow.com