Title: Indexing and Data Mining in Multimedia Databases
1Indexing and Data Mining in Multimedia Databases
- Christos Faloutsos
- CMU
- www.cs.cmu.edu/christos
2Outline
- Goal Find similar / interesting things
- Problem - Applications
- Indexing - similarity search
- New tools for Data Mining Fractals
- Conclusions
- Resources
3Problem
- Given a large collection of (multimedia) records,
find similar/interesting things, ie - Allow fast, approximate queries, and
- Find rules/patterns
4Sample queries
- Similarity search
- Find pairs of branches with similar sales
patterns - find medical cases similar to Smith's
- Find pairs of sensor series that move in sync
5Sample queries contd
- Rule discovery
- Clusters (of patients of customers ...)
- Forecasting (total sales for next year?)
- Outliers (eg., fraud detection)
6Outline
- Goal Find similar / interesting things
- Problem - Applications
- Indexing - similarity search
- New tools for Data Mining Fractals
- Conclusions
- Resourses
7Indexing - Multimedia
- Problem
- given a set of (multimedia) objects,
- find the ones similar to a desirable query object
(quickly!)
8distance function by expert
9GEMINI - Pictorially
eg,. std
S1
F(S1)
1
365
day
F(Sn)
Sn
eg, avg
off-the-shelf S.A.Ms (spatial Access Methods)
1
365
day
10GEMINI
- fast correct (no false dismissals)
- used for
- images (eg., QBIC) (2x, 10x faster)
- shapes (27x faster)
- video (eg., InforMedia)
- time sequences (RafieiMendelzon, )
11Remaining issues
- how to extract features automatically?
- how to merge similarity scores from different
media
12Outline
- Goal Find similar / interesting things
- Problem - Applications
- Indexing - similarity search
- Visualization Fastmap
- Relevance feedback FALCON
- Data Mining / Fractals
- Conclusions
13FastMap
O1 O2 O3 O4 O5
O1 0 1 1 100 100
O2 1 0 1 100 100
O3 1 1 0 100 100
O4 100 100 100 0 1
O5 100 100 100 1 0
??
14FastMap
- Multi-dimensional scaling (MDS) can do that, but
in O(N2) time - We want a linear algorithm FastMap SIGMOD95
15Applications time sequences
- given n co-evolving time sequences
- visualize them find rules ICDE00
DEM
rate
JPY
HKD
time
16Applications - financial
- currency exchange rates ICDE00
FRF GBP JPY HKD
USD(t)
USD(t-5)
17Applications - financial
- currency exchange rates ICDE00
USD(t)
USD(t-5)
18Application VideoTrails
HIDE
19VideoTrails - usage
HIDE
- scene-cut detection (about 10 errors)
- scene classification (eg., dialogue vs action)
20Outline
- Goal Find similar / interesting things
- Problem - Applications
- Indexing - similarity search
- Visualization Fastmap
- Relevance feedback FALCON
- Data Mining / Fractals
- Conclusions
21Merging similarity scores
- eg., video text, color, motion, audio
- weights change with the query!
- solution 1 user specifies weights
- solution 2 user gives examples ?
- and we learn what he/she wants rel. feedback
(Rocchio, MARS, MindReader) - but how about disjunctive queries?
22DEMO
demo
server
23FALCON
Inverted Vs
Vs
Trader wants only unstable stocks
24FALCON
Inverted Vs
Vs
average is flat!
25Single query point methods
std
x
avg
Rocchio
26Single query point methods
x
x
x
Rocchio
MindReader
MARS
The averaging affect in action...
27Main idea FALCON Contours
Wu, vldb2000
feature2 eg., std
feature1 (eg., avg)
28A Aggregate Dissimilarity
x
g1
g2
- ? parameter ( -5 soft OR)
29FALCON
- converges quickly (5 iterations)
- good precision/recall
- is fast (can use off-the-shelf spatial/metric
access methods)
30Conclusions for indexing visualization
- GEMINI fast indexing, exploiting off-the-shelf
SAMs - FastMap automatic feature extraction in O(N)
time - FALCON relevance feedback for disjunctive queries
31Outline
- Goal Find similar / interesting things
- Problem - Applications
- Indexing - similarity search
- New tools for Data Mining Fractals
- Conclusions
- Resourses
32Data mining fractals Road map
- Motivation problems / case study
- Definition of fractals and power laws
- Solutions to posed problems
- More examples
33Problem 1 - spatial d.m.
- Galaxies (Sloan Digital Sky Survey w/ B. Nichol)
- - spiral and elliptical galaxies
- (stores households healthy ill subjects)
- - patterns? (not Gaussian not uniform)
- attraction/repulsion?
- separability??
34Problem2 dim. reduction
mpg
- given attributes x1, ... xn
- possibly, non-linearly correlated
- drop the useless ones
- (Q why?
- A to avoid the dimensionality curse)
engine size
35Answer
- Fractals / self-similarities / power laws
36What is a fractal?
- self-similar point set, e.g., Sierpinski
triangle
zero area infinite length!
...
37Definitions (contd)
- Paradox Infinite perimeter Zero area!
- dimensionality between 1 and 2
- actually Log(3)/Log(2) 1.58 (long story)
38Intrinsic (fractal) dimension
Eg cylinders miles / gallon
- Q fractal dimension of a line?
x y
5 1
4 2
3 3
2 4
39Intrinsic (fractal) dimension
- Q fractal dimension of a line?
- A nn ( lt r ) r1
40Intrinsic (fractal) dimension
- Q fractal dimension of a line?
- A nn ( lt r ) r1
- Q fd of a plane?
- A nn ( lt r ) r2
- fd slope of (log(nn) vs log(r) )
41Sierpinsky triangle
correlation integral
42Observations
- self-similarity -gt
- ltgt fractals
- ltgt scale-free
- ltgt power-laws (yxa, FCr(-2))
log(pairs within ltr )
1.58
log( r )
43Road map
- Motivation problems / case studies
- Definition of fractals and power laws
- Solutions to posed problems
- More examples
- Conclusions
44Solution1 spatial d.m.
- Galaxies (Sloan Digital Sky Survey w/ B. Nichol -
BOPS plot - sigmod2000)
- clusters?
- separable?
- attraction/repulsion?
- data scrubbing duplicates?
45Solution1 spatial d.m.
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
46Solution1 spatial d.m.
w/ Seeger, Traina, Traina, SIGMOD00
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
47spatial d.m.
Heuristic on choosing of clusters
48Solution1 spatial d.m.
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
49Solution1 spatial d.m.
log(pairs within ltr )
- - 1.8 slope
- - plateau!
- repulsion!!
ell-ell
spi-spi
-duplicates
spi-ell
log(r)
50Problem 2 Dim. reduction
51Solution
- drop the attributes that dont increase the
partial f.d. PFD - dfn PFD of attribute set A is the f.d. of the
projected cloud of points w/ Traina, Traina, Wu,
SBBD00
52Problem 2 dim. reduction
global FD1
PFD1
PFD1
PFD0
PFD1
PFD1
53Problem 2 dim. reduction
global FD1
PFD1
PFD1
Notice max variance would fail here
PFD0
PFD1
PFD1
54Problem 2 dim. reduction
global FD1
PFD1
PFD1
Notice SVD would fail here
PFD0
PFD1
PFD1
55Currency dataset
HIDE
56self-similar?
HIDE
eigenfaces
currency
fd1.98
fd4.25
57FDR on the currency dataset
HIDE
58FDR on the currency dataset
HIDE
- HKD useless
- gt1.98 axis are needed
59Road map
- Motivation problems / case studies
- Definition of fractals and power laws
- Solutions to posed problems
- More examples
- Conclusions
60App. traffic
- disk traces self-similar (also web traffic
comm. errors etc)
61More apps Brain scans
62More fractals
- stock prices (LYCOS) - random walks 1.5
63More fractals
- coast-lines 1.1-1.2 (up to 1.58)
64(No Transcript)
65ExamplesMG county
- Montgomery County of MD (road end-points)
66ExamplesLB county
- Long Beach county of CA (road end-points)
67More power laws Zipfs law
log(freq)
a
- Bible - rank vs frequency (log-log)
the
log(rank)
68More power laws
- Freq. distr. of first names last names
(Mandelbrot)
69Internet
- Internet routers how many neighbors within h
hops?
U of Alberta
70Internet topology
- Internet routers how many neighbors within h
hops? SIGCOMM 99
log(pairs)
Reachability function number of neighbors within
r hops, vs r (log-log). Mbone routers, 1995
log(hops)
71More power laws areas Korcaks law
(icde99, w/ Proietti)
Scandinavian lakes
72More power laws areas Korcaks law
log(count( gt area))
Scandinavian lakes area vs complementary
cumulative count (log-log axes)
log(area)
73Olympic medals
log( medals)
log rank
74More power laws
- Energy of earthquakes (Gutenberg-Richter law)
simscience.org
log(count)
amplitude
magnitude
day
75Even more power laws
- Income distribution (Paretos law)
- sales distributions
- duration of UNIX jobs
- Distribution of UNIX file sizes
- publication counts (Lotkas law)
76Even more power laws
- web hit frequencies (Huberman)
- hyper-link distribution Barabasi,
77Overall Conclusions
- Find similar/interesting things in multimedia
databases - Indexing feature extraction (GEMINI)
- automatic feature extraction FastMap
- Relevance feedback FALCON
78Conclusions - contd
- New tools for Data Mining Fractals/power laws
- appear everywhere
- lead to skewed distributions (Gaussian, Poisson,
uniformity, independence) - correlation integral for separability/cluster
detection - PFD for dimensionality reduction
79Conclusions - contd
- can model bursty time sequences
(buffering/prefetching) - selectivity estimation (how many neighbors
within x km?) - dim. curse diagnosis (its the fractal dim. that
matters! ICDE2000)
80Resources
- Software and papers
- http//www.cs.cmu.edu/christos
- Fractal dimension (FracDim)
- Separability (sigmod 2000)
- Relevance feedback for query by content (FALCON
vldb 2000)