Title: Indexing and Data Mining in Multimedia Databases
1Indexing and Data Mining in Multimedia Databases
- Christos Faloutsos
- CMU
- www.cs.cmu.edu/christos
2Outline
- Goal Find similar / interesting things
- Problem - Applications
- Indexing - similarity search
- New tools for Data Mining Fractals
- Conclusions
- Resources
3Problem
- Given a large collection of (multimedia) records,
find similar/interesting things, ie - Allow fast, approximate queries, and
- Find rules/patterns
4Sample queries
- Similarity search
- Find pairs of branches with similar sales
patterns - find medical cases similar to Smith's
- Find pairs of sensor series that move in sync
- Find shapes like a spark-plug
5Sample queries contd
- Rule discovery
- Clusters (of branches of sensor data ...)
- Forecasting (total sales for next year?)
- Outliers (eg., unexpected part failures fraud
detection)
6Outline
- Goal Find similar / interesting things
- Problem - Applications
- Indexing - similarity search
- New tools for Data Mining Fractals
- Conclusions
- related projects _at_ CMU and resourses
7Indexing - Multimedia
- Problem
- given a set of (multimedia) objects,
- find the ones similar to a desirable query object
8distance function by expert
9GEMINI - Pictorially
eg,. std
S1
F(S1)
1
365
day
F(Sn)
Sn
eg, avg
1
365
day
10Remaining issues
- how to extract features automatically?
- how to merge similarity scores from different
media
11Outline
- Goal Find similar / interesting things
- Problem - Applications
- Indexing - similarity search
- Visualization Fastmap
- Relevance feedback FALCON
- Data Mining / Fractals
- Conclusions
12FastMap
O1 O2 O3 O4 O5
O1 0 1 1 100 100
O2 1 0 1 100 100
O3 1 1 0 100 100
O4 100 100 100 0 1
O5 100 100 100 1 0
??
13FastMap
- Multi-dimensional scaling (MDS) can do that, but
in O(N2) time - We want a linear algorithm FastMap SIGMOD95
14Applications time sequences
- given n co-evolving time sequences
- visualize them find rules ICDE00
DEM
rate
JPY
HKD
time
15Applications - financial
- currency exchange rates ICDE00
FRF GBP JPY HKD
USD(t)
USD(t-5)
16Applications - financial
- currency exchange rates ICDE00
USD(t)
USD(t-5)
17Application VideoTrails
18VideoTrails - usage
- scene-cut detection (about 10 errors)
- scene classification (eg., dialogue vs action)
19Outline
- Goal Find similar / interesting things
- Problem - Applications
- Indexing - similarity search
- Visualization Fastmap
- Relevance feedback FALCON
- Data Mining / Fractals
- Conclusions
20Merging similarity scores
- eg., video text, color, motion, audio
- weights change with the query!
- solution 1 user specifies weights
- solution 2 user gives examples ?
- and we learn what he/she wants rel. feedback
(Rocchio, MARS, MindReader) - but how about disjunctive queries?
21FALCON
Inverted Vs
Vs
Trader wants only unstable stocks
22Single query point methods
x
Rocchio
23Single query point methods
x
x
x
Rocchio
MindReader
MARS
The averaging affect in action...
24Main idea FALCON Contours
Wu, vldb2000
feature2 eg., frequency
feature1 (eg., temperature)
25Conclusions for indexing visualization
- GEMINI fast indexing, exploiting off-the-shelf
SAMs - FastMap automatic feature extraction in O(N)
time - FALCON relevance feedback for disjunctive queries
26Outline
- Goal Find similar / interesting things
- Problem - Applications
- Indexing - similarity search
- New tools for Data Mining Fractals
- Conclusions
- Resourses
27Data mining fractals Road map
- Motivation problems / case study
- Definition of fractals and power laws
- Solutions to posed problems
- More examples
28Problem 1 - spatial d.m.
- Galaxies (Sloan Digital Sky Survey w/ B. Nichol)
- - spiral and elliptical galaxies
- (stores households mpg MTBF...)
- - patterns? (not Gaussian not uniform)
- attraction/repulsion?
- separability??
29Problem2 dim. reduction
- given attributes x1, ... xn
- possibly, non-linearly correlated
- drop the useless ones
- (Q why?
- A to avoid the dimensionality curse)
30Answer
- Fractals / self-similarities / power laws
31What is a fractal?
- self-similar point set, e.g., Sierpinski
triangle
zero area infinite length!
...
32Definitions (contd)
- Paradox Infinite perimeter Zero area!
- dimensionality between 1 and 2
- actually Log(3)/Log(2) 1.58 (long story)
33Intrinsic (fractal) dimension
Eg cylinders miles / gallon
- Q fractal dimension of a line?
x y
5 1
4 2
3 3
2 4
34Intrinsic (fractal) dimension
- Q fractal dimension of a line?
- A nn ( lt r ) r1
- (power law yxa)
35Intrinsic (fractal) dimension
- Q fractal dimension of a line?
- A nn ( lt r ) r1
- (power law yxa)
- Q fd of a plane?
- A nn ( lt r ) r2
- fd slope of (log(nn) vs log(r) )
36Sierpinsky triangle
correlation integral
37Road map
- Motivation problems / case studies
- Definition of fractals and power laws
- Solutions to posed problems
- More examples
- Conclusions
38Solution1 spatial d.m.
- Galaxies (Sloan Digital Sky Survey w/ B. Nichol -
BOPS plot - sigmod2000)
- clusters?
- separable?
- attraction/repulsion?
- data scrubbing duplicates?
39Solution1 spatial d.m.
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
40Solution1 spatial d.m.
w/ Seeger, Traina, Traina, SIGMOD00
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
41spatial d.m.
Heuristic on choosing of clusters
42Solution1 spatial d.m.
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
43Solution1 spatial d.m.
log(pairs within ltr )
- - 1.8 slope
- - plateau!
- repulsion!!
ell-ell
spi-spi
-duplicates
spi-ell
log(r)
44Problem 2 Dim. reduction
45Solution
- drop the attributes that dont increase the
partial f.d. PFD - dfn PFD of attribute set A is the f.d. of the
projected cloud of points w/ Traina, Traina, Wu,
SBBD00
46Problem 2 dim. reduction
global FD1
PFD1
PFD1
PFD0
PFD1
PFD1
47Problem 2 dim. reduction
global FD1
PFD1
PFD1
Notice max variance would fail here
PFD0
PFD1
PFD1
48Problem 2 dim. reduction
global FD1
PFD1
PFD1
Notice SVD would fail here
PFD0
PFD1
PFD1
49Road map
- Motivation problems / case studies
- Definition of fractals and power laws
- Solutions to posed problems
- More examples
- fractals
- power laws
- Conclusions
50disk traffic
- Not Poisson, not(?) iid - BUT self-similar
- How to model it?
51traffic
- disk traces (80-20 law multifractal
ICDE02)
bytes
time
52Traffic
- Many other time-sequences are bursty/clustered
(such as?)
53Tape accesses
tapes needed, to retrieve n records? ( days
down, due to failures / hurricanes /
communication noise...)
54Tape accesses
50-50 Poisson
tapes retrieved
real
qual. records
55More apps Brain scans
56GIS points
- Cross-roads of Montgomery county
- any rules?
57GIS
- A self-similarity
- intrinsic dim. 1.51
- avgneighbors(lt r ) rD
log(pairs(within lt r))
log( r )
58ExamplesLB county
- Long Beach county of CA (road end-points)
59More fractals
- cardiovascular system 3 (!)
- stock prices (LYCOS) - random walks 1.5
- Coastlines 1.2-1.58 (?)
60(No Transcript)
61Road map
- Motivation problems / case studies
- Definition of fractals and power laws
- Solutions to posed problems
- More examples
- fractals
- power laws
- Conclusions
62Fractals lt-gt Power laws
- self-similarity -gt
- ltgt fractals
- ltgt scale-free
- ltgt power-laws (yxa, FCr(-2))
log(pairs within ltr )
1.58
log( r )
63Zipfs law
the
log(freq)
and
Bible RANK-FREQUENCY plot (in log-log scales)
log(rank)
Zipfs (first) Law
64Zipfs law
- similarly for first names (slope -1)
- last names ( -0.7)
- etc
65More power laws
- Energy of earthquakes (Gutenberg-Richter law)
simscience.org
log(count)
amplitude
magnitude
day
66Clickstream data
lturl, u-id, ....gt
67Lotkas law
- library science (Lotkas law of publication
count) and citation counts (citeseer.nj.nec.com
6/2001)
log(count)
J. Ullman
log(citations)
68Korcaks law
log(count( gt area))
Scandinavian lakes area vs complementary
cumulative count (log-log axes)
log(area)
69More power laws Korcak
log(count( gt area))
Japan islands area vs cumulative count (log-log
axes)
log(area)
70(Korcaks law Aegean islands)
71Olympic medals
log( medals)
Russia
China
USA
log rank
72SALES data store96
count of products
units sold
73TELCO data
count of customers
of service units
74More power laws on the Internet
log(degree)
log(rank)
degree vs rank, for Internet domains (log-log)
sigcomm99
75Even more power laws
- Income distribution (Paretos law)
- duration of UNIX jobs Harchol-Balter
- Distribution of UNIX file sizes
- Web graph CLEVER-IBM Barabasi
76Overall Conclusions
- Find similar/interesting things in multimedia
databases - Indexing feature extraction (GEMINI)
- automatic feature extraction FastMap
- Relevance feedback FALCON
77Conclusions - contd
- New tools for Data Mining Fractals/power laws
- appear everywhere
- lead to skewed distributions (Gaussian, Poisson,
uniformity, independence) - correlation integral for separability/cluster
detection - PFD for dimensionality reduction
78Resources
- Software and papers
- www.cs.cmu.edu/christos
- Fractal dimension (FracDim)
- Separability (sigmod 2000, kdd2001)
- Relevance feedback for query by content (FALCON
vldb 2000)
79Resources
- Manfred Schroeder Chaos, Fractals and Power Laws