Indexing and Data Mining in Multimedia Databases - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Indexing and Data Mining in Multimedia Databases

Description:

Data mining & fractals. Road map. Motivation problems / case study. Definition of fractals and power laws. Solutions to posed problems. More examples ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 79
Provided by: christosf
Category:

less

Transcript and Presenter's Notes

Title: Indexing and Data Mining in Multimedia Databases


1
Indexing and Data Mining in Multimedia Databases
  • Christos Faloutsos
  • CMU
  • www.cs.cmu.edu/christos

2
Outline
  • Goal Find similar / interesting things
  • Problem - Applications
  • Indexing - similarity search
  • New tools for Data Mining Fractals
  • Conclusions
  • Resources

3
Problem
  • Given a large collection of (multimedia) records,
    find similar/interesting things, ie
  • Allow fast, approximate queries, and
  • Find rules/patterns

4
Sample queries
  • Similarity search
  • Find pairs of branches with similar sales
    patterns
  • find medical cases similar to Smith's
  • Find pairs of sensor series that move in sync

5
Sample queries contd
  • Rule discovery
  • Clusters (of patients of customers ...)
  • Forecasting (total sales for next year?)
  • Outliers (eg., fraud detection)

6
Outline
  • Goal Find similar / interesting things
  • Problem - Applications
  • Indexing - similarity search
  • New tools for Data Mining Fractals
  • Conclusions
  • Resourses

7
Indexing - Multimedia
  • Problem
  • given a set of (multimedia) objects,
  • find the ones similar to a desirable query object
    (quickly!)

8
distance function by expert
9
GEMINI - Pictorially
eg,. std
S1
F(S1)
1
365
day
F(Sn)
Sn
eg, avg
off-the-shelf S.A.Ms (spatial Access Methods)
1
365
day
10
GEMINI
  • fast correct (no false dismissals)
  • used for
  • images (eg., QBIC) (2x, 10x faster)
  • shapes (27x faster)
  • video (eg., InforMedia)
  • time sequences (RafieiMendelzon, )

11
Remaining issues
  • how to extract features automatically?
  • how to merge similarity scores from different
    media

12
Outline
  • Goal Find similar / interesting things
  • Problem - Applications
  • Indexing - similarity search
  • Visualization Fastmap
  • Relevance feedback FALCON
  • Data Mining / Fractals
  • Conclusions

13
FastMap
O1 O2 O3 O4 O5
O1 0 1 1 100 100
O2 1 0 1 100 100
O3 1 1 0 100 100
O4 100 100 100 0 1
O5 100 100 100 1 0
??
14
FastMap
  • Multi-dimensional scaling (MDS) can do that, but
    in O(N2) time
  • We want a linear algorithm FastMap SIGMOD95

15
Applications time sequences
  • given n co-evolving time sequences
  • visualize them find rules ICDE00

DEM
rate
JPY
HKD
time
16
Applications - financial
  • currency exchange rates ICDE00

FRF GBP JPY HKD
USD(t)
USD(t-5)
17
Applications - financial
  • currency exchange rates ICDE00

USD(t)
USD(t-5)
18
Application VideoTrails
HIDE
  • ACM MM97

19
VideoTrails - usage
HIDE
  • scene-cut detection (about 10 errors)
  • scene classification (eg., dialogue vs action)

20
Outline
  • Goal Find similar / interesting things
  • Problem - Applications
  • Indexing - similarity search
  • Visualization Fastmap
  • Relevance feedback FALCON
  • Data Mining / Fractals
  • Conclusions

21
Merging similarity scores
  • eg., video text, color, motion, audio
  • weights change with the query!
  • solution 1 user specifies weights
  • solution 2 user gives examples ?
  • and we learn what he/she wants rel. feedback
    (Rocchio, MARS, MindReader)
  • but how about disjunctive queries?

22
DEMO
demo
server
23
FALCON
Inverted Vs
Vs
Trader wants only unstable stocks
24
FALCON
Inverted Vs
Vs
average is flat!
25
Single query point methods
std



x



avg
Rocchio
26
Single query point methods



x
x
x



Rocchio
MindReader
MARS
The averaging affect in action...
27
Main idea FALCON Contours
Wu, vldb2000


feature2 eg., std



feature1 (eg., avg)
28
A Aggregate Dissimilarity
x
g1
g2
  • ? parameter ( -5 soft OR)

29
FALCON
  • converges quickly (5 iterations)
  • good precision/recall
  • is fast (can use off-the-shelf spatial/metric
    access methods)

30
Conclusions for indexing visualization
  • GEMINI fast indexing, exploiting off-the-shelf
    SAMs
  • FastMap automatic feature extraction in O(N)
    time
  • FALCON relevance feedback for disjunctive queries

31
Outline
  • Goal Find similar / interesting things
  • Problem - Applications
  • Indexing - similarity search
  • New tools for Data Mining Fractals
  • Conclusions
  • Resourses

32
Data mining fractals Road map
  • Motivation problems / case study
  • Definition of fractals and power laws
  • Solutions to posed problems
  • More examples

33
Problem 1 - spatial d.m.
  • Galaxies (Sloan Digital Sky Survey w/ B. Nichol)
  • - spiral and elliptical galaxies
  • (stores households healthy ill subjects)
  • - patterns? (not Gaussian not uniform)
  • attraction/repulsion?
  • separability??

34
Problem2 dim. reduction
mpg
  • given attributes x1, ... xn
  • possibly, non-linearly correlated
  • drop the useless ones
  • (Q why?
  • A to avoid the dimensionality curse)

engine size
35
Answer
  • Fractals / self-similarities / power laws

36
What is a fractal?
  • self-similar point set, e.g., Sierpinski
    triangle

zero area infinite length!
...
37
Definitions (contd)
  • Paradox Infinite perimeter Zero area!
  • dimensionality between 1 and 2
  • actually Log(3)/Log(2) 1.58 (long story)

38
Intrinsic (fractal) dimension
Eg cylinders miles / gallon
  • Q fractal dimension of a line?

x y
5 1
4 2
3 3
2 4
39
Intrinsic (fractal) dimension
  • Q fractal dimension of a line?
  • A nn ( lt r ) r1

40
Intrinsic (fractal) dimension
  • Q fractal dimension of a line?
  • A nn ( lt r ) r1
  • Q fd of a plane?
  • A nn ( lt r ) r2
  • fd slope of (log(nn) vs log(r) )

41
Sierpinsky triangle
correlation integral
42
Observations
  • self-similarity -gt
  • ltgt fractals
  • ltgt scale-free
  • ltgt power-laws (yxa, FCr(-2))

log(pairs within ltr )
1.58
log( r )
43
Road map
  • Motivation problems / case studies
  • Definition of fractals and power laws
  • Solutions to posed problems
  • More examples
  • Conclusions

44
Solution1 spatial d.m.
  • Galaxies (Sloan Digital Sky Survey w/ B. Nichol -
    BOPS plot - sigmod2000)
  • clusters?
  • separable?
  • attraction/repulsion?
  • data scrubbing duplicates?

45
Solution1 spatial d.m.
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
46
Solution1 spatial d.m.
w/ Seeger, Traina, Traina, SIGMOD00
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
47
spatial d.m.
Heuristic on choosing of clusters
48
Solution1 spatial d.m.
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
49
Solution1 spatial d.m.
log(pairs within ltr )
  • - 1.8 slope
  • - plateau!
  • repulsion!!

ell-ell
spi-spi
-duplicates
spi-ell
log(r)
50
Problem 2 Dim. reduction
51
Solution
  • drop the attributes that dont increase the
    partial f.d. PFD
  • dfn PFD of attribute set A is the f.d. of the
    projected cloud of points w/ Traina, Traina, Wu,
    SBBD00

52
Problem 2 dim. reduction
global FD1
PFD1
PFD1
PFD0
PFD1
PFD1
53
Problem 2 dim. reduction
global FD1
PFD1
PFD1
Notice max variance would fail here
PFD0
PFD1
PFD1
54
Problem 2 dim. reduction
global FD1
PFD1
PFD1
Notice SVD would fail here
PFD0
PFD1
PFD1
55
Currency dataset
HIDE
56
self-similar?
HIDE
eigenfaces
currency
fd1.98
fd4.25
57
FDR on the currency dataset
HIDE
58
FDR on the currency dataset
HIDE
  • HKD useless
  • gt1.98 axis are needed

59
Road map
  • Motivation problems / case studies
  • Definition of fractals and power laws
  • Solutions to posed problems
  • More examples
  • Conclusions

60
App. traffic
  • disk traces self-similar (also web traffic
    comm. errors etc)

61
More apps Brain scans
  • Oct-trees brain-scans

62
More fractals
  • stock prices (LYCOS) - random walks 1.5

63
More fractals
  • coast-lines 1.1-1.2 (up to 1.58)

64
(No Transcript)
65
ExamplesMG county
  • Montgomery County of MD (road end-points)

66
ExamplesLB county
  • Long Beach county of CA (road end-points)

67
More power laws Zipfs law
log(freq)
a
  • Bible - rank vs frequency (log-log)

the
log(rank)
68
More power laws
  • Freq. distr. of first names last names
    (Mandelbrot)

69
Internet
  • Internet routers how many neighbors within h
    hops?

U of Alberta
70
Internet topology
  • Internet routers how many neighbors within h
    hops? SIGCOMM 99

log(pairs)
Reachability function number of neighbors within
r hops, vs r (log-log). Mbone routers, 1995
log(hops)
71
More power laws areas Korcaks law
(icde99, w/ Proietti)
Scandinavian lakes
72
More power laws areas Korcaks law
log(count( gt area))
Scandinavian lakes area vs complementary
cumulative count (log-log axes)
log(area)
73
Olympic medals
log( medals)
log rank
74
More power laws
  • Energy of earthquakes (Gutenberg-Richter law)
    simscience.org

log(count)
amplitude
magnitude
day
75
Even more power laws
  • Income distribution (Paretos law)
  • sales distributions
  • duration of UNIX jobs
  • Distribution of UNIX file sizes
  • publication counts (Lotkas law)

76
Even more power laws
  • web hit frequencies (Huberman)
  • hyper-link distribution Barabasi,

77
Overall Conclusions
  • Find similar/interesting things in multimedia
    databases
  • Indexing feature extraction (GEMINI)
  • automatic feature extraction FastMap
  • Relevance feedback FALCON

78
Conclusions - contd
  • New tools for Data Mining Fractals/power laws
  • appear everywhere
  • lead to skewed distributions (Gaussian, Poisson,
    uniformity, independence)
  • correlation integral for separability/cluster
    detection
  • PFD for dimensionality reduction

79
Conclusions - contd
  • can model bursty time sequences
    (buffering/prefetching)
  • selectivity estimation (how many neighbors
    within x km?)
  • dim. curse diagnosis (its the fractal dim. that
    matters! ICDE2000)

80
Resources
  • Software and papers
  • http//www.cs.cmu.edu/christos
  • Fractal dimension (FracDim)
  • Separability (sigmod 2000)
  • Relevance feedback for query by content (FALCON
    vldb 2000)
Write a Comment
User Comments (0)
About PowerShow.com