Multimedia Databases - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Multimedia Databases

Description:

used for multi-lingual retrieval. How exactly SVD works? Indexing - Detailed outline ... secondary key / multi-key indexing. spatial access methods. fractals ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 45

Provided by: GeorgeK159

Learn more at: https://www.cs.bu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multimedia Databases

1
Multimedia Databases

Text II

2
Outline

Spatial Databases
Temporal Databases
Spatio-temporal Databases
Multimedia Databases
Text databases
Image and video databases
Time Series databases
Data Mining

3
Text - Detailed outline

Text databases
problem
full text scanning
inversion
signature files
clustering
information filtering and LSI

4
Vector Space Model and Clustering

keyword queries (vs Boolean)
each document -gt vector (HOW?)
each query -gt vector
search for similar vectors

5
Vector Space Model and Clustering

main idea

document
zoo
aaron
data
indexing
...data...
V ( vocabulary size)
6
Vector Space Model and Clustering

Then, group nearby vectors together
Q1 cluster search?
Q2 cluster generation?
Two significant contributions
ranked output
relevance feedback

7
Vector Space Model and Clustering

cluster search visit the (k) closest
superclusters continue recursively

MD TRs
CS TRs
8
Vector Space Model and Clustering

ranked output easy!

MD TRs
CS TRs
9
Vector Space Model and Clustering

relevance feedback (brilliant idea) Roccio73

MD TRs
CS TRs
10
Vector Space Model and Clustering

relevance feedback (brilliant idea) Roccio73
How?

MD TRs
CS TRs
11
Vector Space Model and Clustering

How? A by adding the good vectors and
subtracting the bad ones

MD TRs
CS TRs
12
Outline - detailed

main idea
cluster search
cluster generation
evaluation

13
Cluster generation

Problem
given N points in V dimensions,
group them

14
Cluster generation

Problem
given N points in V dimensions,
group them

15
Cluster generation

We need
Q1 document-to-document similarity
Q2 document-to-cluster similarity

16
Cluster generation

Q1 document-to-document similarity
(recall bag of words representation)
D1 data, retrieval, system
D2 lung, pulmonary, system
distance/similarity functions?

17
Cluster generation

A1 of words in common
A2 ........ normalized by the vocabulary sizes
A3 .... etc
About the same performance - prevailing one
cosine similarity

18
Cluster generation

cosine similarity
sim(D1, D2) cos(?)
sum(v1,i v2,i) / len(v1)/ len(v2)

D1
D2
?
19
Cluster generation

cosine similarity - observations
related to the Euclidean distance
weights vi,j according to tf/idf

D1
D2
?
20
Cluster generation

tf (term frequency)
high, if the term appears very often in this
document.
idf (inverse document frequency)
penalizes common words, that appear in almost
every document

21
Cluster generation

We need
Q1 document-to-document similarity
Q2 document-to-cluster similarity

?
22
Cluster generation

A1 min distance (single-link)
A2 max distance (all-link)
A3 avg distance
A4 distance to centroid

?
23
Cluster generation

A1 min distance (single-link)
leads to elongated clusters
A2 max distance (all-link)
many, small, tight clusters
A3 avg distance
in between the above
A4 distance to centroid
fast to compute

24
Cluster generation

We have
document-to-document similarity
document-to-cluster similarity
Q How to group documents into natural clusters

25
Cluster generation

A many-many algorithms - in two groups
VanRijsbergen
theoretically sound (O(N2))
independent of the insertion order
iterative (O(N), O(N log(N))

26
Outline - detailed

main idea
cluster search
cluster generation
evaluation

27
Evaluation

Q how to measure goodness of one distance
function vs another?
A ground truth (by humans) and
precision and recall

28
Evaluation

precision (retrieved and relevant) / retrieved
100 precision -gt no false alarms
recall (retrieved and relevant)/ relevant
100 recall -gt no false dismissals

29
Text - Detailed outline

text
problem
full text scanning
inversion
signature files
clustering
information filtering and LSI

30
LSI - Detailed outline

LSI
problem definition
main idea
experiments

31
Information Filtering LSI

Foltz,92 Goal
users specify interests ( keywords)
system alerts them, on suitable news-documents
Major contribution LSI Latent Semantic
Indexing
latent (hidden) concepts

32
Information Filtering LSI

Main idea
map each document into some concepts
map each term into some concepts
Concept a set of terms, with weights, e.g.
data (0.8), system (0.5), retrieval (0.6)
-gt DBMS_concept

33
Information Filtering LSI

Pictorially term-document matrix (BEFORE)

34
Information Filtering LSI

Pictorially concept-document matrix and...

35
Information Filtering LSI

... and concept-term matrix

36
Information Filtering LSI

Q How to search, eg., for system?

37
Information Filtering LSI

A find the corresponding concept(s) and the
corresponding documents

38
Information Filtering LSI

A find the corresponding concept(s) and the
corresponding documents

39
Information Filtering LSI

Thus it works like an (automatically constructed)
thesaurus
we may retrieve documents that DONT have the
term system, but they contain almost everything
else (data, retrieval)

40
LSI - Discussion - Conclusions

Great idea,
to derive concepts from documents
to build a statistical thesaurus automatically
to reduce dimensionality
Often leads to better precision/recall
but
Needs training set of documents
concept vectors are not sparse anymore

41
LSI - Discussion - Conclusions

Observations
Bellcore (-gt Telcordia) has a patent
used for multi-lingual retrieval
How exactly SVD works?

42
Indexing - Detailed outline

primary key indexing
secondary key / multi-key indexing
spatial access methods
fractals
text
SVD a powerful tool
multimedia
...

43
References

Foltz, P. W. and S. T. Dumais (Dec. 1992).
"Personalized Information Delivery An Analysis
of Information Filtering Methods." Comm. of ACM
(CACM) 35(12) 51-60.
Can, F. and E. A. Ozkarahan (Dec. 1990).
"Concepts and Effectiveness of the
Cover-Coefficient-Based Clustering Methodology
for Text Databases." ACM TODS 15(4) 483-517.
Rocchio, J. J. (1971). Relevance Feedback in
Information Retrieval. The SMART Retrieval System
- Experiments in Automatic Document Processing.
G. Salton. Englewood Cliffs, New Jersey,
Prentice-Hall Inc.

44
References - contd

Salton, G. (1971). The SMART Retrieval System -
Experiments in Automatic Document Processing.
Englewood Cliffs, New Jersey, Prentice-Hall Inc.
Salton, G. and M. J. McGill (1983). Introduction
to Modern Information Retrieval, McGraw-Hill.
Van-Rijsbergen, C. J. (1979). Information
Retrieval. London, England, Butterworths.
Zahn, C. T. (Jan. 1971). "Graph-Theoretical
Methods for Detecting and Describing Gestalt
Clusters." IEEE Trans. on Computers C-20(1)
68-86.