Multimedia Databases - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Multimedia Databases

Description:

used for multi-lingual retrieval. How exactly SVD works? Indexing - Detailed outline ... secondary key / multi-key indexing. spatial access methods. fractals ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 45
Provided by: GeorgeK159
Learn more at: https://www.cs.bu.edu
Category:

less

Transcript and Presenter's Notes

Title: Multimedia Databases


1
Multimedia Databases
  • Text II

2
Outline
  • Spatial Databases
  • Temporal Databases
  • Spatio-temporal Databases
  • Multimedia Databases
  • Text databases
  • Image and video databases
  • Time Series databases
  • Data Mining

3
Text - Detailed outline
  • Text databases
  • problem
  • full text scanning
  • inversion
  • signature files
  • clustering
  • information filtering and LSI

4
Vector Space Model and Clustering
  • keyword queries (vs Boolean)
  • each document -gt vector (HOW?)
  • each query -gt vector
  • search for similar vectors

5
Vector Space Model and Clustering
  • main idea

document
zoo
aaron
data
indexing
...data...
V ( vocabulary size)
6
Vector Space Model and Clustering
  • Then, group nearby vectors together
  • Q1 cluster search?
  • Q2 cluster generation?
  • Two significant contributions
  • ranked output
  • relevance feedback

7
Vector Space Model and Clustering
  • cluster search visit the (k) closest
    superclusters continue recursively

MD TRs
CS TRs
8
Vector Space Model and Clustering
  • ranked output easy!

MD TRs
CS TRs
9
Vector Space Model and Clustering
  • relevance feedback (brilliant idea) Roccio73

MD TRs
CS TRs
10
Vector Space Model and Clustering
  • relevance feedback (brilliant idea) Roccio73
  • How?

MD TRs
CS TRs
11
Vector Space Model and Clustering
  • How? A by adding the good vectors and
    subtracting the bad ones

MD TRs
CS TRs
12
Outline - detailed
  • main idea
  • cluster search
  • cluster generation
  • evaluation

13
Cluster generation
  • Problem
  • given N points in V dimensions,
  • group them

14
Cluster generation
  • Problem
  • given N points in V dimensions,
  • group them

15
Cluster generation
  • We need
  • Q1 document-to-document similarity
  • Q2 document-to-cluster similarity

16
Cluster generation
  • Q1 document-to-document similarity
  • (recall bag of words representation)
  • D1 data, retrieval, system
  • D2 lung, pulmonary, system
  • distance/similarity functions?

17
Cluster generation
  • A1 of words in common
  • A2 ........ normalized by the vocabulary sizes
  • A3 .... etc
  • About the same performance - prevailing one
  • cosine similarity

18
Cluster generation
  • cosine similarity
  • sim(D1, D2) cos(?)
  • sum(v1,i v2,i) / len(v1)/ len(v2)

D1
D2
?
19
Cluster generation
  • cosine similarity - observations
  • related to the Euclidean distance
  • weights vi,j according to tf/idf

D1
D2
?
20
Cluster generation
  • tf (term frequency)
  • high, if the term appears very often in this
    document.
  • idf (inverse document frequency)
  • penalizes common words, that appear in almost
    every document

21
Cluster generation
  • We need
  • Q1 document-to-document similarity
  • Q2 document-to-cluster similarity

?
22
Cluster generation
  • A1 min distance (single-link)
  • A2 max distance (all-link)
  • A3 avg distance
  • A4 distance to centroid

?
23
Cluster generation
  • A1 min distance (single-link)
  • leads to elongated clusters
  • A2 max distance (all-link)
  • many, small, tight clusters
  • A3 avg distance
  • in between the above
  • A4 distance to centroid
  • fast to compute

24
Cluster generation
  • We have
  • document-to-document similarity
  • document-to-cluster similarity
  • Q How to group documents into natural clusters

25
Cluster generation
  • A many-many algorithms - in two groups
    VanRijsbergen
  • theoretically sound (O(N2))
  • independent of the insertion order
  • iterative (O(N), O(N log(N))

26
Outline - detailed
  • main idea
  • cluster search
  • cluster generation
  • evaluation

27
Evaluation
  • Q how to measure goodness of one distance
    function vs another?
  • A ground truth (by humans) and
  • precision and recall

28
Evaluation
  • precision (retrieved and relevant) / retrieved
  • 100 precision -gt no false alarms
  • recall (retrieved and relevant)/ relevant
  • 100 recall -gt no false dismissals

29
Text - Detailed outline
  • text
  • problem
  • full text scanning
  • inversion
  • signature files
  • clustering
  • information filtering and LSI

30
LSI - Detailed outline
  • LSI
  • problem definition
  • main idea
  • experiments

31
Information Filtering LSI
  • Foltz,92 Goal
  • users specify interests ( keywords)
  • system alerts them, on suitable news-documents
  • Major contribution LSI Latent Semantic
    Indexing
  • latent (hidden) concepts

32
Information Filtering LSI
  • Main idea
  • map each document into some concepts
  • map each term into some concepts
  • Concept a set of terms, with weights, e.g.
  • data (0.8), system (0.5), retrieval (0.6)
    -gt DBMS_concept

33
Information Filtering LSI
  • Pictorially term-document matrix (BEFORE)

34
Information Filtering LSI
  • Pictorially concept-document matrix and...

35
Information Filtering LSI
  • ... and concept-term matrix

36
Information Filtering LSI
  • Q How to search, eg., for system?

37
Information Filtering LSI
  • A find the corresponding concept(s) and the
    corresponding documents

38
Information Filtering LSI
  • A find the corresponding concept(s) and the
    corresponding documents

39
Information Filtering LSI
  • Thus it works like an (automatically constructed)
    thesaurus
  • we may retrieve documents that DONT have the
    term system, but they contain almost everything
    else (data, retrieval)

40
LSI - Discussion - Conclusions
  • Great idea,
  • to derive concepts from documents
  • to build a statistical thesaurus automatically
  • to reduce dimensionality
  • Often leads to better precision/recall
  • but
  • Needs training set of documents
  • concept vectors are not sparse anymore

41
LSI - Discussion - Conclusions
  • Observations
  • Bellcore (-gt Telcordia) has a patent
  • used for multi-lingual retrieval
  • How exactly SVD works?

42
Indexing - Detailed outline
  • primary key indexing
  • secondary key / multi-key indexing
  • spatial access methods
  • fractals
  • text
  • SVD a powerful tool
  • multimedia
  • ...

43
References
  • Foltz, P. W. and S. T. Dumais (Dec. 1992).
    "Personalized Information Delivery An Analysis
    of Information Filtering Methods." Comm. of ACM
    (CACM) 35(12) 51-60.
  • Can, F. and E. A. Ozkarahan (Dec. 1990).
    "Concepts and Effectiveness of the
    Cover-Coefficient-Based Clustering Methodology
    for Text Databases." ACM TODS 15(4) 483-517.
  • Rocchio, J. J. (1971). Relevance Feedback in
    Information Retrieval. The SMART Retrieval System
    - Experiments in Automatic Document Processing.
    G. Salton. Englewood Cliffs, New Jersey,
    Prentice-Hall Inc.

44
References - contd
  • Salton, G. (1971). The SMART Retrieval System -
    Experiments in Automatic Document Processing.
    Englewood Cliffs, New Jersey, Prentice-Hall Inc.
  • Salton, G. and M. J. McGill (1983). Introduction
    to Modern Information Retrieval, McGraw-Hill.
  • Van-Rijsbergen, C. J. (1979). Information
    Retrieval. London, England, Butterworths.
  • Zahn, C. T. (Jan. 1971). "Graph-Theoretical
    Methods for Detecting and Describing Gestalt
    Clusters." IEEE Trans. on Computers C-20(1)
    68-86.
Write a Comment
User Comments (0)
About PowerShow.com