CIS750 - PowerPoint PPT Presentation

1 / 111
About This Presentation
Title:

CIS750

Description:

CIS750 Seminar in Advanced Topics in Computer Science ... of EDBT conference, Cambridge, U.K., Springer Verlag. References - cont'd ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 112
Provided by: Vas111
Learn more at: https://cis.temple.edu
Category:
Tags: cis750 | verlag

less

Transcript and Presenter's Notes

Title: CIS750


1
CIS750 Seminar in Advanced Topics in Computer
ScienceAdvanced topics in databases
Multimedia Databases
  • V. Megalooikonomou
  • Text Databases
  • (some slides are based on notes by C. Faloutsos)

2
Text - Detailed outline
  • text
  • problem
  • full text scanning
  • inversion
  • signature files
  • clustering
  • information filtering and LSI

3
Problem - Motivation
  • Eg., find documents containing data,
    retrieval
  • Applications

4
Problem - Motivation
  • Eg., find documents containing data,
    retrieval
  • Applications
  • Web
  • law patent offices
  • digital libraries
  • information filtering

5
Problem - Motivation
  • Types of queries
  • boolean (data AND retrieval AND NOT ...)

6
Problem - Motivation
  • Types of queries
  • boolean (data AND retrieval AND NOT ...)
  • additional features (data ADJACENT retrieval)
  • keyword queries (data, retrieval)
  • How to search a large collection of documents?

7
Full-text scanning
  • Build a FSA scan

a
c
t
8
Full-text scanning
  • for single term
  • (naive O(NM))

ABRACADABRA
text
CAB
pattern
9
Full-text scanning
  • for single term
  • (naive O(NM))
  • Knuth Morris and Pratt (77)
  • build a small FSA visit every text letter once
    only, by carefully shifting more than one step

ABRACADABRA
text
CAB
pattern
10
Full-text scanning
ABRACADABRA
text
CAB
pattern
CAB
...
CAB
CAB
11
Full-text scanning
  • for single term
  • (naive O(NM))
  • Knuth Morris and Pratt (77)
  • Boyer and Moore (77)
  • preprocess pattern start from right to left
    skip!

ABRACADABRA
text
CAB
pattern
12
Full-text scanning
ABRACADABRA
text
CAB
pattern
CAB
CAB
CAB
13
Full-text scanning
ABRACADABRA
text
OMINOUS
pattern
OMINOUS
BoyerMoore fastest, in practice Sunday (90)
some improvements
14
Full-text scanning
  • For multiple terms (w/o dont care characters)
    AhoCorasic (75)
  • again, build a simplified FSA in O(M) time
  • Probabilistic algorithms fingerprints (Karp
    Rabin 87)
  • approximate match agrep WuManber,
    Baeza-Yates, 92

15
Full-text scanning
  • Approximate matching - string editing distance
  • d( survey, surgery) 2
  • min of insertions, deletions,
    substitutions to transform the first string
  • into the second
  • SURVEY
  • SURGERY

16
Full-text scanning
  • string editing distance - how to compute?
  • A

17
Full-text scanning
  • string editing distance - how to compute?
  • A dynamic programming
  • cost( i, j ) cost to match prefix of
    length i of first string s with prefix of length
    j of second string t

18
Full-text scanning
  • if si tj then
  • cost( i, j ) cost(i-1, j-1)
  • else
  • cost(i, j ) min (
  • 1 cost(i, j-1) // deletion
  • 1 cost(i-1, j-1) //
    substitution
  • 1 cost(i-1, j) // insertion
  • )

19
Full-text scanning
  • Complexity O(MN) (when using a matrix to
    memoize partial results)

20
Full-text scanning
  • Conclusions
  • Full text scanning needs no space overhead, but
    is slow for large datasets

21
Text - Detailed outline
  • text
  • problem
  • full text scanning
  • inversion
  • signature files
  • clustering
  • information filtering and LSI

22
Text - Inversion
23
Text - Inversion
Q space overhead?
24
Text - Inversion
A mainly, the postings lists
25
Text - Inversion
  • how to organize dictionary?
  • stemming Y/N?
  • insertions?

26
Text - Inversion
  • how to organize dictionary?
  • B-tree, hashing, TRIEs, PATRICIA trees, ...
  • stemming Y/N?
  • insertions?

27
Text Inversion
  • newer topics
  • Parallelism Tomasic,93
  • Insertions Tomasic94, Brown
  • zipf distributions
  • Approximate searching (glimpse Wu)

28
Text - Inversion
  • postings list more Zipf distr. eg.,
    rank-frequency plot of Bible

log(freq)
freq 1 / (rank ln(1.78V))
log(rank)
29
Text - Inversion
  • postings lists
  • CuttingPedersen
  • (keep first 4 in B-tree leaves)
  • how to allocate space Faloutsos92
  • geometric progression
  • compression (Elias codes) Zobel down to 2
    overhead!

30
Conclusions
  • Conclusions needs space overhead (2-300), but
    it is the fastest

31
Text - Detailed outline
  • text
  • problem
  • full text scanning
  • inversion
  • signature files
  • clustering
  • information filtering and LSI

32
Signature files
  • idea quick dirty filter

33
Signature files
  • idea quick dirty filter
  • then, do seq. scan on sign. file and discard
    false alarms
  • Adv. easy insertions faster than seq. scan
  • Disadv. O(N) search (with small constant)
  • Q how to extract signatures?

34
Signature files
  • A superimposed coding!! Mooers49, ...

m (4 bits/word) (4 bits set to 1 and the
rest left as 0) F (12 bits sign. size) the bit
patterns are OR-ed to form the document signature
35
Signature files
  • A superimposed coding!! Mooers49, ...

data
actual match
36
Signature files
  • A superimposed coding!! Mooers49, ...

retrieval
actual dismissal
37
Signature files
  • A superimposed coding!! Mooers49, ...

nucleotic
false alarm (false drop)
38
Signature files
  • A superimposed coding!! Mooers49, ...

YES is MAYBE NO is NO
39
Signature files
  • Q1 How to choose F and m ?
  • Q2 Why is it called false drop?
  • Q3 other apps of signature files?

40
Signature files
  • Q1 How to choose F and m ?

m (4 bits/word) F (12 bits sign. size)
41
Signature files
  • Q1 How to choose F and m ?
  • A so that doc. signature is 50 full

m (4 bits/word) F (12 bits sign. size)
42
Signature files
  • Q1 How to choose F and m ?
  • Q2 Why is it called false drop?
  • Q3 other apps of signature files?

43
Signature files
  • Q2 Why is it called false drop?
  • Old, but fascinating story 1949
  • how to find qualifying books (by title word,
    and/or author, and/or keyword)
  • in O(1) time?
  • without computers

44
Signature files
  • Solution edge-notched cards

1
2
40
  • each title word is mapped to m numbers(how?)
  • and the corresponding holes are cut out

45
Signature files
  • Solution edge-notched cards

1
2
40
data
data -gt 1, 39
46
Signature files
  • Search, e.g., for data activate needle 1,
    39, and shake the stack of cards!

1
2
40
data
data -gt 1, 39
47
Signature files
  • Also known as zatocoding, from Zator company.

48
Signature files
  • Q1 How to choose F and m ?
  • Q2 Why is it called false drop?
  • Q3 other apps of signature files?

49
Signature files
  • Q3 other apps of signature files?
  • A anything that has to do with membership
    testing does data belong to the set of words
    of the document?

50
Signature files
  • UNIXs early spell system McIlroy
  • Bloom-joins in System R Mackert and active
    disks Riedel99
  • differential files SeveranceLohman

51
Signature files - conclusions
  • easy insertions slower than inversion
  • brilliant idea of quick and dirty filter
    quickly discard the vast majority of
    non-qualifying elements, and focus on the rest.

52
References
  • Aho, A. V. and M. J. Corasick (June 1975). "Fast
    Pattern Matching An Aid to Bibliographic
    Search." CACM 18(6) 333-340.
  • Boyer, R. S. and J. S. Moore (Oct. 1977). "A Fast
    String Searching Algorithm." CACM 20(10)
    762-772.
  • Brown, E. W., J. P. Callan, et al. (March 1994).
    Supporting Full-Text Information Retrieval with a
    Persistent Object Store. Proc. of EDBT
    conference, Cambridge, U.K., Springer Verlag.

53
References - contd
  • Faloutsos, C. and H. V. Jagadish (Aug. 23-27,
    1992). On B-tree Indices for Skewed
    Distributions. 18th VLDB Conference, Vancouver,
    British Columbia.
  • Karp, R. M. and M. O. Rabin (March 1987).
    "Efficient Randomized Pattern-Matching
    Algorithms." IBM Journal of Research and
    Development 31(2) 249-260.
  • Knuth, D. E., J. H. Morris, et al. (June 1977).
    "Fast Pattern Matching in Strings." SIAM J.
    Comput 6(2) 323-350.

54
References - contd
  • Mackert, L. M. and G. M. Lohman (August 1986). R
    Optimizer Validation and Performance Evaluation
    for Distributed Queries. Proc. of 12th Int. Conf.
    on Very Large Data Bases (VLDB), Kyoto, Japan.
  • Manber, U. and S. Wu (1994). GLIMPSE A Tool to
    Search Through Entire File Systems. Proc. of
    USENIX Techn. Conf.
  • McIlroy, M. D. (Jan. 1982). "Development of a
    Spelling List." IEEE Trans. on Communications
    COM-30(1) 91-99.

55
References - contd
  • Mooers, C. (1949). Application of Random Codes to
    the Gathering of Statistical Information
  • Bulletin 31. Cambridge, Mass, Zator Co.
  • Pedersen, D. C. a. J. (1990). Optimizations for
    dynamic inverted index maintenance. ACM SIGIR.
  • Riedel, E. (1999). Active Disks Remote Execution
    for Network Attached Storage. ECE, CMU.
    Pittsburgh, PA.

56
References - contd
  • Severance, D. G. and G. M. Lohman (Sept. 1976).
    "Differential Files Their Application to the
    Maintenance of Large Databases." ACM TODS 1(3)
    256-267.
  • Tomasic, A. and H. Garcia-Molina (1993).
    Performance of Inverted Indices in Distributed
    Text Document Retrieval Systems. PDIS.
  • Tomasic, A., H. Garcia-Molina, et al. (May 24-27,
    1994). Incremental Updates of Inverted Lists for
    Text Document Retrieval. ACM SIGMOD, Minneapolis,
    MN.

57
References - contd
  • Wu, S. and U. Manber (1992). "AGREP- A Fast
    Approximate Pattern-Matching Tool." .
  • Zobel, J., A. Moffat, et al. (Aug. 23-27, 1992).
    An Efficient Indexing Technique for Full-Text
    Database Systems. VLDB, Vancouver, B.C., Canada.

58
Text - Detailed outline
  • text
  • problem
  • full text scanning
  • inversion
  • signature files
  • clustering
  • information filtering and LSI

59
Vector Space Model and Clustering
  • keyword queries (vs Boolean)
  • each document -gt vector (HOW?)
  • each query -gt vector
  • search for similar vectors

60
Vector Space Model and Clustering
  • main idea

document
zoo
aaron
data
indexing
...data...
V ( vocabulary size)
61
Vector Space Model and Clustering
  • Then, group nearby vectors together
  • Q1 cluster search?
  • Q2 cluster generation?
  • Two significant contributions
  • ranked output
  • relevance feedback

62
Vector Space Model and Clustering
  • cluster search visit the (k) closest
    superclusters continue recursively

TU TRs
CS TRs
63
Vector Space Model and Clustering
  • ranked output easy!

TU TRs
CS TRs
64
Vector Space Model and Clustering
  • relevance feedback (brilliant idea) Roccio73

TU TRs
CS TRs
65
Vector Space Model and Clustering
  • relevance feedback (brilliant idea) Roccio73
  • How?

TU TRs
CS TRs
66
Vector Space Model and Clustering
  • How? A by adding the good vectors and
    subtracting the bad ones

TU TRs
CS TRs
67
Outline - detailed
  • main idea
  • cluster search
  • cluster generation
  • evaluation

68
Cluster generation
  • Problem
  • given N points in V dimensions,
  • group them

69
Cluster generation
  • Problem
  • given N points in V dimensions,
  • group them

70
Cluster generation
  • We need
  • Q1 document-to-document similarity
  • Q2 document-to-cluster similarity

71
Cluster generation
  • Q1 document-to-document similarity
  • (recall bag of words representation)
  • D1 data, retrieval, system
  • D2 lung, pulmonary, system
  • distance/similarity functions?

72
Cluster generation
  • A1 of words in common
  • A2 ........ normalized by the vocabulary sizes
  • A3 .... etc
  • About the same performance - prevailing one
  • cosine similarity

73
Cluster generation
  • cosine similarity
  • similarity(D1, D2) cos(?)
  • sum(v1,i v2,i) / len(v1)/ len(v2)

D1
D2
?
74
Cluster generation
  • cosine similarity - observations
  • related to the Euclidean distance
  • weights vi,j according to tf/idf

D1
D2
?
75
Cluster generation
  • tf (term frequency)
  • high, if the term appears very often in this
    document.
  • idf (inverse document frequency)
  • penalizes common words, that appear in almost
    every document

76
Cluster generation
  • We need
  • Q1 document-to-document similarity
  • Q2 document-to-cluster similarity

?
77
Cluster generation
  • A1 min distance (single-link)
  • A2 max distance (all-link)
  • A3 avg distance
  • A4 distance to centroid

?
78
Cluster generation
  • A1 min distance (single-link)
  • leads to elongated clusters
  • A2 max distance (all-link)
  • many, small, tight clusters
  • A3 avg distance
  • in between the above
  • A4 distance to centroid
  • fast to compute

79
Cluster generation
  • We have
  • document-to-document similarity
  • document-to-cluster similarity
  • Q How to group documents into natural clusters

80
Cluster generation
  • A many-many algorithms - in two groups
    VanRijsbergen
  • theoretically sound (O(N2))
  • independent of the insertion order
  • iterative (O(N), O(N log(N))

81
Cluster generation - sound methods
  • Approach1 dendrograms - create a hierarchy
    (bottom up or top-down) - choose a cut-off (how?)
    and cut

0.8
0.3
0.1
cat
tiger
horse
cow
82
Cluster generation - sound methods
  • Approach2 min. some statistical criterion (eg.,
    sum of squares from cluster centers)
  • like k-means
  • but how to decide k?

83
Cluster generation - sound methods
  • Approach3 Graph theoretic Zahn
  • build MST
  • delete edges longer than 2.5 std of the local
    average

84
Cluster generation - sound methods
  • Result
  • variations
  • Complexity?

85
Cluster generation - iterative methods
  • general outline
  • Choose seeds (how?)
  • assign each vector to its closest seed (possibly
    adjusting cluster centroid)
  • possibly, re-assign some vectors to improve
    clusters
  • Fast and practical, but unpredictable

86
Cluster generation - iterative methods
  • general outline
  • Choose seeds (how?)
  • assign each vector to its closest seed (possibly
    adjusting cluster centroid)
  • possibly, re-assign some vectors to improve
    clusters
  • Fast and practical, but unpredictable

87
Cluster generation
  • one way to estimate of clusters k the cover
    coefficient Can SVD

88
Outline - detailed
  • main idea
  • cluster search
  • cluster generation
  • evaluation

89
Evaluation
  • Q how to measure goodness of one distance
    function vs another?
  • A ground truth (by humans) and
  • precision and recall

90
Evaluation
  • precision (retrieved relevant) / retrieved
  • 100 precision -gt no false alarms
  • recall (retrieved relevant)/ relevant
  • 100 recall -gt no false dismissals

91
References
  • Can, F. and E. A. Ozkarahan (Dec. 1990).
    "Concepts and Effectiveness of the
    Cover-Coefficient-Based Clustering Methodology
    for Text Databases." ACM TODS 15(4) 483-517.
  • Noreault, T., M. McGill, et al. (1983). A
    Performance Evaluation of Similarity Measures,
    Document Term Weighting Schemes and
    Representation in a Boolean Environment.
    Information Retrieval Research, Butterworths.
  • Rocchio, J. J. (1971). Relevance Feedback in
    Information Retrieval. The SMART Retrieval System
    - Experiments in Automatic Document Processing.
    G. Salton. Englewood Cliffs, New Jersey,
    Prentice-Hall Inc.

92
References - contd
  • Salton, G. (1971). The SMART Retrieval System -
    Experiments in Automatic Document Processing.
    Englewood Cliffs, New Jersey, Prentice-Hall Inc.
  • Salton, G. and M. J. McGill (1983). Introduction
    to Modern Information Retrieval, McGraw-Hill.
  • Van-Rijsbergen, C. J. (1979). Information
    Retrieval. London, England, Butterworths.
  • Zahn, C. T. (Jan. 1971). "Graph-Theoretical
    Methods for Detecting and Describing Gestalt
    Clusters." IEEE Trans. on Computers C-20(1)
    68-86.

93
Text - Detailed outline
  • text
  • problem
  • full text scanning
  • inversion
  • signature files
  • clustering
  • information filtering and LSI

94
LSI - Detailed outline
  • LSI
  • problem definition
  • main idea
  • experiments

95
Information Filtering LSI
  • Foltz,92 Goal
  • users specify interests ( keywords)
  • system alerts them, on suitable news-documents
  • Major contribution LSI Latent Semantic
    Indexing
  • latent (hidden) concepts

96
Information Filtering LSI
  • Main idea
  • map each document into some concepts
  • map each term into some concepts
  • Concept a set of terms, with weights, e.g.
  • data (0.8), system (0.5), retrieval (0.6)
    -gt DBMS_concept

97
Information Filtering LSI
  • Pictorially term-document matrix (BEFORE)

98
Information Filtering LSI
  • Pictorially concept-document matrix and...

99
Information Filtering LSI
  • ... and concept-term matrix

100
Information Filtering LSI
  • Q How to search, eg., for system?

101
Information Filtering LSI
  • A find the corresponding concept(s) and the
    corresponding documents

102
Information Filtering LSI
  • A find the corresponding concept(s) and the
    corresponding documents

103
Information Filtering LSI
  • Thus it works like an (automatically constructed)
    thesaurus
  • we may retrieve documents that DONT have the
    term system, but they contain almost everything
    else (data, retrieval)

104
LSI - Detailed outline
  • LSI
  • problem definition
  • main idea
  • experiments

105
LSI - Experiments
  • 150 Tech Memos (TM) / month
  • 34 users submitted profiles (6-66 words per
    profile)
  • 100-300 concepts

106
LSI - Experiments
  • four methods, cross-product of
  • vector-space or LSI, for similarity scoring
  • keywords or document-sample, for profile
    specification
  • measured precision/recall

107
LSI - Experiments
  • LSI, with document-based profiles, were better

(0.25,0.65)
precision
(0.50,0.45)
(0.75,0.30)
recall
108
LSI - Discussion - Conclusions
  • Great idea,
  • to derive concepts from documents
  • to build a statistical thesaurus automatically
  • to reduce dimensionality
  • Often leads to better precision/recall
  • but
  • Needs training set of documents
  • concept vectors are not sparse anymore

109
LSI - Discussion - Conclusions
  • Observations
  • Bellcore (-gt Telcordia) has a patent
  • used for multi-lingual retrieval
  • How exactly SVD works?

110
Indexing - Detailed outline
  • primary key indexing
  • secondary key / multi-key indexing
  • spatial access methods
  • fractals
  • text
  • SVD a powerful tool
  • multimedia
  • ...

111
References
  • Foltz, P. W. and S. T. Dumais (Dec. 1992).
    "Personalized Information Delivery An Analysis
    of Information Filtering Methods." Comm. of ACM
    (CACM) 35(12) 51-60.
Write a Comment
User Comments (0)
About PowerShow.com