Title: CIS750
1CIS750 Seminar in Advanced Topics in Computer
ScienceAdvanced topics in databases
Multimedia Databases
- V. Megalooikonomou
- Text Databases
- (some slides are based on notes by C. Faloutsos)
2Text - Detailed outline
- text
- problem
- full text scanning
- inversion
- signature files
- clustering
- information filtering and LSI
3Problem - Motivation
- Eg., find documents containing data,
retrieval - Applications
4Problem - Motivation
- Eg., find documents containing data,
retrieval - Applications
- Web
- law patent offices
- digital libraries
- information filtering
5Problem - Motivation
- Types of queries
- boolean (data AND retrieval AND NOT ...)
6Problem - Motivation
- Types of queries
- boolean (data AND retrieval AND NOT ...)
- additional features (data ADJACENT retrieval)
- keyword queries (data, retrieval)
- How to search a large collection of documents?
7Full-text scanning
a
c
t
8Full-text scanning
- for single term
- (naive O(NM))
ABRACADABRA
text
CAB
pattern
9Full-text scanning
- for single term
- (naive O(NM))
- Knuth Morris and Pratt (77)
- build a small FSA visit every text letter once
only, by carefully shifting more than one step
ABRACADABRA
text
CAB
pattern
10Full-text scanning
ABRACADABRA
text
CAB
pattern
CAB
...
CAB
CAB
11Full-text scanning
- for single term
- (naive O(NM))
- Knuth Morris and Pratt (77)
- Boyer and Moore (77)
- preprocess pattern start from right to left
skip!
ABRACADABRA
text
CAB
pattern
12Full-text scanning
ABRACADABRA
text
CAB
pattern
CAB
CAB
CAB
13Full-text scanning
ABRACADABRA
text
OMINOUS
pattern
OMINOUS
BoyerMoore fastest, in practice Sunday (90)
some improvements
14Full-text scanning
- For multiple terms (w/o dont care characters)
AhoCorasic (75) - again, build a simplified FSA in O(M) time
- Probabilistic algorithms fingerprints (Karp
Rabin 87) - approximate match agrep WuManber,
Baeza-Yates, 92
15Full-text scanning
- Approximate matching - string editing distance
- d( survey, surgery) 2
- min of insertions, deletions,
substitutions to transform the first string - into the second
- SURVEY
- SURGERY
16Full-text scanning
- string editing distance - how to compute?
- A
17Full-text scanning
- string editing distance - how to compute?
- A dynamic programming
- cost( i, j ) cost to match prefix of
length i of first string s with prefix of length
j of second string t
18Full-text scanning
- if si tj then
- cost( i, j ) cost(i-1, j-1)
- else
- cost(i, j ) min (
- 1 cost(i, j-1) // deletion
- 1 cost(i-1, j-1) //
substitution - 1 cost(i-1, j) // insertion
- )
19Full-text scanning
- Complexity O(MN) (when using a matrix to
memoize partial results)
20Full-text scanning
- Conclusions
- Full text scanning needs no space overhead, but
is slow for large datasets
21Text - Detailed outline
- text
- problem
- full text scanning
- inversion
- signature files
- clustering
- information filtering and LSI
22Text - Inversion
23Text - Inversion
Q space overhead?
24Text - Inversion
A mainly, the postings lists
25Text - Inversion
- how to organize dictionary?
- stemming Y/N?
- insertions?
26Text - Inversion
- how to organize dictionary?
- B-tree, hashing, TRIEs, PATRICIA trees, ...
- stemming Y/N?
- insertions?
27Text Inversion
- newer topics
- Parallelism Tomasic,93
- Insertions Tomasic94, Brown
- zipf distributions
- Approximate searching (glimpse Wu)
28Text - Inversion
- postings list more Zipf distr. eg.,
rank-frequency plot of Bible
log(freq)
freq 1 / (rank ln(1.78V))
log(rank)
29Text - Inversion
- postings lists
- CuttingPedersen
- (keep first 4 in B-tree leaves)
- how to allocate space Faloutsos92
- geometric progression
- compression (Elias codes) Zobel down to 2
overhead!
30Conclusions
- Conclusions needs space overhead (2-300), but
it is the fastest
31Text - Detailed outline
- text
- problem
- full text scanning
- inversion
- signature files
- clustering
- information filtering and LSI
32Signature files
33Signature files
- idea quick dirty filter
- then, do seq. scan on sign. file and discard
false alarms - Adv. easy insertions faster than seq. scan
- Disadv. O(N) search (with small constant)
- Q how to extract signatures?
34Signature files
- A superimposed coding!! Mooers49, ...
m (4 bits/word) (4 bits set to 1 and the
rest left as 0) F (12 bits sign. size) the bit
patterns are OR-ed to form the document signature
35Signature files
- A superimposed coding!! Mooers49, ...
data
actual match
36Signature files
- A superimposed coding!! Mooers49, ...
retrieval
actual dismissal
37Signature files
- A superimposed coding!! Mooers49, ...
nucleotic
false alarm (false drop)
38Signature files
- A superimposed coding!! Mooers49, ...
YES is MAYBE NO is NO
39Signature files
- Q1 How to choose F and m ?
- Q2 Why is it called false drop?
- Q3 other apps of signature files?
40Signature files
- Q1 How to choose F and m ?
m (4 bits/word) F (12 bits sign. size)
41Signature files
- Q1 How to choose F and m ?
- A so that doc. signature is 50 full
m (4 bits/word) F (12 bits sign. size)
42Signature files
- Q1 How to choose F and m ?
- Q2 Why is it called false drop?
- Q3 other apps of signature files?
43Signature files
- Q2 Why is it called false drop?
- Old, but fascinating story 1949
- how to find qualifying books (by title word,
and/or author, and/or keyword) - in O(1) time?
- without computers
44Signature files
- Solution edge-notched cards
1
2
40
- each title word is mapped to m numbers(how?)
- and the corresponding holes are cut out
45Signature files
- Solution edge-notched cards
1
2
40
data
data -gt 1, 39
46Signature files
- Search, e.g., for data activate needle 1,
39, and shake the stack of cards!
1
2
40
data
data -gt 1, 39
47Signature files
- Also known as zatocoding, from Zator company.
48Signature files
- Q1 How to choose F and m ?
- Q2 Why is it called false drop?
- Q3 other apps of signature files?
49Signature files
- Q3 other apps of signature files?
- A anything that has to do with membership
testing does data belong to the set of words
of the document?
50Signature files
- UNIXs early spell system McIlroy
- Bloom-joins in System R Mackert and active
disks Riedel99 - differential files SeveranceLohman
51Signature files - conclusions
- easy insertions slower than inversion
- brilliant idea of quick and dirty filter
quickly discard the vast majority of
non-qualifying elements, and focus on the rest.
52References
- Aho, A. V. and M. J. Corasick (June 1975). "Fast
Pattern Matching An Aid to Bibliographic
Search." CACM 18(6) 333-340. - Boyer, R. S. and J. S. Moore (Oct. 1977). "A Fast
String Searching Algorithm." CACM 20(10)
762-772. - Brown, E. W., J. P. Callan, et al. (March 1994).
Supporting Full-Text Information Retrieval with a
Persistent Object Store. Proc. of EDBT
conference, Cambridge, U.K., Springer Verlag.
53References - contd
- Faloutsos, C. and H. V. Jagadish (Aug. 23-27,
1992). On B-tree Indices for Skewed
Distributions. 18th VLDB Conference, Vancouver,
British Columbia. - Karp, R. M. and M. O. Rabin (March 1987).
"Efficient Randomized Pattern-Matching
Algorithms." IBM Journal of Research and
Development 31(2) 249-260. - Knuth, D. E., J. H. Morris, et al. (June 1977).
"Fast Pattern Matching in Strings." SIAM J.
Comput 6(2) 323-350.
54References - contd
- Mackert, L. M. and G. M. Lohman (August 1986). R
Optimizer Validation and Performance Evaluation
for Distributed Queries. Proc. of 12th Int. Conf.
on Very Large Data Bases (VLDB), Kyoto, Japan. - Manber, U. and S. Wu (1994). GLIMPSE A Tool to
Search Through Entire File Systems. Proc. of
USENIX Techn. Conf. - McIlroy, M. D. (Jan. 1982). "Development of a
Spelling List." IEEE Trans. on Communications
COM-30(1) 91-99.
55References - contd
- Mooers, C. (1949). Application of Random Codes to
the Gathering of Statistical Information - Bulletin 31. Cambridge, Mass, Zator Co.
- Pedersen, D. C. a. J. (1990). Optimizations for
dynamic inverted index maintenance. ACM SIGIR. - Riedel, E. (1999). Active Disks Remote Execution
for Network Attached Storage. ECE, CMU.
Pittsburgh, PA.
56References - contd
- Severance, D. G. and G. M. Lohman (Sept. 1976).
"Differential Files Their Application to the
Maintenance of Large Databases." ACM TODS 1(3)
256-267. - Tomasic, A. and H. Garcia-Molina (1993).
Performance of Inverted Indices in Distributed
Text Document Retrieval Systems. PDIS. - Tomasic, A., H. Garcia-Molina, et al. (May 24-27,
1994). Incremental Updates of Inverted Lists for
Text Document Retrieval. ACM SIGMOD, Minneapolis,
MN.
57References - contd
- Wu, S. and U. Manber (1992). "AGREP- A Fast
Approximate Pattern-Matching Tool." . - Zobel, J., A. Moffat, et al. (Aug. 23-27, 1992).
An Efficient Indexing Technique for Full-Text
Database Systems. VLDB, Vancouver, B.C., Canada.
58Text - Detailed outline
- text
- problem
- full text scanning
- inversion
- signature files
- clustering
- information filtering and LSI
59Vector Space Model and Clustering
- keyword queries (vs Boolean)
- each document -gt vector (HOW?)
- each query -gt vector
- search for similar vectors
60Vector Space Model and Clustering
document
zoo
aaron
data
indexing
...data...
V ( vocabulary size)
61Vector Space Model and Clustering
- Then, group nearby vectors together
- Q1 cluster search?
- Q2 cluster generation?
- Two significant contributions
- ranked output
- relevance feedback
62Vector Space Model and Clustering
- cluster search visit the (k) closest
superclusters continue recursively
TU TRs
CS TRs
63Vector Space Model and Clustering
TU TRs
CS TRs
64Vector Space Model and Clustering
- relevance feedback (brilliant idea) Roccio73
TU TRs
CS TRs
65Vector Space Model and Clustering
- relevance feedback (brilliant idea) Roccio73
- How?
TU TRs
CS TRs
66Vector Space Model and Clustering
- How? A by adding the good vectors and
subtracting the bad ones
TU TRs
CS TRs
67Outline - detailed
- main idea
- cluster search
- cluster generation
- evaluation
68Cluster generation
- Problem
- given N points in V dimensions,
- group them
69Cluster generation
- Problem
- given N points in V dimensions,
- group them
70Cluster generation
- We need
- Q1 document-to-document similarity
- Q2 document-to-cluster similarity
71Cluster generation
- Q1 document-to-document similarity
- (recall bag of words representation)
- D1 data, retrieval, system
- D2 lung, pulmonary, system
- distance/similarity functions?
72Cluster generation
- A1 of words in common
- A2 ........ normalized by the vocabulary sizes
- A3 .... etc
- About the same performance - prevailing one
- cosine similarity
73Cluster generation
- cosine similarity
- similarity(D1, D2) cos(?)
- sum(v1,i v2,i) / len(v1)/ len(v2)
D1
D2
?
74Cluster generation
- cosine similarity - observations
- related to the Euclidean distance
- weights vi,j according to tf/idf
D1
D2
?
75Cluster generation
- tf (term frequency)
- high, if the term appears very often in this
document. - idf (inverse document frequency)
- penalizes common words, that appear in almost
every document
76Cluster generation
- We need
- Q1 document-to-document similarity
- Q2 document-to-cluster similarity
?
77Cluster generation
- A1 min distance (single-link)
- A2 max distance (all-link)
- A3 avg distance
- A4 distance to centroid
?
78Cluster generation
- A1 min distance (single-link)
- leads to elongated clusters
- A2 max distance (all-link)
- many, small, tight clusters
- A3 avg distance
- in between the above
- A4 distance to centroid
- fast to compute
79Cluster generation
- We have
- document-to-document similarity
- document-to-cluster similarity
- Q How to group documents into natural clusters
80Cluster generation
- A many-many algorithms - in two groups
VanRijsbergen - theoretically sound (O(N2))
- independent of the insertion order
- iterative (O(N), O(N log(N))
81Cluster generation - sound methods
- Approach1 dendrograms - create a hierarchy
(bottom up or top-down) - choose a cut-off (how?)
and cut
0.8
0.3
0.1
cat
tiger
horse
cow
82Cluster generation - sound methods
- Approach2 min. some statistical criterion (eg.,
sum of squares from cluster centers) - like k-means
- but how to decide k?
83Cluster generation - sound methods
- Approach3 Graph theoretic Zahn
- build MST
- delete edges longer than 2.5 std of the local
average
84Cluster generation - sound methods
85Cluster generation - iterative methods
- general outline
- Choose seeds (how?)
- assign each vector to its closest seed (possibly
adjusting cluster centroid) - possibly, re-assign some vectors to improve
clusters - Fast and practical, but unpredictable
86Cluster generation - iterative methods
- general outline
- Choose seeds (how?)
- assign each vector to its closest seed (possibly
adjusting cluster centroid) - possibly, re-assign some vectors to improve
clusters - Fast and practical, but unpredictable
87Cluster generation
- one way to estimate of clusters k the cover
coefficient Can SVD
88Outline - detailed
- main idea
- cluster search
- cluster generation
- evaluation
89Evaluation
- Q how to measure goodness of one distance
function vs another? - A ground truth (by humans) and
- precision and recall
90Evaluation
- precision (retrieved relevant) / retrieved
- 100 precision -gt no false alarms
- recall (retrieved relevant)/ relevant
- 100 recall -gt no false dismissals
91References
- Can, F. and E. A. Ozkarahan (Dec. 1990).
"Concepts and Effectiveness of the
Cover-Coefficient-Based Clustering Methodology
for Text Databases." ACM TODS 15(4) 483-517. - Noreault, T., M. McGill, et al. (1983). A
Performance Evaluation of Similarity Measures,
Document Term Weighting Schemes and
Representation in a Boolean Environment.
Information Retrieval Research, Butterworths. - Rocchio, J. J. (1971). Relevance Feedback in
Information Retrieval. The SMART Retrieval System
- Experiments in Automatic Document Processing.
G. Salton. Englewood Cliffs, New Jersey,
Prentice-Hall Inc.
92References - contd
- Salton, G. (1971). The SMART Retrieval System -
Experiments in Automatic Document Processing.
Englewood Cliffs, New Jersey, Prentice-Hall Inc. - Salton, G. and M. J. McGill (1983). Introduction
to Modern Information Retrieval, McGraw-Hill. - Van-Rijsbergen, C. J. (1979). Information
Retrieval. London, England, Butterworths. - Zahn, C. T. (Jan. 1971). "Graph-Theoretical
Methods for Detecting and Describing Gestalt
Clusters." IEEE Trans. on Computers C-20(1)
68-86.
93Text - Detailed outline
- text
- problem
- full text scanning
- inversion
- signature files
- clustering
- information filtering and LSI
94LSI - Detailed outline
- LSI
- problem definition
- main idea
- experiments
95Information Filtering LSI
- Foltz,92 Goal
- users specify interests ( keywords)
- system alerts them, on suitable news-documents
- Major contribution LSI Latent Semantic
Indexing - latent (hidden) concepts
96Information Filtering LSI
- Main idea
- map each document into some concepts
- map each term into some concepts
- Concept a set of terms, with weights, e.g.
- data (0.8), system (0.5), retrieval (0.6)
-gt DBMS_concept
97Information Filtering LSI
- Pictorially term-document matrix (BEFORE)
98Information Filtering LSI
- Pictorially concept-document matrix and...
99Information Filtering LSI
- ... and concept-term matrix
100Information Filtering LSI
- Q How to search, eg., for system?
101Information Filtering LSI
- A find the corresponding concept(s) and the
corresponding documents
102Information Filtering LSI
- A find the corresponding concept(s) and the
corresponding documents
103Information Filtering LSI
- Thus it works like an (automatically constructed)
thesaurus - we may retrieve documents that DONT have the
term system, but they contain almost everything
else (data, retrieval)
104LSI - Detailed outline
- LSI
- problem definition
- main idea
- experiments
105LSI - Experiments
- 150 Tech Memos (TM) / month
- 34 users submitted profiles (6-66 words per
profile) - 100-300 concepts
106LSI - Experiments
- four methods, cross-product of
- vector-space or LSI, for similarity scoring
- keywords or document-sample, for profile
specification - measured precision/recall
107LSI - Experiments
- LSI, with document-based profiles, were better
(0.25,0.65)
precision
(0.50,0.45)
(0.75,0.30)
recall
108LSI - Discussion - Conclusions
- Great idea,
- to derive concepts from documents
- to build a statistical thesaurus automatically
- to reduce dimensionality
- Often leads to better precision/recall
- but
- Needs training set of documents
- concept vectors are not sparse anymore
109LSI - Discussion - Conclusions
- Observations
- Bellcore (-gt Telcordia) has a patent
- used for multi-lingual retrieval
- How exactly SVD works?
110Indexing - Detailed outline
- primary key indexing
- secondary key / multi-key indexing
- spatial access methods
- fractals
- text
- SVD a powerful tool
- multimedia
- ...
111References
- Foltz, P. W. and S. T. Dumais (Dec. 1992).
"Personalized Information Delivery An Analysis
of Information Filtering Methods." Comm. of ACM
(CACM) 35(12) 51-60.