Efficient Search in Very Large Text Collections, Databases, and Ontologies presentation

About This Presentation

Title:

Efficient Search in Very Large Text Collections, Databases, and Ontologies

Description:

large variety of challenging algorithmic problems with high practical relevance ... Violin Sonata No. 5 ..., according to Einstein's Mozart: His Character, His Work. ... –

Number of Views:233

Avg rating:3.0/5.0

Slides: 58

Provided by: holge2

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Search in Very Large Text Collections, Databases, and Ontologies

1
Efficient Search in Very Large Text Collections,
Databases, and Ontologies
DFG Priority Programme Algorithm
Engineering Kickoff Meeting in Karlsruhe,
December 2 3, 2007

Holger Bast
Max-Planck-Institut für Informatik
Saarbrücken, Germany

2
General theme of this project

Search engines
large variety of challenging algorithmic problems
with high practical relevance
algorithm engineering is absolutely essential
Focus on scalability
terabytes of data, hundreds of millions of
documents
query times in a fraction of a second
Focus on advanced queries
beyond Google-style keyword search
but still as efficient in time and space

efficiency is often a secondary issue in DB, AI,
CL, or ML research
Fancy Searches, yet Fast
3
Problems encountered in this project

Indexing fast queries, succinct index, fast
construction
Index structures for advanced queries (beyond
keyword search)
How to build them fast
Learning from text scalable, yet effective
large-scale spelling correction
large-scale synonymy detection
large-scale entity annotation
Basic Toolbox (for search)
fast intersection of (sorted) sequences
efficient (de)compression

algorythm ? algorithm
web internet
Einstein ? the physicist? the
physical unit? the musicologist?
possible synergies with Peter Sanders project
I will give a few glimpses in the following
4
Prefix Completion

Fundamental search problem
definition on next slide
many notoriously difficult search problems can be
reduced to it
for example, faceted search
for, say, an article by Peter Sanders that
appeared in WEA 2007, add
authorPeter Sanders Doc. 17 venueWEA
Doc. 17 year2007 Doc. 17

5
Prefix Completion Problem Definition
D74 J W Q
D3 Q DA
D17 B WU K A

Data is given as
documents containing words
documents have ids (D1, D2, )
words have ids (A, B, C, )
Query
given a sorted list of doc ids
and a range of word ids

D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 D17 D88
C D E F G
6
Prefix Completion Problem Definition
D74 J W Q
D3 Q DA

Data is given as
documents containing words
documents have ids (D1, D2, )
words have ids (A, B, C, )
Query
given a sorted list of doc ids
and a range of word ids
Answer
all matching word-in-doc pairs
with scores
and positions

D17 B WU K A
D17 B WU K A
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 A O E W H
D13 D17 D88
C D E F G
7
Prefix Completion Problem Definition
D74 J W Q
D3 Q DA

Data is given as
documents containing words
documents have ids (D1, D2, )
words have ids (A, B, C, )
Query
given a sorted list of doc ids
and a range of word ids
Answer
all matching word-in-doc pairs
with scores
and positions

D17 B WU K A
D17 B WU K A
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 A O E W H
D13 D17 D88
C D E F G
8
Prefix Completion via the Inverted Index

For example, algor eng
given the documents D13, D17, D88, (ids
of hits for algor)
and the word range C D E F G
(ids for eng)
Iterate over all words from the given range
C (engage) D8, D23, D291, ...
D (engel) D24, D36, D165, ...
E (engine) D13, D24, D88, ...
F (engines) D56, D129, D251, ...
G (engineering) D3, D15, D88, ...
Intersect each list with the given one and merge
the results
D13 D88 D88 E E G

running time D W log W merge volume
9
Prefix Completion Status Quo Problems

The inverted index
highly compressible
perfect locality of access (T operations ? T /
block size IOs)
but quadratic worst-case complexity
AutoTree Bast, Weber, Mortensen, SPIRE06
output-sensitive (query time linear in size of
output)
but poor locality of access (heavy use of bit
rank operations)
The half-inverted index Bast, Weber, SIGIR06
highly compressible perfect locality of access
query time linear in the number of docs, with
small constant

Note time for 100 disk seeks time for reading
200 MB of compressed data
99 correlation withactual running times
perfect prediction of time space consum.
Major open problem output-sensitive and
IO-efficient
10
Error-Tolerant Search

With prefix search available, reduces to the
following
Problem Given a set of distinct words (lexicon),
find all clusters of words that are spelling
variants of each other

machine
logarithm
algorithm
alogrithm
maschine
algorytm
logaythm
mahcine

Challenges
find appropriate measure of distance between
words
algorithm that scales in theory as well as in
practice

possible synergies with Ernst Mayrs project
Master thesis of Marjan Celikik (talk on
Wednesday)
11
Semantic Search Problems

Problem 1 how to index
previous engines built on top of DBMS (e.g.,
Oracle)
DBMSs are hard to control (opposite of algorithm
engineering)
ongoing work reduction to prefix search and join
Problem 2 integrate an ontology
relate words / phrases in text to entities from
ontology
no time for deep parsing, reasoning etc.
learn from neighboring words
numerous algorithmic and engineering problems to
make it scale to something like Wikipedia (gt
10,000,000,000 words)

Data Base Management System
12
Semantic Search Entity Recognition

Recognize entities by looking at neighboring words

Quantum inequalities Einstein's theory of General
Relativity amounts to a description
Albert Einstein, the physicist is a physicist,
mathematician, vegetarian, person, entity,
born in 1879
Alfred Einstein, the musicologist is a
musicologist, scholar, intellectual, person,
entity, born in 1880
Violin Sonata No. 5 , according to Einstein's
Mozart His Character, His Work.
13
Software

Enhance our prototype
improve source code, documentation,
integrate our results into the system
Make available to others
public demonstrators
as a platform for experimentation
as a fancy search engine construction toolkit

Thank you!
14
(No Transcript)
15
General theme of this project

Project title
Efficient Search in Very Large Text Collections,
Databases, and Ontologies
In short
Fancy searches, yet fast
advanced search, yet highly scalable
quality is an issue
but must not sacrifice performance
(as often happens in AI, CL, ML)

General
Search engines are a fascinating, multi-faceted
field of research giving rise to a multitude of
challenging algorithmic problems with a strong
algorithm engineering component and of high
practical relevance.

16
Overview just for myself not for the talk

An Index for prefix search
inverted index our open problem top-k
Building such an index
INV sorting, HYB semi-sorting
Error-tolerant search
reduce to spelling variants clustering, define
problem
Semantic Search
point out entity annotation problem

17
Prefix Search

Show demo
first explain prefix search
then how to use if for faceted search
use DBLP show dblp.mpi-inf.mpg.de
Explain inverted index
show for example prefix query
point out IO-efficiency
point out compressability
but quadratic worst-case complexity

18
Problems encountered in this project

Indexing fast queries, succinct index, fast
construction
Index structures for advanced queries (beyond
keyword search)
How to build them fast
Learning from text scalable, yet effective
large-scale spelling correction
large-scale synonymy detection
large-scale entity annotation
Fundamental problems
fast intersection of (sorted) sequences
efficient (de)compression

algorythm ? algorithm
web internet
Einstein ? the physicist? the
physical unit? the musicologist?
I will explain each of these in detail in the
following
19
Problems encountered in this project

Indexing fast queries, succinct index, fast
construction
Index structures for advanced queries (beyond
keyword search)
How to build them fast
Learning from text scalable, yet effective
large-scale spelling correction
large-scale synonymy detection
large-scale entity annotation
Fundamental problems
fast intersection of (sorted) sequences
efficient (de)compression

algorythm ? algorithm
web internet
Einstein ? the physicist? the
physical unit? the musicologist?
just kidding
20
Problems encountered in this project

Indexing fast queries, succinct index, fast
construction
Index structures for advanced queries (beyond
keyword search)
How to build them fast
Learning from text scalable, yet effective
large-scale spelling correction
large-scale synonymy detection
large-scale entity annotation
Fundamental problems
fast intersection of (sorted) sequences
efficient (de)compression

Example prefix search
algorythm ? algorithm
Demo problem definition
web internet
Demo
Einstein ? the physicist? the
physical unit? the musicologist?
I will give you a glimpse of some of these in the
following
21
(No Transcript)
22
Overview

Part 1
Definition of our prefix search problem
Applications
Demos of our search engine
Part 2
Problem definition again
One way to solve it
Another way to solve it
Your way to solve it

23
Part 1

Definition, Applications, Demos

24
Problem Definition Formal

Context-Sensitive Prefix Search
Preprocess
a given collection of text documents such that
queries of the following kind can be processed
efficiently
Given
an arbitrary set of documents D
and a range of words W
Compute
all word-in-document pairs (w , d) such that w ?
W and d ? D

25
Problem Definition Visual
D74 J W Q
D3 Q DA
D17 B WU K A

Data is given as
documents containing words
documents have ids (D1, D2, )
words have ids (A, B, C, )
Query
given a sorted list of doc ids
and a range of word ids

D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 D17 D88
C D E F G
26
Problem Definition Visual
D74 J W Q
D3 Q DA

Data is given as
documents containing words
documents have ids (D1, D2, )
words have ids (A, B, C, )
Query
given a sorted list of doc ids
and a range of word ids
Answer
all matching word-in-doc pairs
with scores
and positions

D17 B WU K A
D17 B WU K A
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 A O E W H
D13 D17 D88
C D E F G
27
Problem Definition Visual
D74 J W Q
D3 Q DA

Data is given as
documents containing words
documents have ids (D1, D2, )
words have ids (A, B, C, )
Query
given a sorted list of doc ids
and a range of word ids
Answer
all matching word-in-doc pairs
with scores
and positions

D17 B WU K A
D17 B WU K A
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 A O E W H
D13 D17 D88
C D E F G
28
Application 1 Autocompletion

After each keystroke
display completions of the last query word that
lead to the best hits, together with the best
such hits
e.g., for the query google amp display
amphitheatre and the corresponding hits

29
Application 2 Error Correction

As before, but also
display spelling variants of completions that
would lead to a hit
e.g., for the query probabilistic algorithm also
consider a document containing probalistic
aigorithm
Implementation
if, say, aigorithm occurs as a misspelling of
algorithm, then for every occurrence of aigorithm
in the index
aigorithm Doc. 17
also add
algorithmaigorithm Doc. 17

30
Application 3 Query Expansion

As before, but also
display words related to completions that would
lead to a hit
e.g., for the query russia metal also consider
documents containing russia aluminium
Implementation
for, say, every occurrence of aluminium in the
index
aluminium Doc. 17
also add (once for every occurrence)
s67aluminium Doc. 17
and (one once for the whole collection)
saluminium67 Doc. 00

31
Application 4 Faceted Search

As before, but also
along with the completions and hits, display a
breakdown of the result set by various categories
e.g., for the query algorithm show (prominent)
authors of articles containing these words
Implementation
for, say, an article by Thomas Hofmann that
appeared in NIPS 2004, add
authorThomas_Hofmann Doc. 17
venueNIPS Doc. 17 year2004 Doc.
17
also add
thomasauthorThomas_Hofmann Doc. 17
hofmannauthorThomas_Hofmann Doc. 17etc.

32
Application 5 Semantic Search

As before, but also
display semantic completions
e.g., for the query beatles musician display
instances of the class musician that occur
together with the word beatles
Implementation
cannot simply duplicate index entries of an
entity for each category it belongs to, e.g. John
Lennon is a
singer, songwriter, person, human being,
organism, guitarist, pacifist, vegetarian,
entertainer, musician,
tricky combination of completions and joins ?
SIGIR07

and still more applications
33
Part 2

Solutions and Open Problem

34
Solution 1 Inverted Index

For example, probab alg
given the documents D13, D17, D88, (ids
of hits for probab)
and the word range C D E F G
(ids for alg)
Iterate over all words from the given range
C (algae) D8, D23, D291, ...
D (algarve) D24, D36, D165, ...
E (algebra) D13, D24, D88, ...
F (algol) D56, D129, D251, ...
G (algorithm) D3, D15, D88, ...
Intersect each list with the given one and merge
the results
D13 D88 D88 E E G

running time D W log W merge volume
35
A General Idea

Precompute inverted lists for ranges of words

list for A-D

Note
each prefix corresponds to a word range
ideally precompute list for each possible prefix
too much space
but lots of redundancy

36
Solution 2 AutoTree
SPIRE06 / JIR07

Trick 1 Relative bit vectors
the i-th bit of the root node corresponds to the
i-th doc
the i-th bit of any other node corresponds to the
i-th set bit of its parent node

aachen-zyskowski 1111111111111
corresponds to doc 5
maakeb-zyskowski 1001000111101
corresponds to doc 5
maakeb-stream 1001110
corresponds to doc 10
37
Solution 2 AutoTree
SPIRE06 / JIR07

Tricks 2 Push up the words
For each node, by each set bit, store the
leftmost word of that doc that is not already
stored by a parent node

algorithm
D 5, 7, 10 W max
advance
advance
advance
advance
aachen
aachen
aachen
algol
art
1 1 1 1 1 1 1 1 1 1
maximum
manning
maximal
maximal
manner
D 5, 10 (? 2, 5) report maximum
1 0 0 0 1 0 0 1 1 1
mazza
maple
middle
D 5 report Ø ? STOP
1 0 0 1 1
38
Solution 2 AutoTree
SPIRE06 / JIR07

Tricks 3 divide into blocks
and build a tree over each block as shown before

39
Solution 2 AutoTree
SPIRE06 / JIR07

Tricks 3 divide into blocks
and build a tree over each block as shown before

40
Solution 2 AutoTree
SPIRE06 / JIR07

Tricks 3 divide into blocks
and build a tree over each block as shown before

Theorem
query processing time O(D output)
uses no more space than an inverted index
AutoTree Summary
output-sensitive
not IO-efficient (heavy use of bit-rank
operations)
compression not optimal

99 correlation withactual running times
41
Parenthesis

Despite its quadratic worst-case complexity, the
inverted index is hard to beat in practice
very simple code
lists are highly compressible
perfect locality of access
Number of operations is a deceptive measure
100 disk seeks take about half a second
in that time can read 200 MB of contiguous
data(if stored compressed)
main memory 100 non-local accesses ? 10 KB
data block

data
42
Solution 3 HYB
SIGIR06 / IR07

Flat division of word range into blocks

list for A-D
list for E-J
list for K-N
43
Solution 3 HYB
SIGIR06 / IR07

Flat division of word range into blocks

Replace doc ids by gaps and words by frequency
ranks

Encode both gaps and ranks such that x ? log2 x
bits

0 ? 0 1 ? 10 2 ? 110
1st (A) ? 0 2nd (C) ? 10 3rd (D) ? 111
4th (B) ? 110

An actual block of HYB

44
Solution 3 HYB
SIGIR06 / IR07

Flat division of word range into blocks

Theorem
Let n number of documents, m number of words
If blocks are chosen of equal volume n
Then query time n and empiricial entropy HHYB
(1 e) HINV

HYB Summary
IO-efficient (mere scans of data)
very good compression
not output-sensitive

experimental results match perfectly
45
Conclusion

Context-sensitive prefix search
core mechanism of the CompleteSearch engine
simple enough to allow efficient realization
powerful enough to support many advanced search
features
Open problems
solution which is both output-sensitive and
IO-efficient
implement the whole thing using MapReduce
support yet more features

Thank you!
46
(No Transcript)
47
Processing the query beatles musician
position
Gitanes legend says that John Lennon
entityjohn_lennon of the Beatles smoked Gitanes
to deepen his voice
John Lennon 0 entityjohn_lennon 1
relationis_a 2 classmusician 2
classsinger
beatles entity
entity . relationis_a . classmusician
two prefix queries
entityjohn_lennonentity1964 entityliverpool et
c.
entitywolfang_amadeus_mozart entityjohann_sebast
ian_bach entityjohn_lennon etc.
onejoin
entityjohn_lennon etc.
48
Processing the query beatles musician
position
Gitanes legend says that John Lennon
entityjohn_lennon of the Beatles smoked Gitanes
to deepen his voice
John Lennon 0 entityjohn_lennon 1
relationis_a 2 classmusician 2
classsinger
beatles entity
entity . relationis_a . classmusician

Problem entity has a huge number of
occurrences
200 million for Wikipedia, which is 20 of
all occurrences
prefix search efficient only for up to 1
(explanation follows)

Solution frontier classes
classes at appropriate level in the hierarchy
e.g. artist, believer, worker, vegetable,
animal,

49
Processing the query beatles musician
position
Gitanes legend says that John Lennon
artistjohn_lennon believerjohn_lennon of the
Beatles smoked
John Lennon 0 artistjohn_lennon 0
believerjohn_lennon 1 relationis_a 2
classmusician
beatles artist
artist . relationis_a . classmusician
two prefix queries
artistjohn_lennonartistgraham_greene artistpet
e_best etc.
artistwolfang_amadeus_mozart artistjohann_sebast
ian_bach artistjohn_lennon etc.
onejoin
first figure out musician ? artist (easy)
artistjohn_lennon etc.
50
INV vs. HYB Space Consumption
Theorem The empirical entropy of INV is S
ni (1/ln 2 log2(n/ni))
Theorem The empirical entropy of HYB with block
size en is S ni ((1e)/ln 2 log2(n/ni))
ni number of documents containing i-th word, n
number of documents
Nice match of theory and practice
51
INV vs. HYB Query Time