Title: Efficient Search in Very Large Text Collections, Databases, and Ontologies
1Efficient Search in Very Large Text Collections,
Databases, and Ontologies
DFG Priority Programme Algorithm
Engineering Kickoff Meeting in Karlsruhe,
December 2 3, 2007
- Holger Bast
- Max-Planck-Institut für Informatik
- Saarbrücken, Germany
2General theme of this project
- Search engines
- large variety of challenging algorithmic problems
with high practical relevance - algorithm engineering is absolutely essential
- Focus on scalability
- terabytes of data, hundreds of millions of
documents - query times in a fraction of a second
- Focus on advanced queries
- beyond Google-style keyword search
- but still as efficient in time and space
efficiency is often a secondary issue in DB, AI,
CL, or ML research
Fancy Searches, yet Fast
3Problems encountered in this project
- Indexing fast queries, succinct index, fast
construction - Index structures for advanced queries (beyond
keyword search) - How to build them fast
- Learning from text scalable, yet effective
- large-scale spelling correction
- large-scale synonymy detection
- large-scale entity annotation
- Basic Toolbox (for search)
- fast intersection of (sorted) sequences
- efficient (de)compression
algorythm ? algorithm
web internet
Einstein ? the physicist? the
physical unit? the musicologist?
possible synergies with Peter Sanders project
I will give a few glimpses in the following
4Prefix Completion
- Fundamental search problem
- definition on next slide
- many notoriously difficult search problems can be
reduced to it - for example, faceted search
- for, say, an article by Peter Sanders that
appeared in WEA 2007, add - authorPeter Sanders Doc. 17 venueWEA
Doc. 17 year2007 Doc. 17
5Prefix Completion Problem Definition
D74 J W Q
D3 Q DA
D17 B WU K A
- Data is given as
- documents containing words
- documents have ids (D1, D2, )
- words have ids (A, B, C, )
- Query
- given a sorted list of doc ids
- and a range of word ids
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 D17 D88
C D E F G
6Prefix Completion Problem Definition
D74 J W Q
D3 Q DA
- Data is given as
- documents containing words
- documents have ids (D1, D2, )
- words have ids (A, B, C, )
- Query
- given a sorted list of doc ids
- and a range of word ids
- Answer
- all matching word-in-doc pairs
- with scores
- and positions
D17 B WU K A
D17 B WU K A
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 A O E W H
D13 D17 D88
C D E F G
7Prefix Completion Problem Definition
D74 J W Q
D3 Q DA
- Data is given as
- documents containing words
- documents have ids (D1, D2, )
- words have ids (A, B, C, )
- Query
- given a sorted list of doc ids
- and a range of word ids
- Answer
- all matching word-in-doc pairs
- with scores
- and positions
D17 B WU K A
D17 B WU K A
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 A O E W H
D13 D17 D88
C D E F G
8Prefix Completion via the Inverted Index
- For example, algor eng
- given the documents D13, D17, D88, (ids
of hits for algor) - and the word range C D E F G
(ids for eng) - Iterate over all words from the given range
- C (engage) D8, D23, D291, ...
- D (engel) D24, D36, D165, ...
- E (engine) D13, D24, D88, ...
- F (engines) D56, D129, D251, ...
- G (engineering) D3, D15, D88, ...
- Intersect each list with the given one and merge
the results - D13 D88 D88 E E G
running time D W log W merge volume
9Prefix Completion Status Quo Problems
- The inverted index
- highly compressible
- perfect locality of access (T operations ? T /
block size IOs) - but quadratic worst-case complexity
- AutoTree Bast, Weber, Mortensen, SPIRE06
- output-sensitive (query time linear in size of
output) - but poor locality of access (heavy use of bit
rank operations) - The half-inverted index Bast, Weber, SIGIR06
- highly compressible perfect locality of access
- query time linear in the number of docs, with
small constant
Note time for 100 disk seeks time for reading
200 MB of compressed data
99 correlation withactual running times
perfect prediction of time space consum.
Major open problem output-sensitive and
IO-efficient
10Error-Tolerant Search
- With prefix search available, reduces to the
following - Problem Given a set of distinct words (lexicon),
find all clusters of words that are spelling
variants of each other
machine
logarithm
algorithm
alogrithm
maschine
algorytm
logaythm
mahcine
- Challenges
- find appropriate measure of distance between
words - algorithm that scales in theory as well as in
practice
possible synergies with Ernst Mayrs project
Master thesis of Marjan Celikik (talk on
Wednesday)
11Semantic Search Problems
- Problem 1 how to index
- previous engines built on top of DBMS (e.g.,
Oracle) - DBMSs are hard to control (opposite of algorithm
engineering) - ongoing work reduction to prefix search and join
- Problem 2 integrate an ontology
- relate words / phrases in text to entities from
ontology - no time for deep parsing, reasoning etc.
- learn from neighboring words
- numerous algorithmic and engineering problems to
make it scale to something like Wikipedia (gt
10,000,000,000 words)
Data Base Management System
12Semantic Search Entity Recognition
- Recognize entities by looking at neighboring words
Quantum inequalities Einstein's theory of General
Relativity amounts to a description
Albert Einstein, the physicist is a physicist,
mathematician, vegetarian, person, entity,
born in 1879
Alfred Einstein, the musicologist is a
musicologist, scholar, intellectual, person,
entity, born in 1880
Violin Sonata No. 5 , according to Einstein's
Mozart His Character, His Work.
13Software
- Enhance our prototype
- improve source code, documentation,
- integrate our results into the system
- Make available to others
- public demonstrators
- as a platform for experimentation
- as a fancy search engine construction toolkit
Thank you!
14(No Transcript)
15General theme of this project
- Project title
- Efficient Search in Very Large Text Collections,
Databases, and Ontologies - In short
- Fancy searches, yet fast
- advanced search, yet highly scalable
- quality is an issue
- but must not sacrifice performance
- (as often happens in AI, CL, ML)
- General
- Search engines are a fascinating, multi-faceted
field of research giving rise to a multitude of
challenging algorithmic problems with a strong
algorithm engineering component and of high
practical relevance.
16Overview just for myself not for the talk
- An Index for prefix search
- inverted index our open problem top-k
- Building such an index
- INV sorting, HYB semi-sorting
- Error-tolerant search
- reduce to spelling variants clustering, define
problem - Semantic Search
- point out entity annotation problem
17Prefix Search
- Show demo
- first explain prefix search
- then how to use if for faceted search
- use DBLP show dblp.mpi-inf.mpg.de
- Explain inverted index
- show for example prefix query
- point out IO-efficiency
- point out compressability
- but quadratic worst-case complexity
18Problems encountered in this project
- Indexing fast queries, succinct index, fast
construction - Index structures for advanced queries (beyond
keyword search) - How to build them fast
- Learning from text scalable, yet effective
- large-scale spelling correction
- large-scale synonymy detection
- large-scale entity annotation
- Fundamental problems
- fast intersection of (sorted) sequences
- efficient (de)compression
algorythm ? algorithm
web internet
Einstein ? the physicist? the
physical unit? the musicologist?
I will explain each of these in detail in the
following
19Problems encountered in this project
- Indexing fast queries, succinct index, fast
construction - Index structures for advanced queries (beyond
keyword search) - How to build them fast
- Learning from text scalable, yet effective
- large-scale spelling correction
- large-scale synonymy detection
- large-scale entity annotation
- Fundamental problems
- fast intersection of (sorted) sequences
- efficient (de)compression
algorythm ? algorithm
web internet
Einstein ? the physicist? the
physical unit? the musicologist?
just kidding
20Problems encountered in this project
- Indexing fast queries, succinct index, fast
construction - Index structures for advanced queries (beyond
keyword search) - How to build them fast
- Learning from text scalable, yet effective
- large-scale spelling correction
- large-scale synonymy detection
- large-scale entity annotation
- Fundamental problems
- fast intersection of (sorted) sequences
- efficient (de)compression
Example prefix search
algorythm ? algorithm
Demo problem definition
web internet
Demo
Einstein ? the physicist? the
physical unit? the musicologist?
I will give you a glimpse of some of these in the
following
21(No Transcript)
22Overview
- Part 1
- Definition of our prefix search problem
- Applications
- Demos of our search engine
- Part 2
- Problem definition again
- One way to solve it
- Another way to solve it
- Your way to solve it
23Part 1
- Definition, Applications, Demos
24Problem Definition Formal
- Context-Sensitive Prefix Search
- Preprocess
- a given collection of text documents such that
queries of the following kind can be processed
efficiently - Given
- an arbitrary set of documents D
- and a range of words W
- Compute
- all word-in-document pairs (w , d) such that w ?
W and d ? D
25Problem Definition Visual
D74 J W Q
D3 Q DA
D17 B WU K A
- Data is given as
- documents containing words
- documents have ids (D1, D2, )
- words have ids (A, B, C, )
- Query
- given a sorted list of doc ids
- and a range of word ids
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 D17 D88
C D E F G
26Problem Definition Visual
D74 J W Q
D3 Q DA
- Data is given as
- documents containing words
- documents have ids (D1, D2, )
- words have ids (A, B, C, )
- Query
- given a sorted list of doc ids
- and a range of word ids
- Answer
- all matching word-in-doc pairs
- with scores
- and positions
D17 B WU K A
D17 B WU K A
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 A O E W H
D13 D17 D88
C D E F G
27Problem Definition Visual
D74 J W Q
D3 Q DA
- Data is given as
- documents containing words
- documents have ids (D1, D2, )
- words have ids (A, B, C, )
- Query
- given a sorted list of doc ids
- and a range of word ids
- Answer
- all matching word-in-doc pairs
- with scores
- and positions
D17 B WU K A
D17 B WU K A
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 A O E W H
D13 D17 D88
C D E F G
28Application 1 Autocompletion
- After each keystroke
- display completions of the last query word that
lead to the best hits, together with the best
such hits - e.g., for the query google amp display
amphitheatre and the corresponding hits
29Application 2 Error Correction
- As before, but also
- display spelling variants of completions that
would lead to a hit - e.g., for the query probabilistic algorithm also
consider a document containing probalistic
aigorithm - Implementation
- if, say, aigorithm occurs as a misspelling of
algorithm, then for every occurrence of aigorithm
in the index - aigorithm Doc. 17
- also add
- algorithmaigorithm Doc. 17
30Application 3 Query Expansion
- As before, but also
- display words related to completions that would
lead to a hit - e.g., for the query russia metal also consider
documents containing russia aluminium - Implementation
- for, say, every occurrence of aluminium in the
index - aluminium Doc. 17
- also add (once for every occurrence)
- s67aluminium Doc. 17
- and (one once for the whole collection)
- saluminium67 Doc. 00
31Application 4 Faceted Search
- As before, but also
- along with the completions and hits, display a
breakdown of the result set by various categories - e.g., for the query algorithm show (prominent)
authors of articles containing these words - Implementation
- for, say, an article by Thomas Hofmann that
appeared in NIPS 2004, add - authorThomas_Hofmann Doc. 17
venueNIPS Doc. 17 year2004 Doc.
17 - also add
- thomasauthorThomas_Hofmann Doc. 17
hofmannauthorThomas_Hofmann Doc. 17etc.
32Application 5 Semantic Search
- As before, but also
- display semantic completions
- e.g., for the query beatles musician display
instances of the class musician that occur
together with the word beatles - Implementation
- cannot simply duplicate index entries of an
entity for each category it belongs to, e.g. John
Lennon is a - singer, songwriter, person, human being,
organism, guitarist, pacifist, vegetarian,
entertainer, musician, - tricky combination of completions and joins ?
SIGIR07
and still more applications
33Part 2
- Solutions and Open Problem
34Solution 1 Inverted Index
- For example, probab alg
- given the documents D13, D17, D88, (ids
of hits for probab) - and the word range C D E F G
(ids for alg) - Iterate over all words from the given range
- C (algae) D8, D23, D291, ...
- D (algarve) D24, D36, D165, ...
- E (algebra) D13, D24, D88, ...
- F (algol) D56, D129, D251, ...
- G (algorithm) D3, D15, D88, ...
- Intersect each list with the given one and merge
the results - D13 D88 D88 E E G
running time D W log W merge volume
35A General Idea
- Precompute inverted lists for ranges of words
list for A-D
- Note
- each prefix corresponds to a word range
- ideally precompute list for each possible prefix
- too much space
- but lots of redundancy
36Solution 2 AutoTree
SPIRE06 / JIR07
- Trick 1 Relative bit vectors
- the i-th bit of the root node corresponds to the
i-th doc - the i-th bit of any other node corresponds to the
i-th set bit of its parent node
aachen-zyskowski 1111111111111
corresponds to doc 5
maakeb-zyskowski 1001000111101
corresponds to doc 5
maakeb-stream 1001110
corresponds to doc 10
37Solution 2 AutoTree
SPIRE06 / JIR07
- Tricks 2 Push up the words
- For each node, by each set bit, store the
leftmost word of that doc that is not already
stored by a parent node
algorithm
D 5, 7, 10 W max
advance
advance
advance
advance
aachen
aachen
aachen
algol
art
1 1 1 1 1 1 1 1 1 1
maximum
manning
maximal
maximal
manner
D 5, 10 (? 2, 5) report maximum
1 0 0 0 1 0 0 1 1 1
mazza
maple
middle
D 5 report Ø ? STOP
1 0 0 1 1
38Solution 2 AutoTree
SPIRE06 / JIR07
- Tricks 3 divide into blocks
- and build a tree over each block as shown before
39Solution 2 AutoTree
SPIRE06 / JIR07
- Tricks 3 divide into blocks
- and build a tree over each block as shown before
40Solution 2 AutoTree
SPIRE06 / JIR07
- Tricks 3 divide into blocks
- and build a tree over each block as shown before
- Theorem
- query processing time O(D output)
- uses no more space than an inverted index
- AutoTree Summary
- output-sensitive
- not IO-efficient (heavy use of bit-rank
operations) - compression not optimal
99 correlation withactual running times
41Parenthesis
- Despite its quadratic worst-case complexity, the
inverted index is hard to beat in practice - very simple code
- lists are highly compressible
- perfect locality of access
- Number of operations is a deceptive measure
- 100 disk seeks take about half a second
- in that time can read 200 MB of contiguous
data(if stored compressed) - main memory 100 non-local accesses ? 10 KB
data block
data
42Solution 3 HYB
SIGIR06 / IR07
- Flat division of word range into blocks
list for A-D
list for E-J
list for K-N
43Solution 3 HYB
SIGIR06 / IR07
- Flat division of word range into blocks
- Replace doc ids by gaps and words by frequency
ranks
- Encode both gaps and ranks such that x ? log2 x
bits
0 ? 0 1 ? 10 2 ? 110
1st (A) ? 0 2nd (C) ? 10 3rd (D) ? 111
4th (B) ? 110
44Solution 3 HYB
SIGIR06 / IR07
- Flat division of word range into blocks
- Theorem
- Let n number of documents, m number of words
- If blocks are chosen of equal volume n
- Then query time n and empiricial entropy HHYB
(1 e) HINV
- HYB Summary
- IO-efficient (mere scans of data)
- very good compression
- not output-sensitive
experimental results match perfectly
45Conclusion
- Context-sensitive prefix search
- core mechanism of the CompleteSearch engine
- simple enough to allow efficient realization
- powerful enough to support many advanced search
features - Open problems
- solution which is both output-sensitive and
IO-efficient - implement the whole thing using MapReduce
- support yet more features
Thank you!
46(No Transcript)
47Processing the query beatles musician
position
Gitanes legend says that John Lennon
entityjohn_lennon of the Beatles smoked Gitanes
to deepen his voice
John Lennon 0 entityjohn_lennon 1
relationis_a 2 classmusician 2
classsinger
beatles entity
entity . relationis_a . classmusician
two prefix queries
entityjohn_lennonentity1964 entityliverpool et
c.
entitywolfang_amadeus_mozart entityjohann_sebast
ian_bach entityjohn_lennon etc.
onejoin
entityjohn_lennon etc.
48Processing the query beatles musician
position
Gitanes legend says that John Lennon
entityjohn_lennon of the Beatles smoked Gitanes
to deepen his voice
John Lennon 0 entityjohn_lennon 1
relationis_a 2 classmusician 2
classsinger
beatles entity
entity . relationis_a . classmusician
- Problem entity has a huge number of
occurrences - 200 million for Wikipedia, which is 20 of
all occurrences - prefix search efficient only for up to 1
(explanation follows)
- Solution frontier classes
- classes at appropriate level in the hierarchy
- e.g. artist, believer, worker, vegetable,
animal,
49Processing the query beatles musician
position
Gitanes legend says that John Lennon
artistjohn_lennon believerjohn_lennon of the
Beatles smoked
John Lennon 0 artistjohn_lennon 0
believerjohn_lennon 1 relationis_a 2
classmusician
beatles artist
artist . relationis_a . classmusician
two prefix queries
artistjohn_lennonartistgraham_greene artistpet
e_best etc.
artistwolfang_amadeus_mozart artistjohann_sebast
ian_bach artistjohn_lennon etc.
onejoin
first figure out musician ? artist (easy)
artistjohn_lennon etc.
50INV vs. HYB Space Consumption
Theorem The empirical entropy of INV is S
ni (1/ln 2 log2(n/ni))
Theorem The empirical entropy of HYB with block
size en is S ni ((1e)/ln 2 log2(n/ni))
ni number of documents containing i-th word, n
number of documents
Nice match of theory and practice
51INV vs. HYB Query Time
- Experiment type ordinary queries from left to
right - db , dbl , dblp , dblp un , dblp uni , dblp
univ , dblp unive , ...
INV
HYB
HYB beats INV by an order of magnitude
52Engineering
- With HYB, every query is essentially one block
scan - perfect locality of access, no sorting or
merging, etc. - balanced ratio of read, decompression,
processing, etc.
- Careful implementation in C
- Experiment sum over array of 10 million 4-byte
integers (on a Linux PC with an approx. 2 GB/sec
memory bandwidth)
53Engineering
- With HYB, every query is essentially one block
scan - perfect locality of access, no sorting or
merging, etc. - balanced ratio of read, decompression,
processing, etc.
- Careful implementation in C
- Experiment sum over array of 10 million 4-byte
integers (on a Linux PC with an approx. 2 GB/sec
memory bandwidth)
54Engineering
- With HYB, every query is essentially one block
scan - perfect locality of access, no sorting or
merging, etc. - balanced ratio of read, decompression,
processing, etc.
- Careful implementation in C
- Experiment sum over array of 10 million 4-byte
integers (on a Linux PC with an approx. 2 GB/sec
memory bandwidth)
55Engineering
- With HYB, every query is essentially one block
scan - perfect locality of access, no sorting or
merging, etc. - balanced ratio of read, decompression,
processing, etc.
- Careful implementation in C
- Experiment sum over array of 10 million 4-byte
integers (on a Linux PC with an approx. 2 GB/sec
memory bandwidth)
56Engineering
- With HYB, every query is essentially one block
scan - perfect locality of access, no sorting or
merging, etc. - balanced ratio of read, decompression,
processing, etc.
- Careful implementation in C
- Experiment sum over array of 10 million 4-byte
integers (on a Linux PC with an approx. 2 GB/sec
memory bandwidth)
57System Design High Level View
Compute ServerC
Web ServerPHP
User ClientJavaScript
Debugging such an application is hell!