Efficient Search in Very Large Text Collections, Databases, and Ontologies

1 / 57
About This Presentation
Title:

Efficient Search in Very Large Text Collections, Databases, and Ontologies

Description:

large variety of challenging algorithmic problems with high practical relevance ... Violin Sonata No. 5 ..., according to Einstein's Mozart: His Character, His Work. ... –

Number of Views:233
Avg rating:3.0/5.0
Slides: 58
Provided by: holge2
Category:

less

Transcript and Presenter's Notes

Title: Efficient Search in Very Large Text Collections, Databases, and Ontologies


1
Efficient Search in Very Large Text Collections,
Databases, and Ontologies
DFG Priority Programme Algorithm
Engineering Kickoff Meeting in Karlsruhe,
December 2 3, 2007
  • Holger Bast
  • Max-Planck-Institut für Informatik
  • Saarbrücken, Germany

2
General theme of this project
  • Search engines
  • large variety of challenging algorithmic problems
    with high practical relevance
  • algorithm engineering is absolutely essential
  • Focus on scalability
  • terabytes of data, hundreds of millions of
    documents
  • query times in a fraction of a second
  • Focus on advanced queries
  • beyond Google-style keyword search
  • but still as efficient in time and space

efficiency is often a secondary issue in DB, AI,
CL, or ML research
Fancy Searches, yet Fast
3
Problems encountered in this project
  • Indexing fast queries, succinct index, fast
    construction
  • Index structures for advanced queries (beyond
    keyword search)
  • How to build them fast
  • Learning from text scalable, yet effective
  • large-scale spelling correction
  • large-scale synonymy detection
  • large-scale entity annotation
  • Basic Toolbox (for search)
  • fast intersection of (sorted) sequences
  • efficient (de)compression

algorythm ? algorithm
web internet
Einstein ? the physicist? the
physical unit? the musicologist?
possible synergies with Peter Sanders project
I will give a few glimpses in the following
4
Prefix Completion
  • Fundamental search problem
  • definition on next slide
  • many notoriously difficult search problems can be
    reduced to it
  • for example, faceted search
  • for, say, an article by Peter Sanders that
    appeared in WEA 2007, add
  • authorPeter Sanders Doc. 17 venueWEA
    Doc. 17 year2007 Doc. 17

5
Prefix Completion Problem Definition
D74 J W Q
D3 Q DA
D17 B WU K A
  • Data is given as
  • documents containing words
  • documents have ids (D1, D2, )
  • words have ids (A, B, C, )
  • Query
  • given a sorted list of doc ids
  • and a range of word ids

D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 D17 D88
C D E F G
6
Prefix Completion Problem Definition
D74 J W Q
D3 Q DA
  • Data is given as
  • documents containing words
  • documents have ids (D1, D2, )
  • words have ids (A, B, C, )
  • Query
  • given a sorted list of doc ids
  • and a range of word ids
  • Answer
  • all matching word-in-doc pairs
  • with scores
  • and positions

D17 B WU K A
D17 B WU K A
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 A O E W H
D13 D17 D88
C D E F G
7
Prefix Completion Problem Definition
D74 J W Q
D3 Q DA
  • Data is given as
  • documents containing words
  • documents have ids (D1, D2, )
  • words have ids (A, B, C, )
  • Query
  • given a sorted list of doc ids
  • and a range of word ids
  • Answer
  • all matching word-in-doc pairs
  • with scores
  • and positions

D17 B WU K A
D17 B WU K A
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 A O E W H
D13 D17 D88
C D E F G
8
Prefix Completion via the Inverted Index
  • For example, algor eng
  • given the documents D13, D17, D88, (ids
    of hits for algor)
  • and the word range C D E F G
    (ids for eng)
  • Iterate over all words from the given range
  • C (engage) D8, D23, D291, ...
  • D (engel) D24, D36, D165, ...
  • E (engine) D13, D24, D88, ...
  • F (engines) D56, D129, D251, ...
  • G (engineering) D3, D15, D88, ...
  • Intersect each list with the given one and merge
    the results
  • D13 D88 D88 E E G

running time D W log W merge volume
9
Prefix Completion Status Quo Problems
  • The inverted index
  • highly compressible
  • perfect locality of access (T operations ? T /
    block size IOs)
  • but quadratic worst-case complexity
  • AutoTree Bast, Weber, Mortensen, SPIRE06
  • output-sensitive (query time linear in size of
    output)
  • but poor locality of access (heavy use of bit
    rank operations)
  • The half-inverted index Bast, Weber, SIGIR06
  • highly compressible perfect locality of access
  • query time linear in the number of docs, with
    small constant

Note time for 100 disk seeks time for reading
200 MB of compressed data
99 correlation withactual running times
perfect prediction of time space consum.
Major open problem output-sensitive and
IO-efficient
10
Error-Tolerant Search
  • With prefix search available, reduces to the
    following
  • Problem Given a set of distinct words (lexicon),
    find all clusters of words that are spelling
    variants of each other

machine
logarithm
algorithm
alogrithm
maschine
algorytm
logaythm
mahcine
  • Challenges
  • find appropriate measure of distance between
    words
  • algorithm that scales in theory as well as in
    practice

possible synergies with Ernst Mayrs project
Master thesis of Marjan Celikik (talk on
Wednesday)
11
Semantic Search Problems
  • Problem 1 how to index
  • previous engines built on top of DBMS (e.g.,
    Oracle)
  • DBMSs are hard to control (opposite of algorithm
    engineering)
  • ongoing work reduction to prefix search and join
  • Problem 2 integrate an ontology
  • relate words / phrases in text to entities from
    ontology
  • no time for deep parsing, reasoning etc.
  • learn from neighboring words
  • numerous algorithmic and engineering problems to
    make it scale to something like Wikipedia (gt
    10,000,000,000 words)

Data Base Management System
12
Semantic Search Entity Recognition
  • Recognize entities by looking at neighboring words

Quantum inequalities Einstein's theory of General
Relativity amounts to a description
Albert Einstein, the physicist is a physicist,
mathematician, vegetarian, person, entity,
born in 1879
Alfred Einstein, the musicologist is a
musicologist, scholar, intellectual, person,
entity, born in 1880
Violin Sonata No. 5 , according to Einstein's
Mozart His Character, His Work.
13
Software
  • Enhance our prototype
  • improve source code, documentation,
  • integrate our results into the system
  • Make available to others
  • public demonstrators
  • as a platform for experimentation
  • as a fancy search engine construction toolkit

Thank you!
14
(No Transcript)
15
General theme of this project
  • Project title
  • Efficient Search in Very Large Text Collections,
    Databases, and Ontologies
  • In short
  • Fancy searches, yet fast
  • advanced search, yet highly scalable
  • quality is an issue
  • but must not sacrifice performance
  • (as often happens in AI, CL, ML)
  • General
  • Search engines are a fascinating, multi-faceted
    field of research giving rise to a multitude of
    challenging algorithmic problems with a strong
    algorithm engineering component and of high
    practical relevance.

16
Overview just for myself not for the talk
  • An Index for prefix search
  • inverted index our open problem top-k
  • Building such an index
  • INV sorting, HYB semi-sorting
  • Error-tolerant search
  • reduce to spelling variants clustering, define
    problem
  • Semantic Search
  • point out entity annotation problem

17
Prefix Search
  • Show demo
  • first explain prefix search
  • then how to use if for faceted search
  • use DBLP show dblp.mpi-inf.mpg.de
  • Explain inverted index
  • show for example prefix query
  • point out IO-efficiency
  • point out compressability
  • but quadratic worst-case complexity

18
Problems encountered in this project
  • Indexing fast queries, succinct index, fast
    construction
  • Index structures for advanced queries (beyond
    keyword search)
  • How to build them fast
  • Learning from text scalable, yet effective
  • large-scale spelling correction
  • large-scale synonymy detection
  • large-scale entity annotation
  • Fundamental problems
  • fast intersection of (sorted) sequences
  • efficient (de)compression

algorythm ? algorithm
web internet
Einstein ? the physicist? the
physical unit? the musicologist?
I will explain each of these in detail in the
following
19
Problems encountered in this project
  • Indexing fast queries, succinct index, fast
    construction
  • Index structures for advanced queries (beyond
    keyword search)
  • How to build them fast
  • Learning from text scalable, yet effective
  • large-scale spelling correction
  • large-scale synonymy detection
  • large-scale entity annotation
  • Fundamental problems
  • fast intersection of (sorted) sequences
  • efficient (de)compression

algorythm ? algorithm
web internet
Einstein ? the physicist? the
physical unit? the musicologist?
just kidding
20
Problems encountered in this project
  • Indexing fast queries, succinct index, fast
    construction
  • Index structures for advanced queries (beyond
    keyword search)
  • How to build them fast
  • Learning from text scalable, yet effective
  • large-scale spelling correction
  • large-scale synonymy detection
  • large-scale entity annotation
  • Fundamental problems
  • fast intersection of (sorted) sequences
  • efficient (de)compression

Example prefix search
algorythm ? algorithm
Demo problem definition
web internet
Demo
Einstein ? the physicist? the
physical unit? the musicologist?
I will give you a glimpse of some of these in the
following
21
(No Transcript)
22
Overview
  • Part 1
  • Definition of our prefix search problem
  • Applications
  • Demos of our search engine
  • Part 2
  • Problem definition again
  • One way to solve it
  • Another way to solve it
  • Your way to solve it

23
Part 1
  • Definition, Applications, Demos

24
Problem Definition Formal
  • Context-Sensitive Prefix Search
  • Preprocess
  • a given collection of text documents such that
    queries of the following kind can be processed
    efficiently
  • Given
  • an arbitrary set of documents D
  • and a range of words W
  • Compute
  • all word-in-document pairs (w , d) such that w ?
    W and d ? D

25
Problem Definition Visual
D74 J W Q
D3 Q DA
D17 B WU K A
  • Data is given as
  • documents containing words
  • documents have ids (D1, D2, )
  • words have ids (A, B, C, )
  • Query
  • given a sorted list of doc ids
  • and a range of word ids

D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 D17 D88
C D E F G
26
Problem Definition Visual
D74 J W Q
D3 Q DA
  • Data is given as
  • documents containing words
  • documents have ids (D1, D2, )
  • words have ids (A, B, C, )
  • Query
  • given a sorted list of doc ids
  • and a range of word ids
  • Answer
  • all matching word-in-doc pairs
  • with scores
  • and positions

D17 B WU K A
D17 B WU K A
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 A O E W H
D13 D17 D88
C D E F G
27
Problem Definition Visual
D74 J W Q
D3 Q DA
  • Data is given as
  • documents containing words
  • documents have ids (D1, D2, )
  • words have ids (A, B, C, )
  • Query
  • given a sorted list of doc ids
  • and a range of word ids
  • Answer
  • all matching word-in-doc pairs
  • with scores
  • and positions

D17 B WU K A
D17 B WU K A
D43 D Q
D92 P U D E M
D1 A O E W H
D78 K L S
D53 J D E A
D27 K L D F
D9 E E R
D4 K L K A B
D88 P A E G Q
D88 P A E G Q
D2 B F A
D32 I L S D H
D98 E B A S
D13 A O E W H
D13 A O E W H
D13 D17 D88
C D E F G
28
Application 1 Autocompletion
  • After each keystroke
  • display completions of the last query word that
    lead to the best hits, together with the best
    such hits
  • e.g., for the query google amp display
    amphitheatre and the corresponding hits

29
Application 2 Error Correction
  • As before, but also
  • display spelling variants of completions that
    would lead to a hit
  • e.g., for the query probabilistic algorithm also
    consider a document containing probalistic
    aigorithm
  • Implementation
  • if, say, aigorithm occurs as a misspelling of
    algorithm, then for every occurrence of aigorithm
    in the index
  • aigorithm Doc. 17
  • also add
  • algorithmaigorithm Doc. 17

30
Application 3 Query Expansion
  • As before, but also
  • display words related to completions that would
    lead to a hit
  • e.g., for the query russia metal also consider
    documents containing russia aluminium
  • Implementation
  • for, say, every occurrence of aluminium in the
    index
  • aluminium Doc. 17
  • also add (once for every occurrence)
  • s67aluminium Doc. 17
  • and (one once for the whole collection)
  • saluminium67 Doc. 00

31
Application 4 Faceted Search
  • As before, but also
  • along with the completions and hits, display a
    breakdown of the result set by various categories
  • e.g., for the query algorithm show (prominent)
    authors of articles containing these words
  • Implementation
  • for, say, an article by Thomas Hofmann that
    appeared in NIPS 2004, add
  • authorThomas_Hofmann Doc. 17
    venueNIPS Doc. 17 year2004 Doc.
    17
  • also add
  • thomasauthorThomas_Hofmann Doc. 17
    hofmannauthorThomas_Hofmann Doc. 17etc.

32
Application 5 Semantic Search
  • As before, but also
  • display semantic completions
  • e.g., for the query beatles musician display
    instances of the class musician that occur
    together with the word beatles
  • Implementation
  • cannot simply duplicate index entries of an
    entity for each category it belongs to, e.g. John
    Lennon is a
  • singer, songwriter, person, human being,
    organism, guitarist, pacifist, vegetarian,
    entertainer, musician,
  • tricky combination of completions and joins ?
    SIGIR07

and still more applications
33
Part 2
  • Solutions and Open Problem

34
Solution 1 Inverted Index
  • For example, probab alg
  • given the documents D13, D17, D88, (ids
    of hits for probab)
  • and the word range C D E F G
    (ids for alg)
  • Iterate over all words from the given range
  • C (algae) D8, D23, D291, ...
  • D (algarve) D24, D36, D165, ...
  • E (algebra) D13, D24, D88, ...
  • F (algol) D56, D129, D251, ...
  • G (algorithm) D3, D15, D88, ...
  • Intersect each list with the given one and merge
    the results
  • D13 D88 D88 E E G

running time D W log W merge volume
35
A General Idea
  • Precompute inverted lists for ranges of words

list for A-D
  • Note
  • each prefix corresponds to a word range
  • ideally precompute list for each possible prefix
  • too much space
  • but lots of redundancy

36
Solution 2 AutoTree
SPIRE06 / JIR07
  • Trick 1 Relative bit vectors
  • the i-th bit of the root node corresponds to the
    i-th doc
  • the i-th bit of any other node corresponds to the
    i-th set bit of its parent node

aachen-zyskowski 1111111111111
corresponds to doc 5
maakeb-zyskowski 1001000111101
corresponds to doc 5
maakeb-stream 1001110
corresponds to doc 10
37
Solution 2 AutoTree
SPIRE06 / JIR07
  • Tricks 2 Push up the words
  • For each node, by each set bit, store the
    leftmost word of that doc that is not already
    stored by a parent node

algorithm
D 5, 7, 10 W max
advance
advance
advance
advance
aachen
aachen
aachen
algol
art
1 1 1 1 1 1 1 1 1 1
maximum
manning
maximal
maximal
manner
D 5, 10 (? 2, 5) report maximum
1 0 0 0 1 0 0 1 1 1
mazza
maple
middle
D 5 report Ø ? STOP
1 0 0 1 1
38
Solution 2 AutoTree
SPIRE06 / JIR07
  • Tricks 3 divide into blocks
  • and build a tree over each block as shown before

39
Solution 2 AutoTree
SPIRE06 / JIR07
  • Tricks 3 divide into blocks
  • and build a tree over each block as shown before

40
Solution 2 AutoTree
SPIRE06 / JIR07
  • Tricks 3 divide into blocks
  • and build a tree over each block as shown before
  • Theorem
  • query processing time O(D output)
  • uses no more space than an inverted index
  • AutoTree Summary
  • output-sensitive
  • not IO-efficient (heavy use of bit-rank
    operations)
  • compression not optimal

99 correlation withactual running times
41
Parenthesis
  • Despite its quadratic worst-case complexity, the
    inverted index is hard to beat in practice
  • very simple code
  • lists are highly compressible
  • perfect locality of access
  • Number of operations is a deceptive measure
  • 100 disk seeks take about half a second
  • in that time can read 200 MB of contiguous
    data(if stored compressed)
  • main memory 100 non-local accesses ? 10 KB
    data block

data
42
Solution 3 HYB
SIGIR06 / IR07
  • Flat division of word range into blocks

list for A-D
list for E-J
list for K-N
43
Solution 3 HYB
SIGIR06 / IR07
  • Flat division of word range into blocks
  • Replace doc ids by gaps and words by frequency
    ranks
  • Encode both gaps and ranks such that x ? log2 x
    bits

0 ? 0 1 ? 10 2 ? 110
1st (A) ? 0 2nd (C) ? 10 3rd (D) ? 111
4th (B) ? 110
  • An actual block of HYB

44
Solution 3 HYB
SIGIR06 / IR07
  • Flat division of word range into blocks
  • Theorem
  • Let n number of documents, m number of words
  • If blocks are chosen of equal volume n
  • Then query time n and empiricial entropy HHYB
    (1 e) HINV
  • HYB Summary
  • IO-efficient (mere scans of data)
  • very good compression
  • not output-sensitive

experimental results match perfectly
45
Conclusion
  • Context-sensitive prefix search
  • core mechanism of the CompleteSearch engine
  • simple enough to allow efficient realization
  • powerful enough to support many advanced search
    features
  • Open problems
  • solution which is both output-sensitive and
    IO-efficient
  • implement the whole thing using MapReduce
  • support yet more features

Thank you!
46
(No Transcript)
47
Processing the query beatles musician
position
Gitanes legend says that John Lennon
entityjohn_lennon of the Beatles smoked Gitanes
to deepen his voice
John Lennon 0 entityjohn_lennon 1
relationis_a 2 classmusician 2
classsinger
beatles entity
entity . relationis_a . classmusician
two prefix queries
entityjohn_lennonentity1964 entityliverpool et
c.
entitywolfang_amadeus_mozart entityjohann_sebast
ian_bach entityjohn_lennon etc.
onejoin
entityjohn_lennon etc.
48
Processing the query beatles musician
position
Gitanes legend says that John Lennon
entityjohn_lennon of the Beatles smoked Gitanes
to deepen his voice
John Lennon 0 entityjohn_lennon 1
relationis_a 2 classmusician 2
classsinger
beatles entity
entity . relationis_a . classmusician
  • Problem entity has a huge number of
    occurrences
  • 200 million for Wikipedia, which is 20 of
    all occurrences
  • prefix search efficient only for up to 1
    (explanation follows)
  • Solution frontier classes
  • classes at appropriate level in the hierarchy
  • e.g. artist, believer, worker, vegetable,
    animal,

49
Processing the query beatles musician
position
Gitanes legend says that John Lennon
artistjohn_lennon believerjohn_lennon of the
Beatles smoked
John Lennon 0 artistjohn_lennon 0
believerjohn_lennon 1 relationis_a 2
classmusician
beatles artist
artist . relationis_a . classmusician
two prefix queries
artistjohn_lennonartistgraham_greene artistpet
e_best etc.
artistwolfang_amadeus_mozart artistjohann_sebast
ian_bach artistjohn_lennon etc.
onejoin
first figure out musician ? artist (easy)
artistjohn_lennon etc.
50
INV vs. HYB Space Consumption
Theorem The empirical entropy of INV is S
ni (1/ln 2 log2(n/ni))
Theorem The empirical entropy of HYB with block
size en is S ni ((1e)/ln 2 log2(n/ni))
ni number of documents containing i-th word, n
number of documents
Nice match of theory and practice
51
INV vs. HYB Query Time
  • Experiment type ordinary queries from left to
    right
  • db , dbl , dblp , dblp un , dblp uni , dblp
    univ , dblp unive , ...

INV
HYB
HYB beats INV by an order of magnitude
52
Engineering
  • With HYB, every query is essentially one block
    scan
  • perfect locality of access, no sorting or
    merging, etc.
  • balanced ratio of read, decompression,
    processing, etc.
  • Careful implementation in C
  • Experiment sum over array of 10 million 4-byte
    integers (on a Linux PC with an approx. 2 GB/sec
    memory bandwidth)

53
Engineering
  • With HYB, every query is essentially one block
    scan
  • perfect locality of access, no sorting or
    merging, etc.
  • balanced ratio of read, decompression,
    processing, etc.
  • Careful implementation in C
  • Experiment sum over array of 10 million 4-byte
    integers (on a Linux PC with an approx. 2 GB/sec
    memory bandwidth)

54
Engineering
  • With HYB, every query is essentially one block
    scan
  • perfect locality of access, no sorting or
    merging, etc.
  • balanced ratio of read, decompression,
    processing, etc.
  • Careful implementation in C
  • Experiment sum over array of 10 million 4-byte
    integers (on a Linux PC with an approx. 2 GB/sec
    memory bandwidth)

55
Engineering
  • With HYB, every query is essentially one block
    scan
  • perfect locality of access, no sorting or
    merging, etc.
  • balanced ratio of read, decompression,
    processing, etc.
  • Careful implementation in C
  • Experiment sum over array of 10 million 4-byte
    integers (on a Linux PC with an approx. 2 GB/sec
    memory bandwidth)

56
Engineering
  • With HYB, every query is essentially one block
    scan
  • perfect locality of access, no sorting or
    merging, etc.
  • balanced ratio of read, decompression,
    processing, etc.
  • Careful implementation in C
  • Experiment sum over array of 10 million 4-byte
    integers (on a Linux PC with an approx. 2 GB/sec
    memory bandwidth)

57
System Design High Level View
Compute ServerC
Web ServerPHP
User ClientJavaScript
Debugging such an application is hell!
Write a Comment
User Comments (0)
About PowerShow.com