Title: ESTER Efficient Search on Text, Entities, and Relations
1ESTEREfficient Search on Text, Entities, and
Relations
- Holger Bast
- Max-Planck-Institut für Informatik
- Saarbrücken, Germany
- joint work with
- Alexandru Chitea, Fabian Suchanek, Ingmar Weber
Talk at SIGIR07 in Amsterdam, July 26th
2ESTEREfficient Search on Text, Entities, and
Relations
- Holger Bast
- Max-Planck-Institut für Informatik
- Saarbrücken, Germany
- joint work with
- Alexandru Chitea, Fabian Suchanek, Ingmar Weber
Talk at SIGIR07 in Amsterdam, July 26th
3ESTERIts about Fast Semantic Search
- Holger Bast
- Max-Planck-Institut für Informatik
- Saarbrücken, Germany
- joint work with
- Alexandru Chitea, Fabian Suchanek, Ingmar Weber
Talk at SIGIR07 in Amsterdam, July 26th
4Keyword Search vs. Semantic Search
- Keyword search
- Query john lennon
- Answer documents containing the words john and
lennon - Semantic search
- Query musician
- Answer documents containing an instance of
musician - Combined search
- Query beatles musician
- Answer documents containing the word beatles and
an instance of musician
Useful by itself or as a component of a QA system
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11Semantic Search Challenges Our System
- 1. Entity recognition
- approach 1 let users annotate (semantic web)
- approach 2 annotate (semi-)automatically
- our system uses Wikipedia links learns from
them - 2. Query Processing
- build a space-efficient index
- which enables fast query answers
- our system as compact and fast as a standard
full-text engine - 3. User Interface
- easy to use
- yet powerful query capabilities
- our system standard interface with interactive
suggestions
12Semantic Search Challenges Our System
- 1. Entity recognition
- approach 1 let users annotate (semantic web)
- approach 2 annotate (semi-)automatically
- our system uses Wikipedia links learns from
them - 2. Query Processing
- build a space-efficient index
- which enables fast query answers
- our system as compact and fast as a standard
full-text engine - 3. User Interface
- easy to use
- yet powerful query capabilities
- our system standard interface with interactive
suggestions
focus of the paper and of this talk
13In the Rest of this Talk
- Efficiency
- three simple ideas (which all fail)
- our approach (which works)
- Queries supported
- essentially all SPARQL queries, and
- seamless integration with ordinary full-text
search - Experiments
- efficiency (great)
- quality (not so great yet)
- Conclusions
- lots of interesting challenging open problems
14Efficiency Simple Idea 1
- Add semantic tags to the document
- e.g., add the special word tagmusician before
every occurrence of a musician in a document
- Problem 1 Index blowup
- e.g., John Lennon is a Musician, Singer,
Composer, Artist, Vegetarian, Person, Pacifist,
(28 classes) - Problem 2 Limited querying capabilities
- e.g., could not produce list of musicians that
occur in documents that also contain the word
beatles - i.p., could not do all SPARQL queries (more on
that later)
15Efficiency Simple Idea 2
- Query Expansion
- e.g., replace query word musician by disjunction
- musicianaaron_copland OR OR
musicianzarah_leander -
(7,593 musicians in Wikipedia) - Problem Inefficient query processing
- one intersection per element of the disjunction
needed
16Efficiency Simple Idea 3
- Use a database
- map semantic queries to SQL queries on suitably
constructed tables - thats what the Artificial-Intelligence /
Semantic-Web people usually do - Problem Inefficient Lack of control
- building a search engine on top of an
off-the-shelf database is orders of magnitude
slower or uses orders of magnitude more space, or
both - very limited control regarding efficiency aspects
17Efficiency Our Approach
- Two basic operations
- prefix search of a special kind will be
explained by example - join will be explained by example
- An index data structure
- which supports these two operations efficiently
- Artificial words in the documents
- such that a large class of semantic queries
reduces to a combination of (few of) these
operations
18Processing the query beatles musician
position
Gitanes legend says that John Lennon
entityjohn_lennon of the Beatles smoked Gitanes
to deepen his voice
John Lennon 0 entityjohn_lennon 1
relationis_a 2 classmusician 2
classsinger
beatles entity
entity . relationis_a . classmusician
two prefix queries
entityjohn_lennonentity1964 entityliverpool et
c.
entitywolfang_amadeus_mozart entityjohann_sebast
ian_bach entityjohn_lennon etc.
onejoin
entityjohn_lennon etc.
19Processing the query beatles musician
position
Gitanes legend says that John Lennon
entityjohn_lennon of the Beatles smoked Gitanes
to deepen his voice
John Lennon 0 entityjohn_lennon 1
relationis_a 2 classmusician 2
classsinger
beatles entity
entity . relationis_a . classmusician
- Problem entity has a huge number of
occurrences - 200 million for Wikipedia, which is 20 of
all occurrences - prefix search efficient only for up to 1
(explanation follows)
- Solution frontier classes
- classes at appropriate level in the hierarchy
- e.g. artist, believer, worker, vegetable,
animal,
20Processing the query beatles musician
position
Gitanes legend says that John Lennon
artistjohn_lennon believerjohn_lennon of the
Beatles smoked
John Lennon 0 artistjohn_lennon 0
believerjohn_lennon 1 relationis_a 2
classmusician
beatles artist
artist . relationis_a . classmusician
two prefix queries
artistjohn_lennonartistgraham_greene artistpet
e_best etc.
artistwolfang_amadeus_mozart artistjohann_sebast
ian_bach artistjohn_lennon etc.
onejoin
first figure out musician ? artist (easy)
artistjohn_lennon etc.
21The HYB Index Bast/Weber, SIGIR06
- Maintains lists for word ranges (not words)
able
ablaze
abroad
abnormal
abl-abt
- Looks like this for person
persongraham_greene
personjohn_lennon
personringo_starr
personjohn_lennon
person
22The HYB Index Bast/Weber, SIGIR06
- Maintains lists for word ranges (not words)
able
ablaze
abroad
abnormal
abl-abt
- Provably efficient
- no more space than an inverted index (on the same
data) - each query scan of a moderate number of
(compressed) items
- Extremely versatile
- can do all kinds of things an inverted index
cannot do (efficiently) - autocompletion, faceted search, query expansion,
errorcorrection, select and join,
23Queries we can handle
SPARQL Protocol AndRDF Query Language (yes, its
recursive)
- We prove the following theorem
- Any basic SPARQL graph query with m edges can be
reduced to at most 2m prefix / join operations - SELECT ?who WHERE ?who is_a Musician
?who born_in_year ?when John_Lennon
born_in_year ?when - ESTER achieves seamless integration with
full-text search - SPARQL has no means for dealing with full text
search - XQuery can handle full-text search, but is not
really suitable for semantic search
musicians born in the same year as John Lennon
more about supported queries in the paper
24Experiments Corpus, Ontology, Index
- Corpus English Wikipedia (xml dump from Nov.
2006) - 8 GB raw xml
- 2,8 million documents
- 1 billion words
- Ontology YAGO (Suchanek/Kasneci/Weikum, WWW07)
- 2,5 million facts
- derived from clever combination of Wikipedia
WordNet (Entities from Wikipedia,
Taxonomy from WordNet) - Our Index
- 1.5 billion words (original artificial)
- 3.3 GB total index size ontology-only is a
mere 100 MB
Note our system works for an arbitrary corpus
ontology
25Experiments Efficiency What Baseline?
- SPARQL engines
- cant do text search
- and slow for ontology-only too (on Wikipedia
seconds) - XQuery engines
- extremely slow for text search (on Wikipedia
minutes) - and slow for ontology-only too (on Wikipedia
seconds) - Other prototypes which do semantic full-text
search - efficiency is hardly considered
- e.g., the system of Castells/Fernandez/Vallet
(TKDE07) - average informally observed response time on
a standard professional desktop computer of
below 30 seconds on 145,316 documents and an
ontology with 465,848 facts - our system 100ms, 2.8 million documents, 2.5
million facts
26Experiments Efficiency Stress Test 1
- Compare to ontology-only system
- the YAGO engine from WWW07
- Onto Simple when was person born
1000 queries - Onto Advanced list all people from
profession 1000 queries - Onto Hard when did people die who were
born in the same year as
person 1000 queries - Note comparison very unfair (for our system)
4 GB index
100 MB index
27Experiments Efficiency Stress Test 2
- Compare to text-only search engine
- state-of-the-art system from SIGIR06
- OntoText Easy counties in US state
50 queries - OntoText Hard computer scientists
nationality 50 queries - Full-text query e.g. german computer scientists
Note hardly
finds relevant documents - Note comparison extremely unfair (for our system)
28Experiments Quality Entity Recognition
- Use Wikipedia links as hints
- following John Lennon Lennon and Paul
McCartney, two of the Beatles, - The southern terminus is located south of the
town of Lennon, Michigan Lennon - Learn other links
- use words in neighborhood as features
- Accuracy
29Experiments Quality Relevance
- 2 Query Sets
- People associated with american university 100
queries - Counties of american state 50 queries
- Ground truth
- Wikipedia has corresponding lists
- e.g., List of Carnegie Mellon University People
- Precision and Recall
30Conclusions
- Semantic Retrieval System ESTER
- fast and scalable via reduction to prefix search
and join - can handle all basic SPARQL queries
- seamless integration with full-text search
- standard user interface with (semantic)
suggestions - Lots of interesting and challenging problems
- simultaneous ranking of entities and documents
- proper snippet generation and highlighting
- search result quality
Dank je wel!
31(No Transcript)
32(No Transcript)
33Context-Sensitive Prefix-Search
- Compute completions of last query word
- which together with the previous part of the
query would lead to a hit - DEMO show a live example
- Extremely useful
- autocompletion search
- faceted search
- error correction, synonym search,
- category search
- for example, add placeamsterdam
- then query place finds all instances of a place
formal definition in the paper
Isnt the last idea enough for semantic search?
34DEMO
- Do the following queries live or recorded
- beatles
- beatles musi
- beatles musicia
- beatles musicianjohn_lennon (or beatles
entityjohn_lennon)
35Processing the query beatles musician
position
Liverpool one of many documents mentioning John
Lennon in honor of the late Beatle
entityjohn_lennon
John Lennon 0 entityjohn_lennon 1
ris_a 2 classmusician 2
classsinger
beatles entity
entity ris_a classmusician
- Problem entity has a huge number of
occurrences - 200 million for Wikipedia 20 of all
occurrences - prefix search efficient only up XXX
- Solution Frontier set
- classes high up in the hierarchy explain more
- e.g. person, animal, substance, abstraction,
36Processing the query beatles musician
position
Liverpool one of many documents mentioning John
Lennon in honour of the late Beatle
personjohn_lennon
John Lennon 0 personjohn_lennon 1
is_a 2 classmusician 2 classsinger
beatles person personjohn_lennonpersonthe_que
en personpete_best etc.
person ris_a classmusician personwolfang_am
adeus_mozart personjohann_sebastian_bach personj
ohn_lennon etc.
two prefix queries
one join
entityjohn_lennon etc.
37Our Solution, Version 1
Some document about Albert Einstein
entityeinstein
Albert Einstein entityalbert_einsteinscientistv
egetarianintellectual
- Combination of Prefix Search Join
- Query 1 beatles entity entities
co-occuring with beatles - Query 2 musician entity entities which are
musicians - Join the completion from 1 2 musicians
co-occuring with beatles
But unspecific prefixes (entity) are hard
38Our Solution, Version 2
Some document mentioning John Lennon
musicianjohn_lennon xyzjohn_lennon
John Lennon musicianjohn_lennonxyzjohn_lennon
Special Doc TRANSLATEsingermusician
- Combination of Prefix Search Join
- Query 1 translatesinger tells us that a
singer is a musician - Query 2 beatles musician
musicians co-occurring with beatles - Query 3 physicist scientist musicians which
are singers - Join the completion from 1 2 singers
co-occurring with beatles
39Processing the query beatles musician
position
John Lennon at the Royal Variety Show in 1963, in
the presence of members of the British royalty
"Those of you in the cheaper seats can clap
your hands. The rest of you, if you'll just
rattle your jewellery."
Gitanes legend says that John Lennon
artistjohn_lennon believerjohn_lennon of the
Beatles smoked
John Lennon 0 artistjohn_lennon 0
believerjohn_lennon 1 relationis_a 2
classmusician
beatles artist
artist . relationis_a . classmusician
two prefix queries
artistjohn_lennonartistqueen_elisabeth artistp
ete_best etc.
artistwolfang_amadeus_mozart artistjohann_sebast
ian_bach artistjohn_lennon etc.
onejoin
personjohn_lennon etc.