ESTER Efficient Search on Text, Entities, and Relations

About This Presentation

Title:

ESTER Efficient Search on Text, Entities, and Relations

Description:

(7,593 musicians in Wikipedia) Problem: Inefficient query processing ... Corpus: English Wikipedia (xml dump from Nov. 2006) 8 GB raw xml. 2,8 million documents ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 40

Provided by: holge2

Category:

more less

Transcript and Presenter's Notes

Title: ESTER Efficient Search on Text, Entities, and Relations

1
ESTEREfficient Search on Text, Entities, and
Relations

Holger Bast
Max-Planck-Institut für Informatik
Saarbrücken, Germany
joint work with
Alexandru Chitea, Fabian Suchanek, Ingmar Weber

Talk at SIGIR07 in Amsterdam, July 26th
2
ESTEREfficient Search on Text, Entities, and
Relations

Holger Bast
Max-Planck-Institut für Informatik
Saarbrücken, Germany
joint work with
Alexandru Chitea, Fabian Suchanek, Ingmar Weber

Talk at SIGIR07 in Amsterdam, July 26th
3
ESTERIts about Fast Semantic Search

Holger Bast
Max-Planck-Institut für Informatik
Saarbrücken, Germany
joint work with
Alexandru Chitea, Fabian Suchanek, Ingmar Weber

Talk at SIGIR07 in Amsterdam, July 26th
4
Keyword Search vs. Semantic Search

Keyword search
Query john lennon
Answer documents containing the words john and
lennon
Semantic search
Query musician
Answer documents containing an instance of
musician
Combined search
Query beatles musician
Answer documents containing the word beatles and
an instance of musician

Useful by itself or as a component of a QA system
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Semantic Search Challenges Our System

1. Entity recognition
approach 1 let users annotate (semantic web)
approach 2 annotate (semi-)automatically
our system uses Wikipedia links learns from
them
2. Query Processing
build a space-efficient index
which enables fast query answers
our system as compact and fast as a standard
full-text engine
3. User Interface
easy to use
yet powerful query capabilities
our system standard interface with interactive
suggestions

12
Semantic Search Challenges Our System

1. Entity recognition
approach 1 let users annotate (semantic web)
approach 2 annotate (semi-)automatically
our system uses Wikipedia links learns from
them
2. Query Processing
build a space-efficient index
which enables fast query answers
our system as compact and fast as a standard
full-text engine
3. User Interface
easy to use
yet powerful query capabilities
our system standard interface with interactive
suggestions

focus of the paper and of this talk
13
In the Rest of this Talk

Efficiency
three simple ideas (which all fail)
our approach (which works)
Queries supported
essentially all SPARQL queries, and
seamless integration with ordinary full-text
search
Experiments
efficiency (great)
quality (not so great yet)
Conclusions
lots of interesting challenging open problems

14
Efficiency Simple Idea 1

Add semantic tags to the document
e.g., add the special word tagmusician before
every occurrence of a musician in a document

Problem 1 Index blowup
e.g., John Lennon is a Musician, Singer,
Composer, Artist, Vegetarian, Person, Pacifist,
(28 classes)
Problem 2 Limited querying capabilities
e.g., could not produce list of musicians that
occur in documents that also contain the word
beatles
i.p., could not do all SPARQL queries (more on
that later)

15
Efficiency Simple Idea 2

Query Expansion
e.g., replace query word musician by disjunction
musicianaaron_copland OR OR
musicianzarah_leander
(7,593 musicians in Wikipedia)
Problem Inefficient query processing
one intersection per element of the disjunction
needed

16
Efficiency Simple Idea 3

Use a database
map semantic queries to SQL queries on suitably
constructed tables
thats what the Artificial-Intelligence /
Semantic-Web people usually do
Problem Inefficient Lack of control
building a search engine on top of an
off-the-shelf database is orders of magnitude
slower or uses orders of magnitude more space, or
both
very limited control regarding efficiency aspects

17
Efficiency Our Approach

Two basic operations
prefix search of a special kind will be
explained by example
join will be explained by example
An index data structure
which supports these two operations efficiently
Artificial words in the documents
such that a large class of semantic queries
reduces to a combination of (few of) these
operations

18
Processing the query beatles musician
position
Gitanes legend says that John Lennon
entityjohn_lennon of the Beatles smoked Gitanes
to deepen his voice
John Lennon 0 entityjohn_lennon 1
relationis_a 2 classmusician 2
classsinger
beatles entity
entity . relationis_a . classmusician
two prefix queries
entityjohn_lennonentity1964 entityliverpool et
c.
entitywolfang_amadeus_mozart entityjohann_sebast
ian_bach entityjohn_lennon etc.
onejoin
entityjohn_lennon etc.
19
Processing the query beatles musician
position
Gitanes legend says that John Lennon
entityjohn_lennon of the Beatles smoked Gitanes
to deepen his voice
John Lennon 0 entityjohn_lennon 1
relationis_a 2 classmusician 2
classsinger
beatles entity
entity . relationis_a . classmusician

Problem entity has a huge number of
occurrences
200 million for Wikipedia, which is 20 of
all occurrences
prefix search efficient only for up to 1
(explanation follows)

Solution frontier classes
classes at appropriate level in the hierarchy
e.g. artist, believer, worker, vegetable,
animal,

20
Processing the query beatles musician
position
Gitanes legend says that John Lennon
artistjohn_lennon believerjohn_lennon of the
Beatles smoked
John Lennon 0 artistjohn_lennon 0
believerjohn_lennon 1 relationis_a 2
classmusician
beatles artist
artist . relationis_a . classmusician
two prefix queries
artistjohn_lennonartistgraham_greene artistpet
e_best etc.
artistwolfang_amadeus_mozart artistjohann_sebast
ian_bach artistjohn_lennon etc.
onejoin
first figure out musician ? artist (easy)
artistjohn_lennon etc.
21
The HYB Index Bast/Weber, SIGIR06

Maintains lists for word ranges (not words)

able
ablaze
abroad
abnormal
abl-abt

Looks like this for person

persongraham_greene
personjohn_lennon
personringo_starr
personjohn_lennon
person
22
The HYB Index Bast/Weber, SIGIR06

Maintains lists for word ranges (not words)

able
ablaze
abroad
abnormal
abl-abt

Provably efficient
no more space than an inverted index (on the same
data)
each query scan of a moderate number of
(compressed) items

Extremely versatile
can do all kinds of things an inverted index
cannot do (efficiently)
autocompletion, faceted search, query expansion,
errorcorrection, select and join,

23
Queries we can handle
SPARQL Protocol AndRDF Query Language (yes, its
recursive)

We prove the following theorem
Any basic SPARQL graph query with m edges can be
reduced to at most 2m prefix / join operations
SELECT ?who WHERE ?who is_a Musician
?who born_in_year ?when John_Lennon
born_in_year ?when
ESTER achieves seamless integration with
full-text search
SPARQL has no means for dealing with full text
search
XQuery can handle full-text search, but is not
really suitable for semantic search

musicians born in the same year as John Lennon
more about supported queries in the paper
24
Experiments Corpus, Ontology, Index

Corpus English Wikipedia (xml dump from Nov.
2006)
8 GB raw xml
2,8 million documents
1 billion words
Ontology YAGO (Suchanek/Kasneci/Weikum, WWW07)
2,5 million facts
derived from clever combination of Wikipedia
WordNet (Entities from Wikipedia,
Taxonomy from WordNet)
Our Index
1.5 billion words (original artificial)
3.3 GB total index size ontology-only is a
mere 100 MB

Note our system works for an arbitrary corpus
ontology
25
Experiments Efficiency What Baseline?

SPARQL engines
cant do text search
and slow for ontology-only too (on Wikipedia
seconds)
XQuery engines
extremely slow for text search (on Wikipedia
minutes)
and slow for ontology-only too (on Wikipedia
seconds)
Other prototypes which do semantic full-text
search
efficiency is hardly considered
e.g., the system of Castells/Fernandez/Vallet
(TKDE07)
average informally observed response time on
a standard professional desktop computer of
below 30 seconds on 145,316 documents and an
ontology with 465,848 facts
our system 100ms, 2.8 million documents, 2.5
million facts

26
Experiments Efficiency Stress Test 1

Compare to ontology-only system
the YAGO engine from WWW07
Onto Simple when was person born
1000 queries
Onto Advanced list all people from
profession 1000 queries
Onto Hard when did people die who were
born in the same year as
person 1000 queries
Note comparison very unfair (for our system)

4 GB index
100 MB index
27
Experiments Efficiency Stress Test 2

Compare to text-only search engine
state-of-the-art system from SIGIR06
OntoText Easy counties in US state
50 queries
OntoText Hard computer scientists
nationality 50 queries
Full-text query e.g. german computer scientists
Note hardly
finds relevant documents
Note comparison extremely unfair (for our system)

28
Experiments Quality Entity Recognition

Use Wikipedia links as hints
following John Lennon Lennon and Paul
McCartney, two of the Beatles,
The southern terminus is located south of the
town of Lennon, Michigan Lennon
Learn other links
use words in neighborhood as features
Accuracy

29
Experiments Quality Relevance

2 Query Sets
People associated with american university 100
queries
Counties of american state 50 queries
Ground truth
Wikipedia has corresponding lists
e.g., List of Carnegie Mellon University People
Precision and Recall

30
Conclusions

Semantic Retrieval System ESTER
fast and scalable via reduction to prefix search
and join
can handle all basic SPARQL queries
seamless integration with full-text search
standard user interface with (semantic)
suggestions
Lots of interesting and challenging problems
simultaneous ranking of entities and documents
proper snippet generation and highlighting
search result quality

Dank je wel!
31
(No Transcript)
32
(No Transcript)
33
Context-Sensitive Prefix-Search

Compute completions of last query word
which together with the previous part of the
query would lead to a hit
DEMO show a live example
Extremely useful
autocompletion search
faceted search
error correction, synonym search,
category search
for example, add placeamsterdam
then query place finds all instances of a place

formal definition in the paper
Isnt the last idea enough for semantic search?
34
DEMO

Do the following queries live or recorded
beatles
beatles musi
beatles musicia
beatles musicianjohn_lennon (or beatles
entityjohn_lennon)

35
Processing the query beatles musician
position
Liverpool one of many documents mentioning John
Lennon in honor of the late Beatle
entityjohn_lennon
John Lennon 0 entityjohn_lennon 1
ris_a 2 classmusician 2
classsinger
beatles entity
entity ris_a classmusician

Problem entity has a huge number of
occurrences
200 million for Wikipedia 20 of all
occurrences
prefix search efficient only up XXX

Solution Frontier set
classes high up in the hierarchy explain more
e.g. person, animal, substance, abstraction,

36
Processing the query beatles musician
position
Liverpool one of many documents mentioning John
Lennon in honour of the late Beatle
personjohn_lennon
John Lennon 0 personjohn_lennon 1
is_a 2 classmusician 2 classsinger

beatles person personjohn_lennonpersonthe_que
en personpete_best etc.
person ris_a classmusician personwolfang_am
adeus_mozart personjohann_sebastian_bach personj
ohn_lennon etc.
two prefix queries
one join
entityjohn_lennon etc.
37
Our Solution, Version 1
Some document about Albert Einstein
entityeinstein
Albert Einstein entityalbert_einsteinscientistv
egetarianintellectual

Combination of Prefix Search Join
Query 1 beatles entity entities
co-occuring with beatles
Query 2 musician entity entities which are
musicians
Join the completion from 1 2 musicians
co-occuring with beatles

But unspecific prefixes (entity) are hard
38
Our Solution, Version 2
Some document mentioning John Lennon
musicianjohn_lennon xyzjohn_lennon
John Lennon musicianjohn_lennonxyzjohn_lennon

Special Doc TRANSLATEsingermusician

Combination of Prefix Search Join
Query 1 translatesinger tells us that a
singer is a musician
Query 2 beatles musician
musicians co-occurring with beatles
Query 3 physicist scientist musicians which
are singers
Join the completion from 1 2 singers
co-occurring with beatles

39
Processing the query beatles musician
position
John Lennon at the Royal Variety Show in 1963, in
the presence of members of the British royalty
"Those of you in the cheaper seats can clap
your hands. The rest of you, if you'll just
rattle your jewellery."
Gitanes legend says that John Lennon
artistjohn_lennon believerjohn_lennon of the
Beatles smoked
John Lennon 0 artistjohn_lennon 0
believerjohn_lennon 1 relationis_a 2
classmusician
beatles artist
artist . relationis_a . classmusician
two prefix queries
artistjohn_lennonartistqueen_elisabeth artistp
ete_best etc.
artistwolfang_amadeus_mozart artistjohann_sebast
ian_bach artistjohn_lennon etc.
onejoin
personjohn_lennon etc.

Write a Comment

User Comments (0)

About PowerShow.com

ESTER Efficient Search on Text, Entities, and Relations - PowerPoint PPT Presentation

ESTER Efficient Search on Text, Entities, and Relations

(7,593 musicians in Wikipedia) Problem: Inefficient query processing ... Corpus: English Wikipedia (xml dump from Nov. 2006) 8 GB raw xml. 2,8 million documents ... – PowerPoint PPT presentation