ESTER Efficient Search on Text, Entities, and Relations - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

ESTER Efficient Search on Text, Entities, and Relations

Description:

(7,593 musicians in Wikipedia) Problem: Inefficient query processing ... Corpus: English Wikipedia (xml dump from Nov. 2006) 8 GB raw xml. 2,8 million documents ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 40
Provided by: holge2
Category:

less

Transcript and Presenter's Notes

Title: ESTER Efficient Search on Text, Entities, and Relations


1
ESTEREfficient Search on Text, Entities, and
Relations
  • Holger Bast
  • Max-Planck-Institut für Informatik
  • Saarbrücken, Germany
  • joint work with
  • Alexandru Chitea, Fabian Suchanek, Ingmar Weber

Talk at SIGIR07 in Amsterdam, July 26th
2
ESTEREfficient Search on Text, Entities, and
Relations
  • Holger Bast
  • Max-Planck-Institut für Informatik
  • Saarbrücken, Germany
  • joint work with
  • Alexandru Chitea, Fabian Suchanek, Ingmar Weber

Talk at SIGIR07 in Amsterdam, July 26th
3
ESTERIts about Fast Semantic Search
  • Holger Bast
  • Max-Planck-Institut für Informatik
  • Saarbrücken, Germany
  • joint work with
  • Alexandru Chitea, Fabian Suchanek, Ingmar Weber

Talk at SIGIR07 in Amsterdam, July 26th
4
Keyword Search vs. Semantic Search
  • Keyword search
  • Query john lennon
  • Answer documents containing the words john and
    lennon
  • Semantic search
  • Query musician
  • Answer documents containing an instance of
    musician
  • Combined search
  • Query beatles musician
  • Answer documents containing the word beatles and
    an instance of musician

Useful by itself or as a component of a QA system
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Semantic Search Challenges Our System
  • 1. Entity recognition
  • approach 1 let users annotate (semantic web)
  • approach 2 annotate (semi-)automatically
  • our system uses Wikipedia links learns from
    them
  • 2. Query Processing
  • build a space-efficient index
  • which enables fast query answers
  • our system as compact and fast as a standard
    full-text engine
  • 3. User Interface
  • easy to use
  • yet powerful query capabilities
  • our system standard interface with interactive
    suggestions

12
Semantic Search Challenges Our System
  • 1. Entity recognition
  • approach 1 let users annotate (semantic web)
  • approach 2 annotate (semi-)automatically
  • our system uses Wikipedia links learns from
    them
  • 2. Query Processing
  • build a space-efficient index
  • which enables fast query answers
  • our system as compact and fast as a standard
    full-text engine
  • 3. User Interface
  • easy to use
  • yet powerful query capabilities
  • our system standard interface with interactive
    suggestions

focus of the paper and of this talk
13
In the Rest of this Talk
  • Efficiency
  • three simple ideas (which all fail)
  • our approach (which works)
  • Queries supported
  • essentially all SPARQL queries, and
  • seamless integration with ordinary full-text
    search
  • Experiments
  • efficiency (great)
  • quality (not so great yet)
  • Conclusions
  • lots of interesting challenging open problems

14
Efficiency Simple Idea 1
  • Add semantic tags to the document
  • e.g., add the special word tagmusician before
    every occurrence of a musician in a document
  • Problem 1 Index blowup
  • e.g., John Lennon is a Musician, Singer,
    Composer, Artist, Vegetarian, Person, Pacifist,
    (28 classes)
  • Problem 2 Limited querying capabilities
  • e.g., could not produce list of musicians that
    occur in documents that also contain the word
    beatles
  • i.p., could not do all SPARQL queries (more on
    that later)

15
Efficiency Simple Idea 2
  • Query Expansion
  • e.g., replace query word musician by disjunction
  • musicianaaron_copland OR OR
    musicianzarah_leander

  • (7,593 musicians in Wikipedia)
  • Problem Inefficient query processing
  • one intersection per element of the disjunction
    needed

16
Efficiency Simple Idea 3
  • Use a database
  • map semantic queries to SQL queries on suitably
    constructed tables
  • thats what the Artificial-Intelligence /
    Semantic-Web people usually do
  • Problem Inefficient Lack of control
  • building a search engine on top of an
    off-the-shelf database is orders of magnitude
    slower or uses orders of magnitude more space, or
    both
  • very limited control regarding efficiency aspects

17
Efficiency Our Approach
  • Two basic operations
  • prefix search of a special kind will be
    explained by example
  • join will be explained by example
  • An index data structure
  • which supports these two operations efficiently
  • Artificial words in the documents
  • such that a large class of semantic queries
    reduces to a combination of (few of) these
    operations

18
Processing the query beatles musician
position
Gitanes legend says that John Lennon
entityjohn_lennon of the Beatles smoked Gitanes
to deepen his voice
John Lennon 0 entityjohn_lennon 1
relationis_a 2 classmusician 2
classsinger
beatles entity
entity . relationis_a . classmusician
two prefix queries
entityjohn_lennonentity1964 entityliverpool et
c.
entitywolfang_amadeus_mozart entityjohann_sebast
ian_bach entityjohn_lennon etc.
onejoin
entityjohn_lennon etc.
19
Processing the query beatles musician
position
Gitanes legend says that John Lennon
entityjohn_lennon of the Beatles smoked Gitanes
to deepen his voice
John Lennon 0 entityjohn_lennon 1
relationis_a 2 classmusician 2
classsinger
beatles entity
entity . relationis_a . classmusician
  • Problem entity has a huge number of
    occurrences
  • 200 million for Wikipedia, which is 20 of
    all occurrences
  • prefix search efficient only for up to 1
    (explanation follows)
  • Solution frontier classes
  • classes at appropriate level in the hierarchy
  • e.g. artist, believer, worker, vegetable,
    animal,

20
Processing the query beatles musician
position
Gitanes legend says that John Lennon
artistjohn_lennon believerjohn_lennon of the
Beatles smoked
John Lennon 0 artistjohn_lennon 0
believerjohn_lennon 1 relationis_a 2
classmusician
beatles artist
artist . relationis_a . classmusician
two prefix queries
artistjohn_lennonartistgraham_greene artistpet
e_best etc.
artistwolfang_amadeus_mozart artistjohann_sebast
ian_bach artistjohn_lennon etc.
onejoin
first figure out musician ? artist (easy)
artistjohn_lennon etc.
21
The HYB Index Bast/Weber, SIGIR06
  • Maintains lists for word ranges (not words)

able
ablaze
abroad
abnormal
abl-abt
  • Looks like this for person

persongraham_greene
personjohn_lennon
personringo_starr
personjohn_lennon
person
22
The HYB Index Bast/Weber, SIGIR06
  • Maintains lists for word ranges (not words)

able
ablaze
abroad
abnormal
abl-abt
  • Provably efficient
  • no more space than an inverted index (on the same
    data)
  • each query scan of a moderate number of
    (compressed) items
  • Extremely versatile
  • can do all kinds of things an inverted index
    cannot do (efficiently)
  • autocompletion, faceted search, query expansion,
    errorcorrection, select and join,

23
Queries we can handle
SPARQL Protocol AndRDF Query Language (yes, its
recursive)
  • We prove the following theorem
  • Any basic SPARQL graph query with m edges can be
    reduced to at most 2m prefix / join operations
  • SELECT ?who WHERE ?who is_a Musician
    ?who born_in_year ?when John_Lennon
    born_in_year ?when
  • ESTER achieves seamless integration with
    full-text search
  • SPARQL has no means for dealing with full text
    search
  • XQuery can handle full-text search, but is not
    really suitable for semantic search

musicians born in the same year as John Lennon
more about supported queries in the paper
24
Experiments Corpus, Ontology, Index
  • Corpus English Wikipedia (xml dump from Nov.
    2006)
  • 8 GB raw xml
  • 2,8 million documents
  • 1 billion words
  • Ontology YAGO (Suchanek/Kasneci/Weikum, WWW07)
  • 2,5 million facts
  • derived from clever combination of Wikipedia
    WordNet (Entities from Wikipedia,
    Taxonomy from WordNet)
  • Our Index
  • 1.5 billion words (original artificial)
  • 3.3 GB total index size ontology-only is a
    mere 100 MB

Note our system works for an arbitrary corpus
ontology
25
Experiments Efficiency What Baseline?
  • SPARQL engines
  • cant do text search
  • and slow for ontology-only too (on Wikipedia
    seconds)
  • XQuery engines
  • extremely slow for text search (on Wikipedia
    minutes)
  • and slow for ontology-only too (on Wikipedia
    seconds)
  • Other prototypes which do semantic full-text
    search
  • efficiency is hardly considered
  • e.g., the system of Castells/Fernandez/Vallet
    (TKDE07)
  • average informally observed response time on
    a standard professional desktop computer of
    below 30 seconds on 145,316 documents and an
    ontology with 465,848 facts
  • our system 100ms, 2.8 million documents, 2.5
    million facts

26
Experiments Efficiency Stress Test 1
  • Compare to ontology-only system
  • the YAGO engine from WWW07
  • Onto Simple when was person born
    1000 queries
  • Onto Advanced list all people from
    profession 1000 queries
  • Onto Hard when did people die who were
    born in the same year as
    person 1000 queries
  • Note comparison very unfair (for our system)

4 GB index
100 MB index
27
Experiments Efficiency Stress Test 2
  • Compare to text-only search engine
  • state-of-the-art system from SIGIR06
  • OntoText Easy counties in US state
    50 queries
  • OntoText Hard computer scientists
    nationality 50 queries
  • Full-text query e.g. german computer scientists
    Note hardly
    finds relevant documents
  • Note comparison extremely unfair (for our system)

28
Experiments Quality Entity Recognition
  • Use Wikipedia links as hints
  • following John Lennon Lennon and Paul
    McCartney, two of the Beatles,
  • The southern terminus is located south of the
    town of Lennon, Michigan Lennon
  • Learn other links
  • use words in neighborhood as features
  • Accuracy

29
Experiments Quality Relevance
  • 2 Query Sets
  • People associated with american university 100
    queries
  • Counties of american state 50 queries
  • Ground truth
  • Wikipedia has corresponding lists
  • e.g., List of Carnegie Mellon University People
  • Precision and Recall

30
Conclusions
  • Semantic Retrieval System ESTER
  • fast and scalable via reduction to prefix search
    and join
  • can handle all basic SPARQL queries
  • seamless integration with full-text search
  • standard user interface with (semantic)
    suggestions
  • Lots of interesting and challenging problems
  • simultaneous ranking of entities and documents
  • proper snippet generation and highlighting
  • search result quality

Dank je wel!
31
(No Transcript)
32
(No Transcript)
33
Context-Sensitive Prefix-Search
  • Compute completions of last query word
  • which together with the previous part of the
    query would lead to a hit
  • DEMO show a live example
  • Extremely useful
  • autocompletion search
  • faceted search
  • error correction, synonym search,
  • category search
  • for example, add placeamsterdam
  • then query place finds all instances of a place

formal definition in the paper
Isnt the last idea enough for semantic search?
34
DEMO
  • Do the following queries live or recorded
  • beatles
  • beatles musi
  • beatles musicia
  • beatles musicianjohn_lennon (or beatles
    entityjohn_lennon)

35
Processing the query beatles musician
position
Liverpool one of many documents mentioning John
Lennon in honor of the late Beatle
entityjohn_lennon
John Lennon 0 entityjohn_lennon 1
ris_a 2 classmusician 2
classsinger
beatles entity
entity ris_a classmusician
  • Problem entity has a huge number of
    occurrences
  • 200 million for Wikipedia 20 of all
    occurrences
  • prefix search efficient only up XXX
  • Solution Frontier set
  • classes high up in the hierarchy explain more
  • e.g. person, animal, substance, abstraction,

36
Processing the query beatles musician
position
Liverpool one of many documents mentioning John
Lennon in honour of the late Beatle
personjohn_lennon
John Lennon 0 personjohn_lennon 1
is_a 2 classmusician 2 classsinger

beatles person personjohn_lennonpersonthe_que
en personpete_best etc.
person ris_a classmusician personwolfang_am
adeus_mozart personjohann_sebastian_bach personj
ohn_lennon etc.
two prefix queries
one join
entityjohn_lennon etc.
37
Our Solution, Version 1
Some document about Albert Einstein
entityeinstein
Albert Einstein entityalbert_einsteinscientistv
egetarianintellectual
  • Combination of Prefix Search Join
  • Query 1 beatles entity entities
    co-occuring with beatles
  • Query 2 musician entity entities which are
    musicians
  • Join the completion from 1 2 musicians
    co-occuring with beatles

But unspecific prefixes (entity) are hard
38
Our Solution, Version 2
Some document mentioning John Lennon
musicianjohn_lennon xyzjohn_lennon
John Lennon musicianjohn_lennonxyzjohn_lennon

Special Doc TRANSLATEsingermusician
  • Combination of Prefix Search Join
  • Query 1 translatesinger tells us that a
    singer is a musician
  • Query 2 beatles musician
    musicians co-occurring with beatles
  • Query 3 physicist scientist musicians which
    are singers
  • Join the completion from 1 2 singers
    co-occurring with beatles

39
Processing the query beatles musician
position
John Lennon at the Royal Variety Show in 1963, in
the presence of members of the British royalty
"Those of you in the cheaper seats can clap
your hands. The rest of you, if you'll just
rattle your jewellery."
Gitanes legend says that John Lennon
artistjohn_lennon believerjohn_lennon of the
Beatles smoked
John Lennon 0 artistjohn_lennon 0
believerjohn_lennon 1 relationis_a 2
classmusician
beatles artist
artist . relationis_a . classmusician
two prefix queries
artistjohn_lennonartistqueen_elisabeth artistp
ete_best etc.
artistwolfang_amadeus_mozart artistjohann_sebast
ian_bach artistjohn_lennon etc.
onejoin
personjohn_lennon etc.
Write a Comment
User Comments (0)
About PowerShow.com