Title: NAGA: Searching and Ranking Knowledge
1NAGA Searching and Ranking Knowledge
- Gjergji Kasneci
- Joint work with
- Fabian M. Suchanek, Georgiana Ifrim,
- Maya Ramanath, and Gerhard Weikum
2Motivation
- Example queries
- Which politicians are also scientists?
- Which gods do the Maya and the Greeks have in
common? - Keyword queries are too weak to express advanced
user intentions such as - concepts,
- entity properties
- relationships between entities
3Motivation
4Motivation
5Motivation
- Keyword queries are too weak to express advanced
user intentions such as - concepts,
- entity properties
- relationships between entities
- Data is not knowledge.
- Data extraction and organization needed
6Query Results Benjamin Franklin Paul
Wolfowitz Angela Merkel
isA
isA
Scientist
Politician
x
7Greek god
Query Results Thunder god Wisdom
god Agricultural god
type
z
type
y
Mayan god
type
type
x
8(No Transcript)
9SYSTEMS
Web ? Universal knowledge
NAGA
Question Answering Ranking
START Katz et al. TREC 2005
TextRunner Banko et al. IJCAI 2007
Ex DBMS Cafarella et al. CIDR 2007
ALICE Banko et al. K-CAP 2007
Question Answering
Information Extraction
KnowItAll Etzioni et al. WWW 2004
YAGO Suchanek et al. WWW 2007
BLINKS He et al. SIGMOD 2007
Entity Search Cheng et al. CIDR 2007
Semantic Database (Relational Database, XML(XLink
s), RDF)
Entity (Keyword) (Proximity) Search Ranking
Libra Nie et al. WWW 2007
DISCOVER Histridis et al. VLDB 2002
BANKS Bhalotia et al. ICDE 2002
Querying
10Outline
- Framework
- Data model
- Query language
- Ranking model
- Evaluation
- Setting
- Metrics
- Results
11Framework (Data model)
- Entity-relationship (ER) graph
- Node label entity
- Edge label relation
- Edge weight relation strength
- Fact
- Represented by an edge
- Evidence pages for a fact f
- Web pages from which f was
- derived
- Computation of fact confidence
- (i.e. edge weights)
Excerpt from YAGO Suchanek et al. WWW 2007
locatedIn
Max Planck Institute
Germany
type
type
type
12Framework (Data model)
- Entity-relationship (ER) graph
- Node label entity
- Edge label relation
- Edge weight relation strength
- Fact
- Represented by an edge
- Evidence pages for a fact f
- Web pages from which f was
- derived
- Computation of fact confidence
- (i.e. edge weights)
Excerpt from YAGO Suchanek et al. WWW 2007
locatedIn
Max Planck Institute
Germany
type
type
type
13Framework (Data model)
- Entity-relationship (ER) graph
- Node label entity
- Edge label relation
- Edge weight relation strength
- Fact
- Represented by an edge
- Evidence pages for a fact f
- Web pages from which f was
- derived
- Computation of fact confidence
- (i.e. edge weights)
Excerpt from YAGO Suchanek et al. WWW 2007
locatedIn
Max Planck Institute
Germany
type
type
type
14Framework (Data model)
- Entity-relationship (ER) graph
- Node label entity
- Edge label relation
- Edge weight relation strength
- Fact
- Represented by an edge
- Evidence pages for a fact f
- Web pages from which f was
- derived
- Computation of fact confidence
- (i.e. edge weights)
Excerpt from YAGO Suchanek et al. WWW 2007
locatedIn
Max Planck Institute
Germany
type
type
type
15Framework (Data model)
- Entity-relationship (ER) graph
- Node label entity
- Edge label relation
- Edge weight relation strength
- Fact
- Represented by an edge
- Evidence pages for a fact f
- Web pages from which f was
- derived
- Computation of fact confidence
- (i.e. edge weights)
Excerpt from YAGO Suchanek et al. WWW 2007
locatedIn
Max Planck Institute
Germany
type
type
type
16Framework (Query language)
- R set of relationship labels
- RegEx(R) set of regular expressions over
R-labels - E set of entity labels
- V set of variables
- Definition (fact template)
- A fact template is a triple lte1 r e2gt where
e1 , e2 ? E?V and r ? RegEx(R) ?V.
givenNameOf familiyNameOf
Examples
Liu
x
x
Albert Einstein
Mileva Maric
17Framework (Query language)
- Definition (NAGA query)
- A NAGA query is a connected directed graph in
which each edge represents a fact template. - Examples
1) Which physicist was born in the same year as
Max Planck?
isA
isA
2) Which politician is also a scientist?
Physicist
isA
isA
Max Planck
y
Scientist
Politician
x
x
bornInYear
bornInYear
4) Which mountain is located in Africa?
loctedIn
isA
Mountain
3) Which scientist are called Liu?
Africa
x
5) What connects Einstein and Bohr?
givenNameOf familiyNameOf
isA
Liu
Scientist
x
Niels Bohr
Albert Einstein
18Framework (Query language)
- Definition (NAGA query)
- A NAGA query is a connected directed graph in
which each edge represents a fact template. - Examples
1) Which physicist was born in the same year as
Max Planck?
isA
isA
2) Which politician is also a scientist?
Physicist
isA
isA
Max Planck
y
Scientist
Politician
x
x
bornInYear
bornInYear
4) Which mountain is located in Africa?
loctedIn
isA
Mountain
3) Which scientist are called Liu?
Africa
x
5) What connects Einstein and Bohr?
givenNameOf familiyNameOf
isA
Liu
Scientist
x
Niels Bohr
Albert Einstein
19Framework (Query language)
- Definition (NAGA answer)
- A NAGA answer is a subgraph of the underlying ER
graph that matches the query graph. - Examples
1) Which physicist was born in the same year as
Max Planck?
isA
isA
2) Which mountain is located in Africa?
Physicist
loctedIn
Max Planck
y
isA
Mountain
Africa
x
x
bornInYear
bornInYear
loctedIn
loctedIn
isA
Mountain
Africa
Tanzania
Kilimanjaro
isA
isA
0.96
0.98
0.98
Physicist
0.96
0.96
Max Planck
Mihajlo Pupuin
3) What connects Einstein and Bohr?
0.97
0.97
Niels Bohr
Albert Einstein
1858
bornInYear
bornInYear
hasWonPrize
hasWonPrize
Nobel Prize
Albert Einstein
Niels Bohr
0.95
0.95
20Framework (Query language)
- Definition (NAGA answer)
- A NAGA answer is a subgraph of the underlying ER
graph that matches the query graph. - Examples
1) Which physicist was born in the same year as
Max Planck?
isA
isA
2) Which mountain is located in Africa?
Physicist
loctedIn
Max Planck
y
isA
Mountain
Africa
x
x
bornInYear
bornInYear
loctedIn
loctedIn
isA
Mountain
Africa
Tanzania
Kilimanjaro
isA
isA
0.96
0.98
0.98
Physicist
0.96
0.96
Max Planck
Mihajlo Pupin
3) What connects Einstein and Bohr?
0.97
0.97
Niels Bohr
Albert Einstein
1858
bornInYear
bornInYear
hasWonPrize
hasWonPrize
Nobel Prize
Albert Einstein
Niels Bohr
0.95
0.95
21Framework (Ranking model)
- Question
- How to rank multiple matches to the same query?
- Ranking desiderata
- Confidence
- Correct answers
- Certainty of IE
- Trust/Authority of source
Max Planck born in Kiel bornIn (Max_Planck,
Kiel) (Source Wikipedia) They believe Elvis
hides on Mars livesIn (Elvis_Presley, Mars)
(Source The One and Only Kings Blog)
- Informativeness
- prominent results preferred
- Frequency of facts in the
- corpus
isA
isA
- Compactness
- Prefer tightly connected
- answers
- Size of the answer graph
Einstein
vegetarian
Tom Cruise
bornInYear
hasWonPrize
Bohr
Nobel Prize
1962
hasWonPrize
diedInYear
22Framework (Ranking model)
- Question
- How to rank multiple matches to the same query?
- Ranking desiderata
- Confidence
- Correct answers
- Certainty of IE
- Trust/Authority of source
Max Planck born in Kiel bornIn (Max_Planck,
Kiel) (Source Wikipedia) They believe Elvis
hides on Mars livesIn (Elvis_Presley, Mars)
(Source The One and Only Kings Blog)
- Informativeness
- prominent results preferred
- Frequency of facts in the
- corpus
isA
isA
- Compactness
- Prefer tightly connected
- answers
- Size of the answer graph
Einstein
vegetarian
Tom Cruise
bornInYear
hasWonPrize
Bohr
Nobel Prize
1962
hasWonPrize
diedInYear
23Framework (Ranking model)
- Question
- How to rank multiple matches to the same query?
- Ranking desiderata
- Confidence
- Correct answers
- Certainty of IE
- Trust/Authority of source
Max Planck born in Kiel bornIn (Max_Planck,
Kiel) (Source Wikipedia) They believe Elvis
hides on Mars livesIn (Elvis_Presley, Mars)
(Source The One and Only Kings Blog)
- Informativeness
- prominent results preferred
- Frequency of facts in the
- corpus
isA
isA
- Compactness
- Prefer tightly connected
- answers
- Size of the answer graph
Einstein
vegetarian
Tom Cruise
bornInYear
hasWonPrize
Bohr
Nobel Prize
1962
hasWonPrize
diedInYear
24Framework (Ranking model)
- Question
- How to rank multiple matches to the same query?
- Ranking desiderata
- Confidence
- Correct answers
- Certainty of IE
- Trust/Authority of source
Max Planck born in Kiel bornIn (Max_Planck,
Kiel) (Source Wikipedia) They believe Elvis
hides on Mars livesIn (Elvis_Presley, Mars)
(Source The One and Only Kings Blog)
- Informativeness
- prominent results preferred
- Frequency of facts
-
isA
isA
- Compactness
- Prefer tightly connected
- answers
- Size of the answer graph
Einstein
vegetarian
Tom Cruise
bornInYear
hasWonPrize
Bohr
Nobel Prize
1962
hasWonPrize
diedInYear
25Framework (Ranking model)
- Question
- How to rank multiple matches to the same query?
- Ranking desiderata
- Confidence
- Correct answers
- Certainty of IE
- Trust/Authority of source
Max Planck born in Kiel bornIn (Max_Planck,
Kiel) (Source Wikipedia) They believe Elvis
hides on Mars livesIn (Elvis_Presley, Mars)
(Source The One and Only Kings Blog)
- Informativeness
- prominent results preferred
- Frequency of facts
isA
isA
- Compactness
- Prefer tightly connected
- answers
- Size of the answer graph
Einstein
vegetarian
Tom Cruise
bornInYear
hasWonPrize
Bohr
Nobel Prize
1962
hasWonPrize
diedInYear
26Framework (Ranking model)
- Question
- How to rank multiple matches to the same query?
- Ranking desiderata
- Confidence
- Correct answers
- Certainty of IE
- Trust/Authority of source
NAGA exploits language models for
ranking
Max Planck born in Kiel bornIn (Max_Planck,
Kiel) (Source Wikipedia) They believe Elvis
hides on Mars livesIn (Elvis_Presley, Mars)
(Source The One and Only Kings Blog)
- Informativeness
- prominent results preferred
- Frequency of facts
isA
isA
- Compactness
- Prefer tightly connected
- answers
- Size of the answer graph
Einstein
vegetarian
Tom Cruise
bornInYear
hasWonPrize
Bohr
Nobel Prize
1962
hasWonPrize
diedInYear
27Framework (Ranking model)
Statistical Language Models for Document IR
Maron/Kuhns 1960, Ponte/Croft 1998,
Lafferty/Zhai 2001
- each doc has LM generative
- prob. distr. with parameters ?
- query q viewed as sample
- estimate likelihood that q
- is sample of LM of doc d
- rank by descending likelihoods
- (best explanation of q)
d1
?
LM(?1)
q
d2
?
LM(?2)
MLE sparseness
mixture model
Background model (smoothing)
28Framework (Ranking model)
- Scoring answers
- Query q with templates q1q2 qn , e.g.
- Given g with facts g1g2 gn , e.g.
- We use generative mixture models to compute Pq
g
isA
Albert Einstein
x
isA
Albert Einstein
Physicist
using generative mixture model
estimated using knowledge base graph structure
based on IE accuracy and authority analysis
estimated by correlation statistics
29Framework (Ranking model)
isA
Consider
Albert Einstein
x
Possible results
NAGA Ranking (Informativeness)
30Framework (Ranking model)
isA
Consider
Albert Einstein
x
Possible results
BANKS Ranking (Bhalotia et al. ICDE 2002)
- Relies only on underlying graph structure
- Importance of an entity is proportional to its
degree
31Evaluation (Setting)
- Knowledge graph YAGO (Suchanek et al. WWW 2007)
- 16 Million facts
- 85 NAGA queries
- 55 queries from TREC 2005/2006
- 12 queries from the work on SphereSearch
- (Graupmann et al. VLDB 2005)
- We provided 18 regular expression queries
32Evaluation (Setting)
- The queries were issued to
- Google,
- Yahoo! Answers,
- START (http//start.csail.mit.edu/),
- NAGA (Banks scoring)
- relies only on the structure of the underlying
graph. - (see Bhalotia et al. ICDE 2002)
- NAGA (NAGA scoring)
-
- top-10 answers assessed by 20 human judges as
relevant, less relevant and irrelevant.
33Evaluation (Setting)
- The queries were issued to
- Google,
- Yahoo! Answers,
- START (http//start.csail.mit.edu/),
- NAGA (Banks scoring)
- relies only on the structure of the underlying
graph. - (see Bhalotia et al. ICDE 2002)
- NAGA (NAGA scoring)
-
- top-10 answers assessed by 20 human judges as
relevant (2), less relevant (1), and irrelevant
(0).
34Evaluation (Metrics Results)
- NDCG (normalized discounted cumulative gain)
- rewards result lists in which relevant results
are ranked higher than less relevant ones - Useful when comparing result lists of different
lengths - P_at_1
- to measure how satisfied the user was on average
with the first answer of the search engine - We report the Wilson confidence intervals at
?0.95
35Evaluation (Metrics Results)
- NDCG (normalized discounted cumulative gain)
- rewards result lists in which relevant results
are ranked higher than less relevant ones - Useful when comparing result lists of different
lengths - P_at_1
- to measure how satisfied the user was on average
with the first answer of the search engine - We report the Wilson confidence intervals at
?0.95
36Evaluation (Metrics Results)
- NDCG (normalized discounted cumulative gain)
- rewards result lists in which relevant results
are ranked higher than less relevant ones - Useful when comparing result lists of different
lengths - P_at_1
- to measure how satisfied the user was on average
with the first answer of the search engine - Wilson confidence intervals computed at ?0.95
37Summary
- NAGA is a search engine for
- Advanced querying of information in ER graphs
- NAGA queries
- NAGA answers
- We a novel scoring mechanism
- based on generative language models,
- Applied to the specific and unexplored setting of
ER graphs - Incorporating confidence, informativeness, and
compactness - Viability of the approach demonstrated in
comparison to state of the art search engines and
QA-Systems.
38Summary
- NAGA is a search engine for
- Advanced querying of information in ER graphs
- NAGA queries
- NAGA answers
- We propose a novel scoring mechanism
- based on generative language models,
- Applied to the specific and unexplored setting of
ER graphs - Incorporating confidence, informativeness, and
compactness - Viability of the approach demonstrated in
comparison to state of the art search engines and
QA-Systems.
39Summary
- NAGA is a search engine for
- Advanced querying of information in ER graphs
- NAGA queries
- NAGA answers
- We propose a novel scoring mechanism
- based on generative language models,
- Applied to the specific and unexplored setting of
ER graphs - Incorporating confidence, informativeness, and
compactness - Viability of the approach demonstrated in
comparison to state of the art search engines and
QA-Systems.
40 Thank you NAGA http//www.mpi-inf.mpg.de/kasne
ci/naga YAGO http//www.mpi-inf.mpg.de/suchane
k/yago