Title: Joint work with
1Joint work with Georgiana Ifrim, Gjergji
Kasneci, Thomas Neumann, Maya Ramanath, Fabian
Suchanek
2Vision
Opportunity Turn the Web (and Web 2.0 and Web
3.0 ...) into the worlds most comprehensive
knowledge base
- Approach
- 1) harvest and combine
- hand-crafted knowledge sources
- (Semantic Web, ontologies)
- automatic knowledge extraction
- (Statistical Web, text mining)
- social communities and human computing
- (Social Web, Web 2.0)
- 2) express knowledge queries, search, and rank
- 3) everything efficient and scalable
3Why Google and Wikipedia Are Not Enough
Answer knowledge queries such as
proteins that inhibit proteases and other human
enzymes
connection between Thomas Mann and Goethe
German Nobel prize winner who survived both world
wars and all of his four children
politicians who are also scientists
4Why Google and Wikipedia Are Not Enough
Which politicians are also scientists ?
- What is lacking?
- Information is not Knowledge.
- Knowledge is not Wisdom.
- Wisdom is not Truth
- Truth is not Beauty.
- Beauty is not Music.
- Music is the best.
- (Frank Zappa)
- extract facts from Web pages
- capture user intention by
- concepts, entities, relations
5Related Work
Cimple DBlife
Libra
TextRunner
START
Avatar
Answers
information extraction ontology building
Web entity search QA
UIMA
Hakia
Powerset
Freebase
Cyc
EntityRank
DBpedia
semistructured IR graph search
TopX
XQ-FT
Yago Naga
Tijah
SPARQL
DBexplorer
Banks
SWSE
6Outline
?
Motivation
Information Extraction Knowledge Harvesting
(YAGO)
Ranking for Search over Entity-Relation Graphs
(NAGA)
Efficient Query Processing (RDF-3X)
Conclusion
7High-Quality Knowledge Sources
General-purpose ontologies and thesauri WordNet
family
- 200 000 concepts and relations
- can be cast into
- description logics or
- graph, with weights for relation strengths
- (derived from co-occurrence statistics)
scientist, man of science (a person with
advanced knowledge) gt cosmographer,
cosmographist gt biologist, life scientist
gt chemist gt cognitive scientist gt
computer scientist ... gt principal
investigator, PI HAS INSTANCE gt Bacon,
Roger Bacon
8Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
Infobox_Scientist name Max Planck
birth_date April 23, 1858
birth_place Kiel, Germany death_date
October 4, 1947 death_place
Göttingen, Germany residence
Germany nationality GermanyGerman
field Physicist work_institution
University of Kiellt/brgt Humboldt-Universi
tät zu Berlinlt/brgt Georg-August-Universität
Göttingen alma_mater Ludwig-Maximilians-Un
iversität München doctoral_advisor
Philipp von Jolly doctoral_students
Gustav Ludwig Hertzlt/brgt known_for
Planck's constant,
Quantum mechanicsquantum theory prizes
Nobel Prize in Physics (1918)
9Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
10YAGO Yet Another Great OntologyF. Suchanek et
al. WWW07
- Turn Wikipedia into explicit knowledge base
(semantic DB) - keep source pages as witnesses
- Exploit hand-crafted categories and infobox
templates - Represent facts as explicit knowledge triples
- relation (entity1, entity2)
- (in FOL, compatible with RDF, OWL-lite, XML,
etc.) - Map (and disambiguate) relations into WordNet
concept DAG
relation
entity1
entity2
Examples
bornIn
isInstanceOf
City
Max_Planck
Kiel
Kiel
11YAGO Knowledge Base F. Suchanek et al. WWW07
Accuracy ? 95
Entity
subclass
subclass
Person
concepts
Location
subclass
Scientist
subclass
subclass
subclass
subclass
City
Country
Biologist
Physicist
instanceOf
instanceOf
Nobel Prize
Erwin_Planck
Kiel
bornIn
hasWon
FatherOf
individuals
bornOn
diedOn
October 4, 1947
Max_Planck
April 23, 1858
means
means
means
Dr. Planck
Max Karl Ernst Ludwig Planck
Max Planck
words
Online access and download at http//www.mpi-inf.m
pg.de/suchanek/yago/
12Wikipedia Harvesting Difficulties Solutions
- instanceOf relation isleading and difficult
category names - (disputed articles, particle physics,
American Music of the 20th Century, - Nobel laureates in physics, naturalized
citizens of the United States, ) - ? noun group parser ignore when head word in
singular - isA relation mapping categories onto WordNet
classes - Nobel laureates in physics ?
Nobel_laureates, people from Kiel ? person - ? map to (singular of) head exploit synsets
and statistics - Entity name ambiguities
- St. Petersburg, Saint Petersburg, M31,
NGC224 ? means ... - ? exploit Wikipedia redirects
disambiguations, WN synsets
- type checking for scrutinizing candidates
- accept fact candidate only if arguments have
proper classes - marriedTo (Max Planck, quantum physics) ?
Person ? Person
13Higher-Order Facts in YAGO
CapitalOf
CapitalOf
Bonn
Berlin
Germany
14Outline
?
Motivation
?
Information Extraction Knowledge Harvesting
(YAGO)
Ranking for Search over Entity-Relation Graphs
(NAGA)
Efficient Query Processing (RDF-3X)
Conclusion
15NAGA Graph Search G. Kasneci et al. ICDE08
Graph-based search on YAGO-style knowledge bases
with built-in ranking based on confidence and
informativeness
discovery queries
connectedness queries
isa
Thomas Mann
German novelist
isa
isa
Goethe
politician
x
scientist
complex queries (with regular expressions)
isa
wonPrize
inField
computer science
x
scientist
p
worksAt graduatedFrom
locatedIn
u
university
Switzerland
isa
capitalOf
queries over reified facts
isa
c
city
Germany
validIn
1988
16Search Results Without Ranking
q Fisher isa scientist Fisher isa x
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X alumnus_109165182
_at_Fisher Irving_Fisher _at_scientist
scientist_109871938 X social_scientist_1099273
04 _at_Fisher James_Fisher _at_scientist
scientist_10981938 X ornithologist_109711173
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X theorist_110008610
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X colleague_109301221
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X organism_100003226
mathematician_109635652 subClassOfgt
scientist_109871938 Alumni_of_Gonville_and_Caiu
s_College,_Cambridge subClassOfgt
alumnus_109165182 "Fisher" familyNameOfgt
Ronald_Fisher Ronald_Fisher typegt
Alumni_of_Gonville_and_Caius_College,_Cambridge
Ronald_Fisher typegt 20th_century_mathematic
ians "scientist" meansgt scientist_109871938
17Ranking with Statistical Language Model
q Fisher isa scientist Fisher isa x
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X mathematician_109635652
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X statistician_109958989
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X president_109787431 _at_Fi
sher Ronald_Fisher _at_scientist
scientist_109871938 X geneticist_109475749
_at_Fisher Ronald_Fisher _at_scientist
scientist_109871938 X scientist_109871938
Score 7.184462521168058E-13 mathematician_1096356
52 subClassOfgt scientist_109871938
"Fisher" familyNameOfgt Ronald_Fisher
Ronald_Fisher typegt 20th_century_mathematic
ians "scientist" meansgt scientist_109871938
20th_century_mathematicians subClassOfgt
mathematician_109635652
? statistical language model for result
graphs
Online access at http//www.mpi-inf.mpg.de/kasnec
i/naga/
18Ranking Factors
- Confidence
- Prefer results that are likely to be correct
- Certainty of IE
- Authenticity and Authority of Sources
bornIn (Max Planck, Kiel) from Max Planck was
born in Kiel (Wikipedia)
livesIn (Elvis Presley, Mars) from They believe
Elvis hides on Mars (Martian Bloggeria)
- Informativeness
- Prefer results that are likely important
- May prefer results that are likely new to user
- Frequency in answer
- Frequency in corpus (e.g. Web)
- Frequency in query log
q isa (Einstein, y)
isa (Einstein, scientist)
isa (Einstein, vegetarian)
q isa (x, vegetarian)
isa (Einstein, vegetarian)
isa (Al Nobody, vegetarian)
- Compactness
- Prefer results that are tightly connected
- Size of answer graph
vegetarian
Tom Cruise
isa
isa
bornIn
Einstein
1962
won
won
Bohr
Nobel Prize
diedIn
19NAGA Example
Query x isa politician x isa
scientist Results Benjamin Franklin Paul
Wolfowitz Angela Merkel
20Outline
?
Motivation
?
Information Extraction Knowledge Harvesting
(YAGO)
?
Ranking for Search over Entity-Relation Graphs
(NAGA)
Efficient Query Processing (RDF-3X)
Conclusion
21Why RDF? Why a New DB Engine?
Person
Location
subclass
subclass
(id1, Name, Max Planck), (id1, bornOn, 23 Apr
1858), (id1, bornIn, id2), (id2, Name, Kiel),
(id2, locatedIn, id3), (id3, Name, Germany),
(id1, FatherOf, id4) (id4, Name, Erwin
Planck),
Scientist
City
subclass
Physicist
instanceOf
instanceOf
Nobel Prize
Erwin_Planck
Kiel
bornIn
Father Of
hasWon
Apr 23, 1858
bornOn
Oct 4, 1947
diedOn
Max_Planck
- RDF triples (subject property/predicate
value/object) - pay-as-you-go schema-agnostic or schema later
- RDF triples form fine-grained ER graph
- queries bound to need many star-joins and long
chain-joins - physical design critical, but hardly predictable
workload
22SPARQL Query Language
SPJ combinations of triple patterns
Example Select ?c Where ?p
isa scientist . ?p bornIn ?t . ?p hasWon ?a .
?t inCountry ?c . ?a Name NobelPrize
options for filter predicates, duplicate
handling, wildcard join, etc.
Example Select Distinct ?c Where ?p
?r1 ?t . ?t ?r2 ?c . ?c isa ltcountrygt .
?p bornOn ?b . Filter (?b gt 1945)
support for RDFS types
23RDF-3X a RISC-style Engine
- Design rationale
- RDF-specific engine
- Simplify operations
- Reduce implementation choices
- Optimize for common case
- Eliminate tuning knobs
- Key principles
- Mapping dictionary for encoding all literals
into ids - Exhaustive indexing of id triples
- Index-only store, high compression
- QP mostly merge joins with order-preservation
- Very fast DP-based query optimizer
- Frequent-paths synopses, property-value
histograms
Benchmarks on gt50 Mio. triples ltlt 100 ms
response times for queries with gt 10
joins Three important things in DBS
performance, performance, performance
!
24Outline
?
Motivation
?
Information Extraction Knowledge Harvesting
(YAGO)
?
Ranking for Search over Entity-Relation Graphs
(NAGA)
?
Efficient Query Processing (RDF-3X)
Conclusion
25Large-Scale Knowledge Gathering
Turn Web (2.0, 3.0, ...) into worlds most
comprehensive knowledge base
info extraction text mining
ontologies encyclopedia
Semantic Web
Statistical Web
Web 2.0 communities human computing
Social Web
26Thank You !