Title: Robust and Scalable Harvesting
1Knowledge on the Web
Towards
Robust and Scalable Harvesting of
Entity-Relationship Facts
Gerhard Weikum Max Planck Institute for
Informatics http//www.mpi-inf.mpg.de/weikum/
2Acknowledgements
3Vision Turn Web into Knowledge Base
- comprehensive DB
- of human knowledge
- everything that
- Wikipedia knows
- machine-readable
- capturing entities,
- classes, relationships
Source DB IR methods for knowledge
discovery. Communications of the ACM 52(4), 2009
4Knowledge as Enabling Technology
- entity recognition disambiguation
- understanding natural language speech
- knowledge services reasoning for semantic apps
- semantic search precise answers to advanced
queries - (by scientists, students, journalists,
analysts, etc.)
German chancellor when Angela Merkel was born?
Japanese computer science institutes?
Politicians who are also scientists?
Enzymes that inhibit HIV? Influenza drugs for
pregnant women?
...
5Knowledge Search on the Web (1)
Query sushi ingredients?
Results Nori seaweed Ginger Tuna Sashimi ... Unag
i
http//www.google.com/squared/
6Knowledge Search on the Web (1)
Query Japanese oOputer science
Query Japanese computer science institutes ?
Query Japanese computers ?
http//www.google.com/squared/
7Knowledge Search on the Web (2)
Query politicians who are also scientists ? ?x
isa politician . ?x isa scientist Results Benjam
in Franklin Zbigniew Brzezinski Alan
Greenspan Angela Merkel
http//www.mpi-inf.mpg.de/yago-naga/
8Knowledge Search on the Web (2)
Query politicians who are married to
scientists ? ?x isa politician . ?x isMarriedTo
?y . ?y isa scientist Results (3) Adrienne
Clarkson, Stephen Clarkson , Raúl Castro,
Vilma Espín , Jeannemarie Devolites Davis,
Thomas M. Davis
http//www.mpi-inf.mpg.de/yago-naga/
9Knowledge Search on the Web (3)
http//www-tsujii.is.s.u-tokyo.ac.jp/medie/
10Take-Home Message
Information is not Knowledge. Knowledge is not
Wisdom. Wisdom is not Truth Truth is not
Beauty. Beauty is not Music. Music is the best.
If music was invented 20 years ago when the
Web was created, we'd all be playing
one-string instruments.
(Frank Zappa jazzrock musician 1940 1993)
(Udi Manber VP Engineering Google)
- extract facts from Web sources
- organize them in an automatically built
knowledge base - answer questions in terms of entities and
relations
11Related Work
Yago-Naga
Text2Onto
Kylin KOG
Powerset
ReadTheWeb
Avatar
Hakia
ontologies entity search
fact extraction statist. ranking
Cyc
UIMA
kosmix
(Semantic Web)
KnowItAll
(Statistical Web)
TextRunner
WolframAlpha
StatSnowball EntityCube
SWSE
online communities question answering
sig.ma
DBpedia
Cimple DBlife
(Social Web)
TrueKnowledge
GoogleSquared
Freebase
Answers
START
12Outline
What and Why
?
Building a Large Knowledge Base
Consistent Growth of the Knowledge Base
Adding Multimodal Knowledge
Challenges Scope, Scale, Robustness
...
13Information Extraction (IE) Text to Relations
bornOn (Max Planck, 23 April 1858) bornIn (Max
Planck, Kiel)
type (Max Planck, physicist)
Max Karl Ernst Ludwig Planck was born in Kiel,
Germany, on April 23, 1858, the son of Julius
Wilhelm and Emma (née Patzig) Planck. Planck
studied at the Universities of Munich and Berlin,
where his teachers included Kirchhoff and
Helmholtz, and received his doctorate of
philosophy at Munich in 1879. He was
Privatdozent in Munich from 1880 to 1885, then
Associate Professor of Theoretical Physics at
Kiel until 1889, in which year he succeeded
Kirchhoff as Professor at Berlin University,
where he remained until his retirement in 1926.
Afterwards he became President of the Kaiser
Wilhelm Society for the Promotion of Science, a
post he held until 1937. He was also a gifted
pianist and is said to have at one time
considered music as a career. Planck was twice
married. Upon his appointment, in 1885, to
Associate Professor in his native town Kiel he
married a friend of his childhood, Marie Merck,
who died in 1909. He remarried her cousin Marga
von Hösslin. Three of his children died young,
leaving him with two sons.
advisor (Max Planck, Kirchhoff) advisor (Max
Planck, Helmholtz) AlmaMater (Max Planck, TU
Munich)
plays (Max Planck, piano) spouse (Max Planck,
Marie Merck) spouse (Max Planck, Marga Hösslin)
- IE builds data space (with uncertain data)
- confidence lt 1 (sometimes ltlt 1)
- knowledge base from many sources
- high computational cost
IE combine NLP, pattern matching, statistical
learning
14IE for Knowledge Harvesting
- YAGO knowledge base from
- Wikipedia infoboxes categories and
- integration with WordNet taxonomy
- NAGA search on RDF graph
- with entity-relationship LM for ranking
Infobox scientist name Max Planck
birth_date birth date1858423mfy
birth_place Kiel, Holstein
death_date death date and agemfyes1947104
death_place Göttingen, West
Germany nationality GermanyGerman
field Physics alma_mater
Ludwig-Maximilians-Universität München
work_institutions University of Kielltbr /gt
Humboldt University of
BerlinUniversity of Berlinltbr /gt
University of Göttingenltbr /gt
Kaiser-Wilhelm-Gesellschaftltbr /gt
doctoral_advisor Alexander von Brill
doctoral_students Gustav Ludwig Hertzltbr
/gt
known_for Planck constantltbr /gt
Planck postulateltbr
/gt Planck's
law of black body radiation
15YAGO Knowledge Base (F. Suchanek et al. WWW07)
Entity
40 Mio. RDF triples ( entity1-relation-enti
ty2, subject-predicate-object )
subclass
subclass
subclass
Person
Location
Organization
subclass
subclass
subclass
Accuracy ? 95
subclass
subclass
Country
Scientist
Politician
subclass
subclass
State
instanceOf
instanceOf
Biologist
instanceOf
Physicist
City
instanceOf
Germany
instanceOf
instanceOf
locatedIn
Erwin_Planck
Oct 23, 1944
diedOn
locatedIn
Kiel
Schleswig-Holstein
FatherOf
bornIn
Nobel Prize
hasWon
instanceOf
citizenOf
diedOn
Oct 4, 1947
Max_Planck Society
Max_Planck
Angela Merkel
Apr 23, 1858
bornOn
means(0.9)
means
means
means
means(0.1)
Max Planck
Max Karl Ernst Ludwig Planck
Angela Dorothea Merkel
Angela Merkel
16Leveraging YAGO for Entity Extraction
Existing knowledge base boosts entity detection
disambiguation (similarity of string-in-context to
target entity-in-context)
17Outline
What and Why
?
Building a Large Knowledge Base
?
Consistent Growth of the Knowledge Base
Adding Multimodal Knowledge
Challenges Scope, Scale, Robustness
...
18Growing the Knowledge Base
Word Net
Wikipedia
YAGO Core Extractors
YAGO Core Checker
YAGO Core
G r o w i n g
19Pattern-Based Harvesting
(Dipre, Snowball, Text2Onto, Leila, StatSnowball,
etc.)
Facts
Patterns
Fact Candidates
- good for recall
- noisy, drifting
- not robust enough
20SOFIE Self-Organizing Framework for IE
(F. Suchanek et al. WWW09)
- Integrate methods
- textual/linguistic pattern-based IE with
statistics - seeds ? patterns ? facts ? patterns ? ...
- (Hillary, Bill) ? X and her husband Y ? (Carla,
Nicolas), (Carla, Mick) ? - declarative rule-based IE with constraints
- functional dependencies marriedTo is a
function - inclusion dependencies presidentOf ?
citizenOf
- Address problems
- pattern selection (and her husband, has been
dating, ...) - reasoning on mutual consistency of facts
- entity disambiguation (Merkel ? AngelaMerkel,
MaxMerkel, ... - MPI ? MaxPlanckInstitute, MessagePassingInter
face)
Unified solution by Weighted Max-Sat solver (high
accuracy and much faster than MCMC for prob.
graphical models)
21SOFIE Example
100 40 60 20 10
occurs (X and her husband Y, Hillary, Bill)
Patterns
Facts
Spouse (HillaryClinton,
BillClinton)
occurs (X Y and their children, Hillary, Bill)
occurs (X and her husband Y, Victoria, David)
Spouse (CarlaBruni,
NicolasSarkozy)
occurs (X dating with Y, Rebecca, David)
occurs (X dating with Y, Victoria, Tom)
Spouse (Victoria, David) ? ? Spouse (Rebecca,
David) Spouse (Victoria, David) ? ? Spouse
(Victoria, Tom) occurs (husband, Victoria,
David) ? expresses (husband, Spouse) ?
Spouse (Victoria, David) occurs (dating, Rebecca,
David) ? expresses (dating, Spouse) ?
Spouse (Rebecca, David) occurs (husband,
Victoria, David) ? Spouse (Victoria, David)
? expresses (husband, Spouse)
? x,y,z,w R(x,y) ? R(x,z) ? yz (alt.
?R(x,y) ?? R(x,z)) ? x,y,z,w R(x,y) ? R(w,y) ?
xw (alt. ?R(x,y) ?? R(x,z)) ... ? x,y R(x,y)
? R(y,x) ? p,x,y occurs (p, x, y) ? expresses
(p, R) ? R (x, y) ? p,x,y occurs (p, x, y) ?
R (x, y) ? expresses (p, R)
Clauses
22Reasoning on Hypothesesby Weighted-Max-Sat Solver
- Clauses (propositional logic formulae consisting
of - conjunctions of disjunctions of positive or
negative literals) - connect facts, patterns, hypotheses,
constraints - Treat hypotheses (literals) as variables, facts
as constants - (?1 ? ?A ? 1), (?1 ? ?A ? B), (?1 ? ?C), (?D
? E), (?D ? F), ... - Clauses can be weighted by pattern statistics
- Solve weighted Max-Sat problem
- assign truth values to variables s.t.
- total weight of satisfied clauses is max!
- ? NP-hard, but good approximation algorithms
23SOFIE Example
100 40 60 20 10
occurs (X and her husband Y, Hillary, Bill)
Spouse (HillaryClinton,
BillClinton)
occurs (X Y and their children, Hillary, Bill)
occurs (X and her husband Y, Victoria, David)
Spouse (CarlaBruni,
NicolasSarkozy)
occurs (X dating with Y, Rebecca, David)
occurs (X dating with Y, Victoria, Tom)
Spouse (Victoria, David) ? ? Spouse (Rebecca,
David) Spouse (Victoria, David) ? ? Spouse
(Victoria, Tom) occurs (husband, Victoria,
David) ? expresses (husband, Spouse) ?
Spouse (Victoria, David) occurs (dating, Rebecca,
David) ? expresses (dating, Spouse) ?
Spouse (Rebecca, David)
Wanted truth assignment for A, B, C, with
maximal total weight of satisfied clauses
24Consistent Growth of Knowledge
- SOFIE self-organizing framework for
- scrutinizing hypotheses about new facts,
- enabling automated growth of the knowledge base
- unifies pattern-based IE, consistency reasoning,
- and entity disambiguation
- highly related to methods based on Markov Logic
Networks, - joint learning with constraints
- but SOFIE does not compute joint probability
distribution, - much faster than Monte-Carlo Markov-Chain
methods
25Outline
What and Why
?
Building a Large Knowledge Base
?
?
Consistent Growth of the Knowledge Base
Adding Multimodal Knowledge
Challenges Scope, Scale, Robustness
...
26Whats Wrong With This?
27Multimodal Knowledge
type (MPI, ScientificOrganization) fullName (MPI,
Max Planck Institute for Informatics) inField
(MPI, Computer Science) partOf (MPI, Max Planck
Society) foundingDirector (MPI, Kurt Mehlhorn)
28K2 (Knowledge Kaleidoscope) Photos of Named
Entities
Challenges
Long Tail non-famous but notable entities
?
Diversity variety of different views, different
ages, etc.
?
Scale all entities with Wikipedia article (known
to YAGO) all entities mentioned in Wikipedia
articles
?
29Gathering Ranking Photosby Image Search Engines
30Knowledge-based Photo Harvesting
(Bilyana Taneva et al. WSDM 2010)
- generate expanded queries qi for entity e
using affiliation, knownFor, wonAward,
etc. e.g. Kitsuregawa University Tokyo,
Kitsuregawa Hash Join, Kitsuregawa
Sigmod Award, etc.
- run queries and retrieve photos p from top-k
results (k100)
- combine results by rank-based weighted voting
- (learn weights wi from training entities)
- consider visual similarities (using SIFT)
- rank results, cluster by similarity
31David Patterson
David Patterson Berkeley
David Patterson RISC
David Patterson ACM
our method
32Outline
What and Why
?
Building a Large Knowledge Base
?
Consistent Growth of the Knowledge Base
?
?
Adding Multimodal Knowledge
Challenges Scope, Scale, Robustness
...
33Challenges Scope, Scale, Robustness
- Temporal Knowledge
- temporal validity of all facts (spouses, CEOs,
etc.) - Multilingual Knowledge via cross-lingual
Wikipedia links etc. - Rome ? Roma ? Rom ? Rím ? ??? ? ????
- Moment (Stochastik) ? Moment (math) ?
Momento estándar - Multimodal Knowledge photos videos of
- entities (people, landmarks, etc.) and
- facts (weddings, award ceremonies, soccer
matches, etc.) - Active Knowledge on-demand coupling with Web
Services - for live facts (ratings, charts, sports
feeds, etc.) - Diverse Knowledge diversity of
facts/facets/views of entities - Scalable Knowledge Gathering
- high-quality extraction at the rate at which
- news, publications, Wikipedia updates are
produced
34Scale Benchmark Proposal
for all people in Wikipedia (100,000s) gather
all spouses, incl. divorced widowed, and
corresponding time periods! gt95 accuracy,
gt95 coverage, in one night
35Robustness Patterns Reasoning
- Easy to optimize either one of recall or
precision alone - recall ? pattern-based harvesting (fast
furious IE) - precision ? rigorous consistency reasoning
Challenge lies in reconciling both recall
precision
- Some ideas
- richer patterns, richer pattern statistics
- negative seed facts
- more and richer constraints
- efficiency scalability (map-reduce)
parallelism - (some parts embarrasingly parallel, others very
difficult)
36Scope Temporal Knowledge
- different resolutions
- missing dates
- relative dates
- adverbial phrases
- vague time periods
- temporal refinement
extracting, aggregating, and reasoning
on temporal scopes of facts from many sources is
a major challenge
37Summary
Information is not Knowledge. Knowledge is not
Wisdom. Wisdom is not Truth Truth is not
Beauty. Beauty is not Music. Music is the best.
(Frank Zappa
1940 1993)
- Distill entities relations from Web pages to
- automatically build a large knowledge base
- knowledge (base) enables
- more ( better) knowledge
38OutlookKnowledge Harvesting at Web Scale
- Grand Challenge
- as literature, news blogs are being produced,
- read everything, detect entities, extract
relations, - confirm old knowledge obtain new knowledge
- new facts
- new relation types
- temporal evolution of entities facts
- opinionated statements diversity
- multimodal footage
- Grand Opportunities
- machine-processable, comprehensive KB can enable
or boost - semantic Web search precise answers
- context-sensitive machine translation
- situation-aware human-computer dialogs
- machine reasoning and value-added knowledge
services
39Domo Arigato Gozaimasu!