Robust and Scalable Harvesting - PowerPoint PPT Presentation

About This Presentation
Title:

Robust and Scalable Harvesting

Description:

Title: Slide 1 Author: weikum Last modified by: weikum Created Date: 5/14/2005 4:14:09 PM Document presentation format: On-screen Show (4:3) Other titles – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 40
Provided by: wei97
Category:

less

Transcript and Presenter's Notes

Title: Robust and Scalable Harvesting


1
Knowledge on the Web
Towards
Robust and Scalable Harvesting of
Entity-Relationship Facts
Gerhard Weikum Max Planck Institute for
Informatics http//www.mpi-inf.mpg.de/weikum/
2
Acknowledgements
3
Vision Turn Web into Knowledge Base
  • comprehensive DB
  • of human knowledge
  • everything that
  • Wikipedia knows
  • machine-readable
  • capturing entities,
  • classes, relationships

Source DB IR methods for knowledge
discovery. Communications of the ACM 52(4), 2009
4
Knowledge as Enabling Technology
  • entity recognition disambiguation
  • understanding natural language speech
  • knowledge services reasoning for semantic apps
  • semantic search precise answers to advanced
    queries
  • (by scientists, students, journalists,
    analysts, etc.)

German chancellor when Angela Merkel was born?
Japanese computer science institutes?
Politicians who are also scientists?
Enzymes that inhibit HIV? Influenza drugs for
pregnant women?
...
5
Knowledge Search on the Web (1)
Query sushi ingredients?
Results Nori seaweed Ginger Tuna Sashimi ... Unag
i
http//www.google.com/squared/
6
Knowledge Search on the Web (1)
Query Japanese oOputer science
Query Japanese computer science institutes ?
Query Japanese computers ?
http//www.google.com/squared/
7
Knowledge Search on the Web (2)
Query politicians who are also scientists ? ?x
isa politician . ?x isa scientist Results Benjam
in Franklin Zbigniew Brzezinski Alan
Greenspan Angela Merkel
http//www.mpi-inf.mpg.de/yago-naga/
8
Knowledge Search on the Web (2)
Query politicians who are married to
scientists ? ?x isa politician . ?x isMarriedTo
?y . ?y isa scientist Results (3) Adrienne
Clarkson, Stephen Clarkson , Raúl Castro,
Vilma Espín , Jeannemarie Devolites Davis,
Thomas M. Davis
http//www.mpi-inf.mpg.de/yago-naga/
9
Knowledge Search on the Web (3)
http//www-tsujii.is.s.u-tokyo.ac.jp/medie/
10
Take-Home Message
Information is not Knowledge. Knowledge is not
Wisdom. Wisdom is not Truth Truth is not
Beauty. Beauty is not Music. Music is the best.
If music was invented 20 years ago when the
Web was created, we'd all be playing
one-string instruments.
(Frank Zappa jazzrock musician 1940 1993)
(Udi Manber VP Engineering Google)
  • extract facts from Web sources
  • organize them in an automatically built
    knowledge base
  • answer questions in terms of entities and
    relations

11
Related Work
Yago-Naga
Text2Onto
Kylin KOG
Powerset
ReadTheWeb
Avatar
Hakia
ontologies entity search
fact extraction statist. ranking
Cyc
UIMA
kosmix
(Semantic Web)
KnowItAll
(Statistical Web)
TextRunner
WolframAlpha
StatSnowball EntityCube
SWSE
online communities question answering
sig.ma
DBpedia
Cimple DBlife
(Social Web)
TrueKnowledge
GoogleSquared
Freebase
Answers
START
12
Outline
What and Why
?

Building a Large Knowledge Base
Consistent Growth of the Knowledge Base
Adding Multimodal Knowledge
Challenges Scope, Scale, Robustness
...
13
Information Extraction (IE) Text to Relations
bornOn (Max Planck, 23 April 1858) bornIn (Max
Planck, Kiel)
type (Max Planck, physicist)
Max Karl Ernst Ludwig Planck was born in Kiel,
Germany, on April 23, 1858, the son of Julius
Wilhelm and Emma (née Patzig) Planck. Planck
studied at the Universities of Munich and Berlin,
where his teachers included Kirchhoff and
Helmholtz, and received his doctorate of
philosophy at Munich in 1879. He was
Privatdozent in Munich from 1880 to 1885, then
Associate Professor of Theoretical Physics at
Kiel until 1889, in which year he succeeded
Kirchhoff as Professor at Berlin University,
where he remained until his retirement in 1926.
Afterwards he became President of the Kaiser
Wilhelm Society for the Promotion of Science, a
post he held until 1937. He was also a gifted
pianist and is said to have at one time
considered music as a career. Planck was twice
married. Upon his appointment, in 1885, to
Associate Professor in his native town Kiel he
married a friend of his childhood, Marie Merck,
who died in 1909. He remarried her cousin Marga
von Hösslin. Three of his children died young,
leaving him with two sons.
advisor (Max Planck, Kirchhoff) advisor (Max
Planck, Helmholtz) AlmaMater (Max Planck, TU
Munich)
plays (Max Planck, piano) spouse (Max Planck,
Marie Merck) spouse (Max Planck, Marga Hösslin)
  • IE builds data space (with uncertain data)
  • confidence lt 1 (sometimes ltlt 1)
  • knowledge base from many sources
  • high computational cost

IE combine NLP, pattern matching, statistical
learning
14
IE for Knowledge Harvesting
  • YAGO knowledge base from
  • Wikipedia infoboxes categories and
  • integration with WordNet taxonomy
  • NAGA search on RDF graph
  • with entity-relationship LM for ranking

Infobox scientist name Max Planck
birth_date birth date1858423mfy
birth_place Kiel, Holstein
death_date death date and agemfyes1947104
death_place Göttingen, West
Germany nationality GermanyGerman
field Physics alma_mater
Ludwig-Maximilians-Universität München
work_institutions University of Kielltbr /gt
Humboldt University of
BerlinUniversity of Berlinltbr /gt
University of Göttingenltbr /gt
Kaiser-Wilhelm-Gesellschaftltbr /gt
doctoral_advisor Alexander von Brill
doctoral_students Gustav Ludwig Hertzltbr
/gt
known_for Planck constantltbr /gt
Planck postulateltbr
/gt Planck's
law of black body radiation
15
YAGO Knowledge Base (F. Suchanek et al. WWW07)
Entity
40 Mio. RDF triples ( entity1-relation-enti
ty2, subject-predicate-object )
subclass
subclass
subclass
Person
Location
Organization
subclass
subclass
subclass
Accuracy ? 95
subclass
subclass
Country
Scientist
Politician
subclass
subclass
State
instanceOf
instanceOf
Biologist
instanceOf
Physicist
City
instanceOf
Germany
instanceOf
instanceOf
locatedIn
Erwin_Planck
Oct 23, 1944
diedOn
locatedIn
Kiel
Schleswig-Holstein
FatherOf
bornIn
Nobel Prize
hasWon
instanceOf
citizenOf
diedOn
Oct 4, 1947
Max_Planck Society
Max_Planck
Angela Merkel
Apr 23, 1858
bornOn
means(0.9)
means
means
means
means(0.1)
Max Planck
Max Karl Ernst Ludwig Planck
Angela Dorothea Merkel
Angela Merkel
16
Leveraging YAGO for Entity Extraction
Existing knowledge base boosts entity detection
disambiguation (similarity of string-in-context to
target entity-in-context)
17
Outline
What and Why
?
Building a Large Knowledge Base
?

Consistent Growth of the Knowledge Base
Adding Multimodal Knowledge
Challenges Scope, Scale, Robustness
...
18
Growing the Knowledge Base

Word Net
Wikipedia
YAGO Core Extractors
YAGO Core Checker
YAGO Core
G r o w i n g
19
Pattern-Based Harvesting
(Dipre, Snowball, Text2Onto, Leila, StatSnowball,
etc.)
Facts
Patterns
Fact Candidates
  • good for recall
  • noisy, drifting
  • not robust enough

20
SOFIE Self-Organizing Framework for IE
(F. Suchanek et al. WWW09)
  • Integrate methods
  • textual/linguistic pattern-based IE with
    statistics
  • seeds ? patterns ? facts ? patterns ? ...
  • (Hillary, Bill) ? X and her husband Y ? (Carla,
    Nicolas), (Carla, Mick) ?
  • declarative rule-based IE with constraints
  • functional dependencies marriedTo is a
    function
  • inclusion dependencies presidentOf ?
    citizenOf
  • Address problems
  • pattern selection (and her husband, has been
    dating, ...)
  • reasoning on mutual consistency of facts
  • entity disambiguation (Merkel ? AngelaMerkel,
    MaxMerkel, ...
  • MPI ? MaxPlanckInstitute, MessagePassingInter
    face)

Unified solution by Weighted Max-Sat solver (high
accuracy and much faster than MCMC for prob.
graphical models)
21
SOFIE Example
100 40 60 20 10
occurs (X and her husband Y, Hillary, Bill)
Patterns
Facts
Spouse (HillaryClinton,
BillClinton)
occurs (X Y and their children, Hillary, Bill)
occurs (X and her husband Y, Victoria, David)
Spouse (CarlaBruni,
NicolasSarkozy)
occurs (X dating with Y, Rebecca, David)
occurs (X dating with Y, Victoria, Tom)
Spouse (Victoria, David) ? ? Spouse (Rebecca,
David) Spouse (Victoria, David) ? ? Spouse
(Victoria, Tom) occurs (husband, Victoria,
David) ? expresses (husband, Spouse) ?
Spouse (Victoria, David) occurs (dating, Rebecca,
David) ? expresses (dating, Spouse) ?
Spouse (Rebecca, David) occurs (husband,
Victoria, David) ? Spouse (Victoria, David)
? expresses (husband, Spouse)
? x,y,z,w R(x,y) ? R(x,z) ? yz (alt.
?R(x,y) ?? R(x,z)) ? x,y,z,w R(x,y) ? R(w,y) ?
xw (alt. ?R(x,y) ?? R(x,z)) ... ? x,y R(x,y)
? R(y,x) ? p,x,y occurs (p, x, y) ? expresses
(p, R) ? R (x, y) ? p,x,y occurs (p, x, y) ?
R (x, y) ? expresses (p, R)
Clauses
22
Reasoning on Hypothesesby Weighted-Max-Sat Solver
  • Clauses (propositional logic formulae consisting
    of
  • conjunctions of disjunctions of positive or
    negative literals)
  • connect facts, patterns, hypotheses,
    constraints
  • Treat hypotheses (literals) as variables, facts
    as constants
  • (?1 ? ?A ? 1), (?1 ? ?A ? B), (?1 ? ?C), (?D
    ? E), (?D ? F), ...
  • Clauses can be weighted by pattern statistics
  • Solve weighted Max-Sat problem
  • assign truth values to variables s.t.
  • total weight of satisfied clauses is max!
  • ? NP-hard, but good approximation algorithms

23
SOFIE Example
100 40 60 20 10
occurs (X and her husband Y, Hillary, Bill)
Spouse (HillaryClinton,
BillClinton)
occurs (X Y and their children, Hillary, Bill)
occurs (X and her husband Y, Victoria, David)
Spouse (CarlaBruni,
NicolasSarkozy)
occurs (X dating with Y, Rebecca, David)
occurs (X dating with Y, Victoria, Tom)
Spouse (Victoria, David) ? ? Spouse (Rebecca,
David) Spouse (Victoria, David) ? ? Spouse
(Victoria, Tom) occurs (husband, Victoria,
David) ? expresses (husband, Spouse) ?
Spouse (Victoria, David) occurs (dating, Rebecca,
David) ? expresses (dating, Spouse) ?
Spouse (Rebecca, David)
Wanted truth assignment for A, B, C, with
maximal total weight of satisfied clauses
24
Consistent Growth of Knowledge
  • SOFIE self-organizing framework for
  • scrutinizing hypotheses about new facts,
  • enabling automated growth of the knowledge base
  • unifies pattern-based IE, consistency reasoning,
  • and entity disambiguation
  • highly related to methods based on Markov Logic
    Networks,
  • joint learning with constraints
  • but SOFIE does not compute joint probability
    distribution,
  • much faster than Monte-Carlo Markov-Chain
    methods

25
Outline
What and Why
?
Building a Large Knowledge Base
?
?
Consistent Growth of the Knowledge Base

Adding Multimodal Knowledge
Challenges Scope, Scale, Robustness
...
26
Whats Wrong With This?
27
Multimodal Knowledge
type (MPI, ScientificOrganization) fullName (MPI,
Max Planck Institute for Informatics) inField
(MPI, Computer Science) partOf (MPI, Max Planck
Society) foundingDirector (MPI, Kurt Mehlhorn)
28
K2 (Knowledge Kaleidoscope) Photos of Named
Entities
Challenges
Long Tail non-famous but notable entities
?
Diversity variety of different views, different
ages, etc.
?
Scale all entities with Wikipedia article (known
to YAGO) all entities mentioned in Wikipedia
articles
?
29
Gathering Ranking Photosby Image Search Engines
30
Knowledge-based Photo Harvesting
(Bilyana Taneva et al. WSDM 2010)
  • generate expanded queries qi for entity e

using affiliation, knownFor, wonAward,
etc. e.g. Kitsuregawa University Tokyo,
Kitsuregawa Hash Join, Kitsuregawa
Sigmod Award, etc.
  • run queries and retrieve photos p from top-k
    results (k100)
  • combine results by rank-based weighted voting
  • (learn weights wi from training entities)
  • consider visual similarities (using SIFT)
  • rank results, cluster by similarity

31
David Patterson
David Patterson Berkeley
David Patterson RISC
David Patterson ACM
our method
32
Outline
What and Why
?
Building a Large Knowledge Base
?
Consistent Growth of the Knowledge Base
?
?
Adding Multimodal Knowledge

Challenges Scope, Scale, Robustness
...
33
Challenges Scope, Scale, Robustness
  • Temporal Knowledge
  • temporal validity of all facts (spouses, CEOs,
    etc.)
  • Multilingual Knowledge via cross-lingual
    Wikipedia links etc.
  • Rome ? Roma ? Rom ? Rím ? ??? ? ????
  • Moment (Stochastik) ? Moment (math) ?
    Momento estándar
  • Multimodal Knowledge photos videos of
  • entities (people, landmarks, etc.) and
  • facts (weddings, award ceremonies, soccer
    matches, etc.)
  • Active Knowledge on-demand coupling with Web
    Services
  • for live facts (ratings, charts, sports
    feeds, etc.)
  • Diverse Knowledge diversity of
    facts/facets/views of entities
  • Scalable Knowledge Gathering
  • high-quality extraction at the rate at which
  • news, publications, Wikipedia updates are
    produced

34
Scale Benchmark Proposal
for all people in Wikipedia (100,000s) gather
all spouses, incl. divorced widowed, and
corresponding time periods! gt95 accuracy,
gt95 coverage, in one night
35
Robustness Patterns Reasoning
  • Easy to optimize either one of recall or
    precision alone
  • recall ? pattern-based harvesting (fast
    furious IE)
  • precision ? rigorous consistency reasoning

Challenge lies in reconciling both recall
precision
  • Some ideas
  • richer patterns, richer pattern statistics
  • negative seed facts
  • more and richer constraints
  • efficiency scalability (map-reduce)
    parallelism
  • (some parts embarrasingly parallel, others very
    difficult)

36
Scope Temporal Knowledge
  • different resolutions
  • missing dates
  • relative dates
  • adverbial phrases
  • vague time periods
  • temporal refinement

extracting, aggregating, and reasoning
on temporal scopes of facts from many sources is
a major challenge
37
Summary
Information is not Knowledge. Knowledge is not
Wisdom. Wisdom is not Truth Truth is not
Beauty. Beauty is not Music. Music is the best.
(Frank Zappa
1940 1993)
  • Distill entities relations from Web pages to
  • automatically build a large knowledge base
  • knowledge (base) enables
  • more ( better) knowledge

38
OutlookKnowledge Harvesting at Web Scale
  • Grand Challenge
  • as literature, news blogs are being produced,
  • read everything, detect entities, extract
    relations,
  • confirm old knowledge obtain new knowledge
  • new facts
  • new relation types
  • temporal evolution of entities facts
  • opinionated statements diversity
  • multimodal footage
  • Grand Opportunities
  • machine-processable, comprehensive KB can enable
    or boost
  • semantic Web search precise answers
  • context-sensitive machine translation
  • situation-aware human-computer dialogs
  • machine reasoning and value-added knowledge
    services

39
Domo Arigato Gozaimasu!
Write a Comment
User Comments (0)
About PowerShow.com