Title: Wikitology Wikipedia as an Ontology
1WikitologyWikipedia as an Ontology
Tim Finin, UMBC
- Zareen Syed and Anupam Joshi
- University of Maryland, Baltimore County
- James Mayfield, Paul McNamee and Christine Piatko
- JHU Human Language Technology Center of Excellence
2Overview
- Introduction
- Wikipedia as an ontology
- Applications
- Discussion
- Conclusion
introduction ? wikitology ? applications ?
discussion ? conclusion
3Wikis and Knowledge
- Wikis are a great way to collaborate on knowledge
encoding - Wikipedia is an archetype for this, but thereare
many examples - Ongoing research is exploring how to integrate
this with structured knowledge - DBpedia, Semantic Media Wiki, Freebase, etc.
- Ill describe an approach weve taken and
experiments in using it - We came at this from an IR/HLT perspective
introduction ? wikitology ? applications ?
discussion ? conclusion
4Wikipedia data in RDF
introduction ? wikitology ? applications ?
discussion ? conclusion
5Populating Freebase KB
introduction ? wikitology ? applications ?
discussion ? conclusion
6Populating Powersets KB
introduction ? wikitology ? applications ?
discussion ? conclusion
7AskWiki uses Wikipedia for QA
introduction ? wikitology ? applications ?
discussion ? conclusion
8With sometimes surprising results
introduction ? wikitology ? applications ?
discussion ? conclusion
9TrueKnowledge mines Wikipedia
introduction ? wikitology ? applications ?
discussion ? conclusion
10Wikipedia pages as tags
introduction ? wikitology ? applications ?
discussion ? conclusion
11Wikitology
- We are exploring an approach to deriving an
ontology from Wikipedia that is useful in a
variety of language processing tasks
introduction ? wikitology ? applications ?
discussion ? conclusion
12Our original problem (2006)
- Problem describe what an analyst has been
working on to support collaboration - Idea track documents she reads and map these to
terms in an ontology, aggregate to produce a
short list of topics - Approach use Wikipedia articles as ontology
terms, use document-article similarity for the
mapping, and spreading activation for aggregation
introduction ? wikitology ? applications ?
discussion ? conclusion
13Whats a document about?
- Two common approaches
- (1) Select words and phrases using TF-IDF that
characterize the document - (2) Map document to a list of terms from a
controlled vocabulary or ontology - (1) is flexible and does not require creating and
maintaining an ontology - (2) can tie documents to a rich knowledge base
introduction ? wikitology ? applications ?
discussion ? conclusion
14Wikitology !
- Using Wikipedia as an ontology offers the best of
both approaches - each article (3M) is a concept in the ontology
- terms linked via Wikipedias category system
(200k) and inter-article links - Lots of structured and semi-structured data
- Its a consensus ontology created and maintained
by a diverse community - Broad coverage, multilingual, very current
- Overall content quality is high
introduction ? wikitology ? applications ?
discussion ? conclusion
15Wikitology features
- Terms have unique IDs (URLs) and are self
describing for people - Underlying graphs provide structure and
associations categories, article links,
disambiguation, aliases (redirects), - Article history contains useful meta-data for
trust, provenance, controversy, - External sources provide more info (e.g.,
Googles PageRank) - Annotated with structured data from DBpedia,
Freebase, Geonames LOD
introduction ? wikitology ? applications ?
discussion ? conclusion
16Problems as an Ontology
- Treating Wikipedia as an ontology reveals many
problems - Uncategorized and miscategorized articles
- Single document in too many categories
- George W. Bush is included in about 30
categories - Links between articles belonging to very
different categories - John F. Kennedy has a link for coincidence
theory which belongs to the Mathematical
Analysis/ Topology/Fixed Points
introduction ? wikitology ? applications ?
discussion ? conclusion
17Problems as an Ontology
- Article links in text are not typed
- Uneven category articulation
- Some categories are under represented where as
others have many articles - Administrative categories, e.g.
- Clean up from Sep 2006
- Articles with unsourced statements
- Over-linking, e.g.
- A mention of United States linked to the page
United_states - Mentions of 1949 linked to the year 1949
introduction ? wikitology ? applications ?
discussion ? conclusion
18Problems as an Ontology
- Wikipedias infobox templates have great
potential for have several problems - Multiple templates for same class
- Multiple attribute names for same property
- E.g., six attributes for a persons birth date
- Attributes lack domains or datatypes
- E.g., value can be string or link
introduction ? wikitology ? applications ?
discussion ? conclusion
19Wikitology 1, 2, 3
- Weve addressed some of of these problems in
developing Wikitology - The development has been driven by several use
cases and applications
introduction ? wikitology ? applications ?
discussion ? conclusion
20Wikitology Use Cases
- Identifying user context in a collaboration
system from documents viewed (2006) - Improve IR accuracy of by adding Wikitology tags
to documents (2007) - Cross document co-reference resolution for named
entities in text (2008) - Knowledge Base population from text (2009)
- Improve Web search engine by tagging documents
and queries (2009)
introduction ? wikitology ? applications ?
discussion ? conclusion
21Wikitology 1.0 (2007)
- Structured Data
- Specialized concepts (article titles)
- Generalized concepts (category titles)
- Inter-category and -article links as relations
between concepts - Article-category links as relations between
specialized and generalized concepts - Un-Structured Data
- Article text
- Algorithms to remove useless categor-ies and
links, infer categories, and select, rank and
aggregate concepts using the hybrid knowledge
base
text
graphs
Human input editing
introduction ? wikitology ? applications ?
discussion ? conclusion
22Experiments
- Goal given one or more documents, compute a
ranked list of the top Wikipedia articles and/or
categories that describe it. - Basic metric document similarity between
Wikipedia article and document(s) - Variations role of categories, eliminating
uninteresting articles, use of spreading
activation, using similarity scores, weighing
links, number of spreading activation pulses,
individual or set of query documents, etc, etc.
introduction ? wikitology ? applications ?
discussion ? conclusion
23Method 1
Using Wikipedia article text categories to
predict concepts
Input
Querydoc(s)
similar to
Similar Wikipedia Articles
0.8
0.2
0.1
Cosine similarity
0.2
introduction ? wikitology ? applications ?
discussion ? conclusion
24Method 1
Using Wikipedia article text categories to
predict concepts
Wikipedia Category Graph
Input
Querydoc(s)
similar to
Similar Wikipedia Articles
0.8
0.2
0.1
Cosine similarity
0.2
0.3
introduction ? wikitology ? applications ?
discussion ? conclusion
25Method 1
Using Wikipedia article text categories to
predict concepts
Output
- Rank Categories
- Links
- Cosine similarity
Wikipedia Category Graph
0.9
3
Input
Querydoc(s)
similar to
Similar Wikipedia Articles
0.8
0.2
0.1
Cosine similarity
0.2
0.3
introduction ? wikitology ? applications ?
discussion ? conclusion
26Method 2
Using spreading activation on category link graph
to get aggregated concepts
Spreading Activation
Output
Ranked Concepts based on Final Activation Score
Wikipedia Category Graph
Input
Querydoc(s)
Similar to
0.8
0.2
0.1
Input Function
Cosine similarity
0.2
0.3
Output Function
introduction ? wikitology ? applications ?
discussion ? conclusion
27Method 3
Using spreading activation on article link graph
Input
Threshold Ignore Spreading Activation to
articles with less than 0.4 Cosine similarity
score
Querydoc(s)
Similar To
Edge Weights Cosine similarity between
linkedarticles
Wikipedia Article Links Graph
Spreading Activation
Node Input Function
Output
Node Output Function
Ranked Concepts based on Final Activation Score
28Evaluation
- An initial informal evaluation compared results
against our own judgments - Used to select promising combinations of ideas
and parameter settings - Formal evaluation
- Selected Wikipedia articles for testing remove
from Lucene index and graphs - For each, use methods to predict categories and
linked articles - Compare results using precision and recall to
known categories and linked articles
introduction ? wikitology ? applications ?
discussion ? conclusion
29Example
Prediction for Set of Test Documents
Test Document Titles in the Set (Wikipedia
Articles) Crop_rotation Permaculture
Beneficial_insects Neem Lady_Bird Principles_of_
Organic_Agriculture Rhizobia Biointensive Intercr
opping Green_manure
Concept not in the Category Hierarchy
Method 1 Ranking Categories Directly Method 2 (2 pulses) Spreading Activation on Category links Graph Method 3 (2 pulses) Spreading Activation on Article Links Graph
Agriculture Sustainable_technologies Crops Agronomy Permaculture Skills Applied_sciences Land_management Food_industry Agriculture Organic_farming Sustainable_agriculture Organic_gardening Agriculture Companion_planting
30Category prediction evaluation
- Spreading activation with two pulses worked best
- Only considering articles with similarity gt 0.5
was a good threshold
introduction ? wikitology ? applications ?
discussion ? conclusion
31Article prediction evaluation
- Spreading activation with one pulse worked best
- Only considering articles with similarity gt 0.5
was a good threshold
introduction ? wikitology ? applications ?
discussion ? conclusion
32Improving IR performance (2008-09)
- Improving IR performance for a collection by
adding semantic terms to documents - Query with blind relevance feedback may benefit
from the semantic terms - Initial evaluation with NIST TREC 2005 collection
in collaboration with Paul McNamee, JHU HLTCOE - Ongoing integration into RiverGlass MORAG search
engine
introduction ? wikitology ? applications ?
discussion ? conclusion
33Improving IR performance
Doc FT921-4598 (3/9/92)
... Alan Turing, described as a brilliant
mathematician and a key figure in the breaking of
the Nazis' Enigma codes. Prof IJ Good says it is
as well that British security was unaware of
Turing's homosexuality, otherwise he might have
been fired 'and we might have lost the war'. In
1950 Turing wrote the seminal paper 'Computing
Machinery And Intelligence', but in 1954 killed
himself ...
Turing_machine, Turing_test, Church_Turing_thesi
s, Halting_problem, Computable_number, Bombe,
Alan_Turing, Recusion_theory, Formal_methods,
Computational_models, Theory_of_computation,
Theoretical_computer_science, Artificial_Intellig
ence
introduction ? wikitology ? applications ?
discussion ? conclusion
34Evaluation
- Mixed results on NIST evaluation
- Slightly worse on mean average precision
- Slightly better for precision at 10
MAP P_at_10
base 0.2076 0.4207
Base rf 0.2470 0.4480
Concepts rf 0.2400 0.4553
introduction ? wikitology ? applications ?
discussion ? conclusion
35Information Extraction
- Problem resolve entities found by a named entity
recognition system across documents to a KB
entries - ACE 2008 NIST run Automatic Extrac-tion
Conference is focused on this task - We were part of a team lead by JHU Human Language
Technology Center of Excellence - Use Wikitology to map document entities to KB
entities
introduction ? wikitology ? applications ?
discussion ? conclusion
36Wikitology 2.0 (2008)
RDF
RDF
text
graphs
Freebase KB
Yago
WordNet
Human input editing
Databases
37Named Entity Recognition
- Timothy F. Geithner, who as president of the New
York Federal Reserve Bank oversaw many of the
nations most powerful financial institutions,
stunned the group with the audacity of his
answer. He proposed asking Congress to give the
president broad power to guarantee all the debt
in the banking system, according to two
participants, including Michele Davis, then an
assistant Treasury secretary.
38Named Entity Recognition
- Timothy F. Geithner, who as president of the New
York Federal Reserve Bank oversaw many of the
nations most powerful financial institutions,
stunned the group with the audacity of his
answer. He proposed asking Congress to give the
president broad power to guarantee all the debt
in the banking system, according to two
participants, including Michele Davis, then an
assistant Treasury secretary.
39Open Calais
Free NER service that returns results in RDF
40Global Coreference Task
- Start with entities and relations produced by a
within document extraction system - Produce Global clusters for PERSON and
ORGANIZATION entities - Only evaluate over instances of entities with a
name - Challenges
- Very limited development data
- ACE released 49 files in English, none in Arabic
- MITRE released English ACE05 corpus, but
annotation is noisy and data has few ambiguous
entities - Within document mistakes are propagated to
cross-document system - 10K document evaluation set required work on
scalability of approaches
Abu Abbas aka Muhammad Zaydan aka Muhammad Abbas
William Wallace (living British Lord)
William Wallace (of Braveheart fame)
introduction ? wikitology ? applications ?
discussion ? conclusion
41Global Coreference Resolution Approach
- Serif for intra-document processing
- Entity Filtering
- Collect all pairs of SERIF entities
- Filter entity pairs with heuristics (e.g., string
similarity of mentions) to get high-recall set of
pairs significantly smaller than n2 possible
pairs - Feature generation
- Training
- Train SVM to identify coreferent pairs
- Entity Clustering
- Cluster predicted pairs
- Each connected component forms a global entity
- Relation Identification
- Every pair of SERIF-identified relations whose
types are identical and whose endpoints are
coreferent are deemed to be coreferent
introduction ? wikitology ? applications ?
discussion ? conclusion
42Wikitology tagging
- Using Serifs output, we produced an entity
document for each entity. - Included the entitys name, nominal and
pronom-inal mentions, APF type and subtype, and
words in a window around the mentions - We tagged entity documents using Wiki-tology
producing vectors of (1) terms and (2) categories
for the entity - We used the vectors to compute fea-tures
measuring entity pair similarity/dissimilarity
introduction ? wikitology ? applications ?
discussion ? conclusion
43Entity Document Tags
- ltDOCgt
- ltDOCNOgtABC19980430.1830.0091.LDC2000T44-E2
ltDOCNOgt - ltTEXTgt
- Webb Hubbell
- PER
- Individual
- NAM "Hubbell "Hubbells "Webb Hubbell
"Webb_Hubbell" - NAM "Mr . " "friend "income"
- PRO "he "him "his"
- , . abc's accountant after again ago all alleges
alone also and arranged attorney avoid been
before being betray but came can cat charges
cheating circle clearly close concluded
conspiracy cooperate counsel counsel's department
did disgrace do dog dollars earned eightynine
enough evasion feel financial firm first four
friend friends going got grand happening has he
help him hi s hope house hubbell hubbells hundred
hush income increase independent indict indicted
indictment inner investigating jackie jackie_judd
jail jordan judd jury justice kantor ken knew
lady late law left lie little make many mickey
mid money mr my nineteen nineties ninetyfour not
nothing now office other others paying
peter_jennings president's pressure pressured
probe prosecutors questions reported reveal rock
saddened said schemed seen seven since starr
statement such tax taxes tell them they thousand
time today ultimately vernon washington webb
webb_hubbell were what's whether which white
whitewater why wife years - lt/TEXTgt
- lt/DOCgt
Wikitology article tag vector Webster_Hubbell
1.000 Hubbell_Trading_Post National Historic
Site 0.379 United_States_v._Hubbell 0.377
Hubbell_Center 0.226 Whitewater_controversy
0.222 Wikitology category tag vector
Clinton_administration_controversies 0.204
American_political_scandals 0.204 Living_people
0.201 1949_births 0.167 People_from_Arkansas
0.167 Arkansas_politicians 0.167
American_tax_evaders 0.167 Arkansas_lawyers 0.167
44Wikitology derived features
- Seven features measured entity similarity using
cosine similarity of various length article or
category vectors - Five features measured entity dissimilarity
- two PER entities match different Wikitology
persons - two entities match Wikitology tags in a
disambiguation set - two ORG entities match different Wikitology
organizations - two PER entities match different Wikitology
persons, weighted by 1-abs(score1-score2) - two ORG entities match different Wikitology orgs,
weighted by 1-abs(score1-score2)
introduction ? wikitology ? applications ?
discussion ? conclusion
45COE Features
- Character-level features
- Exact Match of NAM mentions
- Longest mention exact match
- Some mention exact match
- Multiple mention exact match
- All mention exact match
- Partial Match
- Dice score, character bigrams
- Dice score, longest mention character bigrams
- Last word of longest string match
- Matching nominals and pronominals
- Exact match
- Multiple exact match
- All match
- Dice score of mention strings
- Document-level features
- Words
- Dice score, words in document
- Dice score, words around mentions
- Cosine score, words in document
- Cosine score, words around mentions
- Entities
- Dice score, entities in document
- Dice score, entities around mentions
- Metadata features
- Speech/text
- News/non-news
- Same document
- Social context features
- Heuristic
- Probabilistic
introduction ? wikitology ? applications ?
discussion ? conclusion
46More COE Features
- KB features - ontology
- Wikitology
- Top Wikitology category matches
- Top Wikitology article matches
- Different top Wikitology person
- Different top Wikitology organization
- Top Wikitology categories in disambiguation set
- Reuters topics
- Cosine score, words in document
- Cosine score, words around mentions
- Thesaurus concepts
- Cosine score, words in document
- Cosine score, words around mentions
- KB features - instances
- Known alias
- Also derived aliases from test collection
- BBN name match
- Famous singleton
- KB features - semantic match
- Entity type match
- Sex match
- Number match
- Occupation match
- Fuzzy occupation match
- Nationality match
- Spouse match
- Parent match
- Sibling match
introduction ? wikitology ? applications ?
discussion ? conclusion
47Clustering
- Approach
- Assign score to each entity pair (SVM or
heuristic) - Eliminate pairs whose score does not exceed
threshold (0.95 for SVM runs) - Identify connected components in resulting graph
- Large clusters
- AP (good)
- Clinton (bad conflates William and Hillary)
- Sources of large clusters varied
- Connected components clustering
- SERIF errors
- Insufficient features to distinguish separate
entities
introduction ? wikitology ? applications ?
discussion ? conclusion
48Features with High F1 scores
- Recall that F1 2PR/(PR)
- Variants of exact name match, in general,
especially a name mention in one entity exactly
matches one in the other (83.1) - Cosine similarity of the vectors of top
Wikitology article matches (75.1) - Top Wikitology article for the two entities
matched (38.1) - An entity contained a mention that was a known
alias of a mention found in the other (47.5)
introduction ? wikitology ? applications ?
discussion ? conclusion
49Feature Ablation
- A post hoc feature ablation evaluationshowed
contribution of KB features
introduction ? wikitology ? applications ?
discussion ? conclusion
50High Precision Features
- High precision/low recall features are useful
when applicable - Features with precision gt 95 include
- A name mentioned by each entity matches exactly
one person in Wikipedia - The entities have the same parent
- The entities have the same spouse
- All name mentions have an exact match across the
two entities - Longest named mention has exact match
introduction ? wikitology ? applications ?
discussion ? conclusion
51Knowledge Base Population
- The 2009 NIST Text Analysis Confer- ence (TAC)
will include a new Knowledge Base Population
track - Goal discover information about named entities
(people, organizations, places) and incorporate
it into a KB - TAC KBP has two related tasks
- Entity linking doc. entity mention -gt KB entity
- Slot filling given a document entity mention,
find missing slot values in large corpus
introduction ? wikitology ? applications ?
discussion ? conclusion
52KBs and IE are Symbiotic
KB info helps interpret text
KnowledgeBase
Information Extraction from Text
IE helps populate KBs
introduction ? wikitology ? applications ?
discussion ? conclusion
53Planned Extensions
- Make greater use of data from Linked Open Data
(LOD) resources DBpedia, Geonames, Freebase - Replace ad hoc processing of RDF data in Lucene
with a triple store - Add additional graphs (e.g., derived from infobox
links and develop algorithms to exploit them - Develop a better hybrid query creation tools
introduction ? wikitology ? applications ?
discussion ? conclusion
54Wikitology 3.0 (2009)
Articles
IRcollection
Application Specific Algorithms
CategoryLinks Graph
Infobox Graph
WikitologyCode
Application Specific Algorithms
Infobox Graph
Page LinkGraph
RDFreasoner
Application Specific Algorithms
Relational Database
TripleStore
LinkedSemanticWeb data ontologies
55Challenges
- Wikitology tagging is expensive
- 3 seconds/document
- ACE English 150K entities (24 hr on Bluegrit)
- A spreading activation algorithm on the
underlying graphs improves accuracy at even more
cost - Exploit the RDF metadata and data and the
underlying graphs - requires reasoning and graph processing
- Extract entities from Wiki text to find more
relations - More graph processing
introduction ? wikitology ? applications ?
discussion ? conclusion
56Wikipedias social network
- Wikipedia has an implicit social network that
can help disambiguate PER mentions - Resolving PER mentions in a short document to KB
people who are linked in the KB is good - The same can be done for the network of ORG and
GPE entities
57WSN Data
- We extracted 213K people from the DBpedias
Infobox dataset, 30K of which participate in an
infobox link to another person - We extracted 875K people from Freebase, 616K of
were linked to Wikipedia pages, 431K of which are
in one of 4.8M person-person article links - Consider a document that mentions two people
George Bush and Mr. Quayle
58Which Bush which Quayle?
Six George Bushes
Nine Male Quayles
59A simple closeness metric
- Let Si two hop neighbors of Si
- Cij intersection(Si,Sj) / union(Si,Sj)
- Cijgt0 for six of the 56 possible pairs
- 0.43 George_H._W._Bush -- Dan_Quayle
- 0.24 George_W._Bush -- Dan_Quayle
- 0.18 George_Bush_(biblical_scholar) -- Dan_Quayle
- 0.02 George_Bush_(biblical_scholar) --
James_C._Quayle - 0.02 George_H._W._Bush -- Anthony_Quayle
- 0.01 George_H._W._Bush -- James_C._Quayle
60Application to TAC KBP
- Using entity network data extracted from Dbpedia
and Wikipedia provides evidence to support KBP
tasks - Mapping document mentions into infobox entities
- Mapping potential slot fillers into infobox
entities - Evaluating the coherence of entities as potential
slot fillers
61Next Steps
- Construct a Web-based API and demo system to
facilitate experimentation - Process Wikitology updates in real-time
- Exploit machine learning to classify pages and
improve performance - Better use of cluster using Hadoop, etc.
- Exploit cell technology for spreading activation
and other graph-based algorithms - e.g., recognize people by the graph of relations
they are part of
introduction ? wikitology ? applications ?
discussion ? conclusion
62Dbpedia ontology
- Dbpedia 3.2 (Nov 2008) added a manually
constructed ontology with - 170 classes in a subsumption hierarchy
- 880K instances
- 940 properties with domain and range
- A partial, manual mapping was constructed from
infobox attributes to these term - Current domain and range constraints are loose
- Namespace http//dbpedia.org/ontology/
Place 248,000 Person 214,000 Work
193,000 Species 90,000 Org.
76,000 Building 23,000
63Person
56 properties
64Organisation
50 properties
65Place
110 properties
66Exploiting Linked Data
67Conclusion
- Our initial applications shows that the
Wikitology idea has merit - Wikipedia is increasingly being used as a
knowledge source of choice - Easily extendable to other wikis and
collaborative KBs, e.g., Intellipedia - Serious use may require exploiting cluster
machines and cell processing - We need to move beyond Wikipedia to exploit the
LOD cloud
introduction ? wikitology ? applications ?
discussion ? conclusion
68(No Transcript)