Title: Wikification
1Wikification
- CSE 6339 (Section 002)
- Abhijit Tendulkar
2- Wikify! Linking Documents to Encyclopedic
Knowledge. R. Mihalcea and A. Csomai - Learning to Link with Wikipedia. D. Milne and I.
H. Witten
3What is Wikification
- Automatic keyword extraction
- Word sense disambiguation
- Automatically cross-reference documents
(unstructured text) with wikipedia.
4Wikify! - Introduction
- Introduces annotation of documents by linking
them with Wikipedia - Applications could be semantic web, educational
applications, useful in no. of text processing
problems. - Previous similar works Microsoft Smart Tags,
Google AutoLink merely based on word or phrase
lookup (no keyword extraction or disambiguation)
5Wikify! - Text Wikification
6Wikify! - Keyword Extraction
- Recommendations from Wikipedia style manual link
terms providing deeper understanding of topic,
avoid linking unrelated terms, select proper
amount of keywords. - Unsupervised algorithms Involve two steps
- Candidate extraction extract all possible
n-grams. - Keyword ranking Assign numeric value to each
candidate. Used three methods - tf-idf, ?2,
Keyphraseness.
7Wikify! - Evaluation of Keyword Extraction
8Wikify! - Word Sense Disambiguation
- Ambiguity is inherent to human language
- Disambiguation algorithms
- Knowledge-based rely exclusively on knowledge
derived from dictionaries. - Data-driven based on probabilities collected
from sense-annotated data. - Here voting scheme is used which seeks agreement
between both. - Wikify! provides highly precise annotation even
if recall is lower.
9Wikify! - Disambiguation Evaluation
- Word sense disambiguation results total number
of attempted (A) and correct (C) word senses,
together with precision (P), recall (R) and
F-measure (F) evaluations.
10Wikify! - Overall Evaluation and Conclusion
- Wikify! allows user to upload a text file or
accepts URL of webpage, processes the document
provided by the user, and finally returns the
wikified version of the document. - The user also has option of providing density of
keywords in the range 2-10 default being 6. - When it was evaluated by human evaluators (20
users evaluating 10 documents each) only 57 of
the cases were identified accurately (50 would
be ideal case).
11Learning to Link with Wikipedia
- Machine learning approach to identify significant
terms within unstructured text. - It can provide structured knowledge about any
unstructured text. - Uses Wikipedia articles as training data, which
improves recall and precision.
12Snapshot of Wikified document
13Learning to Disambiguate Links
- Uses disambiguation to inform detection.
- Features such as Commonness and Relatedness of
the term are used as measures to resolve
ambiguity. - Commonness of a sense is defined by number of
times it is used by wikipedia articles as
destination. - Commonness (No. of times term is used as link)
/ (No. of times term appears in Wikipedia
articles)
14Disambiguation (Continued)
- Relatedness is given by following formula
Where a and b are two articles of interest A and
B are sets of all articles that link to a and b
respectively, and W is set of all articles in
Wikipedia.
15Disambiguation (Continued)
- Commonness and Relatedness
16Disambiguation (Continued)
- All context terms are not equally useful, so
weight is assigned to each context term which is
average of its link probability (i.e. commonness)
and relatedness. - All the above features are combined and the
feature of context quality is defined as sum of
the weights that are previously assigned to each
context term. - These features are used to train the classifier.
- To configure the classifier, parameter specifying
minimum probability of sense is used.
17Disambiguation Evaluation
- Disambiguation classifier was trained over 500
articles (instead of entire Wikipedia) on a
modest desktop with 3 GHz dual Core processor and
4GB of RAM. - Classifier was configured using 100 wikipedia
articles. - It was trained in 13 minutes, and tested in 4
minutes and another 3 minutes were required to
load required summaries of Wikipedias link
structure and anchor statistics into memory. - To evaluate classifier, 11000 anchors were
gathered from 100 random articles.
18Disambiguation Evaluation
19Learning to Detect Links
- Central difference between Wikifys link
detection approach and this new link detector
Wikify exclusively relies on link probability,
whereas in this new approach, the context
surrounding the terms is also taken into
consideration. - This link detector discards only terms having
very low link probability so that nonsense
phrases and stop words are removed.
20Features used for Link Detection
- Link probability It considers average link
probability. - Relatedness semantic relatedness, average
relatedness between each topic and all other
candidates. - Disambiguation Confidence
- Generality
- Location and Spread
21Link Detector
22Link Detector Performance
- Same dataset as for disambiguation classifier was
used for training, configuration as well as
evaluation. - 6.5 link probability was set as recall and
precision balance at that point. - Link detector was trained on unambiguous terms.
23Link Detector Performance (Continued)
24Wikification in the Wild
- This system was tested using news articles
instead of wikipedia and it gave 76.4 accuracy
in link detection.
25Conclusions
- This system resolves ambiguity as well as
polysemy. - Common hurdle in all such applications they must
somehow move from unstructured text to collection
of relevant wikipedia articles. - This paper has contibuted proven method for
extracting key concepts from plain text. - Finally these are attempts to explain and
organize sum total of human knowledge.
26Application on itself
27Questions