Wikification - PowerPoint PPT Presentation

About This Presentation
Title:

Wikification

Description:

Wikification CSE 6339 (Section 002) Abhijit Tendulkar Wikify! Linking Documents to Encyclopedic Knowledge. R. Mihalcea and A. Csomai Learning to Link with Wikipedia. – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 28
Provided by: Nikol90
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Wikification


1
Wikification
  • CSE 6339 (Section 002)
  • Abhijit Tendulkar

2
  • Wikify! Linking Documents to Encyclopedic
    Knowledge. R. Mihalcea and A. Csomai
  • Learning to Link with Wikipedia. D. Milne and I.
    H. Witten

3
What is Wikification
  • Automatic keyword extraction
  • Word sense disambiguation
  • Automatically cross-reference documents
    (unstructured text) with wikipedia.

4
Wikify! - Introduction
  • Introduces annotation of documents by linking
    them with Wikipedia
  • Applications could be semantic web, educational
    applications, useful in no. of text processing
    problems.
  • Previous similar works Microsoft Smart Tags,
    Google AutoLink merely based on word or phrase
    lookup (no keyword extraction or disambiguation)

5
Wikify! - Text Wikification
6
Wikify! - Keyword Extraction
  • Recommendations from Wikipedia style manual link
    terms providing deeper understanding of topic,
    avoid linking unrelated terms, select proper
    amount of keywords.
  • Unsupervised algorithms Involve two steps
  • Candidate extraction extract all possible
    n-grams.
  • Keyword ranking Assign numeric value to each
    candidate. Used three methods - tf-idf, ?2,
    Keyphraseness.

7
Wikify! - Evaluation of Keyword Extraction
8
Wikify! - Word Sense Disambiguation
  • Ambiguity is inherent to human language
  • Disambiguation algorithms
  • Knowledge-based rely exclusively on knowledge
    derived from dictionaries.
  • Data-driven based on probabilities collected
    from sense-annotated data.
  • Here voting scheme is used which seeks agreement
    between both.
  • Wikify! provides highly precise annotation even
    if recall is lower.

9
Wikify! - Disambiguation Evaluation
  • Word sense disambiguation results total number
    of attempted (A) and correct (C) word senses,
    together with precision (P), recall (R) and
    F-measure (F) evaluations.

10
Wikify! - Overall Evaluation and Conclusion
  • Wikify! allows user to upload a text file or
    accepts URL of webpage, processes the document
    provided by the user, and finally returns the
    wikified version of the document.
  • The user also has option of providing density of
    keywords in the range 2-10 default being 6.
  • When it was evaluated by human evaluators (20
    users evaluating 10 documents each) only 57 of
    the cases were identified accurately (50 would
    be ideal case).

11
Learning to Link with Wikipedia
  • Machine learning approach to identify significant
    terms within unstructured text.
  • It can provide structured knowledge about any
    unstructured text.
  • Uses Wikipedia articles as training data, which
    improves recall and precision.

12
Snapshot of Wikified document
13
Learning to Disambiguate Links
  • Uses disambiguation to inform detection.
  • Features such as Commonness and Relatedness of
    the term are used as measures to resolve
    ambiguity.
  • Commonness of a sense is defined by number of
    times it is used by wikipedia articles as
    destination.
  • Commonness (No. of times term is used as link)
    / (No. of times term appears in Wikipedia
    articles)

14
Disambiguation (Continued)
  • Relatedness is given by following formula

Where a and b are two articles of interest A and
B are sets of all articles that link to a and b
respectively, and W is set of all articles in
Wikipedia.
15
Disambiguation (Continued)
  • Commonness and Relatedness

16
Disambiguation (Continued)
  • All context terms are not equally useful, so
    weight is assigned to each context term which is
    average of its link probability (i.e. commonness)
    and relatedness.
  • All the above features are combined and the
    feature of context quality is defined as sum of
    the weights that are previously assigned to each
    context term.
  • These features are used to train the classifier.
  • To configure the classifier, parameter specifying
    minimum probability of sense is used.

17
Disambiguation Evaluation
  • Disambiguation classifier was trained over 500
    articles (instead of entire Wikipedia) on a
    modest desktop with 3 GHz dual Core processor and
    4GB of RAM.
  • Classifier was configured using 100 wikipedia
    articles.
  • It was trained in 13 minutes, and tested in 4
    minutes and another 3 minutes were required to
    load required summaries of Wikipedias link
    structure and anchor statistics into memory.
  • To evaluate classifier, 11000 anchors were
    gathered from 100 random articles.

18
Disambiguation Evaluation
19
Learning to Detect Links
  • Central difference between Wikifys link
    detection approach and this new link detector
    Wikify exclusively relies on link probability,
    whereas in this new approach, the context
    surrounding the terms is also taken into
    consideration.
  • This link detector discards only terms having
    very low link probability so that nonsense
    phrases and stop words are removed.

20
Features used for Link Detection
  • Link probability It considers average link
    probability.
  • Relatedness semantic relatedness, average
    relatedness between each topic and all other
    candidates.
  • Disambiguation Confidence
  • Generality
  • Location and Spread

21
Link Detector
22
Link Detector Performance
  • Same dataset as for disambiguation classifier was
    used for training, configuration as well as
    evaluation.
  • 6.5 link probability was set as recall and
    precision balance at that point.
  • Link detector was trained on unambiguous terms.

23
Link Detector Performance (Continued)
24
Wikification in the Wild
  • This system was tested using news articles
    instead of wikipedia and it gave 76.4 accuracy
    in link detection.

25
Conclusions
  • This system resolves ambiguity as well as
    polysemy.
  • Common hurdle in all such applications they must
    somehow move from unstructured text to collection
    of relevant wikipedia articles.
  • This paper has contibuted proven method for
    extracting key concepts from plain text.
  • Finally these are attempts to explain and
    organize sum total of human knowledge.

26
Application on itself
27
Questions
  • ?
Write a Comment
User Comments (0)
About PowerShow.com