Using Encyclopedic Knowledge for Named Entity Disambiguation - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Using Encyclopedic Knowledge for Named Entity Disambiguation

Description:

Using Encyclopedic Knowledge for Named Entity Disambiguation Razvan Bunescu Marius Pasca Machine Learning Group Department of Computer Sciences University of Texas at ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 37
Provided by: RazvanB4
Category:

less

Transcript and Presenter's Notes

Title: Using Encyclopedic Knowledge for Named Entity Disambiguation


1
Using Encyclopedic Knowledge forNamed Entity
Disambiguation
Razvan Bunescu
Marius Pasca
Machine Learning Group Department of Computer
Sciences University of Texas at Austin
Google Inc. 1600 Amphitheatre Parkway Mountain
View, CA
razvan_at_cs.utexas.edu
mars_at_google.com
2
Introduction Disambiguation
  • Some names denote multiple entities
  • John Williams and the Boston Pops conducted a
    summer Star Wars concert at Tanglewood.
  • John Williams ? John Williams (composer)
  • John Williams lost a Taipei death match against
    his brother, Axl Rotten.
  • John Williams ? John Williams (wrestler)
  • John Williams won a Victoria Cross for his
    actions at the battle of Rorkes Drift.
  • John Williams ? John Williams (VC)

3
Introduction Normalization
  • Some entities have multiple names
  • John Williams (composer) ? John Williams
  • John Williams (composer) ? John Towner Williams
  • John Williams (wrestler) ? John Williams
  • John Williams (wrestler) ? Ian Rotten
  • Venus (planet) ? Venus
  • Venus (planet) ? Morning Star
  • Venus (planet) ? Evening Star

4
Introduction Motivation
  • Web searches
  • Queries about Named Entities (NEs) constitute a
    significant portion of popular web queries.
  • Ideally, search results are clustered such that
  • In each cluster, the queried name denotes the
    same entity.
  • Each cluster is enriched by querying the web with
    alternative names of the corresponding entity.
  • Web-based Information Extraction (IE)
  • Aggregating extractions from multiple web pages
    can lead to improved accuracy in IE tasks (e.g.
    extracting relationships between NEs).
  • Named entity disambiguation is essential for
    performing a meaningful aggregation.

5
Introduction Approach
  • Build a dictionary D of named entities
  • Use information from a large coverage
    encyclopedia Wikipedia.
  • Each name d?D is mapped to d.E, the set of
    entities that d can refer to in Wikipedia.
  • Design a method that takes as input a proper name
    in its document context, and can be trained to
  • Detect when a proper name refers to an entity
    from D. Detection
  • Find the named entity refered in that context.
    Disambiguation

6
Introduction Example
Dictionary
John Williams
John Towner Williams
Ian Rotten
John Williams (composer)
John Williams (VC)
John Williams (wrestler)
John Williams (other)
Document
?
this past weekend. John Williams and the
Boston Pops conducted a summer Star Wars concert
at Tanglewood
7
Outline
  • Introduction
  • Wikipedia Structures
  • Named Entity Dictionary
  • Disambiguation Dataset
  • Disambiguation Detection
  • Experimental Evaluation
  • Future Work
  • Conclusions

8
Wikipedia A Wiki Encyclopedia
  • Wikipedia a free online encyclopedia written
    collaboratively by volunteers, using wiki
    software.
  • 200 language editions, with varying levels of
    coverage.
  • Very dynamic and quickly growing resource
  • May 2005 577,860 articles
  • Sep. 2005 751,666 articles

9
Wikipedia Articles Titles
  • Each article describes a specific entity or
    concept.
  • An article is uniquely identified by its title.
  • Usually, the title is the most common name used
    to denote the entity described in the article.
  • If the title name is ambiguous, it may be
    qualified with an expression between parentheses.
  • Example John Williams (composer)
  • Notation
  • E ? the set of all named entities from Wikipedia.
  • e?E ? an arbitrary named entity.
  • e.title ? the title name
  • e.T ? the text of the article

10
Wikipedia Structures
  • In general, there is a many-to-many relationship
    between names and entities, captured in Wikipedia
    through
  • Redirect articles.
  • Disambiguation articles.
  • Hyperlinks An article may contain links to other
    articles in Wikipedia.
  • Categories each article belongs to at least one
    Wikipedia category.

11
Redirect Articles
  • A redirect article exists for each alternative
    name used to refer to an entity in Wikipedia.
  • Example The article titled John Towner Williams
    consists in a pointer to the article John
    Williams (composer).
  • Notation
  • e.R ? the set of all names that redirect to e.
  • Example
  • e.title ? United States.
  • e.R ? USA, US, Estados Unidos, Untied States,
    Yankee Land, .

12
Disambiguation Articles
  • A disambiguation article lists all Wikipedia
    entities (articles) that may be denoted by an
    ambiguous name.
  • Example The article titled John Williams
    (disambiguation) list 22 entities (articles).
  • Notation
  • e.D ? the set of names whose disambiguation pages
    contain a link to e.
  • Example
  • e.title ? Venus (planet).
  • e.D ? Venus, Morning Star, Evening Star.

13
Named Entity Dictionary
  • Named Entities ? entities with a proper name
    title.
  • All Wikipedia titles begin with a capital letter
    ? 3 heuristics for detecting
    proper name titles
  • If e.title is a multiword title, then e is a
    named entity only if all content words are
    capitalized (e.g. The Witches of Eastwick)
  • If e.title is a one word title that contains at
    least two capital letters, then e is a named
    entity (e.g. NATO)
  • If at least 75 of the title occurrences inside
    the article are capitalized, then e is a named
    entity.
  • Notation
  • d?D is a proper name entry in the dictionary D
    (?500K entries).
  • d.E is the set of entities that may be denoted by
    d in Wikipedia,
  • e?d.E ? d ? e.name ? d?e.R ? d?e.D

(e.name ? e.title without the expression between
parantheses)
14
Hyperlinks
  • Mentions of entities in Wikipedia articles are
    often linked to their corresponding article, by
    using links or piped links.

piped link
link
Wiki source
The Vatican CityVatican is now an enclave
surrounded by Rome.
Display string
The Vatican is now an enclave surrounded by Rome.
15
Disambiguation Dataset
  • Hyperlinks in Wikipedia provide disambiguated
    named entity queries q.

q1
q2
The Vatican CityVatican is now an enclave
surrounded by Rome.
display name
title
display name ? title
  • Notation
  • q.E ? the set of entities that are associated in
    the dictionary D with the display name from the
    link.
  • q.e?q.E ? the true entity associated with the
    query, given by the title included in the link.
  • q.T ? the text contained in a window of size 55
    words Gooi Allan, 2004 centered on the link.

16
Disambiguation Dataset
  • Every entity ek?q.E contributes a disambiguation
    example, labeled 1 if and only if ek ? q.e

q
this past weekend. John Williams and the
Boston Pops conducted a summer Star Wars concert
at Tanglewood
? Query Text (q.T) Entity Title (ek.title)
1 Boston Pops conducted concert Star Wars e1 John Williams (composer)
0 Boston Pops conducted concert Star Wars e2 John Williams (wrestler)
0 Boston Pops conducted concert Star Wars e3 John Williams (VC)
1,783,868 queries
17
Categories
  • Each article in Wikipedia is required to be
    associated with at least one category.
  • Categories form a directed acyclic graph, which
    allows multiple categorization schemes to
    co-exist.
  • 59,759 categories in Wikipedia taxonomy.
  • Notation
  • e.C ? the set of categories to which e belongs
    (ancestors included).
  • Example
  • e.title ? Venus (planet).
  • e.C ? Venus, Planets of the Solar Systems,
    Planets, Solar System.

18
Outline
  • Introduction
  • Wikipedia Structures
  • Named Entity Dictionary
  • Disambiguation Dataset
  • Disambiguation Detection
  • Experimental Evaluation
  • Future Work
  • Conclusions

19
NE Disambiguation Two Approaches
  • Classification
  • Train a classifier for each proper name in the
    dictionary D.
  • Not feasible 500K proper names ? need 500K
    classifiers!
  • Ranking
  • Design a scoring function score(q,ek) that
    computes the compatibility between the context of
    the proper name occurring in a query q, and any
    of the entities ek?q.E that may be referred by
    that proper name.
  • For a given named entity query q, select the
    highest ranking entity

20
Context-Article Similarity
  • NE disambiguation ? ranking problem.
  • Use cosine similarity between query context and
    article, based on the tf x idf formulation

21
Word-Category Correlations
  • Problem In many cases, given a query q, the true
    entity q.e fails to rank first because cue words
    from the query context do not occur in q.es
    article.
  • The article may be too short, or incomplete.
  • Relevant concepts from the query context are
    captured in the article through synonymous words
    or phrases.
  • Approach Use correlations between words in the
    query context w?q.T and categories to which the
    named entity belongs c?e.C.

22
Word-Category Correlations
People by occupation
People known in connection with sports and hobbies
Musicians
Composers
Wrestlers
Film score composers
Professional wrestlers
John Williams (composer)
John Williams (wrestler)
?
John Williams and the Boston Pops
a summer Star Wars concert at Tanglewood.
conducted
23
Ranking Formulation
  • Redefine q.E ? the set of named entities from D
    that may be denoted by the display name in the
    query, plus an out-of-Wikipedia entity eout.
  • Use a linear ranking function

???cos?w,c?out
24
Ranking Formulation Example
q.T ? past, weekend, Boston, Pops, conducted,
summer, Star, Wars, concert, Tanglewood, e1.C?
Film score composers, Composers, Musicians,
People by occupation, eout.C ??
25
NE Disambiguation Overview 1
Data Structures
Redirect Pages
Disambiguation Dataset
NE Dictionary
Disambig Pages
Hyperlinks
26
NE Disambiguation Overview 2
27
Outline
  • Introduction
  • Wikipedia Structures
  • Named Entity Dictionary
  • Disambiguation Detection
  • Experimental Evaluation
  • Future Work
  • Conclusions

28
Experimental Evaluation
  • The normalized ranking kernel is trained and
    evaluated against cosine similarity in 4
    scenarios
  • Disambiguation between entities with different
    categories in the set of 110 top-level categories
    under People by Occupation.
  • Disambiguation between entities with different
    categories in the set of 540 most popular (size gt
    200) categories under People by Occupation.
  • Disambiguation between entities with different
    categories in the set of 2847 most popular (size
    gt 20) categories under People by Occupation.
  • Detection Disambiguation between entities with
    different categories in the set of 540 most
    popular (size gt 200) categories under People by
    Occupation.
  • Use SVMlight with the max-margin ranking approach
    from Joachims 2002.

29
Experimental Evaluation S2
  • The set of Wikipedia categories is restricted to
  • C2 ? the 540 categories under People by
    Occupation that have at least 200 articles
  • Train Test only on ambiguous queries ?q,ek?
    such that
  • ek.C ? C2 ? ? (i.e. matching entities have
    categories in C2)
  • ek.C ? C2 ? q.e.C ? C2 (i.e. the true entity does
    not have exactly the same categories as other
    matching entities)
  • Statistics Results

Cat Training dataset Training dataset Training dataset Test dataset Test dataset Test Accuracy Test Accuracy
Cat Queries Pairs Constr. Queries Pairs Kernel Cosine
540 17,970 55,452 37,482 70,468 235,290 68.4 55.8
30
Experimental Evaluation S4
  • The set of Wikipedia categories is restricted to
  • C4 ? the 540 categories under People by
    Occupation that have at least 200 articles.
  • Train Test
  • Consider out-of-Wikipedia all entities that are
    not under People by Occupation.
  • Randomly select queries such that 10 have true
    answer out-of-Wikipedia.
  • Statistics Results

Cat Training dataset Training dataset Training dataset Test dataset Test dataset Test Accuracy Test Accuracy
Cat Queries Pairs Constr. Queries Pairs Kernel Cosine
540 38,726 102,553 63,827 80,386 191,227 84.8 82.3
31
Future Work
  • Use weight vector w explicitly reduce its
    dimensionality by considering only features
    occurring frequently in training data.
  • Augment article text with context from hyperlinks
    that point to it.
  • Use correlations between categories and
    traditional WSD features such as (syntactic)
    bigrams and trigrams centered on the ambiguous
    proper name.

32
Conclusion
  • A novel approach to Named Entity Disambiguation
    based on knowledge encoded in Wikipedia.
  • Learned correlations between Wikipedia categories
    and context words substantially improve
    disambiguation accuracy.
  • Potential applications
  • Clustering results to web searches for popular
    named entities.
  • NE disambiguation is essential for aggregating
    corpus-level results from Information Extraction.

33
Questions?
34
Ranking Kernel
  • The corresponding kernel is
  • The normalized version

35
Experimental Evaluation S1
  • The set of Wikipedia categories is restricted to
  • C1 ? the 110 top-level categories under People
    by Occupation.
  • Train Test only on ambiguous queries ?q,ek?
    such that
  • ek.C ? C1 ? ? (i.e. matching entities have
    categories in C1)
  • ek.C ? C1 ? q.e.C ? C1 (i.e. the true entity does
    not have exactly the same categories as other
    matching entities)
  • Statistics Results

Cat Training dataset Training dataset Training dataset Test dataset Test dataset Test Accuracy Test Accuracy
Cat Queries Pairs Constr. Queries Pairs Kernel Cosine
110 12,288 39,880 27,592 48,661 147,165 77.2 61.5
36
Experimental Evaluation S3
  • The set of Wikipedia categories is restricted to
  • C3 ? the 2847 top-level categories under People
    by Occupation that have at least 20 articles
  • Train Test only on ambiguous queries ?q,ek?
    such that
  • ek.C ? C3 ? ? (i.e. matching entities have
    categories in C3)
  • ek.C ? C3 ? q.e.C ? C3 (i.e. the true entity does
    not have exactly the same categories as other
    matching entities)
  • Statistics Results

Cat Training dataset Training dataset Training dataset Test dataset Test dataset Test Accuracy Test Accuracy
Cat Queries Pairs Constr. Queries Pairs Kernel Cosine
2847 21,185 64,560 43,375 75,190 261,723 68.0 55.4
Write a Comment
User Comments (0)
About PowerShow.com