Title: Robust Reading of Ambiguous Names in Texts
1Robust Reading of Ambiguous Names in Texts
- Xin Li, Paul Morie and Dan Roth
- Dept. of Computer Science
- University of Illinois at Urbana-Champaign
2The problem
Document 1 The Justice Department has officially
ended its inquiry into the assassinations of John
F. Kennedy and Martin Luther King Jr., finding
no persuasive evidence'' to support conspiracy
theories, according to department documents. The
House Assassinations Committee concluded in 1978
that Kennedy was probably'' assassinated as the
result of a conspiracy involving a second gunman,
a finding that broke from the Warren Commission
's belief that Lee Harvey Oswald acted alone in
Dallas on Nov. 22, 1963. Document 2 In 1953,
Massachusetts Sen. John F. Kennedy married
Jacqueline Lee Bouvier in Newport, R.I. In 1960,
Democratic presidential candidate John F. Kennedy
confronted the issue of his Roman Catholic faith
by telling a Protestant group in Houston, I do
not speak for my church on public matters, and
the church does not speak for me.' Document 3
David Kennedy was born in Leicester, England in
1959. Kennedy co-edited The New Poetry
(Bloodaxe Books 1993), and is the author of New
Relations The Refashioning Of British Poetry
1980-1994 (Seren 1996).Â
3Document 1 The Justice Department has officially
ended its inquiry into the assassinations of John
F. Kennedy and Martin Luther King Jr., finding
no persuasive evidence'' to support conspiracy
theories, according to department documents. The
House Assassinations Committee concluded in 1978
that Kennedy was probably'' assassinated as the
result of a conspiracy involving a second gunman,
a finding that broke from the Warren Commission
's belief that Lee Harvey Oswald acted alone in
Dallas on Nov. 22, 1963. Document 2 In 1953,
Massachusetts Sen. John F. Kennedy married
Jacqueline Lee Bouvier in Newport, R.I. In 1960,
Democratic presidential candidate John F. Kennedy
confronted the issue of his Roman Catholic faith
by telling a Protestant group in Houston, I do
not speak for my church on public matters, and
the church does not speak for me.' Document 3
David Kennedy was born in Leicester, England in
1959. Kennedy co-edited The New Poetry
(Bloodaxe Books 1993), and is the author of New
Relations The Refashioning Of British Poetry
1980-1994 (Seren 1996).Â
4Robust Reading of Ambiguous Names
We identify different entities that are mentioned
in text, and map mentions, within and across
documents to the corresponding entity.
5Problems
- Entity Identity
- Do Mentions A and B refer to the same entity?
- Name Expansion
- Given a writing of a name, find other likely
writings of the same entity. - Prominence
- Whats Bushs foreign policy?
- Find the most prominent Bush.
6Why is this problem important?
- Intelligent access to textual information
requires identifying entities and discovering
knowledge about entities from text. - Information Extraction extract knowledge about
entities - Question Answering Answer English questions
automatically - Most research in NLP is still done with
individual mentions of an entity. - We would like to start moving from mentions to
concepts and treat mentions as a whole based on
the real-world entities they refer to.
7Our solution
- We developed machine learning techniques to this
problem. - They are based on a natural generation process
of documents. - Data Collection New York Times news articles
and Yahoo movie databases.
8A generative model
- A natural process of how documents are
generated. - A probabilistic view of how documents are
generated and how "mentions" of entities areÂ
"sprinkled into them. - Entity identification through inference, once the
model is learned. Learning is done in an
unsupervised way.
9Generate a document d
The Justice Department has officially ended its
inquiry into the assassinations of President John
F. Kennedy and Martin Luther King Jr., finding
no persuasive evidence'' to support conspiracy
theories, according to department documents. The
House Assassinations Committee concluded in 1978
that Kennedy was probably'' assassinated as the
result of a conspiracy involving a second gunman,
a finding that broke from the Warren Commission's
belief that Lee Harvey Oswald acted alone in
Dallas on Nov. 22, 1963. President KennedyJFK
10At the beginning, we have a set of entities in
our mind
A set of entities E
The Justice Department
Dallas
The House Assassinations Committee
David Kennedy
11First Step Select a subset of entities for a
document d. Underlying probability
distribution P(Ed).
The Justice Department
Dallas
The House Assassinations Committee
Ed the set of entities in a document d
12Second Step For each entity e, select a
representative r. Underlying probability
distribution P(re) and P(RdEd)? P(re).
Rd the set of representatives in document d
13Third Step For each representative r, select a
set of mentions m. Underlying probability
distributions P(mr) and P(MdRd)? P(mr).
Kennedy, JFK President Kennedy
President John F. Kennedy
Md the set of actual mentions in document d
14Generate a document d
The Justice Department has officially ended its
inquiry into the assassinations of President John
F. Kennedy and Martin Luther King Jr., finding
no persuasive evidence'' to support conspiracy
theories, according to department documents. The
House Assassinations Committee concluded in 1978
that Kennedy was probably'' assassinated as the
result of a conspiracy involving a second gunman,
a finding that broke from the Warren Commission's
belief that Lee Harvey Oswald acted alone in
Dallas on Nov. 22, 1963. President KennedyJFK
15Robust Reading
- Assuming we have the model, the fundamental
problem is to decide what entities are mentioned
in a given document and what is the most likely
entity referred by each mention. (Li, Morie and
Roth, HLT-NAACL 2004) - Ed argmax (Ed,Rd) P(Ed,Rd Md, ?)
- argmax (Ed,Rd) P(Ed,Rd, Md ?)
16Significant Applications
- Based on the work, we can implement an analysis
tool that can be used to browse, retrieve and
track information about entities using textual
resources. The tool can - Automatically retrieve knowledge about specific
persons and locations, - Automatically extract relations between them,
- Automatically build and unify important databases.