Title: Entity-oriented filtering of large streams
1Entity-oriented filtering of large streams
John R. Frank jrf_at_mit.edu Ian Soboroff ian.soboroff_at_nist.gov
Max Kleiman-Weiner maxkw_at_mit.edu Dan A. Roberts drob_at_mit.edu
http//trec-kba.org http//trec-kba.org
2- Date Tue, 13 Mar 2012 024540 0000
- From Google Alerts ltgooglealerts-noreply_at_google.c
omgt - Subject Google Alert - "John R. Frank"
- Web - 2 new results for "John R. Frank"
- John R. Frank
- SPOKANE, Wash. - John R. Frank, 55, died March 4,
2012, in Coeur d' Alene, - Idaho. Survivors include his wife, Miki
daughter, Patricia Frank ... - lthttp//www.hutchnews.com/obituaries/Frank--John-C
Pgt - In Memory of John R Frank
- Biography. John R. Frank, age 55, passed away at
Sacred Heart Medical - Center in Spokane, WA, on March 4, 2012. John was
born in Hutchison, KS, ... - lthttp//www.englishfuneralchapel.com/sitemaker/sit
es/Englis1/obit.cgi?user583335Frankgt
32012 TaskFiltering to Recommend Citations
Entities in Wikipedia or another Knowledge Base
- Initialize with a target WP entity
- state of WP from Jan 2012
- Iterate over stream of text items
- Oct-Dec 2011 train on labels
- For each, output relevance between 0, 1
- Jan-Apr 2012 labels hidden
Automatically recommend new edits
- Content Stream
- 462M texts, 40 English
- 4,973 hourly chunks of a 105 docs/hour
- News, blogs, forums, and link shortening
Your KBA System
4s3//aws-publicdatasets/trec/kba/kba-stream-corpus
-2012/
5(No Transcript)
6Accelerate?
rate of assimilation ltlt stream size
editors ltlt entities ltlt mentions
(definition of a large KB)
7How many days must a news article wait before
being cited in Wikipedia?
8Complex entity with many relationships and
attributes.
9Has many interests, including trying to takeover
UK soccer teams. His empire includes many
entities
Note Usmanov not mentioned in this text!
Elaborate link trails
Citation 18
10Example KBA Rating Task
Published March 31, 2012 Impact of Thoughts on
Water By Denis Gorce-Bourge Water covers 70
of our Blue planet and our body is made of about
70 water. Masaru Emoto is a Japanese
Photographer and scientist. He is known over the
world for his remarkable work on water and its
deep connection with individual and collective
consciousness. For decades, Masaru took pictures
of frozen crystals of water and tested the direct
influence of the environment on the quality of
those crystals. Pollution has a direct impact on
the beauty of a frozen crystal but as well words,
music and thoughts. He tested the quality of
water crystals by exposing it to various
conditions to written words like hate and
violence and Love and gratitude. The results were
just astonishing. The crystal exposed to Love and
gratitude was beautiful and perfectly formed
where the other one was severely degraded. He
demonstrated as well the impact of Heavy Metal
music versus Mozart or Beethoven and how the
vibration of music impacts water. The very shape
of water crystals is modified by violence,
aggression, and negative words.
11Example KBA Rating Task
Published March 31, 2012 Impact of Thoughts on
Water By Denis Gorce-Bourge Water covers 70
of our Blue planet and our body is made of about
70 water. Masaru Emoto is a Japanese
Photographer and scientist. He is known over the
world for his remarkable work on water and its
deep connection with individual and collective
consciousness. For decades, Masaru took pictures
of frozen crystals of water and tested the direct
influence of the environment on the quality of
those crystals. Pollution has a direct impact on
the beauty of a frozen crystal but as well words,
music and thoughts. He tested the quality of
water crystals by exposing it to various
conditions to written words like hate and
violence and Love and gratitude. The results were
just astonishing. The crystal exposed to Love and
gratitude was beautiful and perfectly formed
where the other one was severely degraded. He
demonstrated as well the impact of Heavy Metal
music versus Mozart or Beethoven and how the
vibration of music impacts water. The very shape
of water crystals is modified by violence,
aggression, and negative words.
12(No Transcript)
13Interannotator Agreement
97.6 /- 1.4 (N5365) coref
69.5 /- 2.7 (N1352) central
70.9 /- 2.0 (N2403) relevant
58.4 /- 3.4 (N884) neutral
84.9 /- 2.0 (N2599) garbage
82.6 /- 1.8 (N3200) central relevant
89.0 /- 1.7 (N3551) central relevant neutral
14TRECing the continental divide between NLP and IR
- IR
- User task centric
- Variation in interpretation
- Scores ? cascading lists
- Constructionist, emergence
- NLP
- Data parsing centric
- Universal annotation
- Scores ? probabilities
- Reductionist
15(No Transcript)
16string matching
task generator 91 recall 15 precision 26 F1
17(No Transcript)
18KBA 2013
- More entity types with an emphasis on temporality
in the stream.
Target Entities KB Centrally Relevant Training Data Annotation
People and Organizations Wikipedia or maybe Freebase Citation worthy Judgments from early stream High recall on all mentioning docs.
Pharmaceutical Compounds Merck KB? Reporting of Adverse Drug Reaction (ADR) (an event) (same) Focus recall on first person reporting negative reactions?
Event-type Entities Defined by a cluster of entities and possibly a Type-of-Event from a taxonomy WP/FB for cluster of entities, possibly also event itself. Provides causality info Judgments on docs for that Type-of-Event but different specific event. Find training data from TDT? Use citations in CategoryCurrent_events? Judge post-hoc?
19KBX
Pool top-K filtered docs, or use each KBA run as
separate KBP input. (1000x filter)
Cold Start queries focused on nil entities
related to target cluster and/or causality of
event
Output KB
KBP
KBA
Must coordinate choice of KBA target entities
with desired content of KBs for Cold Start
queries.
Clusters of related entities and/or event-type
entities
- KBA Stream Corpus 2012 (or the new Stream Corpus
2013) - 462M texts, 40 English
- 4,973 hourly chunks of a 105 docs/hour
- News, blogs, forums, and link shortening
20Sponsors Thank You.
Diffeo
21Thanks for your time.
- John R. Frank
- jrf_at_mit.edu
- http//trec-kba.org
22Lessons Learned(in progress)
- 97 coref, but 70 rating agreement ? hard
- one-in-twenty WP citations non-mentioning
- Definition of citable varies across entities
- Tension between IR rating and NLP labeling
- Lost gt1 teams from challenges of crunching big
data - Learn from Kaggle score a run every day
- AWS is really useful.
- KB feature mining ML beat string matching
- Too much training data?
- Must exercise temporality in the stream spikes
events.
23KBP
depth
Future of Large Knowledge Bases
NLP derives logical structures from
messages. Might leverage O(n2) and more to
explore the problem space. Limits corpus size to
106 docs.
IR filters relevant messages from large
streams. Limit algorithmic complexity to O(n) and
simpler by forcing large corpora, 108 docs and
higher.
KBA
As NLP and IR converge, relevance concepts are
gaining structure and logical inference is
spreading across documents.
volume
24one-shot NLP pipeline (traditional approach)
Users query the tagged text and occasionally
create more structure
Traditional Doc DB
docs
pipeline of NLP taggers add metadata to doc
DB create structure once
25Paradigm Shift persistent, adaptive NLP in the
database
End users and adaptive tagging algorithms access
the same content in the DB
Doc DB
docs
26Methods in the Madness
mean edit interval (days)
mean mention interval (hours)
27Has many interests, including trying to takeover
UK soccer teams. His empire includes many
entities
Elaborate link trails