Entity-oriented filtering of large streams

1 / 27
About This Presentation
Title:

Entity-oriented filtering of large streams

Description:

Entity-oriented filtering of large streams John R. Frank jrf_at_mit.edu Ian Soboroff ian.soboroff_at_nist.gov Max Kleiman-Weiner maxkw_at_mit.edu Dan A. Roberts –

Number of Views:87
Avg rating:3.0/5.0
Slides: 28
Provided by: jrf50
Learn more at: https://tac.nist.gov
Category:

less

Transcript and Presenter's Notes

Title: Entity-oriented filtering of large streams


1
Entity-oriented filtering of large streams
John R. Frank jrf_at_mit.edu Ian Soboroff ian.soboroff_at_nist.gov
Max Kleiman-Weiner maxkw_at_mit.edu Dan A. Roberts drob_at_mit.edu
http//trec-kba.org http//trec-kba.org
2
  • Date Tue, 13 Mar 2012 024540 0000
  • From Google Alerts ltgooglealerts-noreply_at_google.c
    omgt
  • Subject Google Alert - "John R. Frank"
  • Web - 2 new results for "John R. Frank"
  • John R. Frank
  • SPOKANE, Wash. - John R. Frank, 55, died March 4,
    2012, in Coeur d' Alene,
  • Idaho. Survivors include his wife, Miki
    daughter, Patricia Frank ...
  • lthttp//www.hutchnews.com/obituaries/Frank--John-C
    Pgt
  • In Memory of John R Frank
  • Biography. John R. Frank, age 55, passed away at
    Sacred Heart Medical
  • Center in Spokane, WA, on March 4, 2012. John was
    born in Hutchison, KS, ...
  • lthttp//www.englishfuneralchapel.com/sitemaker/sit
    es/Englis1/obit.cgi?user583335Frankgt

3
2012 TaskFiltering to Recommend Citations
Entities in Wikipedia or another Knowledge Base
  • Initialize with a target WP entity
  • state of WP from Jan 2012
  • Iterate over stream of text items
  • Oct-Dec 2011 train on labels
  • For each, output relevance between 0, 1
  • Jan-Apr 2012 labels hidden

Automatically recommend new edits
  • Content Stream
  • 462M texts, 40 English
  • 4,973 hourly chunks of a 105 docs/hour
  • News, blogs, forums, and link shortening

Your KBA System
4
s3//aws-publicdatasets/trec/kba/kba-stream-corpus
-2012/
5
(No Transcript)
6
Accelerate?
rate of assimilation ltlt stream size
editors ltlt entities ltlt mentions
(definition of a large KB)
7
How many days must a news article wait before
being cited in Wikipedia?
8
Complex entity with many relationships and
attributes.
9
Has many interests, including trying to takeover
UK soccer teams. His empire includes many
entities
Note Usmanov not mentioned in this text!
Elaborate link trails
Citation 18
10
Example KBA Rating Task
Published March 31, 2012 Impact of Thoughts on
Water By Denis Gorce-Bourge Water covers 70
of our Blue planet and our body is made of about
70 water. Masaru Emoto is a Japanese
Photographer and scientist. He is known over the
world for his remarkable work on water and its
deep connection with individual and collective
consciousness. For decades, Masaru took pictures
of frozen crystals of water and tested the direct
influence of the environment on the quality of
those crystals. Pollution has a direct impact on
the beauty of a frozen crystal but as well words,
music and thoughts. He tested the quality of
water crystals by exposing it to various
conditions to written words like hate and
violence and Love and gratitude. The results were
just astonishing. The crystal exposed to Love and
gratitude was beautiful and perfectly formed
where the other one was severely degraded. He
demonstrated as well the impact of Heavy Metal
music versus Mozart or Beethoven and how the
vibration of music impacts water. The very shape
of water crystals is modified by violence,
aggression, and negative words.
11
Example KBA Rating Task
Published March 31, 2012 Impact of Thoughts on
Water By Denis Gorce-Bourge Water covers 70
of our Blue planet and our body is made of about
70 water. Masaru Emoto is a Japanese
Photographer and scientist. He is known over the
world for his remarkable work on water and its
deep connection with individual and collective
consciousness. For decades, Masaru took pictures
of frozen crystals of water and tested the direct
influence of the environment on the quality of
those crystals. Pollution has a direct impact on
the beauty of a frozen crystal but as well words,
music and thoughts. He tested the quality of
water crystals by exposing it to various
conditions to written words like hate and
violence and Love and gratitude. The results were
just astonishing. The crystal exposed to Love and
gratitude was beautiful and perfectly formed
where the other one was severely degraded. He
demonstrated as well the impact of Heavy Metal
music versus Mozart or Beethoven and how the
vibration of music impacts water. The very shape
of water crystals is modified by violence,
aggression, and negative words.
12
(No Transcript)
13
Interannotator Agreement
97.6 /- 1.4 (N5365) coref
69.5 /- 2.7 (N1352) central
70.9 /- 2.0 (N2403) relevant
58.4 /- 3.4 (N884) neutral
84.9 /- 2.0 (N2599) garbage
82.6 /- 1.8 (N3200) central relevant
89.0 /- 1.7 (N3551) central relevant neutral
14
TRECing the continental divide between NLP and IR
  • IR
  • User task centric
  • Variation in interpretation
  • Scores ? cascading lists
  • Constructionist, emergence
  • NLP
  • Data parsing centric
  • Universal annotation
  • Scores ? probabilities
  • Reductionist

15
(No Transcript)
16
string matching
task generator 91 recall 15 precision 26 F1
17
(No Transcript)
18
KBA 2013
  • More entity types with an emphasis on temporality
    in the stream.

Target Entities KB Centrally Relevant Training Data Annotation
People and Organizations Wikipedia or maybe Freebase Citation worthy Judgments from early stream High recall on all mentioning docs.
Pharmaceutical Compounds Merck KB? Reporting of Adverse Drug Reaction (ADR) (an event) (same) Focus recall on first person reporting negative reactions?
Event-type Entities Defined by a cluster of entities and possibly a Type-of-Event from a taxonomy WP/FB for cluster of entities, possibly also event itself. Provides causality info Judgments on docs for that Type-of-Event but different specific event. Find training data from TDT? Use citations in CategoryCurrent_events? Judge post-hoc?
19
KBX
Pool top-K filtered docs, or use each KBA run as
separate KBP input. (1000x filter)
Cold Start queries focused on nil entities
related to target cluster and/or causality of
event
Output KB
KBP
KBA
Must coordinate choice of KBA target entities
with desired content of KBs for Cold Start
queries.
Clusters of related entities and/or event-type
entities
  • KBA Stream Corpus 2012 (or the new Stream Corpus
    2013)
  • 462M texts, 40 English
  • 4,973 hourly chunks of a 105 docs/hour
  • News, blogs, forums, and link shortening

20
Sponsors Thank You.
Diffeo
21
Thanks for your time.
  • John R. Frank
  • jrf_at_mit.edu
  • http//trec-kba.org

22
Lessons Learned(in progress)
  • 97 coref, but 70 rating agreement ? hard
  • one-in-twenty WP citations non-mentioning
  • Definition of citable varies across entities
  • Tension between IR rating and NLP labeling
  • Lost gt1 teams from challenges of crunching big
    data
  • Learn from Kaggle score a run every day
  • AWS is really useful.
  • KB feature mining ML beat string matching
  • Too much training data?
  • Must exercise temporality in the stream spikes
    events.

23
KBP
depth
Future of Large Knowledge Bases
NLP derives logical structures from
messages. Might leverage O(n2) and more to
explore the problem space. Limits corpus size to
106 docs.
IR filters relevant messages from large
streams. Limit algorithmic complexity to O(n) and
simpler by forcing large corpora, 108 docs and
higher.
KBA
As NLP and IR converge, relevance concepts are
gaining structure and logical inference is
spreading across documents.
volume
24
one-shot NLP pipeline (traditional approach)
Users query the tagged text and occasionally
create more structure
Traditional Doc DB
docs
pipeline of NLP taggers add metadata to doc
DB create structure once
25
Paradigm Shift persistent, adaptive NLP in the
database
End users and adaptive tagging algorithms access
the same content in the DB
Doc DB
docs
26
Methods in the Madness
mean edit interval (days)
mean mention interval (hours)
27
Has many interests, including trying to takeover
UK soccer teams. His empire includes many
entities
Elaborate link trails
Write a Comment
User Comments (0)
About PowerShow.com