Automatic Discovery of Useful Facet Terms - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Automatic Discovery of Useful Facet Terms

Description:

Automatic Discovery of Useful Facet Terms Wisam Dakka Columbia University Rishabh Dayal Columbia University Panagiotis G. Ipeirotis NYU – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 13
Provided by: Wis1150
Category:

less

Transcript and Presenter's Notes

Title: Automatic Discovery of Useful Facet Terms


1
Automatic Discovery of Useful Facet Terms
  • Wisam Dakka Columbia University
  • Rishabh Dayal Columbia University
  • Panagiotis G. Ipeirotis NYU

2
Searching the NYT Archive for Book Research
3
Motivation News Archive
  • Accessing and searching is not an easy task
  • Researchers and reporters spend a large amount of
    time going through their long query results
  • News archives are huge and available for tens of
    years
  • Many relevant results
  • Results in the first page are not more relevant
    than the results in the 5th or the 10th page (NYT
    archive)
  • Search engines of news archive mainly follow the
    paradigm
  • Search, skim through long results, modify, and
    search again
  • Goal Multifaceted Interfaces (MI) over the news
    archive of Newsblaster
  • Newsblaster archive
  • About 6 years of news from 24 news sources
  • Stories are clustered daily into hierarchies of
    topics and events
  • Events are threaded over time, summarized, and
    classified

4
Motivation MI for Newsblaster Archive
  • Our multifaceted interfaces work has some
    limitations CIKM2005
  • Supervised learning facets that could be
    identified by our algorithm appear in the
    training set
  • WordNet hypernyms
  • WordNet has rather poor coverage of named
    entities
  • Free text collections
  • The quality of the hierarchies built on top of
    news stories was low.

5
Challenge Automatic Extraction of the Useful
Facets from News Archive
  • Automatically discover, in an unsupervised
    manner, a set of candidate facet terms from free
    text
  • Automatically group together facet terms that
    belong to the same facet
  • Build the appropriate browsing structure for each
    facet

6
Intuition Look for Facet Terms Elsewhere
  • Pilot study - 100 stories from The NYTimes
  • Common facets Location, Institutes, History,
    People, Social Phenomenon, Markets, Nature, and
    Event
  • Sub-facets Leaders under People, Corporations
    under Markets
  • Clear phenomenon the terms for the useful facets
    do not usually appear in the news stories
  • A journalist writing a story about Jacques Chirac
    will not necessarily use the terms Political
    Leader, Europe, or France. Such missing terms are
    tremendously useful for identifying the
    appropriate facets for the story
  • We will look for these terms elsewhere
  • infrequent terms in the original collection, but
    are frequent in expanded documents

7
Context-Aware Expansion
Murkowski made the announcement three days after
BP said it would shut down a Prudhoe Bay oil
field after a small leak was found. Energy
officials have said pipeline repairs are likely
to take months, curtailing Alaskan production
into next year
Wiki
Murkowski made the announcement three days after
BP said it would shut down a Prudhoe Bay oil
field after a small leak was found. Energy
officials have said pipeline repairs are likely
to take months, curtailing Alaskan production
into next year
Murkowski made the announcement three days after
BP said it would shut down a Prudhoe Bay oil
field after a small leak was found. Energy
officials have said pipeline repairs are likely
to take months, curtailing Alaskan production
into next year
Murkowski made the announcement three days after
BP said it would shut down a Prudhoe Bay oil
field after a small leak was found. Energy
officials have said pipeline repairs are likely
to take months, curtailing Alaskan production
into next year
Yahoo Term Extractor
Name Entities
8
Useful Facets Terms are Elsewhere
Original Collection
Context-aware Collection
Infrequent Terms
ti
9
Term Frequency Analysis
  • Frequency-based shifting
  • ? Due to the Zipfian nature, we favor terms that
    have already high frequencies (inverse problem)
  • Rank-shifting

10
Summary Candidate Facet Terms
  • For each document in the database, identify the
    important terms that are useful to characterize
    the contents of the document
  • For each term in the original database, query the
    external resource and retrieve the terms that
    appear in the results. Add the retrieved terms in
    the original document, in order to create an
    expanded, context-aware document
  • Analyze the frequency of the terms, in both the
    original and the expanded database and identify
    the candidate facet terms

11
Indicative
12
Research in Progress
  • Cleaning and filtering
  • Grouping similar facet terms under one facet
  • Evaluation
  • The resulted candidate terms
  • The resulted hierarchies
Write a Comment
User Comments (0)
About PowerShow.com