On BurstinessAware Search for Document Sequences - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

On BurstinessAware Search for Document Sequences

Description:

Major Events are discussed in numerous articles for an extended timeframe. ... Frequency of the term 'earthquake', as it appeared in the SF Call , (1908 - 1909) ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 23
Provided by: velblodVid
Category:

less

Transcript and Presenter's Notes

Title: On BurstinessAware Search for Document Sequences


1
On Burstiness-Aware Search for Document Sequences
  • Theodoros Lappas Benjamin Arai
  • Manolis Platakis Dimitrios Kotsakos
  • Dimitrios Gunopulos
  • SIGKDD 2009

2
Outline
  • The Problem How to effectively search through
    large document sequences (e.g. newspapers)
  • Previous Work
  • Using Bursty Terms to identify Events
  • Modeling Burstiness using Discrepancy Theory
  • Our Search Framework
  • Experiments

3
The Problem
  • Given a large sequence of documents (e.g. a
    daily newspaper) and a query of terms, find
    documents that discuss major events relevant to
    the query.
  • Consider the San Francisco Call a daily 1900s
    newspaper
  • We are given the query lttheater, disastergt
  • Two candidate events, relevant to the query
  • The disastrous fire of 1903 in the Iroquois
    Theater in Chicago
  • A disastrous performance given by an actor in a
    local theater
  • Clearly the first event is far more influential
    articles on this event should be ranked higher!

4
Previous Work
  • Burstiness explored in different domains
  • Burst Detection - Kleinberg 2002
  • Stream clustering - He et al. 2007
  • Graph Evolution - Kumar et al. 2003
  • Event Detection - Fung et al. 2005
  • Nothing on Burstiness-aware Search
  • Standard Information Retrieval techniques do not
    consider the underlying events discussed in the
    collection.
  • Event Detection Techniques do not consider user
    input.

5
Burstiness
  • Major Events are discussed in numerous articles
    for an extended timeframe.
  • The events keywords exhibit high frequency
    bursts during the timeframe
  • Frequency of the term earthquake, as it
    appeared in the SF Call , (1908 - 1909).
  • Bursty periods periods of unusually high
    frequency
  • Unusual? ? Deviating from an expected baseline.

6
Modeling Burstiness using Discrepancy Theory
  • Discrepancy Used to express and quantify the
    deviation from the norm
  • In our case find intervals on the timeline were
    the observed frequency differs the most from the
    expected frequency
  • Maximal Interval One that does not include and
    is not included in an interval of higher score.
  • MAX-1 Linear-Time Algorithm for Maximal Interval
    Extraction.

7
Baseline - Discussion
  • Baseline can be dynamic
  • frequency sequence(s) from previous year(s)
  • Time Series Decomposition to extract Seasonal,
    Trend and Irregular Components

8
A Diagram of our framework
9
Phase 1 Preprocessing
  • The output is the set of terms to be monitored
  • The input is a raw document sequence.
  • Preprocessing Methods
  • Stemming, Synonym matching, etc.
  • Stopwords Removal
  • Frequency Pruning for rare words

10
Phase 2 Retrieval of Bursty Intervals
  • Input A term
  • Output Set of non-overlapping intervals
    their burstiness scores
  • Create the frequency sequence for the term.
  • Extract bursty intervals using the MAX-1
    algorithm

11
Phase 3 Interval Indexing
  • Input Set of bursty intervals for each term
  • Output An Index of Intervals
  • Simple, easily updatable structure
  • Need to support multi-term queries

12
Inverted Interval Index
  • Up Next Query Evaluation

13
Phase 4 Top- k Evaluation for Multi-Term
Queries
  • Customized Version of the Threshold Algorithm
    (TA) for top-k Evaluation.
  • Standard Version
  • Terms-to-Documents
  • Each document either appears in a terms list or
    not
  • Our Version (TA)
  • Terms-to-Intervals
  • A bursty interval of a term t1 may overlap
    multiple intervals of a term t2.
  • Up Next Experiments

14
Empirical Evaluation
  • San Francisco Call a daily newspaper with
    publication dates between 1900-1909. 400,000
    articles
  • List of Major Events from 1900-1909 (from
    Wikipedia) query for each event.

15
Major Events List
16
Experiment 1 - Query Expansion
  • Submit respective query for each event in Major
    Events List.
  • Get top interval
  • Report the 10 terms that appear in the most
    document titles within the interval

17
Example 1
Event King Umberto I of Italy is assassinated by
Italian-born anarchist Gaetano Bressi. Query ki
ng assassination
Umberto july state anarchist italy
unit Rome Bressi general police
18
Example 2
Event Louis Bleriot is the first man to fly
across the English Channel in an
aircraft. Query English channel
flight july miles cross aviator
attempt return Bleriot
condition machine
19
Experiment 2 Burst Detection
  • Submit respective query for each event in Major
    Events List.
  • Get top reported interval
  • Compare with actual event date
  • We use MAX-1, MAX-2 to extract bursty intervals.
  • MAX-2
  • Re-run MAX-1 on each interval
  • Obtain nested structure

20
Examples
  • Event A fire at the Iroquois Theater in Chicago
    kills 600.
  • Query lt theater, disastergt
  • Event A fire aboard the steamboat General Slocum
    in New York Citys East River kills 1,021.
  • Query lt steamboat, disaster gt

21
Conclusion
  • The 1st efficient end-to-end framework for
    burstiness-aware search in document sequences.
  • Future Work
  • Evaluate on even larger Corpora
  • Evaluate on more types of text

22
Thank you!!!
Write a Comment
User Comments (0)
About PowerShow.com