Title: On BurstinessAware Search for Document Sequences
1On Burstiness-Aware Search for Document Sequences
- Theodoros Lappas Benjamin Arai
- Manolis Platakis Dimitrios Kotsakos
- Dimitrios Gunopulos
- SIGKDD 2009
2Outline
- The Problem How to effectively search through
large document sequences (e.g. newspapers) - Previous Work
- Using Bursty Terms to identify Events
- Modeling Burstiness using Discrepancy Theory
- Our Search Framework
- Experiments
3The Problem
- Given a large sequence of documents (e.g. a
daily newspaper) and a query of terms, find
documents that discuss major events relevant to
the query. - Consider the San Francisco Call a daily 1900s
newspaper - We are given the query lttheater, disastergt
- Two candidate events, relevant to the query
- The disastrous fire of 1903 in the Iroquois
Theater in Chicago - A disastrous performance given by an actor in a
local theater - Clearly the first event is far more influential
articles on this event should be ranked higher!
4Previous Work
- Burstiness explored in different domains
- Burst Detection - Kleinberg 2002
- Stream clustering - He et al. 2007
- Graph Evolution - Kumar et al. 2003
- Event Detection - Fung et al. 2005
- Nothing on Burstiness-aware Search
- Standard Information Retrieval techniques do not
consider the underlying events discussed in the
collection. - Event Detection Techniques do not consider user
input.
5Burstiness
- Major Events are discussed in numerous articles
for an extended timeframe. - The events keywords exhibit high frequency
bursts during the timeframe
- Frequency of the term earthquake, as it
appeared in the SF Call , (1908 - 1909).
- Bursty periods periods of unusually high
frequency - Unusual? ? Deviating from an expected baseline.
6Modeling Burstiness using Discrepancy Theory
- Discrepancy Used to express and quantify the
deviation from the norm - In our case find intervals on the timeline were
the observed frequency differs the most from the
expected frequency
- Maximal Interval One that does not include and
is not included in an interval of higher score. - MAX-1 Linear-Time Algorithm for Maximal Interval
Extraction.
7Baseline - Discussion
- Baseline can be dynamic
- frequency sequence(s) from previous year(s)
- Time Series Decomposition to extract Seasonal,
Trend and Irregular Components
8A Diagram of our framework
9Phase 1 Preprocessing
- The output is the set of terms to be monitored
- The input is a raw document sequence.
- Preprocessing Methods
- Stemming, Synonym matching, etc.
- Stopwords Removal
- Frequency Pruning for rare words
10Phase 2 Retrieval of Bursty Intervals
- Input A term
- Output Set of non-overlapping intervals
their burstiness scores - Create the frequency sequence for the term.
- Extract bursty intervals using the MAX-1
algorithm
11Phase 3 Interval Indexing
- Input Set of bursty intervals for each term
- Output An Index of Intervals
- Simple, easily updatable structure
- Need to support multi-term queries
12Inverted Interval Index
13Phase 4 Top- k Evaluation for Multi-Term
Queries
- Customized Version of the Threshold Algorithm
(TA) for top-k Evaluation. - Standard Version
- Terms-to-Documents
- Each document either appears in a terms list or
not - Our Version (TA)
- Terms-to-Intervals
- A bursty interval of a term t1 may overlap
multiple intervals of a term t2.
14Empirical Evaluation
- San Francisco Call a daily newspaper with
publication dates between 1900-1909. 400,000
articles - List of Major Events from 1900-1909 (from
Wikipedia) query for each event.
15Major Events List
16Experiment 1 - Query Expansion
- Submit respective query for each event in Major
Events List. - Get top interval
- Report the 10 terms that appear in the most
document titles within the interval
17Example 1
Event King Umberto I of Italy is assassinated by
Italian-born anarchist Gaetano Bressi. Query ki
ng assassination
Umberto july state anarchist italy
unit Rome Bressi general police
18Example 2
Event Louis Bleriot is the first man to fly
across the English Channel in an
aircraft. Query English channel
flight july miles cross aviator
attempt return Bleriot
condition machine
19Experiment 2 Burst Detection
- Submit respective query for each event in Major
Events List. - Get top reported interval
- Compare with actual event date
- We use MAX-1, MAX-2 to extract bursty intervals.
- MAX-2
- Re-run MAX-1 on each interval
- Obtain nested structure
20Examples
- Event A fire at the Iroquois Theater in Chicago
kills 600. - Query lt theater, disastergt
- Event A fire aboard the steamboat General Slocum
in New York Citys East River kills 1,021. - Query lt steamboat, disaster gt
21Conclusion
- The 1st efficient end-to-end framework for
burstiness-aware search in document sequences. - Future Work
- Evaluate on even larger Corpora
- Evaluate on more types of text
22Thank you!!!