Title: Trend Analysis
1Trend Analysis
Analysis of Social Media
- Yi-Chia Wang
- LTI 2nd year Master student
2Introduction
- Document streams
- Arrive continuously over time
- E-mail, news articles, search engine query logs,
- Identify topics in document streams
- Topic detection and tracking
- Text mining
- Visualization
-
- Is there a better organizing principle for the
enormous archives of document streams? - Temporal information in document streams
Trausan-Matu et al., 2007
3Burst of activity
- Topics appear, grow in intensity for a period of
time, and then fade away. - Bursts correspond to points at which the
intensity of message arrivals increases sharply - Problems with naive identification of bursts
- Easily identifying large numbers of short bursts
- Fragmenting long burst into many smaller ones
- Goal identifying bursts only when they have
sufficient intensity
4Jon KleinbergDepartment of Computer
ScienceCornell University SIGKDD 02
Bursty and Hierarchical Structure in Streams
5Two-state Automaton (A) Model
- Idea periods of lower message intensity
interleave with periods of higher message
intensity - A begins in state q0
- A changes state with probability p
- When in state q0, messages are emitted at a slow
rate when in state q1, messages are emitted at a
faster rate
6Exponential Distribution
- Modeling the message emission rate
- Modeling the time gap between messages and
- Modeling by exponential distribution with
parameter being the rate of message arrivals
Wikipedia
7Two-state Automaton (A) Model
- Formally, given
- messages with specified arrival times
- inter-arrival gaps
- We want to determine the conditional probability
of a state sequence
8Two-state Automaton (A) Model
- Finding a state sequence q maximizing the
probability - Equivalently, minimizing the following cost
function
Favoring state sequences that conform well to the
sequence x of gap values
Favoring sequences with a small number of state
transitions
9Infinite-state Automata Model
10Computing a minimum-cost state sequence
- THEOREM If q is an optimal state sequence in
, then it is also an optimal state sequence in - Dynamic programming is used for searching an
optimal state sequence
11Bursts exhibit a natural nested structure
A burst of intensity j is a maximal interval over
which a part of state sequence is in a state of
index j or higher
Bursts can also be represented as a tree. Each
burst is a node in the tree
12Experiments
- The model makes sense for many datasets (of an
analogous flavor) - Email
- Titles of conference papers
- U.S. Presidential State of the Union Addresses
- Web clickstreams
13Email Dataset
- Is the appearance of messages containing
particular words exhibits a burst in the vicinity
of significant times such as deadlines? - Authors own collection of email
- June 9, 1997 August 23, 2001
- 34344 messages (41.7 MB)
- Focusing on the response set
14Results for the Word - ITR
- ITR is the name of a large NSF program
- The author wrote 2 proposals for it in 1999-2000
one is a small proposal while another is a large
one - The intervals are annotated with the first and
last dates of the messages - The first subtree splits further into 2 subtrees
- For the 2nd subtree, there is no burst since the
author did not continue the submission
15Results for the Word - prelim
- Prelim is the term used at Cornell for non-final
exams - The author taught courses in 4 of the 8 semesters
covered by the collection of email, and each of
these courses had 2 prelims - For the first of these courses, there was a
special course email account - For remaining 3 courses, each corresponds to a
long burst and 2 shorter, more intense bursts for
the particular prelims
The 2 structures suggest how a large folder of
email might naturally be divided into a
hierarchical set of sub-folders around certain
key events, based only on the rate of message
arrivals
16Titles of Conference Papers
- Goal extracting bursts in term usage from the
titles of conference papers over the past several
decades - Problem conference papers arrive in discrete
batches every half or one year ? no message
inter-arrivals gaps - Modified automaton model
- Generating batched arrivals
- For each state, there is an expected fraction of
relevant documents - Bursty is identified if the fraction of relevant
documents increases
17Titles of Conference Papers
- Cost function for each arrival batch
- The weight of the burst the
improvement in cost by using state q1 rather than
state q0
18SIGMOD VLDB, 1975-2001
- Considering each word in paper titles
- The 30 bursts of highest weight
- The bursts with no ending date ? the interval
extends to the most recent conference - These bursty words are different from a list of
common words - The bursts are picking up trend in language use
19STOC FOCS, 1969-2001
- The 30 bursts of highest weight
- Particular titling conventions that were in
fashion for certain periods - How to construct random functions
20U.S. Presidential State of the Union Addresses
Kleinbergh, SIGKDD 02
21Web usage data clickstreams
- Settings
- 80 undergraduate students
- Two and a half months in Spring 2000
- For every URL w, all bursts in the stream of
visits to w are determined - Focusing on high-weighted bursts as well as those
that involve at least 10 distinct users - Results
- High-ranked bursts involve the URLs of the online
class reading assignments, centered on intervals
shortly before and during the weekly sessions at
which they were discussed
22Conclusions
- Modeling streams using an infinite-state
automaton - State transitions lead to bursts
- First story detection a single message on which
the associated state transition occurred - The model offers a means of structuring the
information from our patterns of interacting and
communicating - Document streams have a strong temporal character
- In many domains, we are accumulating detailed
records of our own communication and behavior