Trend Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Trend Analysis

Description:

Trend Analysis. Yi-Chia Wang. LTI 2nd year Master student. Analysis of Social Media ... The bursts are picking up trend in language use. Oct-30. Analysis of ... – PowerPoint PPT presentation

Number of Views:850
Avg rating:3.0/5.0
Slides: 23
Provided by: yichi3
Learn more at: https://www.cs.cmu.edu
Category:
Tags: analysis | trend

less

Transcript and Presenter's Notes

Title: Trend Analysis


1
Trend Analysis
Analysis of Social Media
  • Yi-Chia Wang
  • LTI 2nd year Master student

2
Introduction
  • Document streams
  • Arrive continuously over time
  • E-mail, news articles, search engine query logs,
  • Identify topics in document streams
  • Topic detection and tracking
  • Text mining
  • Visualization
  • Is there a better organizing principle for the
    enormous archives of document streams?
  • Temporal information in document streams

Trausan-Matu et al., 2007
3
Burst of activity
  • Topics appear, grow in intensity for a period of
    time, and then fade away.
  • Bursts correspond to points at which the
    intensity of message arrivals increases sharply
  • Problems with naive identification of bursts
  • Easily identifying large numbers of short bursts
  • Fragmenting long burst into many smaller ones
  • Goal identifying bursts only when they have
    sufficient intensity

4
Jon KleinbergDepartment of Computer
ScienceCornell University SIGKDD 02
Bursty and Hierarchical Structure in Streams
5
Two-state Automaton (A) Model
  • Idea periods of lower message intensity
    interleave with periods of higher message
    intensity
  • A begins in state q0
  • A changes state with probability p
  • When in state q0, messages are emitted at a slow
    rate when in state q1, messages are emitted at a
    faster rate

6
Exponential Distribution
  • Modeling the message emission rate
  • Modeling the time gap between messages and
  • Modeling by exponential distribution with
    parameter being the rate of message arrivals

Wikipedia
7
Two-state Automaton (A) Model
  • Formally, given
  • messages with specified arrival times
  • inter-arrival gaps
  • We want to determine the conditional probability
    of a state sequence

8
Two-state Automaton (A) Model
  • Finding a state sequence q maximizing the
    probability
  • Equivalently, minimizing the following cost
    function

Favoring state sequences that conform well to the
sequence x of gap values
Favoring sequences with a small number of state
transitions
9
Infinite-state Automata Model
  • Cost Function

10
Computing a minimum-cost state sequence
  • THEOREM If q is an optimal state sequence in
    , then it is also an optimal state sequence in
  • Dynamic programming is used for searching an
    optimal state sequence

11
Bursts exhibit a natural nested structure
A burst of intensity j is a maximal interval over
which a part of state sequence is in a state of
index j or higher
Bursts can also be represented as a tree. Each
burst is a node in the tree
12
Experiments
  • The model makes sense for many datasets (of an
    analogous flavor)
  • Email
  • Titles of conference papers
  • U.S. Presidential State of the Union Addresses
  • Web clickstreams

13
Email Dataset
  • Is the appearance of messages containing
    particular words exhibits a burst in the vicinity
    of significant times such as deadlines?
  • Authors own collection of email
  • June 9, 1997 August 23, 2001
  • 34344 messages (41.7 MB)
  • Focusing on the response set

14
Results for the Word - ITR
  • ITR is the name of a large NSF program
  • The author wrote 2 proposals for it in 1999-2000
    one is a small proposal while another is a large
    one
  • The intervals are annotated with the first and
    last dates of the messages
  • The first subtree splits further into 2 subtrees
  • For the 2nd subtree, there is no burst since the
    author did not continue the submission

15
Results for the Word - prelim
  • Prelim is the term used at Cornell for non-final
    exams
  • The author taught courses in 4 of the 8 semesters
    covered by the collection of email, and each of
    these courses had 2 prelims
  • For the first of these courses, there was a
    special course email account
  • For remaining 3 courses, each corresponds to a
    long burst and 2 shorter, more intense bursts for
    the particular prelims

The 2 structures suggest how a large folder of
email might naturally be divided into a
hierarchical set of sub-folders around certain
key events, based only on the rate of message
arrivals
16
Titles of Conference Papers
  • Goal extracting bursts in term usage from the
    titles of conference papers over the past several
    decades
  • Problem conference papers arrive in discrete
    batches every half or one year ? no message
    inter-arrivals gaps
  • Modified automaton model
  • Generating batched arrivals
  • For each state, there is an expected fraction of
    relevant documents
  • Bursty is identified if the fraction of relevant
    documents increases

17
Titles of Conference Papers
  • Cost function for each arrival batch
  • The weight of the burst the
    improvement in cost by using state q1 rather than
    state q0

18
SIGMOD VLDB, 1975-2001
  • Considering each word in paper titles
  • The 30 bursts of highest weight
  • The bursts with no ending date ? the interval
    extends to the most recent conference
  • These bursty words are different from a list of
    common words
  • The bursts are picking up trend in language use

19
STOC FOCS, 1969-2001
  • The 30 bursts of highest weight
  • Particular titling conventions that were in
    fashion for certain periods
  • How to construct random functions

20
U.S. Presidential State of the Union Addresses
Kleinbergh, SIGKDD 02
21
Web usage data clickstreams
  • Settings
  • 80 undergraduate students
  • Two and a half months in Spring 2000
  • For every URL w, all bursts in the stream of
    visits to w are determined
  • Focusing on high-weighted bursts as well as those
    that involve at least 10 distinct users
  • Results
  • High-ranked bursts involve the URLs of the online
    class reading assignments, centered on intervals
    shortly before and during the weekly sessions at
    which they were discussed

22
Conclusions
  • Modeling streams using an infinite-state
    automaton
  • State transitions lead to bursts
  • First story detection a single message on which
    the associated state transition occurred
  • The model offers a means of structuring the
    information from our patterns of interacting and
    communicating
  • Document streams have a strong temporal character
  • In many domains, we are accumulating detailed
    records of our own communication and behavior
Write a Comment
User Comments (0)
About PowerShow.com