Topics Detection and Tracking - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Topics Detection and Tracking

Description:

Margaret Connel, Ao Feng, Giridhar Kumaran, Hema Raghavan, ... LAT (LA Times/Washington Post) English. 1117. CNE (CNN) English. 104,941. APE (Associated Press) ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 37
Provided by: slpCsie
Category:

less

Transcript and Presenter's Notes

Title: Topics Detection and Tracking


1
Topics Detection and Tracking
  • Presented by CHU Huei-Ming 2004/03/17

2
Reference
  • Pattern Recognition in Speech and Language
    Processing
  • Chap. 12 Modeling Topics for Detection and
    Tracking
  • James Allan
  • University of Massachusetts Amherst
  • PublisherCRC Pr I Llc Published 2003/02
  • UMass at TDT 2004
  • Margaret Connel, Ao Feng, Giridhar Kumaran, Hema
    Raghavan, Chirag Shah, James Allan
  • University of Massachusetts Amherst
  • TDT 2004 workshop

3
Topic Detection and Tracking (1/6)
  • The goal of TDT research is to organize news
    stories by the events that they describe.
  • The TDT research program began in 1996 as a
    collaboration between Carnegie Mellon University,
    Dragon Systems, the University of Massachusetts
    and DARPA
  • To find out how well classic IR technologies
    addressed TDT, they created a small collection of
    news stories and identified some topics within
    them

4
Topic Detection and Tracking (2/6)
  • Event
  • something that happen at some specific time and
    place, along with all necessary preconditions and
    unavoidable consequenes
  • Topic
  • capture the larger set of happenings that are
    related to some triggering event
  • By forcing the additional events to be directly
    related, the topic is prevented from spreading
    out to include too much news

5
Topic Detection and Tracking (3/6)
  • TDT Tasks
  • Segmentation
  • Break an audio track into discrete stories, each
    on a single topic
  • Cluster Detection (Detection)
  • Place all arriving news stories into groups based
    on their topics
  • If no existing groups , the system must decide
    whether to create a new topic
  • Each story is placed in precisely one cluster
  • Tracking
  • Starts with a small set of news stories that a
    user has identified as being on the same topic
  • The system must monitor the stream of arriving
    news to find all additional stories on the same
    topic

6
Topic Detection and Tracking (4/6)
  • New Event Detection (first story detection)
  • Focuses on the cluster creation aspect of cluster
    detection
  • Evaluated on its ability to decide when a new
    topic (event) appears
  • Link Detection
  • Determine weather or not two randomly presented
    stories discuss the same topic
  • The solution of this task could be used to solve
    new event detection

7
Topic Detection and Tracking (5/6)
  • Corpora
  • TDT-2 in 2002 is being augmented with some
    Arabic news from the same time period
  • TDT-3 it is created for 1999 evaluation , and
    stories from four Arabic sources are being added
    during 2002

8
Topic Detection and Tracking (6/6)
  • Evaluation
  • P(target) is the prior probability that a story
    will be on topic
  • Cx are the user-specified values that reflect the
    cost associated with each error
  • P(miss) and P(fa) is the actual system error
    rates
  • Within TDT evaluations, Cmiss10 , Cfa1
  • P(target) 1- P(off-toget) 0.02 (derived from
    training data)

9
Basic Topic Model
  • Vector Space
  • Represent items as (stories or topics) as vector
    in a high dimensional space
  • The most common comparison function is the cosine
    of the angle between the two vectors
  • Language Models
  • A topic is represented as a probability
    distribution of words
  • The initial probability estimates come form the
    maximum likelihood estimate based on the document
  • Use of topic model
  • See how likely the particular story could be
    generated by the model
  • Compare them directly symmetric version of
    Kullback-Leibler divergence

10
Implementing the Models (1/3)
  • Name Entities
  • News is usually about people, so it seems
    reasonable that their names could be treated
    specially
  • Treat the name entities as a separate part of the
    model and then merge the part
  • Boost the weight of any words in the stories that
    come from names, give them a larger contribution
    to the similarity when the names are in common
  • Improve the result slightly, no strong stress so
    far

11
Implementing the Models (2/3)
  • Document Expansion
  • In the segmentation task, a possible segmentation
    boundary could be checked by comparing the models
    generated by text on either side
  • The text could be used as a query to retrieve a
    few dozen related stories and then the most
    frequently occurring words from those stories
    could be used for the comparison
  • Relevance models results in substantial
    improvements in the link detection task

12
Implementing the Models (3/3)
  • Time decay
  • The likelihood that two stories discuss the same
    topic diminished as the stories are further
    separated in time
  • In a vector space model, the cosine similarity
    function can be changed so that it include a time
    decay

13
Comparing model (1/3)
  • Nearest Neighbors
  • In the vector space model, a topic might be
    represented as a single vector
  • To determine whether or not that story is on any
    of the existing topics we consider the distance
    between the storys vector and the closest topic
    vector
  • If it falls outside the specified distance, the
    story is likely to be the seed of a new topic and
    a new vector can be formed

14
Comparing model (2/3)
  • Decision Trees
  • The best place of decision trees within TDT may
    be the segmentation task
  • There are numerous training instances
    (hand-segmented stories)
  • Finding features that are indicative of a story
    boundary is possible and achieves good quality

15
Comparing model (3/3)
  • Model-to-Model
  • Direct comparison of statistical language models
    that represent topics
  • Kullback-Leibler idvergence
  • To finesse the measure, calculate the both ways
    and add them together
  • One approach that has been used to incorporate
    that notion penalized the comparison if the
    models are too much like background news

16
Miscellaneous Issues (1/3)
  • Deferral
  • All of tasks are envisioned as on-line task
  • The decision about a story is expected before the
    next story is presented
  • In fact, TDT provides a moderate amount of look
    ahead for the tasks
  • First, stories are always presented to the system
    grouped into files that correspond to about a
    half hour of news
  • Second, the formal TDT evaluation incorporates a
    notion of deferral that allows a system to
    explore the advantage of deferring decisions
    until several files have passed.

17
Miscellaneous Issues (2/3)
  • Multi-modal Issues
  • TDT systems must deal with are either written
    text (newswire) or read text (audio)
  • Speech recognizers make numerous mistakes,
    inserting, deleting, and even completely
    transforming words into other words
  • The difference of the two modes is the score
    normalization
  • The pair of story drawn from different source the
    distribution is different, in order the score is
    comparable, a system needs to normalize depends
    on those modes

18
Miscellaneous Issues (3/3)
  • Multi-lingual Issues
  • The TDT research program has strong interest in
    evaluating the tasks across multiple languages
  • 19992001 sites were required to handle English
    and Chinese news story
  • 2002 sites will be incorporating Arabic as a
    third language

19
Using TDT Interactively (1/2)
  • Demonstrations
  • Lighthouse is a prototype system that visually
    portrays inter-document similarities to help the
    user find relevant material more quickly

20
Using TDT Interactively (2/2)
  • Timelines
  • Using a timeline to show not only what the topic
    are, but how they occur in time
  • Using X 2 measure to determine whether or not
    that feature is occurring on that day in a
    unusual way

21
UMass at TDT 2004
  • Hierarchical Topic Detection
  • Topic Tracking
  • New Event Detection
  • Link Detection

22
Hierarchical Topic DetectionModel Description
(1/8)
  • This task replaces Topic Detection in previous
    TDT evaluations
  • Used vector space model as the based line
  • Bounded clustering to reduce time complexity and
    had some simple parameter tuning
  • Stories in the same event tend to be close in
    time, we only need to compare a story to its
    local stories instead of the whole collection
  • Two steps
  • Bounded 1-NN for event formation
  • Bounded agglomerative clustering for building the
    hierarchy

23
Hierarchical Topic DetectionModel Description
(2/8)
  • Bounded 1-NN for event formation
  • All stories in the same original language and
    from the some source are taken out and time
    ordered
  • Stories are processed one by one and each
    incoming story is compared to a certain number of
    stories(100 for baseline) before it.
  • Similarity of the current story and the most
    similar previous story is lager than a given
    threshold (0.3 for baseline) the current story
    will be assigned to the event that the most
    similar previous story belongs to, otherwise, a
    new event is created
  • There is a list of events for each
    source/language class
  • The event within each class are sorted by time
    according to the time stamp of the first story

24
Hierarchical Topic DetectionModel Description
(3/8)
  • Bounded 1-NN for event formation

S2
S1
S3
S1
S2
Language A
Language B
25
Hierarchical Topic DetectionModel Description
(4/8)
  • Each source is segmented in to several parts, and
    sorted by time according to the time stamp of the
    first story
  • Sorted event list

26
Hierarchical Topic DetectionModel Description
(5/8)
  • Bounded agglomerative clustering for building the
    hierarchy
  • Take a certain number of events (the number is
    called WSIZE default is 120) from the sorted
    event list
  • At each iteration, find the closest event pair
    and combine the later event to the earlier one

27
Hierarchical Topic DetectionModel Description
(6/8)
  • Each iteration find the closest event pair and
    combine the later event to the earlier one

I1
I2
I3
Ir-1
Ir
28
Hierarchical Topic DetectionModel Description
(7/8)
  • Bounded agglomerative clustering for building the
    hierarchy
  • Continues for (BRANCH-1)WSIZE/BRANCH iterations,
    so the number of clusters left is WSIZE/BRANCH
  • Take the first half out and get WSIZE/2 new
    events and agglomerative cluster until
    WSIZE/BRANCH clusters left
  • The optimal value is around 3, BRANCH3 as
    baseline

29
Hierarchical Topic DetectionModel Description
(8/8)
  • Then all clusters in the same language but from
    difference sources are combined
  • Finally clusters from all languages are mixed and
    clustered until only one cluster is left, which
    become the root
  • Used machine translation for Arabic and Mandarin
    stories to simplify the similarity calculation

30
Hierarchical Topic DetectionTraining (1/4)
  • Training corpus TDT4 newswire and broadcast
    stories Testing corpus TDT5 newswire only
  • Taking newswire stories from the TDT4 corpus
    includes NYT, APW, ANN, ALH, AFP, ZBN, XIN
    420,000 stories

TDT-4 Corpus Overview
31
Hierarchical Topic DetectionTraining (2/4)
32
Hierarchical Topic DetectionTraining (3/4)
  • Parameters
  • BRANCH average branching factor in the bounded
    agglomerative clustering
    algorithm
  • Threshold in the event formation to decide if a
    new event will be created
  • STOP in each source, the number of
    cluster is smaller than square
    root of the number of story
  • WSIZE the maximum window size in
    agglomerative clustering
  • NSTORY Each story will be compared to at most
    NSTORY stories before it in
    the 1-NN event clustering, the idea
    comes from the time locality in event
    threading

33
Hierarchical Topic DetectionTraining (4/4)
  • Among the clusters very close to the root node,
    some contains thousands of stories.
  • Both 1-NN and agglomerative clustering algorithms
    favor large clusters
  • Modified the similarity calculation to give
    smaller clusters more chances
  • Sim(v1,v2) is the similarity of the cluster
    centroids
  • cluster1 is the number of stories in the first
    story
  • a is a constant to control how much favor smaller
    clusters can get

34
Hierarchical Topic DetectionResult (1/2)
  • Three runs for each condition UMASSv1, UMASSv12
    and UMASSv19

35
Hierarchical Topic DetectionResult (2/2)
  • Small branching factor can reduce both detection
    cost and travel cost
  • Small branching factor, there are more clusters
    with different granularities
  • The assumption of temporal locality is useful in
    event threading, more experiments after the
    submission show larger window size can improve
    performance

36
Conclusion
  • Discussed several of the techniques that systems
    have used to build or enhance those models and
    listed merits of many of them
  • The TDT researchers can extent to which IR
    technology can be used to solve TDT problems
Write a Comment
User Comments (0)
About PowerShow.com