Title: Topics Detection and Tracking
1Topics Detection and Tracking
- Presented by CHU Huei-Ming 2004/03/17
2Reference
- Pattern Recognition in Speech and Language
Processing - Chap. 12 Modeling Topics for Detection and
Tracking - James Allan
- University of Massachusetts Amherst
- PublisherCRC Pr I Llc Published 2003/02
- UMass at TDT 2004
- Margaret Connel, Ao Feng, Giridhar Kumaran, Hema
Raghavan, Chirag Shah, James Allan - University of Massachusetts Amherst
- TDT 2004 workshop
3Topic Detection and Tracking (1/6)
- The goal of TDT research is to organize news
stories by the events that they describe. - The TDT research program began in 1996 as a
collaboration between Carnegie Mellon University,
Dragon Systems, the University of Massachusetts
and DARPA - To find out how well classic IR technologies
addressed TDT, they created a small collection of
news stories and identified some topics within
them
4Topic Detection and Tracking (2/6)
- Event
- something that happen at some specific time and
place, along with all necessary preconditions and
unavoidable consequenes - Topic
- capture the larger set of happenings that are
related to some triggering event - By forcing the additional events to be directly
related, the topic is prevented from spreading
out to include too much news
5Topic Detection and Tracking (3/6)
- TDT Tasks
- Segmentation
- Break an audio track into discrete stories, each
on a single topic - Cluster Detection (Detection)
- Place all arriving news stories into groups based
on their topics - If no existing groups , the system must decide
whether to create a new topic - Each story is placed in precisely one cluster
- Tracking
- Starts with a small set of news stories that a
user has identified as being on the same topic - The system must monitor the stream of arriving
news to find all additional stories on the same
topic
6Topic Detection and Tracking (4/6)
- New Event Detection (first story detection)
- Focuses on the cluster creation aspect of cluster
detection - Evaluated on its ability to decide when a new
topic (event) appears - Link Detection
- Determine weather or not two randomly presented
stories discuss the same topic - The solution of this task could be used to solve
new event detection
7Topic Detection and Tracking (5/6)
- Corpora
- TDT-2 in 2002 is being augmented with some
Arabic news from the same time period - TDT-3 it is created for 1999 evaluation , and
stories from four Arabic sources are being added
during 2002
8Topic Detection and Tracking (6/6)
- Evaluation
- P(target) is the prior probability that a story
will be on topic - Cx are the user-specified values that reflect the
cost associated with each error - P(miss) and P(fa) is the actual system error
rates - Within TDT evaluations, Cmiss10 , Cfa1
- P(target) 1- P(off-toget) 0.02 (derived from
training data)
9Basic Topic Model
- Vector Space
- Represent items as (stories or topics) as vector
in a high dimensional space - The most common comparison function is the cosine
of the angle between the two vectors - Language Models
- A topic is represented as a probability
distribution of words - The initial probability estimates come form the
maximum likelihood estimate based on the document - Use of topic model
- See how likely the particular story could be
generated by the model - Compare them directly symmetric version of
Kullback-Leibler divergence
10Implementing the Models (1/3)
- Name Entities
- News is usually about people, so it seems
reasonable that their names could be treated
specially - Treat the name entities as a separate part of the
model and then merge the part - Boost the weight of any words in the stories that
come from names, give them a larger contribution
to the similarity when the names are in common - Improve the result slightly, no strong stress so
far
11Implementing the Models (2/3)
- Document Expansion
- In the segmentation task, a possible segmentation
boundary could be checked by comparing the models
generated by text on either side - The text could be used as a query to retrieve a
few dozen related stories and then the most
frequently occurring words from those stories
could be used for the comparison - Relevance models results in substantial
improvements in the link detection task
12Implementing the Models (3/3)
- Time decay
- The likelihood that two stories discuss the same
topic diminished as the stories are further
separated in time - In a vector space model, the cosine similarity
function can be changed so that it include a time
decay
13Comparing model (1/3)
- Nearest Neighbors
- In the vector space model, a topic might be
represented as a single vector - To determine whether or not that story is on any
of the existing topics we consider the distance
between the storys vector and the closest topic
vector - If it falls outside the specified distance, the
story is likely to be the seed of a new topic and
a new vector can be formed
14Comparing model (2/3)
- Decision Trees
- The best place of decision trees within TDT may
be the segmentation task - There are numerous training instances
(hand-segmented stories) - Finding features that are indicative of a story
boundary is possible and achieves good quality
15Comparing model (3/3)
- Model-to-Model
- Direct comparison of statistical language models
that represent topics - Kullback-Leibler idvergence
- To finesse the measure, calculate the both ways
and add them together - One approach that has been used to incorporate
that notion penalized the comparison if the
models are too much like background news
16Miscellaneous Issues (1/3)
- Deferral
- All of tasks are envisioned as on-line task
- The decision about a story is expected before the
next story is presented - In fact, TDT provides a moderate amount of look
ahead for the tasks - First, stories are always presented to the system
grouped into files that correspond to about a
half hour of news - Second, the formal TDT evaluation incorporates a
notion of deferral that allows a system to
explore the advantage of deferring decisions
until several files have passed.
17Miscellaneous Issues (2/3)
- Multi-modal Issues
- TDT systems must deal with are either written
text (newswire) or read text (audio) - Speech recognizers make numerous mistakes,
inserting, deleting, and even completely
transforming words into other words - The difference of the two modes is the score
normalization - The pair of story drawn from different source the
distribution is different, in order the score is
comparable, a system needs to normalize depends
on those modes
18Miscellaneous Issues (3/3)
- Multi-lingual Issues
- The TDT research program has strong interest in
evaluating the tasks across multiple languages - 19992001 sites were required to handle English
and Chinese news story - 2002 sites will be incorporating Arabic as a
third language
19Using TDT Interactively (1/2)
- Demonstrations
- Lighthouse is a prototype system that visually
portrays inter-document similarities to help the
user find relevant material more quickly
20Using TDT Interactively (2/2)
- Timelines
- Using a timeline to show not only what the topic
are, but how they occur in time - Using X 2 measure to determine whether or not
that feature is occurring on that day in a
unusual way
21UMass at TDT 2004
- Hierarchical Topic Detection
- Topic Tracking
- New Event Detection
- Link Detection
22Hierarchical Topic DetectionModel Description
(1/8)
- This task replaces Topic Detection in previous
TDT evaluations - Used vector space model as the based line
- Bounded clustering to reduce time complexity and
had some simple parameter tuning - Stories in the same event tend to be close in
time, we only need to compare a story to its
local stories instead of the whole collection - Two steps
- Bounded 1-NN for event formation
- Bounded agglomerative clustering for building the
hierarchy
23Hierarchical Topic DetectionModel Description
(2/8)
- Bounded 1-NN for event formation
- All stories in the same original language and
from the some source are taken out and time
ordered - Stories are processed one by one and each
incoming story is compared to a certain number of
stories(100 for baseline) before it. - Similarity of the current story and the most
similar previous story is lager than a given
threshold (0.3 for baseline) the current story
will be assigned to the event that the most
similar previous story belongs to, otherwise, a
new event is created - There is a list of events for each
source/language class - The event within each class are sorted by time
according to the time stamp of the first story
24Hierarchical Topic DetectionModel Description
(3/8)
- Bounded 1-NN for event formation
S2
S1
S3
S1
S2
Language A
Language B
25Hierarchical Topic DetectionModel Description
(4/8)
- Each source is segmented in to several parts, and
sorted by time according to the time stamp of the
first story - Sorted event list
26Hierarchical Topic DetectionModel Description
(5/8)
- Bounded agglomerative clustering for building the
hierarchy - Take a certain number of events (the number is
called WSIZE default is 120) from the sorted
event list - At each iteration, find the closest event pair
and combine the later event to the earlier one
27Hierarchical Topic DetectionModel Description
(6/8)
- Each iteration find the closest event pair and
combine the later event to the earlier one
I1
I2
I3
Ir-1
Ir
28Hierarchical Topic DetectionModel Description
(7/8)
- Bounded agglomerative clustering for building the
hierarchy - Continues for (BRANCH-1)WSIZE/BRANCH iterations,
so the number of clusters left is WSIZE/BRANCH - Take the first half out and get WSIZE/2 new
events and agglomerative cluster until
WSIZE/BRANCH clusters left - The optimal value is around 3, BRANCH3 as
baseline
29Hierarchical Topic DetectionModel Description
(8/8)
- Then all clusters in the same language but from
difference sources are combined - Finally clusters from all languages are mixed and
clustered until only one cluster is left, which
become the root - Used machine translation for Arabic and Mandarin
stories to simplify the similarity calculation
30Hierarchical Topic DetectionTraining (1/4)
- Training corpus TDT4 newswire and broadcast
stories Testing corpus TDT5 newswire only
- Taking newswire stories from the TDT4 corpus
includes NYT, APW, ANN, ALH, AFP, ZBN, XIN
420,000 stories
TDT-4 Corpus Overview
31Hierarchical Topic DetectionTraining (2/4)
32Hierarchical Topic DetectionTraining (3/4)
- Parameters
- BRANCH average branching factor in the bounded
agglomerative clustering
algorithm - Threshold in the event formation to decide if a
new event will be created - STOP in each source, the number of
cluster is smaller than square
root of the number of story - WSIZE the maximum window size in
agglomerative clustering - NSTORY Each story will be compared to at most
NSTORY stories before it in
the 1-NN event clustering, the idea
comes from the time locality in event
threading
33Hierarchical Topic DetectionTraining (4/4)
- Among the clusters very close to the root node,
some contains thousands of stories. - Both 1-NN and agglomerative clustering algorithms
favor large clusters - Modified the similarity calculation to give
smaller clusters more chances - Sim(v1,v2) is the similarity of the cluster
centroids - cluster1 is the number of stories in the first
story - a is a constant to control how much favor smaller
clusters can get
34Hierarchical Topic DetectionResult (1/2)
- Three runs for each condition UMASSv1, UMASSv12
and UMASSv19
35Hierarchical Topic DetectionResult (2/2)
- Small branching factor can reduce both detection
cost and travel cost - Small branching factor, there are more clusters
with different granularities - The assumption of temporal locality is useful in
event threading, more experiments after the
submission show larger window size can improve
performance
36Conclusion
- Discussed several of the techniques that systems
have used to build or enhance those models and
listed merits of many of them - The TDT researchers can extent to which IR
technology can be used to solve TDT problems