Title: Building Topic/Trend Detection System based on Slow Intelligence
1Building Topic/Trend Detection System based on
Slow Intelligence
- Chia-Chun Shih Ting-Chun Peng
- Institute for Information Industry
- Taipei, Taiwan
Presented at DMS10 special session on Slow
Intelligence Systems
2Agenda
- Introduction
- Topic/Trend Detection System
- Topic/Trend Detection System with Slow
Intelligence - Conclusion
3Introduction
4Introduction
Facebook Users
Twitter Posts
Blog Posts
- Social media is prevailing
- Social media is a reflection of real-world
- An experiment from HP Social Computing Lab shows
- Twitter-rate time series can accurately predict
box-office movie sales with Adjusted R2 0.973
(amazing!!) - The emerging market for Social Media Monitoring
Service - E.g., Nielsen Buzzmetrics, Radian6
5Introduction
(contd)
- Topic Detection and Tracking (TDT)
- Initiated by DARPA at 1996
- discover the topical structure in unsegmented
streams of news reporting as it appears across
multiple media - Tasks
- Topic Detection
- Topic Tracking
- First Story Detection
- Story Segmentation
- Link Detection
6Introduction
(contd)
- Slow Intelligence provides a software development
framework for systems with insufficient computing
resources to gradually adapt to environments to
handle complexities
Environment
Knowledge-based Controller
Problem
Solution
1
2
3
4
Enumerator
Adaptor
Eliminator
Concentrator
Slow Intelligence System
7Introduction
(contd)
- In this paper, we propose a design of online
topic/trend detection system for Social Media
with the advantages of Slow Intelligence. - Four complexities of designing online topic/trend
detection systems are identified, along with
corresponding Slow Intelligence solutions.
8Topic/Trend Detection System
9Topic/Trend Detection System
- Objective
- Detect current hot topics and to predict future
hot topics based on data collected from Social
Media - Three components
- Crawler Extractor Collect data and extract
information from Social Media - Topic Extractor Detect hot topics from a set of
text documents - Trend Detector Detect trends (future hot topics)
based on currently available data
Current Hot topics
Crawler Extractor
Topic Extractor
Trend Detector
Social Media
Future Hot topics
10Topic/Trend Detection System
(contd)
Social Media
HTML documents
Users Keywords of Interests
Web Crawler
Text documents
Web data DB
Topic Extractor
Information Extractor
Extract articles and metadata (title, author,
content, etc) from semi-structured web content
Crawler Extractor
11Topic/Trend Detection System
(contd)
Web data DB
Current topics
Topic Word Extraction
Topic Word Clustering
- Apply TF-IDF scheme to generate Top-N topic
words for each document
- Apply clustering algorithm to cluster topic
words into topic groups. The topic groups are
treated as topics
Current Hot topics
Hot topic extraction
- Apply aging theory to find hot topics
Topic Extractor
12Topic/Trend Detection System
(contd)
Trend Detector
- The Trend Estimation Algorithm is a black box
now, however, it will find its way when Slow
Intelligence is involved in the system
13Topic/Trend Detection Systemwith Slow
Intelligence
14T/TD System with Slow Intelligence
- Four complexities of designing online topic/trend
detection systems - 1. It is unlikely to collect all web data based
on limited amount of computing resources. The
system needs to develop data collection
strategies which can concentrate limited
resources on collecting important web data.
Crawler Extractor
15T/TD System with Slow Intelligence
(contd)
- 2. Many computation methods are available for
estimating trends. If parameter settings are also
taken into account, there are too many
combinations to choose. Furthermore, Internet is
a changing environment, which means current best
solution may not perform well in the future. The
system needs to automatically (or at least
quasi-automatically) find best solution from many
alternatives in a changing environment.
Trend Detector
16T/TD System with Slow Intelligence
(contd)
- 3. The crawler needs to revisit websites to
collect up-to-date data in hourly or daily
intervals. Each site has different amount of
to-be-update data and different policy to
restrict frequent access, which are unknown
beforehand. The system needs to find feasible
data collection schedule based on past experience.
Crawler Extractor
17T/TD System with Slow Intelligence
(contd)
- 4. Any changes in web pages may disrupt
Extractors. It needs automatic repair mechanism
for Extractors if many websites are being
monitored. The repair mechanism needs to detect
errors of Extractors, find alternatives, and
choose the best solution from alternatives to fix
the disrupted Extractors.
Crawler Extractor
18T/TD System with Slow Intelligence
(contd)
- 1. SIS to help restrict the range of data
collection
Knowledge of data
Knowledge of algorithm
19T/TD System with Slow Intelligence
(contd)
- 2. SIS to help select and adapt trend detection
algorithms
20T/TD System with Slow Intelligence
(contd)
- 3. SIS to help scheduling Crawler
21T/TD System with Slow Intelligence
(contd)
- 4. SIS to help adapt Extractors
22Conclusion
23Conclusion
- An online trend detection system requires careful
resource allocation and automatic algorithm
adaptation to process huge size of heterogeneous
data. - This research adopts Slow Intelligence, which
provides a framework for systems with
insufficient computing resources to gradually
adapt to environments, to response the
challenges. - Four Slow Intelligence subsystems are proposed,
and each subsystem targets a challenge in
designing online topic/trend detection systems.
24If you have any questions, please e-mail us
- chiachun_at_iii.org.tw (Chia-Chun Shih)
- markpeng_at_iii.org.tw (Ting-Chun Peng)