Title: Extending DSMS for Data Stream Mining
1Extending DSMS for Data Stream Mining
- CS240B Notes
- by
- Carlo Zaniolo
- UCLA CSD
2Data Streams
- Continuous, unbounded, rapid, time-varying
streams of data elements - Occur in a variety of modern applications
- Network monitoring and traffic engineering
- Sensor networks, RFID tags
- Telecom call records
- Financial applications
- Web logs and click-streams
- Manufacturing processes
- DSMS Data Stream Management System
3Many Research Projects
- Amazon/Cougar (Cornell) sensors
- Aurora (Brown/MIT) sensor monitoring, dataflow
- Hancock (ATT) Telecom streams
- Niagara (OGI/Wisconsin) Internet DBs XML
- OpenCQ (Georgia) triggers, view maintenance
- Stream (Stanford) general-purpose DSMS
- Tapestry (Xerox) pubish/subscribe filtering
- Telegraph (Berkeley) adaptive engine for
sensors - Gigascope ATT Labs Network Monitoring
- Stream Mill (UCLA) - power extensibility
4Technology Challenges
- Data Models
- Relational Streams--but XML streams important
too - Tuple Time-Stamping
- Order is important
- Windows
- Query Languages Extensions of SQL or XQUERY
- To support continuous (i.e., persistent) queries
on transient datareversal of roles. - Blocking operators excluded
- Query Plans
- New execution models (main memory oriented)
- Optimized scheduling for response time or memory
- Quality of Services (QoS) Approximation
- Synopses
- Sampling
- Load shedding.
5Commercial Developments
- Several Startups
- Streambase,
- Coral8,
- Apama, and
- Truviso.
- Oracle and DBMS companies
- Publish/subscribe
- Complex Event Processing (CEP)
- Limitations only simple applicationse.g.
continuous queries expressed in SQL - No Support for Data Stream Mining queries.
6Data Stream Mining
- Many applications click stream analysis,
intrusion detection,... - Many fast light algorithms developed for stream
mining. - Ensembles, Moment, SWIM, etc.
- Analyst should be able to focus on high-level
mining tasks. - Leaving QoS and lower-level issues to the
system. - Integration of mining methods into Data Stream
Management Systems (DSMS) is required - Many research challenges.
- Stream Mill Miner (SMM) is the first DSMS
designed for that.
7Data Stream Management Systems (DSMS)
- Data stream mining applications so far ignored by
DSMS although - A. DSMS technology is required for data stream
mining - QoS, query scheduling, synopses, sampling,
windows, ... - B. But supporting DM applications is difficult
since current DSMS only support simple query
languages based on SQL. - Conclusion either a shotgun wedding ... or a
research breakthrough is needed here!
8A Difficult Problem the Inductive DBMS Experience
- Initial attempts to support mining queries in
relational DBMS Unsuccessful - OR-DBMS do not fare much better Sarawagi 98.
- In 1996 the high-road approach by Imielinski
Mannila who called for a quantum leap in
functionality - High-level declarative languages for DM .
- Extensions for query processing and
optimization. - The research area of Inductive DBMS was thus born
- Inspired DMQL, Mine Rule, MSQL, etc.
- Suffer from limited generality and performance
issues.
9DBMS Vendors
- Vendors have taken a low-road approach.
- A library of mining functions using a
cache-mining approach - IBM DB2 Intelligent Miner
- Oracle Data Miner
- MS OLE DB for DM mining models
- Closed systems,
- Lacking in coverage and user-extensibility.
- Not as popular as dedicated, stand-alone mining
systems, such as Weka
10Weka
- A comprehensive set of mining algorithms, and
tools. - Generic algorithms over arbitrary data sets.
- Independent on the number of columns in tables.
- Open and extensible system based on Java.
- These are the features that we want in our
SMMstarting from SQL rather than Java! - Not an easy task ...why?
11SMM Contributions
- Build on Stream Mill DSMS and its SQL-based
continuous query language and enabling
technology. - Language and System Extensions
- Genericity,
- Extensibility, and
- Performance
- A suite of stream mining algorithms.
- Existing ones and
- Newly developed in this projecte.g., SWIM.
- High level mining model for better
- Usability
- Control of mining process.
12From SQL to Online Mining in SMMstep by step
- Naïve Bayesian Classifier (NBC).
- Important and frequently used.
- Schema-specific NBC. Simple to express in SQL by
count, sum aggregates. But a generci NBC is still
preferable. - Genericity one function independent of number
columns involved. - Schema independence in SQL?
13Genericity
- Weka
- Arrays of type real.
- SMM
- Verticalization.
- Similar arrays, but in tables.
- Built-in table function to reduce any table to
this form. - Thus, generic UDAs work with this schema.
- And further improvements are also supported in
SMM
14Extensibility?
- Most mining tasks cannot be implemented in SQL.
- Solution Define complex functions by User
Defined Aggregates (UDAs) - Complex mining tasks can be viewed as aggregates
- UDAs Natively defined in SQL make the language
computationally complete Wang 04 - Turing-complete over static data
- Non-blocking complete over data streams
- Natural extensions to support windows and delta
computations for data streams Bai 06 - UDAs can be defined in a PL, for better
performance
15Windowed UDA Example Continuous Count
- WINDOW AGGREGATE sum(val REAL)REAL
- TABLE state (tot real)
- INITIALIZE
- INSERT INTO state VALUES(val)
-
- ITERATE
- UPDATE state SET tot tot val
-
- EXPIRE
- UPDATE state SET tot tot oldest().val
-
- / No TERMINATE state /
16Online Mining in SMM
- UDAs Invoked with standard SQL2003 syntax of
OLAP functions. - SELECT learn(ts.Column, ts.Value, t.dec)
- OVER (ROWS 1000 PRECEDING)
- FROM trainingstream AS t,
- TABLE (verticalize(Outlook, Temp, Humidity,
Wind)) AS ts -
- Powerful framework
- Concept drifts-shifts
- Association rule mining
17The Slide Construct
- A window can be divided into panes (called a
slide) - Tumbling windows when the size of the slide is
equal or larger than that of the window - The slide/window combination is great for data
stream mining. - Simple construct added to support slides in UDAs
- Allowed us to build a flexible and efficient
library of data stream mining UDAs
18SMM Contributions
- Build on Stream Mill DSMS and its SQL-based
continuous query language and enabling
technology. - Language and System Extensions
- Genericity,
- Extensibility, and
- Performance
- A suite of stream mining algorithms.
- Existing ones and
- Newly developed in this projecte.g., SWIM.
- High level mining model for better
- Usability
- Control of mining process.
19Association Rule Mining
- SWIM Mozafari 08 Maintaining frequent
patterns over large windows with slides. - Differentially computes frequent patterns as
slides enter (expire out of) the window. - Uses efficient Verifiers based on conditional
counting. - Trade-off between Delay and Performance
- Performance gain over existing algorithms.
20SWIM (Sliding Window Incremental Miner)
- If pattern p is freq in a window, it must be freq
in at least one of its slides -- keep a union of
freq patterns of all slides (PT)
Expired
New
.
S4
S5
S6
S7
W4
W5
Mine
Mining Alg.
PT
Prune PT
PT F5 U F6 U F7
PT F4 U F5 U F6
21Concept Drifts/ShiftsComplex Processes
- Ensemble based methods.
- Weighted bagging Wang 03, adaptive boosting
Chu 04, inductive transfer Forman 06. - Generic support, e.g. adaptive boosting (below).
22Built-in Online Mining Algorithms In SMM
- Online classifiers
- Naïve Bayesian
- Decision Tree
- K-nearest Neighbor
- Online clustering
- DBScan Ester 96
- IncDBScan
- Windowed K-means
- DenStream Cao 06
- CluStream
- Association rule mining
- Approximate frequent items
- SWIM Mozafari 08
- Moment Chi 04
- AFPIM
- Time series/sequence queries
- SQL-TS Sadri 01
- Many more
23SMM Contributions
- Build on Stream Mill DSMS and its SQL-based
continuous query language and enabling
technology. - Language and System Extensions
- Genericity,
- Extensibility, and
- Performance
- A suite of stream mining algorithms.
- Existing ones and
- Newly developed in this projecte.g., SWIM.
- High level mining model for better
- Usability
- Control of mining process.
24Usability?
- Complex SQL queries to invoke built-in and
user-defined mining algorithms. - An open and extensible system
- Most analysts would prefer using high-level
mining language that - supports uniform invocation of built-in and
user-defined mining algorithms (no SQL required) - describes the workflow of the mining process
- Is also open and extensible to incorporate newly
defined mining algorithms.
25Example Defining a Mining Model
- CREATE MODEL TYPE NaiveBayesianClassifier
- SHAREDTABLES (DescriptorTbl),
- Learn (UDA LearnNaiveBayesian,
- WINDOW TRUE,
- PARTABLES(), names of param tables required
by the method - PARAMETERS() additional parameters to be
specified for input - ),
- Classify (UDA ClassifyNaiveBayesian,
- WINDOW TRUE,
- PARTABLES(),
- PARAMETERS()
- )
26Example Using a Mining Model
- Creating an instance
- CREATE MODEL INSTANCE NaiveBayesianInstance
- AS NaiveBayesianClassifier
- Uniform invocation of mining tasks
- RUN NaiveBayesianInstance.Learn WITH TrainingSet
27Performance
- SMM Vs. Weka
- NBC and decision tree classifier
- Datasets UCI
- Iris 5 attributes
- Heart disease 13 attributes
- Overhead of integrating algorithms into SMM
- The SWIM algorithm standalone vs. integrated
- Dataset IBM Quest
- Trans len 20, Pattern len 5, Tuples 50K
28Comparison with Weka NBC-Iris
29Comparison with Weka NBC-HD
30Comparison with Weka Decision Tree - Iris
31Integration Overhead Integrated SWIM vs.
Standalone SWIM
32The Stream Mill System
- One server, multiple clients
- Server (on Linux) hosts the ESL language and
manages storage and continuous queries - Client (Java based GUI) allows the user to
specify streams, queries, etc.
33Conclusion
- SMM integrates new solutions for several
difficult problems - Usability by high-level mining models
- Extensibility by user-defined mining models that
call on UDAs with windows - Suite of built-in data stream mining UDAs
- Generic mining UDAs by Verticalization other
techniques - Performance
- SMM is the first of its kind more and better
systems will follow in its footsteps.
34Future Work
- Faster lighter mining algorithms
- E.g. online algorithms for clustering
- Integration of other mining algorithms
- Data flow in mining models
- Similar solution for databases
35 36References
- Arasu 04 Arvind Arasu and Jennifer Widom.
Resource sharing in continuous sliding-window
aggregates. In VLDB, pages 336347, 2004. - Babcock 02 B. Babcock, S. Babu, M. Datar, R.
Motawani, and J. Widom. Models and issues in data
stream systems. In PODS, 2002. - Bai 06 Yijian Bai, Hetal Thakkar, Chang Luo,
Haixun Wang, and Carlo Zaniolo. A data stream
language and system designed for power and
extensibility. In CIKM, pages 337346, 2006. - Cao 06 F Cao, M Ester, W Qian, and A Zhou,
Density-based Clustering over an Evolving Data
Stream with Noise, To appear in Proceedings of
SIAM 2006. - Chi 04 Y. Chi, H. Wang, P. S. Yu, and R. R.
Muntz. Moment Maintaining closed frequent
itemsets over a stream sliding window. In
Proceedings of the 2004 IEEE International
Conference on Data Mining (ICDM04), November
2004. - Chu 04 F. Chu and C. Zaniolo. Fast and light
boosting for adaptive mining of data streams. In
PAKDD, volume 3056, 2004. - Ester 96 Martin Ester, Hans-Peter Kriegel,
Jorg Sander, and Xiaowei Xu. A density-based
algorithm for discovering clusters in large
spatial databases with noise. In Second
International Conference on Knowledge Discovery
and Data Mining, pages 226231, 1996. - Forman 06 George Forman. Tackling concept
drift by temporal inductive transfer. In SIGIR,
pages 252259, 2006.
37References
- Imielinski 96 Tomasz Imielinski and Heikki
Mannila. A database perspective on knowledge
discovery. Commun. ACM, 39(11)5864, 1996. - Law 04 Yan-Nei Law, Haixun Wang, and Carlo
Zaniolo. Data models and query language for data
streams. In VLDB, pages 492503, 2004. - Mozafari 08 Barzan Mozafari, Hetal Thakkar,
and Carlo Zaniolo. Verifying and mining frequent
patterns from large windows over data streams. In
International Conference on Data Engineering
(ICDE), 2008. - Sadri 01 Reza Sadri, Carlo Zaniolo, Amir
Zarkesh, and Jafar Adibi. Optimization of
sequence queries in database systems. In PODS,
Santa Barbara, CA, May 2001. - Sarawagi 98 S. Sarawagi, S. Thomas, and R.
Agrawal. Integrating association rule mining with
relational database systems Alternatives and
implications. In SIGMOD, 1998. - UCI-MLR http//archive.ics.uci.edu/ml/datasets.h
tml - Wang 03 H. Wang, W. Fan, P. S. Yu, and J. Han.
Mining concept-drifting data streams using
ensemble classifiers. In SIGKDD, 2003.