Classification%20and%20Novel%20Class%20Detection%20in%20Data%20Streams - PowerPoint PPT Presentation

About This Presentation
Title:

Classification%20and%20Novel%20Class%20Detection%20in%20Data%20Streams

Description:

Data Streams. Data streams are: Continuous flows of data. Network traffic. Sensor data. Call center records – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 31
Provided by: utda58
Category:

less

Transcript and Presenter's Notes

Title: Classification%20and%20Novel%20Class%20Detection%20in%20Data%20Streams


1
Classification and Novel Class Detection in Data
Streams
  • Mehedy Masud1, Latifur Khan1, Jing Gao2,
  • Jiawei Han2, and Bhavani Thuraisingham1
  • 1Department of Computer Science, University of
    Texas at Dallas
  • 2Department of Computer Science, University of
    Illinois at Urbana Champaign

This work was funded in part by
2
Presentation Overview
  • Stream Mining Background
  • Novel Class Detection Concept Evolution

3
Data Streams
  • Data streams are
  • Examples

Network traffic
Sensor data
Call center records
4
Data Stream Classification
  • Uses past labeled data to build classification
    model
  • Predicts the labels of future instances using the
    model
  • Helps decision making

5
Challenges
Introduction
  • Infinite length
  • Concept-drift
  • Concept-evolution (emergence of novel class)
  • Recurrence (seasonal) class

6
Infinite Length
  • Impractical to store and use all historical data
  • Requires infinite storage
  • And running time

7
Concept-Drift
A data chunk
Negative instance
Instances victim of concept-drift
Positive instance
8
Concept-Evolution
y
  • - - - - -
  • - - - - - - - - - -

D
y1
C
A
  • - - - - - - -
  • - - - - - - - - - - - - - - - -
  • - - - - - - - - - - - - - - -
  • - - - - - - - - - - - - - - -
  • - - - - - - -- - - - -




y2
B

x1
x
Classification rules R1. if (x gt x1 and y lt y2)
or (x lt x1 and y lt y1) then class R2. if (x gt
x1 and y gt y2) or (x lt x1 and y gt y1) then class
-
Existing classification models misclassify novel
class instances
9
Background Ensemble of Classifiers
C1

C2


x,?
C3
-
input
Individual outputs
voting
Ensemble output
Classifier
10
Background Ensemble Classification of Data
Streams
  • Divide the data stream into equal sized chunks
  • Train a classifier from each data chunk
  • Keep the best L such classifier-ensemble
  • Example for L 3

Note Di may contain data points from different
classes
D4
D5
D6
Labeled chunk
Data chunks
Unlabeled chunk
Addresses infinite length and concept-drift
C4
C5
Classifiers
C1
C2
C3
C4
C5
Ensemble
11
Examples of Recurrence and Novel Classes
Introduction
  • Twitter Stream a stream of messages
  • Each message may be given a category or class
  • based on the topic
  • Examples
  • Election 2012, London Olympic, Halloween,
    Christmas, Hurricane Sandy, etc.
  • Among these
  • Election 2012 or Hurricane Sandy are novel
    classes because they are new events.
  • Also
  • Halloween is recurrence class because it
    recurs every year.

12
Concept-Evolution and Feature Space
Introduction
y
  • - - - - -
  • - - - - - - - - - -

D
y1
C
A
  • - - - - - - -
  • - - - - - - - - - - - - - - - -
  • - - - - - - - - - - - - - - -
  • - - - - - - - - - - - - - - -
  • - - - - - - -- - - - -




y2
B

x1
x
Classification rules R1. if (x gt x1 and y lt y2)
or (x lt x1 and y lt y1) then class R2. if (x gt
x1 and y gt y2) or (x lt x1 and y gt y1) then class
-
Existing classification models misclassify novel
class instances
13
Novel Class Detection Prior Work
Prior work
  • Three steps
  • Training and building decision boundary
  • Outlier detection and filtering
  • Computing cohesion and separation

14
Training Creating Decision Boundary
Prior work
  • Training is done chunk-by-chunk (One classifier
    per chunk)
  • An ensemble of classifiers are used for
    classification

Raw training data
Clusters are created
y
  • - - - -
  • - -
  • - - - - - - -

D
y1
C
A
  • - - - - - - - - - -
  • - - - - - - - - - - -
  • - - - - - - - - - - -
  • - - - - - - - - - - -




y2
B

x1
x
Addresses Infinite length problem
15
Outlier Detection and Filtering
Prior work
Test instance inside decision boundary (not
outlier)
Test instance outside decision boundary Raw
outlier or Routlier
y
x
D
y1
C
A
Routlier
Routlier
Routlier
x
X is an existing class instance
AND
False
y2
True
B
X is a filtered outlier (Foutlier) (potential
novel class instance)
x1
x
Routliers may appear as a result of novel class,
concept-drift, or noise. Therefore, they are
filtered to reduce noise as much as possible.
16
Computing Cohesion Separation
Prior work
? o,5(x)
a(x)
x
?-,5(x)
?,5(x)
b(x)
b-(x)
  • - -
  • - -

  • -
  • - -

  • a(x) mean distance from an Foutlier x to the
    instances in ?o,q(x)
  • bmin(x) minimum among all bc(x) (e.g. b(x) in
    figure)
  • q-Neighborhood Silhouette Coefficient (q-NSC)
  • If q-NSC(x) is positive, it means x is closer to
    Foutliers than any other class.

17
Limitation Recurrence Class
Prior work
18
Why Recurrence Classes are Forgotten?
Prior work
  • Divide the data stream into equal sized chunks
  • Train a classifier from whole data chunk
  • Keep the best L such classifier-ensemble
  • Example for L 3
  • Therefore, old models are discarded
  • Old classes are forgotten after a while

D4
D5
D6
Labeled chunk
Data chunks
Unlabeled chunk
C4
C5
Classifiers
Ensemble
C1
C2
C3
C4
C5
Addresses infinite length and concept-drift
19
CLAM The Proposed Approach
Proposed method
CLAss Based Micro-Classifier Ensemble
Stream
Latest Labeled chunk
Training
New model
Update
Latest unlabeled instance
Outlier detection
Ensemble (M) (keeps all classes)
Classify using M
Not outlier
Outlier
(Existing class)
Buffering and novel class detection
20
Training and Updating
Proposed method
  • Each chunk is first separated into different
    classes
  • A micro-classifier is trained from each classs
    data
  • Each micro-classifier replaces one existing
    micro-classifier
  • A total of L micro-classifiers make a
    Micro-Classifier Ensemble (MCE)
  • C such MCEs constitute the whole ensemble, E

21
CLAM The Proposed Approach
Proposed method
CLAss Based Micro-Classifier Ensemble
Stream
Latest Labeled chunk
Training
New model
Update
Latest unlabeled instance
Outlier detection
Ensemble (M) (keeps all classes)
Classify using M
Not outlier
Outlier
(Existing class)
Buffering and novel class detection
22
Outlier Detection and Classification
Proposed method
  • A test instance x is first classified with each
    micro-classifier ensemble
  • Each micro-classifier ensemble gives a partial
    output (Yr) and a outlier flag (boolean)
  • If all ensembles flags x as outlier, then it is
    buffered and sent to novel class detector
  • Otherwise, the partial outputs are combined and a
    class label is predicted

23
Evaluation
Evaluation
  • Competitors
  • CLAM (CL) proposed work
  • SCANR (SC) 1 prior work
  • ECSMiner (EM) 2 prior work
  • Olindda 3-WCE 4 (OW) another baseline
  • Datasets Synthetic, KDD Cup 1999 Forest
    covertype

1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C.
Aggarwal, J. Gao, J. Han, and B. M.
Thuraisingham, Detecting recurring and novel
classes in concept-drifting data streams, in
Proc. ICDM 11, Dec. 2011, pp. 1176181. 2.
Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei
Han, and Bhavani M. Thuraisingham.Classification
and novel class detection in concept-drifting
data streams under time constraints. In
Preprints, IEEE Transactions on Knowledge and
Data Engineering (TKDE), 23(6) 859-874
(2011). 3. E. J. Spinosa, A. P. de Leon F. de
Carvalho, and J. Gama. Cluster-based novel
concept detection in data streams applied to
intrusion detection in computer networks. In
Proc. 2008 ACM symposium on Applied computing,
pages 976980, 2008. 4. H. Wang, W. Fan, P. S.
Yu, and J. Han. Mining concept-drifting data
streams using ensemble classifiers. In Proc.
ninth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages
226235, Washington, DC, USA, Aug, 2003. ACM.
24
Overall Error
Evaluation
Error rates on (a) SynC20, (b)SynC40, (c)Forest
and (d) KDD
25
Number of Recurring Classes vs Error
Evaluation
26
Error vs Drift and Chunk Size
Evaluation
27
Summary Table
Evaluation
28
Conclusion
  • Detect Recurrence
  • Improved Accuracy
  • Running Time
  • Reduced Human Interaction
  • Future work use other base learners

29
  • Questions
  • ?

30
  • Thanks
Write a Comment
User Comments (0)
About PowerShow.com