Title: Classification%20and%20Novel%20Class%20Detection%20in%20Data%20Streams
1 Classification and Novel Class Detection in Data
Streams
- Mehedy Masud1, Latifur Khan1, Jing Gao2,
- Jiawei Han2, and Bhavani Thuraisingham1
- 1Department of Computer Science, University of
Texas at Dallas - 2Department of Computer Science, University of
Illinois at Urbana Champaign
This work was funded in part by
2Presentation Overview
- Novel Class Detection Concept Evolution
3Data Streams
Network traffic
Sensor data
Call center records
4Data Stream Classification
- Uses past labeled data to build classification
model - Predicts the labels of future instances using the
model - Helps decision making
5Challenges
Introduction
- Infinite length
- Concept-drift
- Concept-evolution (emergence of novel class)
- Recurrence (seasonal) class
6Infinite Length
- Impractical to store and use all historical data
- Requires infinite storage
- And running time
7Concept-Drift
A data chunk
Negative instance
Instances victim of concept-drift
Positive instance
8Concept-Evolution
y
- - - - - -
- - - - - - - - - - -
D
y1
C
A
-
- - - - - - - -
- - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - -- - - - -
y2
B
x1
x
Classification rules R1. if (x gt x1 and y lt y2)
or (x lt x1 and y lt y1) then class R2. if (x gt
x1 and y gt y2) or (x lt x1 and y gt y1) then class
-
Existing classification models misclassify novel
class instances
9Background Ensemble of Classifiers
C1
C2
x,?
C3
-
input
Individual outputs
voting
Ensemble output
Classifier
10Background Ensemble Classification of Data
Streams
- Divide the data stream into equal sized chunks
- Train a classifier from each data chunk
- Keep the best L such classifier-ensemble
- Example for L 3
Note Di may contain data points from different
classes
D4
D5
D6
Labeled chunk
Data chunks
Unlabeled chunk
Addresses infinite length and concept-drift
C4
C5
Classifiers
C1
C2
C3
C4
C5
Ensemble
11Examples of Recurrence and Novel Classes
Introduction
- Twitter Stream a stream of messages
- Each message may be given a category or class
- based on the topic
- Examples
- Election 2012, London Olympic, Halloween,
Christmas, Hurricane Sandy, etc. - Among these
- Election 2012 or Hurricane Sandy are novel
classes because they are new events. - Also
- Halloween is recurrence class because it
recurs every year.
12Concept-Evolution and Feature Space
Introduction
y
- - - - - -
- - - - - - - - - - -
D
y1
C
A
-
- - - - - - - -
- - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - -- - - - -
y2
B
x1
x
Classification rules R1. if (x gt x1 and y lt y2)
or (x lt x1 and y lt y1) then class R2. if (x gt
x1 and y gt y2) or (x lt x1 and y gt y1) then class
-
Existing classification models misclassify novel
class instances
13Novel Class Detection Prior Work
Prior work
- Three steps
- Training and building decision boundary
- Outlier detection and filtering
- Computing cohesion and separation
14Training Creating Decision Boundary
Prior work
- Training is done chunk-by-chunk (One classifier
per chunk) - An ensemble of classifiers are used for
classification
Raw training data
Clusters are created
y
- - - - -
- - -
- - - - - - - -
D
y1
C
A
-
- - - - - - - - - - -
- - - - - - - - - - - -
- - - - - - - - - - - -
- - - - - - - - - - - -
y2
B
x1
x
Addresses Infinite length problem
15Outlier Detection and Filtering
Prior work
Test instance inside decision boundary (not
outlier)
Test instance outside decision boundary Raw
outlier or Routlier
y
x
D
y1
C
A
Routlier
Routlier
Routlier
x
X is an existing class instance
AND
False
y2
True
B
X is a filtered outlier (Foutlier) (potential
novel class instance)
x1
x
Routliers may appear as a result of novel class,
concept-drift, or noise. Therefore, they are
filtered to reduce noise as much as possible.
16Computing Cohesion Separation
Prior work
? o,5(x)
a(x)
x
?-,5(x)
?,5(x)
b(x)
b-(x)
- a(x) mean distance from an Foutlier x to the
instances in ?o,q(x) - bmin(x) minimum among all bc(x) (e.g. b(x) in
figure) - q-Neighborhood Silhouette Coefficient (q-NSC)
- If q-NSC(x) is positive, it means x is closer to
Foutliers than any other class.
17Limitation Recurrence Class
Prior work
18Why Recurrence Classes are Forgotten?
Prior work
- Divide the data stream into equal sized chunks
- Train a classifier from whole data chunk
- Keep the best L such classifier-ensemble
- Example for L 3
- Therefore, old models are discarded
- Old classes are forgotten after a while
D4
D5
D6
Labeled chunk
Data chunks
Unlabeled chunk
C4
C5
Classifiers
Ensemble
C1
C2
C3
C4
C5
Addresses infinite length and concept-drift
19CLAM The Proposed Approach
Proposed method
CLAss Based Micro-Classifier Ensemble
Stream
Latest Labeled chunk
Training
New model
Update
Latest unlabeled instance
Outlier detection
Ensemble (M) (keeps all classes)
Classify using M
Not outlier
Outlier
(Existing class)
Buffering and novel class detection
20Training and Updating
Proposed method
- Each chunk is first separated into different
classes - A micro-classifier is trained from each classs
data - Each micro-classifier replaces one existing
micro-classifier - A total of L micro-classifiers make a
Micro-Classifier Ensemble (MCE) - C such MCEs constitute the whole ensemble, E
21CLAM The Proposed Approach
Proposed method
CLAss Based Micro-Classifier Ensemble
Stream
Latest Labeled chunk
Training
New model
Update
Latest unlabeled instance
Outlier detection
Ensemble (M) (keeps all classes)
Classify using M
Not outlier
Outlier
(Existing class)
Buffering and novel class detection
22Outlier Detection and Classification
Proposed method
- A test instance x is first classified with each
micro-classifier ensemble - Each micro-classifier ensemble gives a partial
output (Yr) and a outlier flag (boolean) - If all ensembles flags x as outlier, then it is
buffered and sent to novel class detector - Otherwise, the partial outputs are combined and a
class label is predicted
23Evaluation
Evaluation
- Competitors
- CLAM (CL) proposed work
- SCANR (SC) 1 prior work
- ECSMiner (EM) 2 prior work
- Olindda 3-WCE 4 (OW) another baseline
- Datasets Synthetic, KDD Cup 1999 Forest
covertype
1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C.
Aggarwal, J. Gao, J. Han, and B. M.
Thuraisingham, Detecting recurring and novel
classes in concept-drifting data streams, in
Proc. ICDM 11, Dec. 2011, pp. 1176181. 2.
Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei
Han, and Bhavani M. Thuraisingham.Classification
and novel class detection in concept-drifting
data streams under time constraints. In
Preprints, IEEE Transactions on Knowledge and
Data Engineering (TKDE), 23(6) 859-874
(2011). 3. E. J. Spinosa, A. P. de Leon F. de
Carvalho, and J. Gama. Cluster-based novel
concept detection in data streams applied to
intrusion detection in computer networks. In
Proc. 2008 ACM symposium on Applied computing,
pages 976980, 2008. 4. H. Wang, W. Fan, P. S.
Yu, and J. Han. Mining concept-drifting data
streams using ensemble classifiers. In Proc.
ninth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages
226235, Washington, DC, USA, Aug, 2003. ACM.
24Overall Error
Evaluation
Error rates on (a) SynC20, (b)SynC40, (c)Forest
and (d) KDD
25Number of Recurring Classes vs Error
Evaluation
26Error vs Drift and Chunk Size
Evaluation
27Summary Table
Evaluation
28Conclusion
- Detect Recurrence
- Improved Accuracy
- Running Time
- Reduced Human Interaction
- Future work use other base learners
29 30