Title: Towards Data Mining Without Information on Knowledge Structure
1Towards Data Mining Without Information on
Knowledge Structure
Alexandre Vautier, Marie-Odile Cordier and René
Quiniou Université de Rennes 1 INRIA Rennes -
Bretagne Atlantique
- Wednesday, September 19th 2007
2Usual KD Process
- User needs
- A data mining task
- Domain knowledge
Interpretation/ Evaluation
Data Mining
Knowledge
Transformation
Models
Preprocessing
Selection
Transformed Data
Preprocessed Data
Target Data
Data
3Usual KD Process
û
- User needs
- A data mining task
- Domain knowledge
Interpretation/ Evaluation
Data Mining
Knowledge
Transformation
Models
Preprocessing
Selection
Transformed Data
Preprocessed Data
Target Data
Data
What can a user extract from data without domain
knowledge ?
4Application context Network Alarms
- Represent network alarms
- Understand network behavior
- Detect new DDoS attacks
- An alarm is composed of
- A directed link between two IP addresses
- A date
- A severity (low,med,high) (related to the link
rate)
5Application context Network Alarms
- Represent network alarms
- Understand network behavior
- Detect new DDoS attacks
- An alarm is composed of
- A directed link between two IP addresses
- A date
- A severity (low,med,high) (related to the link
rate)
6Application context Network Alarms
Generalized links M1 192.168.2.1 ! , !
192.168.2.5, Sequences M2 1.5.5. ! 2.2.3.
gt 2.2.3. ! 1.2.3.4 , Clustering on date and
severity M3 11/01/0511/03/05, low,
11/07/0511/15/05, high
Models
Alarms
Data Mining Algorithms
7Objectives
- Goal search models that fit the given data
- Current assumption the user has sufficient
knowledge to - define the type of model
- choose the relevant DM algorithm
- Our proposition alleviate the current assumption
by - executing automatically DM algorithms to extract
models from data - evaluating the resulting models in a generic
manner to propose to the user the best suited
model(s)
8Framework
- DM algorithm specifications
- Data Specification
- Unification of specifications
Model extraction Generic evaluation Model
ranking
9Schemas for specification
- Enhanced algebraic specifications (Types,
operations and equations) - Category theory Mac Lane 1942
- Sketch Ehresmann 1965
- Use specification inheritance
10Data specificationNetwork Alarm Schema
- Node a type
- Edge
- A function
- A relation
- Green dotted edge projection) Cartesian product
- Red dashed edgeinclusion) union
11Data specificationNetwork Alarm Schema
- Node a type
- Edge
- A function
- A relation
- Green dotted edge projection) Cartesian product
- Red dashed edgeinclusion) union
12DM Algorithm specification Generalized edges
13DM Algorithm specification Generalized edges
DM algorithm
Model type
Covering relation
14Schema unification
?
15Schema unification
Data Type
?
Abstract Data Type
16Unification of Schema
Data Type
?
Abstract Data Type
17Framework
DM algorithm specifications Data
Specification Unification of specifications
Model extraction Generic evaluation Model
ranking
18Generic evaluation
- Compare different kinds of model
- Inspired by Kolmogorov complexityThe complexity
of an object x is the size s(p) of the shortest
program p that outputs x executed on a universal
machine f - Cf(x) min s(p) f(p) x
19Generic evaluation
- Complexity of data d in a schema S relatively to
a model m (c M D) - complexity of
- K(d,m,S)
- k(M) the model structure
- k(D) the data structure
- k(c) the covering relation
- k(mM) the model
- k(dm,c,D) the data knowing
20Path Indexing Covering Relation Decomposition
Null Decomposition
c(m)
c M D
m
d
M
D
k(dm,c,D) k(dc(m)) k(d\c(m)D)
21Path Indexing Covering Relation Decomposition
Null Decomposition
c(m)
c M D
m
d
M
D
k(dm,c,D) k(dc(m)) k(d\c(m)D)
Decomposition relying on relation composition
c s t M D
t M A
s A D
d
m
t(m)
M
A
D
c(m) s t(m)
22Path Indexing Covering Relation Decomposition
Null Decomposition
c(m)
c M D
m
d
M
D
k(dm,c,D) k(dc(m)) k(d\c(m)D)
Decomposition relying on relation composition
s(a)
c s t M D
a
t M A
s A D
d
m
t(m)
M
A
D
c(m) s t(m)
k(dm, s t ,D) k(at(m)) k(ds(a))
k(d\s(a)D)
23Experiments
- Extraction of clusters, generalized edges, and
sequences - Dataset 10.000 alarms
- Duration 400 seconds (without DM algorithm
duration) - 6 operational algorithms
- Experiments on datasets generated by models
- Network alarm from real network
24Discussions
- Unification
- Exponential in time with respect to the number of
nodes in a schema - Generic evaluation
- Linear in time and space
- Adapt the evaluation method
- User defined
- According to a model visualization
- According to local data instead of global data
25What do schemas bring to Data Mining ?
- Describe data and DM algorithms with a common
language - Allow to unify data structure with DM algorithms
input - Provide a way to compute the model complexity
relatively to a type in a schema - Provide a way to compute the data complexity
relatively to - A model
- A covering relation and its decomposition
- Are implementable in an efficient manner
26Towards Data Mining Without Information on
Knowledge Structure
Alexandre Vautier, Marie-Odile Cordier and René
Quiniou INRIA Rennes - Bretagne
Atlantique Université de Rennes 1