Title: The Data Mining Visual Environment
1The Data Mining Visual Environment
Motivation  Major problems with existing DM
systems They are based on non-extensible
frameworks. They provide a non-uniform mining
environment - the user is presented with totally
different interface(s) across implementations of
different DM techniques. Major needs An overall
framework that can support the entire Knowledge
Discovery (KD) process (accommodate and integrate
all KD phases seamlessly). Â Placing the user at
the center of the entire KD process/in the
framework. In fact the corresponding system
should provide a consistent, uniform and flexible
visual interaction environment that supports the
user throughout the entire discovery process.
2The Data Mining Visual Environment
- Primary layers
- User layer
- Engine layer
- Data layer
- Main features
- Open
- Modular with
- well defined
- modification/extension points
- Possible integration of
- different tasks
- eg output reuse by another task
- User flexibility and enablement to
- process data and knowledge,
- drive and guide the entire KD
- process
System Architecture
At present, there is a partial prototype,
complete implementation is underway.
3The Data Mining Visual Environment
- Primary layers
- User Interface/GUI Container
- (interacts with specific DM
- visual environments)
- Developed using Java
- Abstract DM Engine/Wrapper of
- DM algorithms (interacts with
- specific DM algorithms)
- Developed using Java
- Note
- The DM algorithms may be
- implemented by third parties
- in possibly any language.
- DM methods (but not limited to)
- MQs, ARs, and clustering
System Architecture The Prototype
Stephen
Stefano
4The Data Mining Visual Environment
Visual Environment A consistent, uniform,
flexible and intuitive GUI, with support
throughout the whole DM process. The principal
focus is to support the user in Visual
construction of the task relevant dataset The
user directly interacts with data. For this task,
there are two intuitive interaction
spaces. Visual construction of the mining query
The user directly interacts with data and other
parameters (e.g. threshold values) in making
queries e.g., in the Metaquery Environment, the
user can suggest patterns by linking attributes,
while the Association Rule Environment offers
visual baskets. Visual output presentation and
interaction Exploiting relevant effective
visualizations and where necessary, we have
designed novel visualizations. Planning E.g.,
advertising relevant prior knowledge. Â Handling
the non-static nature of users quest E.g.,
enabling user to adjust.
5The Data Mining Visual Environment
Visual Environment Overall
6The Data Mining Visual Environment
Visual Environment Tree View (Progress
Companion)
Before user settings
After user settings
After DM results
7The Data Mining Visual Environment
Visual Environment Clustering - Input
8The Data Mining Visual Environment
Visual Environment Clustering - Output
... for more on the prototype, demo
9The Data Mining Visual Environment
Usability Usability heuristics Done, but
regular reference to the same will go
on. Â Mock-up tests Done with DM experts. The
experts gave an encouraging feedback and even
suggestions on how to improve the
interface. (These tests were done at the end of
2001.) Questionnaire experiments The
experiments involved the application simulation,
a case study, data schema and user tasks
corresponding to the case study, and a
questionnaire. Positive interface features
consistency, layout/organization, visual
exploration. Negative interface features size of
some visual elements small/big (These tests were
done in July 2002.) Formal usability tests In
the pipeline.
10The Data Mining Visual Environment
- The Clustering Engine
- Clustering method Generalizations of three
techniques homogeneity, separation, density. - Clustering based on homogeneity/separation
Homogeneity (separation) is a global measure of
the similarity between points belonging to the
same cluster (to different clusters) - Clustering based on density Clusters are regions
of the object space where objects are located
most frequently - Clustering based The system selects the best
clustering according to a cost function - For homogeneity/separation-based clustering the
cost function is computed by evaluating
pointwise, clusterwise, and partitionwise
similarity/dissimilarity - For density-based clustering, the cost function
is derived from an estimated density function
11The Data Mining Visual Environment
- Formal Semantics of the Input Environment
- Visual language abstract syntax semantics
- Abstract syntax defined in terms of multi-graphs
- Visual components are vertices of the multi-graph
- Spatial relations between visual components are
edges of the multi-graph - Semantics
- Clustering defined by a mapping between
multi-graphs and cost functions and predicates
expressing optimality - Metaqueries/association rules defined by a
mapping between multi-graphs and rules - Â
12The Data Mining Visual Environment
- Operational Specification
- Concrete, high-level syntax of the tasks proposed
in the usability tests - Describes legal click-streams allowed to occur
during operation - Standard grammar notation
- Communication protocol between the abstract
clustering engine and the data mining engines - XML DTDs based on PMML 2.0
- Extension of PMML 2.0 to
- Specification of input
- Broader spectrum of clustering methods
- Concrete semantics of the clustering task by
mapping on symbols in the tasks grammar to
structures of the communication protocol - Interpretation function recursively defined on
the grammar rules of the high-level syntax of
tasks - The interpretation of a legal click-stream is an
XML document satisfying the DTD of the input
specification