Title: GridMiner: Vision, Design and Underlying Grid Technology
1GridMiner Vision, Design and Underlying Grid
Technology
- Ivan Janciak, Peter Brezany and A Min Toja
- Vienna University of Technology
- Institute of Scientific Computing
- email janciak_at_par.univie.ac.at
2Motivation
Service Provider
Business understanding
Data understanding
Service Provider
Data provider
Data Preparation
Data
GridMiner
Deployment
Service Provider
Modeling
Evaluation
CRISP-DM, SPSS
3Outline
- Motivation
- Data Mining and The Grid
- GridMiner Architecture
- Workflow Engine
- Knowledge Base
- Data Mining Services
- Decision Trees
- OLAP
- GridMiner The Movie
4Data Mining
- Data Mining process
- Data understanding
- Statistics
- Metadata exploration
- Data preparation
- Integration, Selection, Transformation
- Data cleaning
- Modeling / Evaluation
- Methods / Algorithms selection
- Tuning data mining task parameters
- Reporting
- Visualization
5The Grid
- Large Datasets
- Distributed data
- Access to data, data transfer
- Heterogeneity
- Semantic issues
- Distributed Computing Resources
- Security
- Different policies
- Access rights
6Requirements of data mining system on the Grid
- Analyze huge and distributed datasets
- Sophisticated data access system
- Data Mediator
- Interactive workflow management
- Workflow language and engine
- Data Mining services
- Parallel or distributed versions
- Knowledge Base
- Documents (Metadata, Models, Workflows, Rules, )
- Graphical user interface
- Secure system
7Basic Components
- Grid Layer
- Grid Services
- Grid Data services OGSA-DAI
- Data Mining services OGSA/OGSI
- Web Layer
- Web applications JSP/JavaBeans
- Service Configuration
- Data mining task configuration
- User ltgt Service Interaction
- Documents
- XML/XSLT/XSD documents
- Workflow description, Metadata , Models (PMML),
- Perform Document Mediation Schema
8Architecture Overview
Graphical User Interface
Web
Knowledge Base
Service Task configuration
DSCE Client
Visualization
Data Exploration
Grid
Dynamic service control engine
Grid Data Service
Data mining service
9Workflows
User
DSCL
- Dynamic Service Control Engine (DSCE)
- processes the workflow according to DSCL
- Dynamic Service Control Language (DSCL)
- based on XML
- easy to use
DSCE Client
DSCE
Service A
Service B
Service D
Service C
10Workflow language (DSCL)
act2.1
Conversion to XML
Users view
act1
act2.2
dscl
variables
composition
sequence
createService activityIDact1
parallel
invoke activityIDact2.1
invoke activityIDact2.2
sequence
11DSCL and OGSA-DAI Perform document
ltdsclgt ltvariablesgt ltvariable name"PERFORM_DOCUM
ENT"gt ltvaluegt ltgdsgridDataServicePerformgt
ltgdssqlQueryStatement name"myQuery"gt lt
gdsexpressiongtselect from testlt/gdsexpressiongt
ltgdswebRowSetStream name"myQueryOutput"/gt
lt/gdssqlQueryStatementgt lt/gdsgridDataService
Performgt ltvaluegt lt/variablegt ltvariab
le namePERFORM_RESULTS"/gt lt/variablesgt ltcomposi
tiongt ltsequencegt ltcreateService
activityID"START" factory-gshhttp//localhost8
9/ogsa/services/ogsadai/GDSF/gt ltinvoke
activityIDDAI001" operation"perform"gt ltparam
eter variable"PERFORM_DOCUMENT"/gt ltresult
variable"PERFORM_RESULTS"/gt lt/invokegt lt/seq
uencegt lt/compositiongt lt/dsclgt
12Workflow client (DSCE client)
- Implemented as Web application
- Finalize DSCL Document
- Receives notifications from DSCE
- Delivers results to KB
- Workflow optimization
- cashing
- Interaction with DSCE engine
- Start
- Stop
- Pause
- Resume
13Knowledge Base
- XML Database (xindiche)
- Store and share documents
- DSCL,PMML, Mapping schemas, XSLT , Perform
documents
Rules / Facts
SWRL
Models
PMML MiningModel
Ontologies
OWL/OWL-S
Metadata
PMML DataDictionary
14Components and Documents interactions
Web Applications
Knowledge Base
Services
Visualization
PMML
XSLT
Data mining service
Service/Task Configuration
DSCL
DSCE Engine
DSCE Client
Perform Document / Mapping Schema
Grid Data Service
Data Exploration
15Data Mining Service provided by GridMiner
- Kernel Services
- Decision Trees (distributed version)
- SPRINT algorithm
- Pruning
- Tree evaluation
- OLAP (parallel version)
- Sequences (sequential version)
- SPADE
- Clustering (sequential version)
- SimpleKMeans
- Ongoing work
- Text Mining - classification
16Decision Tree Service - DT
Master
Data
DT
XML
Model
Slave 1
Slave 2
DT
DT
17Decision Tree Service cont.
18Decision Tree Service cont.
19Decision Tree Service- Test
Test Dataset XML file (webRowSet) 6 attributes
1 node
130k
2 nodes
4 nodes
Execution Time ms
50k
250 000
500 000
Size records
20OLAP Service
Master
query
Data
Virtual Cube
XML
answer
Slave 1
Index Service
Indexes
Sub Cube
Slave 3
Slave 2
Sub Cube
Sub Cube
21Graphical User Interface
- Service Configuration
- Services selection
- Services Configuration
- Data Mining Task Configuration
- Setting parameters of methods / algorithms
- Data Processing
- Data Access
- Data Integration
- Data Selection
- Data Statistics, Histograms
- Workflow Execution
- Interaction with DSCE Client
- Results Visualization
22Visualization
PMML Document transformed to SVG
23Summary
- GridMiner the 1st project adressing all facets
of knowledge discovery on the Grid - Running prototype available
- Ongoing work on
- Semantic data integration
- Knowledge management
- Service performance optimization
- Grid intelligence
24GridMiner Group Members