Title: Session 28: Distributed Data Mining Research using Grids and Web Services
1Session 28 Distributed Data Mining Research
using Grids and Web Services
11 July
- Author/Presenter Peter Brezany
- University of Vienna, Austria
2Motivation
Medicine
Scientific experiments
Data and data exploration
cloud
Business
Simulations
Earth observations
3Outline
- Motivation
- Selected projects ?
- Data mining model
- Towards high productivity analytics
- Parallel and distributed data mining and OLAP in
GridMiner/ADMIRE projects - Future developments
4Selected Projects
5A Long-Term Biodiversity, Ecosystem and Awareness
Research Network ALTER-Net
Author Kathi Schleidt
Common Ontology
6China-Austria Data Grid (CADGrid)
- Main Idea Medical Meridian Measurement Grid
(M3G) for On-Line Diagnosis - Diabetic domain is the first domain highly
profiting of the project results
7Motivation
- Meridian-Theory is an important part of
Traditional Chinese Medicine (TCM) - Clinical practices of TCM (esp. acupuncture) have
been guided by meridian theory for thousands of
years - More than 4000 years of experience
- Knowledge that we should not only use but also
support by modern high-tech measurement and IT
technologies
3-Dec-07
CADGrid
7
8Meridian-Theory Basics (1)
- According to TCM our human body has 14
acupuncture meridians - Secret to our biological and medical knowledge
- Each meridian has its main points ? called source
points
9Meridian-Theory Basics (2)
- Using data mining techniques, correlations
between these points can be identified e.g. - start-end point correlation
- symmetric point correlation
- If there was a pain on one place along the
meridian, a good effect can be achieved by
treating another place on the same line
10Meridian-theory Basics (3)
- Meridians can transport
- physical, medical, biological material and
information - The characteristics (weaker or stronger output,
time delay, ) gained by the analysis of
electro-signals sensed from meridians have a
strong relationship with the human body organs
(heart, lung, brain,)
10
11Meridian Measurement Methods 1 Active 2
Passive
11
12Active Measurement
Up-flow point lower electrical
potential Down-flow point higher electrical
potential Fingers and toes zero potential
12
13Passive Measurement
13
14- Application 1
- Non-invasive Glucose Measurement (NIGM)
- Meridian Measurement Instrument
-
15The First Prototype
16NIGM Workflow
17M3G Services for DiabeticsNIGM-Service Model
Setup
18M3G Services for DiabeticsNIGM-Service Use
Model
19M3G Services for DiabeticsNIGM-Service
Maintain Model
20CADGrid Framework
Balatonfüred,Hungary - 6th-18th July 2008
20
21Intelligence Base
21
22Future Work
- Extension to other domains
- Brain Informatics domain
22
23CRISP-DM
Data Understanding
Business Understanding
Data Preparation
Data
Deployment
Modelling
Evaluation
24Towards High Productivity Analytics
Motivation
A Project Sponsored by
25High Productivity Analytics
- Our definition
- A high productive analytics system is one that
delivers a high level of performance, guarantees
a high level of accuracy of analytics models and
other results extracted from analyzed data sets
while scoring equally on other aspects, like
usability, robustness, system management, and
ease of programming.
26High Productivity Analytics Research Agenda
- High performance services developed by high
productivity languages and tools - Efficient workflow management (building and
execution) - Advanced GUI
- Illustration on the GridMiner system
27GridMiner Data Mining Model
Service Provider
Business understanding
Data understanding
Service Provider
Data provider
Data Preparation
Data
GridMiner
Deployment
Service Provider
Modeling
Evaluation
CRISP-DM, SPSS
28GridMiner Conceptual Architecture
Data and functional resources can be
geogra- phically distributed focus on
virtualization and large-scale data mining.
29Motivation for large-scale data mining
100
accuracy
sampled data size
available data size
qi - data quality mi - modeling method
30Service Parallelism Levels
Inter-Service Intra-Service Parallelism
31Hybrid Programming Model
- SPMD Single Program Multiple Data (used for
programming multiprocessor architectures) -
- SSMD Single Service Multiple Data (introduced
by us for programming service-oriented
architectures)
32 1. Construction of Decision Trees - SPRINT
Scalable PaRallelizable INduction of decision
Tree
Out-of-Core Algorithm
categorical
continuous
class
Splitting Attributes
The splitting attribute at a node is determined
by the Gini index.
33 Phase 1 - Preparation
34Phase 2 - Execution
352. Construction of Neural Networks
Node
36Parallel Algorithm
- Challenges
- Training real NN is extremely computationally
intensive. - Many NN practical applications (e.g., speech and
face recognition) involve the large number of
adjustable parameters and training patterns to
achieve the needed accuracy. - Solution
- Parallel training algorithms
- Development of services running in high
performance hardware and software environments
37Programming Environment Titanium
- The goals performance, safety, and
expressiveness. - A language that gives its users access to modern
program structuring through the use of
object-oriented technology, that enables its
users to write explicitly parallel code. - Based on a parallel SPMD model of computation
with a global address space. - Titanium uses Java as its base, not a strict
extension of Java. - Compiler Titanium ? C communication
38Overview of Distributed Solution
Master
Sub-master 0
Sub-master 1
Slave0
Slave1
Slave2
Slave0
Slave1
Data Distribution Scheme 1
Data Distribution Scheme 2
Training Data for Sub-master 1
Training Data for Sub-master 0
39The Parallel Implementation
VGE Vienna Grid Environment
VGE Client
VGE Server
40The Distributed Parallel Implementation
VGE Client
VGE Server
413. On-Line Analytical Processing (OLAP)
a three-dimensional data cube
42Distributed OLAP Aggregation of Compute and
Storage Resources
Tuple Stream
43OLAP Service
Master
query
Data
Virtual Cube
XML
answer
Slave 1
Index Service
Indexes
Sub Cube
Slave 3
Slave 2
Sub Cube
Sub Cube
44Workflow Composition Approaches
45GridMiner Workflow Composition Editor
46Towards Next-Generation Grids
Next-Generation Grid
Knowledge Technologies
Evolution of the Web
Mobile Services
Evolution of HPCN
Current Grids