Session 28: Distributed Data Mining Research using Grids and Web Services

1 / 46

About This Presentation

Title:

Session 28: Distributed Data Mining Research using Grids and Web Services

Description:

Towards high productivity analytics. Parallel and distributed data mining and OLAP in GridMiner/ADMIRE projects. Future developments ... –

Number of Views:26

Avg rating:3.0/5.0

Slides: 47

Provided by: jone1

Category:

more less

Transcript and Presenter's Notes

Title: Session 28: Distributed Data Mining Research using Grids and Web Services

1
Session 28 Distributed Data Mining Research
using Grids and Web Services
11 July

Author/Presenter Peter Brezany
University of Vienna, Austria

2
Motivation
Medicine
Scientific experiments
Data and data exploration
cloud
Business
Simulations
Earth observations
3
Outline

Motivation
Selected projects ?
Data mining model
Towards high productivity analytics
Parallel and distributed data mining and OLAP in
GridMiner/ADMIRE projects
Future developments

4
Selected Projects
5
A Long-Term Biodiversity, Ecosystem and Awareness
Research Network ALTER-Net
Author Kathi Schleidt
Common Ontology
6
China-Austria Data Grid (CADGrid)

Main Idea Medical Meridian Measurement Grid
(M3G) for On-Line Diagnosis
Diabetic domain is the first domain highly
profiting of the project results

7
Motivation

Meridian-Theory is an important part of
Traditional Chinese Medicine (TCM)
Clinical practices of TCM (esp. acupuncture) have
been guided by meridian theory for thousands of
years
More than 4000 years of experience
Knowledge that we should not only use but also
support by modern high-tech measurement and IT
technologies

3-Dec-07
CADGrid
7
8
Meridian-Theory Basics (1)

According to TCM our human body has 14
acupuncture meridians
Secret to our biological and medical knowledge
Each meridian has its main points ? called source
points

9
Meridian-Theory Basics (2)

Using data mining techniques, correlations
between these points can be identified e.g.
start-end point correlation
symmetric point correlation
If there was a pain on one place along the
meridian, a good effect can be achieved by
treating another place on the same line

10
Meridian-theory Basics (3)

Meridians can transport
physical, medical, biological material and
information
The characteristics (weaker or stronger output,
time delay, ) gained by the analysis of
electro-signals sensed from meridians have a
strong relationship with the human body organs
(heart, lung, brain,)

10
11
Meridian Measurement Methods 1 Active 2
Passive
11
12
Active Measurement
Up-flow point lower electrical
potential Down-flow point higher electrical
potential Fingers and toes zero potential
12
13
Passive Measurement
13
14

Application 1
Non-invasive Glucose Measurement (NIGM)
Meridian Measurement Instrument

15
The First Prototype
16
NIGM Workflow

17
M3G Services for DiabeticsNIGM-Service Model
Setup
18
M3G Services for DiabeticsNIGM-Service Use
Model
19
M3G Services for DiabeticsNIGM-Service
Maintain Model
20
CADGrid Framework
Balatonfüred,Hungary - 6th-18th July 2008
20
21
Intelligence Base
21
22
Future Work

Extension to other domains
Brain Informatics domain

22
23
CRISP-DM
Data Understanding
Business Understanding
Data Preparation
Data
Deployment
Modelling
Evaluation
24
Towards High Productivity Analytics
Motivation
A Project Sponsored by






25
High Productivity Analytics

Our definition
A high productive analytics system is one that
delivers a high level of performance, guarantees
a high level of accuracy of analytics models and
other results extracted from analyzed data sets
while scoring equally on other aspects, like
usability, robustness, system management, and
ease of programming.

26
High Productivity Analytics Research Agenda

High performance services developed by high
productivity languages and tools
Efficient workflow management (building and
execution)
Advanced GUI
Illustration on the GridMiner system

27
GridMiner Data Mining Model
Service Provider
Business understanding
Data understanding
Service Provider
Data provider
Data Preparation
Data
GridMiner
Deployment
Service Provider
Modeling
Evaluation
CRISP-DM, SPSS
28
GridMiner Conceptual Architecture
Data and functional resources can be
geogra- phically distributed focus on
virtualization and large-scale data mining.
29
Motivation for large-scale data mining
100
accuracy
sampled data size
available data size
qi - data quality mi - modeling method
30
Service Parallelism Levels
Inter-Service Intra-Service Parallelism
31
Hybrid Programming Model

SPMD Single Program Multiple Data (used for
programming multiprocessor architectures)
SSMD Single Service Multiple Data (introduced
by us for programming service-oriented
architectures)

32
1. Construction of Decision Trees - SPRINT
Scalable PaRallelizable INduction of decision
Tree
Out-of-Core Algorithm
categorical
continuous
class
Splitting Attributes
The splitting attribute at a node is determined
by the Gini index.
33
Phase 1 - Preparation
34
Phase 2 - Execution
35
2. Construction of Neural Networks
Node
36
Parallel Algorithm

Challenges
Training real NN is extremely computationally
intensive.
Many NN practical applications (e.g., speech and
face recognition) involve the large number of
adjustable parameters and training patterns to
achieve the needed accuracy.
Solution
Parallel training algorithms
Development of services running in high
performance hardware and software environments

37
Programming Environment Titanium

The goals performance, safety, and
expressiveness.
A language that gives its users access to modern
program structuring through the use of
object-oriented technology, that enables its
users to write explicitly parallel code.
Based on a parallel SPMD model of computation
with a global address space.
Titanium uses Java as its base, not a strict
extension of Java.
Compiler Titanium ? C communication

38
Overview of Distributed Solution
Master
Sub-master 0
Sub-master 1
Slave0
Slave1
Slave2
Slave0
Slave1
Data Distribution Scheme 1
Data Distribution Scheme 2
Training Data for Sub-master 1
Training Data for Sub-master 0
39
The Parallel Implementation
VGE Vienna Grid Environment
VGE Client
VGE Server
40
The Distributed Parallel Implementation
VGE Client
VGE Server
41
3. On-Line Analytical Processing (OLAP)
a three-dimensional data cube
42
Distributed OLAP Aggregation of Compute and
Storage Resources
Tuple Stream
43
OLAP Service
Master
query
Data
Virtual Cube
XML
answer
Slave 1
Index Service
Indexes
Sub Cube
Slave 3
Slave 2
Sub Cube
Sub Cube
44
Workflow Composition Approaches
45
GridMiner Workflow Composition Editor
46
Towards Next-Generation Grids
Next-Generation Grid
Knowledge Technologies
Evolution of the Web
Mobile Services
Evolution of HPCN
Current Grids

Write a Comment

User Comments (0)