Session 28: Distributed Data Mining Research using Grids and Web Services

1 / 46
About This Presentation
Title:

Session 28: Distributed Data Mining Research using Grids and Web Services

Description:

Towards high productivity analytics. Parallel and distributed data mining and OLAP in GridMiner/ADMIRE projects. Future developments ... –

Number of Views:26
Avg rating:3.0/5.0
Slides: 47
Provided by: jone1
Category:

less

Transcript and Presenter's Notes

Title: Session 28: Distributed Data Mining Research using Grids and Web Services


1
Session 28 Distributed Data Mining Research
using Grids and Web Services
11 July
  • Author/Presenter Peter Brezany
  • University of Vienna, Austria

2
Motivation
Medicine
Scientific experiments
Data and data exploration
cloud
Business
Simulations
Earth observations
3
Outline
  • Motivation
  • Selected projects ?
  • Data mining model
  • Towards high productivity analytics
  • Parallel and distributed data mining and OLAP in
    GridMiner/ADMIRE projects
  • Future developments

4
Selected Projects
5
A Long-Term Biodiversity, Ecosystem and Awareness
Research Network ALTER-Net
Author Kathi Schleidt
Common Ontology
6
China-Austria Data Grid (CADGrid)
  • Main Idea Medical Meridian Measurement Grid
    (M3G) for On-Line Diagnosis
  • Diabetic domain is the first domain highly
    profiting of the project results

7
Motivation
  • Meridian-Theory is an important part of
    Traditional Chinese Medicine (TCM)
  • Clinical practices of TCM (esp. acupuncture) have
    been guided by meridian theory for thousands of
    years
  • More than 4000 years of experience
  • Knowledge that we should not only use but also
    support by modern high-tech measurement and IT
    technologies

3-Dec-07
CADGrid
7
8
Meridian-Theory Basics (1)
  • According to TCM our human body has 14
    acupuncture meridians
  • Secret to our biological and medical knowledge
  • Each meridian has its main points ? called source
    points

9
Meridian-Theory Basics (2)
  • Using data mining techniques, correlations
    between these points can be identified e.g.
  • start-end point correlation
  • symmetric point correlation
  • If there was a pain on one place along the
    meridian, a good effect can be achieved by
    treating another place on the same line

10
Meridian-theory Basics (3)
  • Meridians can transport
  • physical, medical, biological material and
    information
  • The characteristics (weaker or stronger output,
    time delay, ) gained by the analysis of
    electro-signals sensed from meridians have a
    strong relationship with the human body organs
    (heart, lung, brain,)

10
11
Meridian Measurement Methods 1 Active 2
Passive
11
12
Active Measurement
Up-flow point lower electrical
potential Down-flow point higher electrical
potential Fingers and toes zero potential
12
13
Passive Measurement
13
14
  • Application 1
  • Non-invasive Glucose Measurement (NIGM)
  • Meridian Measurement Instrument

15
The First Prototype
16
NIGM Workflow

17
M3G Services for DiabeticsNIGM-Service Model
Setup
18
M3G Services for DiabeticsNIGM-Service Use
Model
19
M3G Services for DiabeticsNIGM-Service
Maintain Model
20
CADGrid Framework
Balatonfüred,Hungary - 6th-18th July 2008
20
21
Intelligence Base
21
22
Future Work
  • Extension to other domains
  • Brain Informatics domain

22
23
CRISP-DM
Data Understanding
Business Understanding
Data Preparation
Data
Deployment
Modelling
Evaluation
24
Towards High Productivity Analytics
Motivation
A Project Sponsored by                          
                              
                                              
                                                
                                                 
                                      
                                        
25
High Productivity Analytics
  • Our definition
  • A high productive analytics system is one that
    delivers a high level of performance, guarantees
    a high level of accuracy of analytics models and
    other results extracted from analyzed data sets
    while scoring equally on other aspects, like
    usability, robustness, system management, and
    ease of programming.

26
High Productivity Analytics Research Agenda
  • High performance services developed by high
    productivity languages and tools
  • Efficient workflow management (building and
    execution)
  • Advanced GUI
  • Illustration on the GridMiner system

27
GridMiner Data Mining Model
Service Provider
Business understanding
Data understanding
Service Provider
Data provider
Data Preparation
Data
GridMiner
Deployment
Service Provider
Modeling
Evaluation
CRISP-DM, SPSS
28
GridMiner Conceptual Architecture
Data and functional resources can be
geogra- phically distributed focus on
virtualization and large-scale data mining.
29
Motivation for large-scale data mining
100
accuracy
sampled data size
available data size
qi - data quality mi - modeling method
30
Service Parallelism Levels
Inter-Service Intra-Service Parallelism
31
Hybrid Programming Model
  • SPMD Single Program Multiple Data (used for
    programming multiprocessor architectures)
  • SSMD Single Service Multiple Data (introduced
    by us for programming service-oriented
    architectures)

32
1. Construction of Decision Trees - SPRINT
Scalable PaRallelizable INduction of decision
Tree
Out-of-Core Algorithm
categorical
continuous
class
Splitting Attributes
The splitting attribute at a node is determined
by the Gini index.
33
Phase 1 - Preparation
34
Phase 2 - Execution
35
2. Construction of Neural Networks
Node
36
Parallel Algorithm
  • Challenges
  • Training real NN is extremely computationally
    intensive.
  • Many NN practical applications (e.g., speech and
    face recognition) involve the large number of
    adjustable parameters and training patterns to
    achieve the needed accuracy.
  • Solution
  • Parallel training algorithms
  • Development of services running in high
    performance hardware and software environments

37
Programming Environment Titanium
  • The goals performance, safety, and
    expressiveness.
  • A language that gives its users access to modern
    program structuring through the use of
    object-oriented technology, that enables its
    users to write explicitly parallel code.
  • Based on a parallel SPMD model of computation
    with a global address space.
  • Titanium uses Java as its base, not a strict
    extension of Java.
  • Compiler Titanium ? C communication

38
Overview of Distributed Solution
Master
Sub-master 0
Sub-master 1
Slave0
Slave1
Slave2
Slave0
Slave1
Data Distribution Scheme 1
Data Distribution Scheme 2
Training Data for Sub-master 1
Training Data for Sub-master 0
39
The Parallel Implementation
VGE Vienna Grid Environment
VGE Client
VGE Server
40
The Distributed Parallel Implementation
VGE Client
VGE Server
41
3. On-Line Analytical Processing (OLAP)
a three-dimensional data cube
42
Distributed OLAP Aggregation of Compute and
Storage Resources
Tuple Stream
43
OLAP Service
Master
query
Data
Virtual Cube
XML
answer
Slave 1
Index Service
Indexes
Sub Cube
Slave 3
Slave 2
Sub Cube
Sub Cube
44
Workflow Composition Approaches
45
GridMiner Workflow Composition Editor
46
Towards Next-Generation Grids
Next-Generation Grid
Knowledge Technologies
Evolution of the Web
Mobile Services
Evolution of HPCN
Current Grids
Write a Comment
User Comments (0)
About PowerShow.com