Title: Data Mining Engineering
1Grids, Grid Technologies and Data Mining
Peter Brezany Institut für Softwarewissenschaft Un
iversität Wien E-mail brezany_at_par.univie.ac.at
2Grid and Grid Technologies
Grid computing has emerged as an important field,
distinguished from conventional distributed
computing by its focus on large- scale resource
sharing, innovative applications, and, in
some cases, high-performance orientation. Grid
itself is supposed to connect computing resources
over the wide area network. Internet computing
and Grid technologies promise to change the way
we tackle complex problems. Harnesing these new
technolo- gies effectively will transform
scientific disciplines ranging from high-energy
physics to the life sciences. The Grid research
field can further be divided into 2 subdomains
- Computational Grid a natural extension of
the former cluster computer - Data Grid
efficient management, placement, and replication
of large amounts of data once data are
in place, computational tasks can be run.
3Data Mining on (Data) Grids
- Data mining on the Grid (DMG) finding data
patterns in an environment with geographically
distributed data and computation an environment
with a special data management, data placement,
and data replication. - A good DMG algorithm analyzes data in a
distributed fashion with modest data
communication overhead. - A typical DMG algorithm involves local data
analysis followed by the generation of a global
data model. - Huge data volumes are involved high performance
I/O needed.
4Application Examples
- Finding out the dependency of the emergence of
hepatitis-C on the weather patterns access to a
large hepatitis-C DB at one location and an
environmental DB at another location. - 2 major financial organizations want to
cooperate. They need to share data patterns
relevant to the data mining task, they do not
want to share the data since it is sensitive -
combining the databases may not be feasible. - A major multi-national corporation wants to
analyze the customer transaction records for
quickly developing successful business
strategies. It has thousands of establishments
through out the world and collecting all the data
to a centralized data warehouse, followed by
analysis using existing commercial data mining
software,takes too long. - Telemedical applications see the next 2 slides.
5Components of Telemedical Applications
Database
Raw Medical Data
Derived Medical Data
Database
Reconstructed Medical Data
Web
6Telemedical Collaboration - Example
A patient living in a remote village has a heart
problem. An EEG is taken by the local doctor and
all the patients details are stored in the
doctors PC based telemedical system. MRI and CT
scans are taken within different departments of
a general hospital and stored in the telemedical
DB. A consultant compiles a report and saves it
in the DB. If necessary, in a specialized clinic
a 3D ultrasound scan is taken and further report
compiled. Requiring complicated surgery, an
external specialist using Virtual Reality
techniques defines how the surgery should be
planned. The resulting operation is placed on
video for, e.g., education. ? Data mining
support/assistance is needed.
7Motivations and History
8Grid Computing Concept
- Enable communities (virtual organizations)
to share geographically distributed resources as
they pursue common goalsin the absence of
central control, omniscience, trust relationships
9Grid Computing Concept (2)
The term the Grid was coined in the mid
1990s to denote a proposed distributed computing
infrastructure for science and engineering. The
aim is coordinated resource sharing and problem
solving in dynamic, multi-institutional virtual
organizations. Resources computers, files, data
to computers, sensors, networks, laboratory
equipments, etc. Sharing is highly controlled,
with resource providers and consumers defining
clearly and carefully just what is shared, who
is allowed to share, and conditions under which
sharing occurs. A set of individuals and/or
institutions defined by such sharing form a
virtual organization (VO).
10Grid Computing Concept (3)
Grid technologies complement rather than compete
with existing distributed computing
technologies. For example, CORBA focus on
enabling resource sharing within a single
organization. GRID technologies focus on
dynamic, cross-organizational sharing.
11Grid Communities and ApplicationsHome Computers
Evaluate AIDS Drugs
- Community
- 1000s of home computer users
- Philanthropic computing vendor (Entropia)
- Research group (Scripps)
- Common goal advance AIDS research
12The Nature of Grid Architecture
A Grid architecture identifies fundamental system
components, specifies the purpose and function of
these components, and indicates how these
components interact with one another. Interoperab
ility is the central issue to be addressed. In a
network environment, interoperability means
common protocols. The GRID architecture is first
and foremost a protocol architecture, with
protocols defining the basic mechanisms by which
VO users and resources negotiate, establish,
manage, and exploit sharing relationships. Standa
rd protocols make it easy to define standard
services that provide enhanced capablities and
construct Application Programming Interfaces and
Software Development Kits.
13The Nature of Grid Architecture (2)
Just as the Web revolutionized information
sharing by providing a universal protocol and
syntax (HTTP and HTML) for information exchange,
so we require standard protocols and syntaxes for
general resource sharing. A Grid protocol
definition specifies - how distributed system
elements interact with one another
in order to achieve a specified behavior, and -
the structure of the information exchanged during
this interaction
14The Nature of Grid Architecture (3)
A Grid service is defined solely by the protocol
that it speaks and the behaviors that it
implements. There are standard Grid services
for - access to computation - access to data
- resource discovery - coscheduling
(mechanisms for coordinating operations across
multiple resources) - data replications,
etc. The definition of the above services allows
as to enhance services offered to VO
participants and also to abstract away resource
specific details.
15The Nature of Grid Architecture (4)
Why do we also consider Application Programming
Interfaces (APIs) and Software Development Kits
(SDKs)? There is more to VOs than
interoperability, protocols and
services. Developers must be able to develop
sophisticated applications in complex and dynamic
execution environments. Users must be able to
operate these applications. Standard
abstractions, APIs, and SDKs can accelerate code
development, enable code sharing, and enhance
application portability. Summary
identification and definition of 1. protocols ?
2. services ? 3. APIs and SDKs.
16Grid Architecture
The architecture is organized into layers see
the next slide Components within each layer share
common characteristics but can build on
capabilities and behaviors provided by any
lower layer. Resource and Connectivity protocols
facilitate the sharing of individual resources.
They are designed so that they can be imlemented
n top of a diverse range of resource types,
defined at the Fabric layer, and can in turn be
used to construct a wide range of global services
and application-specific behaviors at
the Collective layer.
17Layered Grid Architecture(By Analogy to Internet
Architecture)
Application
18Fabric Interface to Local Control
The Grid Fabric layer provides the resources to
which shared access is mediated. Fabric
components implement the local resource-specific
operations that occur as a result of sharing
operations at higher levels. At a minimum,
recources should implement enquiry mechanisms
that permit discovery of their structure and
state, and resource management mechanisms that
provide some control of delivered quality of
service.
19Fabric Interface to Local Control (2)
- A resource-specific characterization of
capabilities - Computational resources Mechanisms for starting
programs and for montoring and controlling the
execution of resulting processes. - Storage resources Mechanisms for putting and
geting files. Enquiry functions for determining
hardware and software cha- racteristics and
information about available space utilization. - Network resources Mechanisms that provide
control over the resources allocated to network
transfers. Enquiry functions to determine network
characteristics and load. - Code repositories Managing versioned source and
object code. - Catalogs Catalog query and update operations.
20Connectivity Communicating Easily and Securely
- The Connectivity layer defines core communication
and authentication protocols required for
Grid-specific network transactions. - Communication protocols enable the exchange of
data between Fabric layered resources. - Authentication protocols build on communication
services to provide cryptographically secure
mechanisms for verifying the identity of users
and resources.
21Connectivity (2)
- Authentications solutions for VO environments
should have the following characteristics - Single sign on Users must be able to log on
(authenticate) just once and then have access to
multiple Grid resources defined by the Fabric
layer, without further user intervention. - Delegation A user must be able to endow a
program with the ability to run on that users
behalf, so that the program is able to access the
resources on which the user is authorized. - Integration with various local security
solutions Grid security solutions must be able
to interoperate with various local security
solutions. - User-based trust relationships If a user hs the
right to use sites A and B, the user should be
able to use sites A and B together without
requiring that As and Bs security adminstrators
interact.
22Resource Sharing Single Resources
- The Resource layer defines protocols (and APIs
and SDKs) for secure initiation, monitoring, and
control of sharing operations on individual
resources. - The primary classes of Resource layer protocols
- Information protocols are used to obtain
information about the structure and state of a
resource, e.g., its configuration, current load,
and usage policy. - Management protocols are used to negotiate access
to a shared resource, specifying, for example,
resource requirements and the operations to be
performed, such as process creation, or data
access. A protocol may support monitoring the
status of an operation and controlling (e.g.,
terminating) the operation.
23Collective Coordinating Multiple Resources
- Collective layer contains protocol and services
(and APIs and SDKs) that are not associated with
any one specific resource but rather are global
in nature and capture interactions across
collections of resources. This layer can, e.g.,
implement - Directory services allow VO participants to
discover the existence and/or properties of VO
resources. - Co-allcation, scheduling, and brokering services
allow VO participants to request the allocatin of
one or more resources for a specific purpose and
the schedulng of tasks on the appropriate
resources. - Monitoring and diagnosics services support the
monitoring of VO resources for failure,
adversarial attack (intrusion detection),
overload, and so forth.
24Collective (2)
- Data replication services suport the management
of VO storage (and perhaps also network and
computing) resources to maximize data access
peformance with respect to metrics such as
response time, reliability, and cost. - Grid-enabled programming systems enable familiar
programming models to be used in Grid
environments. E.g., a Grid-enabled
implementations of the Message Passing Interface
(MPI). - Software discovery services discover and select
the best software imlementation and execution
platform based on the parameters of the problem
being solved. - Community authorization servers enforce community
policies governing resource access. - Collaboratory services support the coordinated
exchange of information within potentially large
user communties.
25Applications
- Applications are constructed in terms of, and by
calling upon, services defined at any layer. - Effective application development can often
benefit from the use of higher-level languages
and frameworks (e.g., the Common Component
Architecture, CORBA, etc.). These higher-level
systems can build on protocols, services, and
APIs provided within the Grid architecture.
26Protocols, Services, and InterfacesOccur at Each
Level
Applications
Languages/Frameworks
Collective Service APIs and SDKs
Collective Service Protocols
Collective Services
Resource APIs and SDKs
Resource Service Protocols
Resource Services
Connectivity APIs
Connectivity Protocols
Local Access APIs and Protocols
Fabric Layer
27Data Grid
The need for Data Grids stems from the fact that
scientific applications like data analysis in
High Energy Physics, climate modeling or earth
observation are very data intensive and a large
community of researchers all around the globe
wants to have fast access to the data. Future
Data Grid applications Medical Grids and
E-Business Grids. Grid Data Warehousing and Grid
Data Mining a new challenging field.
28Storage Model
- 2 different kinds of files
- Master files (owned by their creators)
- Replica files. There may be many replicas of a
master file. - Replicas are owned by, managed by, and may be
deleted by, - the Grid.
- The notion of replicas is new, and critical in a
Grid - environment. Example
- Before a DataGrid job can run at site A, data at
site B may need to be copied to site A. - This data may then be used by subsequent jobs at
site A, or may be needed by jobs at site C, which
has a better network connection to site A than
site B. For this reason, the data should be kept
at site A as long as possible. - The ReplicaManager keeps track of all replica
data so that the replica selection service can
select the optimal replica to use for a given
job, or to request the creation of a new replica.
29SQLDatabaseService
This servis allows to efficiently store, retrieve
and query very large amounts of meta data held in
any type of local or remote RDBMS. The database
can be used for the implementation of catalogs.
30GridMiner A Framework for Data Miningon Grids
31Architecture of a Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Database or data warehouse server
Filtering
Data cleaning, data integration
Data warehouse
32Decomposition of a Knowledge Discovery Process
- Preprocessing
- - data cleaning
- - data transformation
- - data reduction
- Data mining (e.g., association rules)
-
- - find frequent itemsets
- - generate association rules
- Evaluation of discovered patterns
- Graphical User Interface
33Our Philosophy
- Data mining systems can be decomposed into a set
of communicating components? distributed
component architecture - Placement of data-processing functionalities
iscritical. - Grid data mining research tightly coupled to the
ongoing work on parallel I/O for Grids(e.g.,
Armada project at the Dartmouth College, USA)
34Basic Grid Data Mining Models
- Local data analysis followed by the generation
- of a global data model adapting distributed
- data mining techniques. No data replication.
- 2. Data mining system components are optimally
- located on the grid. No dynamic data
replication. - 3. Data mining system components are optimally
- located on the Grid. Dynamic data replication
is - considered.
-
35Data Storage and the Components
Site D
Site C
Site A
Site B
Preprocessing
Preprocesing
Preprocessing
Preprocessing
Local DM
Local DM
Local DM
Local DM
Construction of the Global Model
GUI
Site E