Data Mining Engineering - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining Engineering

Description:

Data Mining Peter Brezany Institut f r Softwarewissenschaft Universit t Wien E-mail : brezany_at_par.univie.ac.at – PowerPoint PPT presentation

Number of Views:411
Avg rating:3.0/5.0
Slides: 36
Provided by: PeterB258
Category:

less

Transcript and Presenter's Notes

Title: Data Mining Engineering


1
Grids, Grid Technologies and Data Mining
Peter Brezany Institut für Softwarewissenschaft Un
iversität Wien E-mail brezany_at_par.univie.ac.at
2
Grid and Grid Technologies
Grid computing has emerged as an important field,
distinguished from conventional distributed
computing by its focus on large- scale resource
sharing, innovative applications, and, in
some cases, high-performance orientation. Grid
itself is supposed to connect computing resources
over the wide area network. Internet computing
and Grid technologies promise to change the way
we tackle complex problems. Harnesing these new
technolo- gies effectively will transform
scientific disciplines ranging from high-energy
physics to the life sciences. The Grid research
field can further be divided into 2 subdomains
- Computational Grid a natural extension of
the former cluster computer - Data Grid
efficient management, placement, and replication
of large amounts of data once data are
in place, computational tasks can be run.
3
Data Mining on (Data) Grids
  • Data mining on the Grid (DMG) finding data
    patterns in an environment with geographically
    distributed data and computation an environment
    with a special data management, data placement,
    and data replication.
  • A good DMG algorithm analyzes data in a
    distributed fashion with modest data
    communication overhead.
  • A typical DMG algorithm involves local data
    analysis followed by the generation of a global
    data model.
  • Huge data volumes are involved high performance
    I/O needed.

4
Application Examples
  • Finding out the dependency of the emergence of
    hepatitis-C on the weather patterns access to a
    large hepatitis-C DB at one location and an
    environmental DB at another location.
  • 2 major financial organizations want to
    cooperate. They need to share data patterns
    relevant to the data mining task, they do not
    want to share the data since it is sensitive -
    combining the databases may not be feasible.
  • A major multi-national corporation wants to
    analyze the customer transaction records for
    quickly developing successful business
    strategies. It has thousands of establishments
    through out the world and collecting all the data
    to a centralized data warehouse, followed by
    analysis using existing commercial data mining
    software,takes too long.
  • Telemedical applications see the next 2 slides.

5
Components of Telemedical Applications
Database
Raw Medical Data
Derived Medical Data
Database
Reconstructed Medical Data
Web
6
Telemedical Collaboration - Example
A patient living in a remote village has a heart
problem. An EEG is taken by the local doctor and
all the patients details are stored in the
doctors PC based telemedical system. MRI and CT
scans are taken within different departments of
a general hospital and stored in the telemedical
DB. A consultant compiles a report and saves it
in the DB. If necessary, in a specialized clinic
a 3D ultrasound scan is taken and further report
compiled. Requiring complicated surgery, an
external specialist using Virtual Reality
techniques defines how the surgery should be
planned. The resulting operation is placed on
video for, e.g., education. ? Data mining
support/assistance is needed.
7
Motivations and History
8
Grid Computing Concept
  • Enable communities (virtual organizations)
    to share geographically distributed resources as
    they pursue common goalsin the absence of
    central control, omniscience, trust relationships

9
Grid Computing Concept (2)
The term the Grid was coined in the mid
1990s to denote a proposed distributed computing
infrastructure for science and engineering. The
aim is coordinated resource sharing and problem
solving in dynamic, multi-institutional virtual
organizations. Resources computers, files, data
to computers, sensors, networks, laboratory
equipments, etc. Sharing is highly controlled,
with resource providers and consumers defining
clearly and carefully just what is shared, who
is allowed to share, and conditions under which
sharing occurs. A set of individuals and/or
institutions defined by such sharing form a
virtual organization (VO).
10
Grid Computing Concept (3)
Grid technologies complement rather than compete
with existing distributed computing
technologies. For example, CORBA focus on
enabling resource sharing within a single
organization. GRID technologies focus on
dynamic, cross-organizational sharing.
11
Grid Communities and ApplicationsHome Computers
Evaluate AIDS Drugs
  • Community
  • 1000s of home computer users
  • Philanthropic computing vendor (Entropia)
  • Research group (Scripps)
  • Common goal advance AIDS research

12
The Nature of Grid Architecture
A Grid architecture identifies fundamental system
components, specifies the purpose and function of
these components, and indicates how these
components interact with one another. Interoperab
ility is the central issue to be addressed. In a
network environment, interoperability means
common protocols. The GRID architecture is first
and foremost a protocol architecture, with
protocols defining the basic mechanisms by which
VO users and resources negotiate, establish,
manage, and exploit sharing relationships. Standa
rd protocols make it easy to define standard
services that provide enhanced capablities and
construct Application Programming Interfaces and
Software Development Kits.
13
The Nature of Grid Architecture (2)
Just as the Web revolutionized information
sharing by providing a universal protocol and
syntax (HTTP and HTML) for information exchange,
so we require standard protocols and syntaxes for
general resource sharing. A Grid protocol
definition specifies - how distributed system
elements interact with one another
in order to achieve a specified behavior, and -
the structure of the information exchanged during
this interaction
14
The Nature of Grid Architecture (3)
A Grid service is defined solely by the protocol
that it speaks and the behaviors that it
implements. There are standard Grid services
for - access to computation - access to data
- resource discovery - coscheduling
(mechanisms for coordinating operations across
multiple resources) - data replications,
etc. The definition of the above services allows
as to enhance services offered to VO
participants and also to abstract away resource
specific details.
15
The Nature of Grid Architecture (4)
Why do we also consider Application Programming
Interfaces (APIs) and Software Development Kits
(SDKs)? There is more to VOs than
interoperability, protocols and
services. Developers must be able to develop
sophisticated applications in complex and dynamic
execution environments. Users must be able to
operate these applications. Standard
abstractions, APIs, and SDKs can accelerate code
development, enable code sharing, and enhance
application portability. Summary
identification and definition of 1. protocols ?
2. services ? 3. APIs and SDKs.
16
Grid Architecture
The architecture is organized into layers see
the next slide Components within each layer share
common characteristics but can build on
capabilities and behaviors provided by any
lower layer. Resource and Connectivity protocols
facilitate the sharing of individual resources.
They are designed so that they can be imlemented
n top of a diverse range of resource types,
defined at the Fabric layer, and can in turn be
used to construct a wide range of global services
and application-specific behaviors at
the Collective layer.
17
Layered Grid Architecture(By Analogy to Internet
Architecture)
Application
18
Fabric Interface to Local Control
The Grid Fabric layer provides the resources to
which shared access is mediated. Fabric
components implement the local resource-specific
operations that occur as a result of sharing
operations at higher levels. At a minimum,
recources should implement enquiry mechanisms
that permit discovery of their structure and
state, and resource management mechanisms that
provide some control of delivered quality of
service.
19
Fabric Interface to Local Control (2)
  • A resource-specific characterization of
    capabilities
  • Computational resources Mechanisms for starting
    programs and for montoring and controlling the
    execution of resulting processes.
  • Storage resources Mechanisms for putting and
    geting files. Enquiry functions for determining
    hardware and software cha- racteristics and
    information about available space utilization.
  • Network resources Mechanisms that provide
    control over the resources allocated to network
    transfers. Enquiry functions to determine network
    characteristics and load.
  • Code repositories Managing versioned source and
    object code.
  • Catalogs Catalog query and update operations.

20
Connectivity Communicating Easily and Securely
  • The Connectivity layer defines core communication
    and authentication protocols required for
    Grid-specific network transactions.
  • Communication protocols enable the exchange of
    data between Fabric layered resources.
  • Authentication protocols build on communication
    services to provide cryptographically secure
    mechanisms for verifying the identity of users
    and resources.

21
Connectivity (2)
  • Authentications solutions for VO environments
    should have the following characteristics
  • Single sign on Users must be able to log on
    (authenticate) just once and then have access to
    multiple Grid resources defined by the Fabric
    layer, without further user intervention.
  • Delegation A user must be able to endow a
    program with the ability to run on that users
    behalf, so that the program is able to access the
    resources on which the user is authorized.
  • Integration with various local security
    solutions Grid security solutions must be able
    to interoperate with various local security
    solutions.
  • User-based trust relationships If a user hs the
    right to use sites A and B, the user should be
    able to use sites A and B together without
    requiring that As and Bs security adminstrators
    interact.

22
Resource Sharing Single Resources
  • The Resource layer defines protocols (and APIs
    and SDKs) for secure initiation, monitoring, and
    control of sharing operations on individual
    resources.
  • The primary classes of Resource layer protocols
  • Information protocols are used to obtain
    information about the structure and state of a
    resource, e.g., its configuration, current load,
    and usage policy.
  • Management protocols are used to negotiate access
    to a shared resource, specifying, for example,
    resource requirements and the operations to be
    performed, such as process creation, or data
    access. A protocol may support monitoring the
    status of an operation and controlling (e.g.,
    terminating) the operation.

23
Collective Coordinating Multiple Resources
  • Collective layer contains protocol and services
    (and APIs and SDKs) that are not associated with
    any one specific resource but rather are global
    in nature and capture interactions across
    collections of resources. This layer can, e.g.,
    implement
  • Directory services allow VO participants to
    discover the existence and/or properties of VO
    resources.
  • Co-allcation, scheduling, and brokering services
    allow VO participants to request the allocatin of
    one or more resources for a specific purpose and
    the schedulng of tasks on the appropriate
    resources.
  • Monitoring and diagnosics services support the
    monitoring of VO resources for failure,
    adversarial attack (intrusion detection),
    overload, and so forth.

24
Collective (2)
  • Data replication services suport the management
    of VO storage (and perhaps also network and
    computing) resources to maximize data access
    peformance with respect to metrics such as
    response time, reliability, and cost.
  • Grid-enabled programming systems enable familiar
    programming models to be used in Grid
    environments. E.g., a Grid-enabled
    implementations of the Message Passing Interface
    (MPI).
  • Software discovery services discover and select
    the best software imlementation and execution
    platform based on the parameters of the problem
    being solved.
  • Community authorization servers enforce community
    policies governing resource access.
  • Collaboratory services support the coordinated
    exchange of information within potentially large
    user communties.

25
Applications
  • Applications are constructed in terms of, and by
    calling upon, services defined at any layer.
  • Effective application development can often
    benefit from the use of higher-level languages
    and frameworks (e.g., the Common Component
    Architecture, CORBA, etc.). These higher-level
    systems can build on protocols, services, and
    APIs provided within the Grid architecture.

26
Protocols, Services, and InterfacesOccur at Each
Level
Applications
Languages/Frameworks
Collective Service APIs and SDKs
Collective Service Protocols
Collective Services
Resource APIs and SDKs
Resource Service Protocols
Resource Services
Connectivity APIs
Connectivity Protocols
Local Access APIs and Protocols
Fabric Layer
27
Data Grid
The need for Data Grids stems from the fact that
scientific applications like data analysis in
High Energy Physics, climate modeling or earth
observation are very data intensive and a large
community of researchers all around the globe
wants to have fast access to the data. Future
Data Grid applications Medical Grids and
E-Business Grids. Grid Data Warehousing and Grid
Data Mining a new challenging field.
28
Storage Model
  • 2 different kinds of files
  • Master files (owned by their creators)
  • Replica files. There may be many replicas of a
    master file.
  • Replicas are owned by, managed by, and may be
    deleted by,
  • the Grid.
  • The notion of replicas is new, and critical in a
    Grid
  • environment. Example
  • Before a DataGrid job can run at site A, data at
    site B may need to be copied to site A.
  • This data may then be used by subsequent jobs at
    site A, or may be needed by jobs at site C, which
    has a better network connection to site A than
    site B. For this reason, the data should be kept
    at site A as long as possible.
  • The ReplicaManager keeps track of all replica
    data so that the replica selection service can
    select the optimal replica to use for a given
    job, or to request the creation of a new replica.

29
SQLDatabaseService
This servis allows to efficiently store, retrieve
and query very large amounts of meta data held in
any type of local or remote RDBMS. The database
can be used for the implementation of catalogs.
30
GridMiner A Framework for Data Miningon Grids
  • A new research field

31
Architecture of a Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Database or data warehouse server
Filtering
Data cleaning, data integration
Data warehouse
32
Decomposition of a Knowledge Discovery Process
  • Preprocessing
  • - data cleaning
  • - data transformation
  • - data reduction
  • Data mining (e.g., association rules)
  • - find frequent itemsets
  • - generate association rules
  • Evaluation of discovered patterns
  • Graphical User Interface

33
Our Philosophy
  • Data mining systems can be decomposed into a set
    of communicating components? distributed
    component architecture
  • Placement of data-processing functionalities
    iscritical.
  • Grid data mining research tightly coupled to the
    ongoing work on parallel I/O for Grids(e.g.,
    Armada project at the Dartmouth College, USA)

34
Basic Grid Data Mining Models
  • Local data analysis followed by the generation
  • of a global data model adapting distributed
  • data mining techniques. No data replication.
  • 2. Data mining system components are optimally
  • located on the grid. No dynamic data
    replication.
  • 3. Data mining system components are optimally
  • located on the Grid. Dynamic data replication
    is
  • considered.

35
Data Storage and the Components
Site D
Site C
Site A
Site B
Preprocessing
Preprocesing
Preprocessing
Preprocessing
Local DM
Local DM
Local DM
Local DM
Construction of the Global Model
GUI
Site E
Write a Comment
User Comments (0)
About PowerShow.com