Decentralized Data Management Framework for Distributed Environments - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Decentralized Data Management Framework for Distributed Environments

Description:

... data transfer in Grid and distributed environments ... extensive administrative oversight and management overhead such as Data Grids ... Grid Node Insertion: ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 22

Provided by: HOU9

Category:

more less

Transcript and Presenter's Notes

Title: Decentralized Data Management Framework for Distributed Environments

1
Decentralized Data Management Framework for
Distributed Environments

Houda Lamehamedi
Computer Science Department
Rensselaer Polytechnic Institute

2
Data in Large Scale Computing

In an increasing number of scientific
disciplines, large data collections are emerging
as important community resources
data produced and collected at experiment sites
e.g. high energy physics, climate modeling
processed data and analysis results
The geographical location of the compute and
storage resources results in complex and
stringent performance demands

3
Data Management Requirements

Scientific collaborations in distributed
environments generate queries involving access to
large data sets
Efficient execution of these queries requires
careful management of large data caches,
gigabits data transfer over wide area networks,
creation, management, and strategic placement of
replicas

4
Data Replication

Globus toolkit is a standard set of services
supporting resource sharing applications
Data management services offered
GridFTP offers secure efficient data transfer in
Grid and distributed environments
Replica Catalog allows users to register files
Replica Location Service allows users to locate
replicas
The system only provides the users with tools to
statically replicate data files

5
Data Management Issues

Existing Data Management frameworks demand
extensive administrative oversight and management
overhead such as Data Grids
Missing support for dynamic and intermittent
participation on the Data Grid hinders scalable
growth of collaborative research
Limited support to replication Data is
statically replicated under user guidelines

6
Our Approach

To address these issues we introduce a
decentralized performance-driven adaptive replica
management middleware that
Uses an overlay network to organize
participating nodes
Dynamically adapts replica placement to changing
users and networks needs and behavior
Dynamically evaluates data access costs vs.
performance gains before creating a new replica

7
Major Components

A theoretical model of data transfer cost and
access performance
Parameterized by the changing computing
environment
Data monitoring tools that feed current values of
resource consumption to the cost function
Dynamic replica management services
Offer transparent replication using the cost
function
Manage replica placement and discovery

8
Middleware Architecture
Replica Management Layer supports the management
and transferring of data between Grid nodes and
the creation of new replicas. Uses input from
lower layers to track users' access patterns and
monitor data popularity
Resource Access Layer provides access to
available resources and monitors their usage and
availability. Includes a Replica Catalog to
support transparent access to data at each Grid
node for local and remote users
Communication Layer consists of data transfer
and authen. protocols used to ensure security,
verify users identities, and maintain data
integrity provides support for the overlay
network structure
9
Framework

Services Offered by the middleware
Resource Monitoring service monitoring resource
availability and access frequency
Replica Creation service replica creation based
on cost evaluation
Replica Location service managing local replica
Catalog
Resource Allocation service allocating space for
newly created replicas
Routing and Connectivity service routing
outgoing messages

10
Catalog Management
File Deletion
Replica deletion
Local Catalog
Local Storage
Key file ID1
Locations /../file1-
File Registration
Node80.cs.rpi.edu/../file1
Key file ID2
Locations /../file1-
Key file ID3
Locations Node80.cs.ri.edu.file2
Remote Catalog
Catalog Replicat.
Key file ID4
Locations /../file3
Replica Creation
File creation
11
Data Search and Replica Location
Data Location Process
Data Access Request
initiates
Local Catalog/ Database
Response processing
File Locations
Key file ID Location /../file- Node2.cs.rpi.ed
u/../file- Node80.cs.rpi.edu/../file
12
Data Model Construction

We use a combination of spanning tree and ring
topologies
Grid Node Insertion
When joining the grid, a node is added through an
existing grid node by attaching to it as a child
node or a sibling
Node Removal
When a node leaves the tree, it sends a
notification message to its parent, siblings, and
children

13
Node Addition / Data Model Construction
After a Node joins, it starts developing a list
of preferred neighbors
Requests Flow
Data Flow
14
Middleware Deployment

We used two hierarchical distribution models that
represent the most popular and commonly used
models
Bottom Up Multiple collection sites
Top Down single collection site
Experiments were conducted on a cluster of 40
Linux machines and a cluster of 20 FreeBSD
workstations

15
Access Patterns

Data access requests are based on patterns
commonly observed in scientific and data-sharing
environments
Files are of similar sizes within an application
Spikes are generated by new Interesting Files
Users social organization and interests guide
the overlay construction
Interest-based adaptive clustering of users

16
Top Down Model
17
Bottom UP Model
18
Top Down Experiments Results
19
Bottom Up Experiments Results
20
Access Performance Evaluation
21
Conclusions

Cost guided dynamic replication improves data
access performance by up to 30 and a minimum of
10 compared to static user initiated replication
The combination of parameter selection for cost
evaluation and resource availability plays a key
role in influencing the performance of the
system.
Lower storage availability might lead to race
conditions where popular data compete for storage
space.
The results also show that popular data files
benefit the most from dynamic replication