Title: Decentralized Data Management Framework for Distributed Environments
1Decentralized Data Management Framework for
Distributed Environments
- Houda Lamehamedi
- Computer Science Department
- Rensselaer Polytechnic Institute
2Data in Large Scale Computing
- In an increasing number of scientific
disciplines, large data collections are emerging
as important community resources - data produced and collected at experiment sites
e.g. high energy physics, climate modeling - processed data and analysis results
- The geographical location of the compute and
storage resources results in complex and
stringent performance demands
3Data Management Requirements
- Scientific collaborations in distributed
environments generate queries involving access to
large data sets - Efficient execution of these queries requires
- careful management of large data caches,
- gigabits data transfer over wide area networks,
- creation, management, and strategic placement of
replicas
4Data Replication
- Globus toolkit is a standard set of services
supporting resource sharing applications - Data management services offered
- GridFTP offers secure efficient data transfer in
Grid and distributed environments - Replica Catalog allows users to register files
- Replica Location Service allows users to locate
replicas - The system only provides the users with tools to
statically replicate data files
5Data Management Issues
- Existing Data Management frameworks demand
extensive administrative oversight and management
overhead such as Data Grids - Missing support for dynamic and intermittent
participation on the Data Grid hinders scalable
growth of collaborative research - Limited support to replication Data is
statically replicated under user guidelines
6Our Approach
- To address these issues we introduce a
decentralized performance-driven adaptive replica
management middleware that - Uses an overlay network to organize
participating nodes - Dynamically adapts replica placement to changing
users and networks needs and behavior - Dynamically evaluates data access costs vs.
performance gains before creating a new replica
7Major Components
- A theoretical model of data transfer cost and
access performance - Parameterized by the changing computing
environment - Data monitoring tools that feed current values of
resource consumption to the cost function - Dynamic replica management services
- Offer transparent replication using the cost
function - Manage replica placement and discovery
8Middleware Architecture
Replica Management Layer supports the management
and transferring of data between Grid nodes and
the creation of new replicas. Uses input from
lower layers to track users' access patterns and
monitor data popularity
Resource Access Layer provides access to
available resources and monitors their usage and
availability. Includes a Replica Catalog to
support transparent access to data at each Grid
node for local and remote users
Communication Layer consists of data transfer
and authen. protocols used to ensure security,
verify users identities, and maintain data
integrity provides support for the overlay
network structure
9Framework
- Services Offered by the middleware
- Resource Monitoring service monitoring resource
availability and access frequency - Replica Creation service replica creation based
on cost evaluation - Replica Location service managing local replica
Catalog - Resource Allocation service allocating space for
newly created replicas - Routing and Connectivity service routing
outgoing messages
10Catalog Management
File Deletion
Replica deletion
Local Catalog
Local Storage
Key file ID1
Locations /../file1-
File Registration
Node80.cs.rpi.edu/../file1
Key file ID2
Locations /../file1-
Key file ID3
Locations Node80.cs.ri.edu.file2
Remote Catalog
Catalog Replicat.
Key file ID4
Locations /../file3
Replica Creation
File creation
11Data Search and Replica Location
Data Location Process
Data Access Request
initiates
Local Catalog/ Database
Response processing
File Locations
Key file ID Location /../file- Node2.cs.rpi.ed
u/../file- Node80.cs.rpi.edu/../file
12Data Model Construction
- We use a combination of spanning tree and ring
topologies - Grid Node Insertion
- When joining the grid, a node is added through an
existing grid node by attaching to it as a child
node or a sibling - Node Removal
- When a node leaves the tree, it sends a
notification message to its parent, siblings, and
children
13Node Addition / Data Model Construction
After a Node joins, it starts developing a list
of preferred neighbors
Requests Flow
Data Flow
14Middleware Deployment
- We used two hierarchical distribution models that
represent the most popular and commonly used
models - Bottom Up Multiple collection sites
- Top Down single collection site
- Experiments were conducted on a cluster of 40
Linux machines and a cluster of 20 FreeBSD
workstations
15Access Patterns
- Data access requests are based on patterns
commonly observed in scientific and data-sharing
environments - Files are of similar sizes within an application
- Spikes are generated by new Interesting Files
- Users social organization and interests guide
the overlay construction - Interest-based adaptive clustering of users
16Top Down Model
17Bottom UP Model
18Top Down Experiments Results
19Bottom Up Experiments Results
20Access Performance Evaluation
21Conclusions
- Cost guided dynamic replication improves data
access performance by up to 30 and a minimum of
10 compared to static user initiated replication - The combination of parameter selection for cost
evaluation and resource availability plays a key
role in influencing the performance of the
system. - Lower storage availability might lead to race
conditions where popular data compete for storage
space. - The results also show that popular data files
benefit the most from dynamic replication