The Vesta Parallel File System - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

The Vesta Parallel File System

Description:

Method for Vesta file system: ... Each I/O node maintains the Vesta objects in a memory-mapped table. ... Vesta introduces the notion of partitioning the data ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 30

Provided by: stevebi

Category:

more less

Transcript and Presenter's Notes

Title: The Vesta Parallel File System

1
The Vesta Parallel File System

Peter F. Corbett Dror G. Feithlson

2
Outline

Introduction
Motivation and Design Guidelines
Abstractions and Interface
Implementation
Conclusion

3
Introduction

The Vesta parallel file system
For the AIX on the IBM SP2
Design to provide parallel file access
Can achieve high efficiency on parallel I/O
hardware
Deal exclusively with persistent on-line storage
of files, particularly those that must be
accessed by parallel applications

4
Introduction cont.
5
Introduction cont.

Method for Vesta file system
Introduce a new abstraction of parallel files, by
which application programmers can express the
required partitioning of file data among the
processes of a parallel application
Reduce the need for synchronization and
concurrency control, and allows for a more
streamlined implementation
Provide explicit control over the way data is
distributed across the I/O nodes, and allows the
distribution to be tailored for the expected
access patterns

6
Motivation and Design Guidelines

Motivation
Users are able to create distributed files
without full control over the mapping of data to
disks
Design Guidelines
Parallelism
Scalability
Layering
Providing commonly expected service

7
Simple stripping method to get a parallel view
Simple stripping technique Assuming that the
number of I/O nodes is N, block i of the file is
located on I/O node i mod N.
8
Method of Vesta to get a parallel view

Two steps
Abstract away from a direct dependency on the
number of I/O nodes
Allow a variety of partitioned views of the data,
in addition to partitioning according to the
physical distribution of data to the I/O nodes
All these parallel views partition the file into
disjoint subfiles, that are typically accessed by
different processes of a parallel application
Guarantee that the accesses by the different
processes are non-overlapping at the byte level
Allow each process to access its data directly

9
Cell abstraction of Vesta

Abstracting away from I/O nodes is done by
introducing the notion of cells
Cells can be thought as containers where data can
be deposited
When a file is created, the number of cells is
given as a parameter
If the number of cells is no more than the number
of I/O nodes, then each cell will reside on a
different I/O node
If there are more cells than I/O nodes, the cells
will be distributed to the I/O nodes in
round-robin manner

10
2-d structure of Vesta

2-dimensional structure
Cell dimension (horizontal) specifies the
parallelism in accessing the data
Data within the cells (vertical)
The data in each cell is viewed as a sequence of
basic striping units (BSUs).
The BSU size can be an arbitrary number of bytes,
and should be chosen to reflect the minimal unit
of data access

11
Two parameters to define the structure

The number of cells
The BSU size
The two parameters are defined when the file is
created , and cant be changed thereafter.
Attach -- new call to do this
Every process in the application must attach
every file before it can open the file.

12
Partition files for parallel access
13
Partition files for parallel access

Define the template of Vesta subfiles
Define the block size used to distribute the data

Data decomposition scheme

14
Handling awkward cases

Ghost cell The extra cells are added to make the
total a multiple of Hbs ? Hn
Ghost cell has no effect for reading and writing
Hole cell leaving a hole in the middle of a cell
for cells with different length
Writing to a hole causes it to be filled with
valid data
Call the Vesta stat function to find how much
data is contained in the whole file

15
Data ordering
16
Feature of Vesta system

Key feature The capability to perform direct
access from a compute node to an I/O node without
referencing any centralized metadata
The form of the abstraction
The 2-d structure of BSUs within cells
The interface used to access the abstraction
partition is also an innovative feature
The partitioning is defined in advance, and then
processes can perform independent accesses to any
part of their partition (subfile)

17
Implementation

Create dedicated I/O nodes
A client library linked with application code
running on the compute nodes
A server that runs on the I/O nodes
Achieve direct access from a compute node to the
I/O node
Find metadata distributed among all the I/O nodes
Can identify the I/O nodes using a combination of
the metadata , parameter, and the offset and
count of data

18
Access to MetaData

Vesta objects files, cells and Xrefs
Each I/O node maintains the Vesta objects in a
memory-mapped table.
The I/O nodes are logically numbered
Each entry in the table contains information, the
file name, its owner ID, group and access
permissions, creation, access, and last
modification times, the number of cells, the BSU
size, the base and highest numbered I/O nodes
used, and the current file status.
7-bit uniquifier field to distinguish two files
or Xrefs with different names
1-bit field to distinguish files from Xrefs
8-bit level field are used to number cells of a
file

19
Attaching and opening

The file is attached to the application Access
the metadata to get parameters, such as the base
and maximal I/O nodes, the number of cells, and
the BSU size
Open a subfile call open function to set the
partitioning parameter that define which subfile
is being accessed

20
Directory Structure

Vesta files are accessed directly by hashing
their pathnames and dont need to maintain
directories to find files.
For users to be easy to organize their files, a
hierarchical structure of directories is created
using Xrefs.
Xrefs simply contain lists of internal Ids of
files and other Xrefs.

21
Access to File Data

Access is done by providing a byte offset and a
byte count
Vesta does not have a separate seek function
File data is not cached on compute nodes
Three mechanisms for reducing access latency
Use of buffer caches on the I/O node
Asynchronous I/O operations
Explicit prefetch and flush operations

22
Access to File Data
23
Sharing

Vesta supports sharing in two main ways
Partition the file into disjoint subfiles that
can be accessed with no synchronization among the
sharing processes
Share a subfile
Each process can have an independent file pointer
into the shared subfile
Each process can share a single pointer
When an application process opens a subfile for
the first time, it gets a local, private pointer.
When a pointer is shared, a random I/O node is
chosen, and the pointer is moved to that I/O
node. The identity of this node and pointers ID
on that node are passed to all processes that
share its use. When a data access based on a
shared pointer is performed, the accessing node
first communicates with the I/O node holing the
pointer. The current pointer value is returned to
the accessing node.

24
Concurrency Control

Concurrency control appears
Write data to a shared subfile
Overlapping subfiles using independent offsets
When an application interleaves file metadata
operations , they also affect the file data
One application writes a file while others read
it
Vesta uses a fast token-passing mechanism among
the I/O nodes to guarantee concurrency atomicity
of request that span multiple I/O nodes, and to
provide sequential consistency and
linearizability among requests
When the token reaches the last I/O node, it
sends an acknowledgement to the requesting
compute node.

25
Concurrency Control

Each I/O node maintains a set of 64 token
buckets, each with an in counter and an out
counter
Each file is assigned to one bucket of the set
When each token is sent, the out counter is
incremented
When a node receives a token, it first tries to
match the tokens value with the value of the
buckets in counter. Token that do not match are
delayed until other tokens that should be
processed before they arrive, and increment the
in counter.

26
Structures for Storing Data

Blocklists for cells are maintained at the I/O
nodes
All I/O node metadata, including the block list,
are pinned into memory
The block list of each cell is organized as a
16-ary tree.

27
(No Transcript)
28
Conclusion

Vesta is a new approach to parallel I/O file
systems
The basis of this approach is 2-d structure of
Vesta files, one dimension represents the
parallelism and the other represents sequential
data
Vesta introduces the notion of partitioning the
data
Vesta are fully implemented on an IBS SP1
multi-computer, using the EUI-H message passing
library and the MPX job control facility
Vesta is the base technology for the AIX Parallel
I/O File System used with the IBM SP2

29
Question