Title: The Vesta Parallel File System
1The Vesta Parallel File System
- Peter F. Corbett Dror G. Feithlson
2Outline
- Introduction
- Motivation and Design Guidelines
- Abstractions and Interface
- Implementation
- Conclusion
3Introduction
- The Vesta parallel file system
- For the AIX on the IBM SP2
- Design to provide parallel file access
- Can achieve high efficiency on parallel I/O
hardware - Deal exclusively with persistent on-line storage
of files, particularly those that must be
accessed by parallel applications
4Introduction cont.
5Introduction cont.
- Method for Vesta file system
- Introduce a new abstraction of parallel files, by
which application programmers can express the
required partitioning of file data among the
processes of a parallel application - Reduce the need for synchronization and
concurrency control, and allows for a more
streamlined implementation - Provide explicit control over the way data is
distributed across the I/O nodes, and allows the
distribution to be tailored for the expected
access patterns
6Motivation and Design Guidelines
- Motivation
- Users are able to create distributed files
without full control over the mapping of data to
disks - Design Guidelines
- Parallelism
- Scalability
- Layering
- Providing commonly expected service
7Simple stripping method to get a parallel view
Simple stripping technique Assuming that the
number of I/O nodes is N, block i of the file is
located on I/O node i mod N.
8 Method of Vesta to get a parallel view
- Two steps
- Abstract away from a direct dependency on the
number of I/O nodes - Allow a variety of partitioned views of the data,
in addition to partitioning according to the
physical distribution of data to the I/O nodes - All these parallel views partition the file into
disjoint subfiles, that are typically accessed by
different processes of a parallel application - Guarantee that the accesses by the different
processes are non-overlapping at the byte level - Allow each process to access its data directly
9Cell abstraction of Vesta
- Abstracting away from I/O nodes is done by
introducing the notion of cells - Cells can be thought as containers where data can
be deposited - When a file is created, the number of cells is
given as a parameter - If the number of cells is no more than the number
of I/O nodes, then each cell will reside on a
different I/O node - If there are more cells than I/O nodes, the cells
will be distributed to the I/O nodes in
round-robin manner
102-d structure of Vesta
- 2-dimensional structure
- Cell dimension (horizontal) specifies the
parallelism in accessing the data - Data within the cells (vertical)
- The data in each cell is viewed as a sequence of
basic striping units (BSUs). - The BSU size can be an arbitrary number of bytes,
and should be chosen to reflect the minimal unit
of data access
11 Two parameters to define the structure
- The number of cells
- The BSU size
- The two parameters are defined when the file is
created , and cant be changed thereafter. - Attach -- new call to do this
- Every process in the application must attach
every file before it can open the file.
12 Partition files for parallel access
13Partition files for parallel access
- Define the template of Vesta subfiles
- Define the block size used to distribute the data
- Data decomposition scheme
14Handling awkward cases
- Ghost cell The extra cells are added to make the
total a multiple of Hbs ? Hn - Ghost cell has no effect for reading and writing
- Hole cell leaving a hole in the middle of a cell
for cells with different length - Writing to a hole causes it to be filled with
valid data - Call the Vesta stat function to find how much
data is contained in the whole file
15Data ordering
16Feature of Vesta system
- Key feature The capability to perform direct
access from a compute node to an I/O node without
referencing any centralized metadata - The form of the abstraction
- The 2-d structure of BSUs within cells
- The interface used to access the abstraction
- partition is also an innovative feature
- The partitioning is defined in advance, and then
processes can perform independent accesses to any
part of their partition (subfile)
17Implementation
- Create dedicated I/O nodes
- A client library linked with application code
running on the compute nodes - A server that runs on the I/O nodes
- Achieve direct access from a compute node to the
I/O node - Find metadata distributed among all the I/O nodes
- Can identify the I/O nodes using a combination of
the metadata , parameter, and the offset and
count of data
18Access to MetaData
- Vesta objects files, cells and Xrefs
- Each I/O node maintains the Vesta objects in a
memory-mapped table. - The I/O nodes are logically numbered
- Each entry in the table contains information, the
file name, its owner ID, group and access
permissions, creation, access, and last
modification times, the number of cells, the BSU
size, the base and highest numbered I/O nodes
used, and the current file status. - 7-bit uniquifier field to distinguish two files
or Xrefs with different names - 1-bit field to distinguish files from Xrefs
- 8-bit level field are used to number cells of a
file
19Attaching and opening
- The file is attached to the application Access
the metadata to get parameters, such as the base
and maximal I/O nodes, the number of cells, and
the BSU size - Open a subfile call open function to set the
partitioning parameter that define which subfile
is being accessed
20Directory Structure
- Vesta files are accessed directly by hashing
their pathnames and dont need to maintain
directories to find files. - For users to be easy to organize their files, a
hierarchical structure of directories is created
using Xrefs. - Xrefs simply contain lists of internal Ids of
files and other Xrefs.
21Access to File Data
- Access is done by providing a byte offset and a
byte count - Vesta does not have a separate seek function
- File data is not cached on compute nodes
- Three mechanisms for reducing access latency
- Use of buffer caches on the I/O node
- Asynchronous I/O operations
- Explicit prefetch and flush operations
22Access to File Data
23Sharing
- Vesta supports sharing in two main ways
- Partition the file into disjoint subfiles that
can be accessed with no synchronization among the
sharing processes - Share a subfile
- Each process can have an independent file pointer
into the shared subfile - Each process can share a single pointer
- When an application process opens a subfile for
the first time, it gets a local, private pointer. - When a pointer is shared, a random I/O node is
chosen, and the pointer is moved to that I/O
node. The identity of this node and pointers ID
on that node are passed to all processes that
share its use. When a data access based on a
shared pointer is performed, the accessing node
first communicates with the I/O node holing the
pointer. The current pointer value is returned to
the accessing node.
24Concurrency Control
- Concurrency control appears
- Write data to a shared subfile
- Overlapping subfiles using independent offsets
- When an application interleaves file metadata
operations , they also affect the file data - One application writes a file while others read
it - Vesta uses a fast token-passing mechanism among
the I/O nodes to guarantee concurrency atomicity
of request that span multiple I/O nodes, and to
provide sequential consistency and
linearizability among requests - When the token reaches the last I/O node, it
sends an acknowledgement to the requesting
compute node.
25Concurrency Control
- Each I/O node maintains a set of 64 token
buckets, each with an in counter and an out
counter - Each file is assigned to one bucket of the set
- When each token is sent, the out counter is
incremented - When a node receives a token, it first tries to
match the tokens value with the value of the
buckets in counter. Token that do not match are
delayed until other tokens that should be
processed before they arrive, and increment the
in counter.
26Structures for Storing Data
- Blocklists for cells are maintained at the I/O
nodes - All I/O node metadata, including the block list,
are pinned into memory - The block list of each cell is organized as a
16-ary tree.
27(No Transcript)
28Conclusion
- Vesta is a new approach to parallel I/O file
systems - The basis of this approach is 2-d structure of
Vesta files, one dimension represents the
parallelism and the other represents sequential
data - Vesta introduces the notion of partitioning the
data - Vesta are fully implemented on an IBS SP1
multi-computer, using the EUI-H message passing
library and the MPX job control facility - Vesta is the base technology for the AIX Parallel
I/O File System used with the IBM SP2
29Question
- What is the 2-dimensional structure of Vesta
files? - What is key feature of the Vesta Parallel File
system? - What mechanism does the Vesta file system use to
control concurrency?