Parallel and Grid I/O Infrastructure - PowerPoint PPT Presentation

About This Presentation

Title:

Parallel and Grid I/O Infrastructure

Description:

Allow application programmers to tune ROMIO with hints rather than using different MPI-IO calls ... MPI-IO hints provide means for specifying number of stripes, ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 41

Provided by: rro1

Learn more at: https://sdm.lbl.gov

Category:

more less

Transcript and Presenter's Notes

Title: Parallel and Grid I/O Infrastructure

1
Parallel and Grid I/O Infrastructure

Rob Ross, Argonne National Lab
Parallel Disk Access and Grid I/O (P4)
SDM All Hands Meeting
March 26, 2002

2
Participants

Argonne National Laboratory
Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham,
Anthony Chan
Northwestern University
Alok Choudhary, Wei-keng Liao, Avery Ching, Kenin
Coloma, Jianwei Li
Collaborators
Lawrence Livermore National Laboratory
Ghaleb Abdulla, Tina Eliassi-Rad, Terence
Critchlow
Application groups

3
Focus Areas in Project

Parallel I/O on clusters
Parallel Virtual File System (PVFS)
MPI-IO hints
ROMIO MPI-IO implementation
Grid I/O
Linking PVFS and ROMIO with Grid I/O components
Application interfaces
NetCDF and HDF5
Everything is interconnected!
Wei-keng Liao will drill down into specific tasks

4
Parallel Virtual File System

Lead developer R. Ross (ANL)
R. Latham (ANL), developer
A. Ching, K. Coloma (NWU), collaborators
Open source, scalable parallel file system
Project began in mid 90s at Clemson University
Now a collaborative between Clemson and ANL
Successes
In use on large Linux clusters (OSC, Utah,
Clemson, ANL, Phillips Petroleum, )
100 unique downloads/month
160 users on mailing list, 90 on developers
list
Multiple Gigabyte/second performance shown

5
Keeping PVFS Relevant PVFS2

Scaling to thousands of clients and hundreds of
servers requires some design changes
Distributed metadata
New storage formats
Improved fault tolerance
New technology, new features
High-performance networking (e.g. Infiniband,
VIA)
Application metadata
New design and implementation warranted (PVFS2)

6
PVFS1, PVFS2, and SDM

Maintaining PVFS1 as a resource to community
Providing support, bug fixes
Encouraging use by application groups
Adding functionality to improve performance (e.g.
tiled display)
Implementing next-generation parallel file system
Basic infrastructure for future PFS work
New physical distributions (e.g. chunking)
Application metadata storage
Ensuring that a working parallel file system will
continue to be available on clusters as they scale

7
Data Staging for Tiled Display

Contact Joe Insley (ANL)
Commodity components
projectors, PCs
Provide very high resolutionvisualization
Staging application preprocesses frames into a
tile stream for each visualization node
Uses MPI-IO to access data from PVFS file system
Streams of tiles are merged into movie files on
visualization nodes
End goal is to display frames directly from PVFS
Enhancing PVFS and ROMIO to improve performance

8
Example Tile Layout

3x2 display, 6 readers
Frame size is 2532x1408 pixels
Tile size is 1024x768 pixels (overlapped)
Movies broken into frames with each frame stored
in its own file in PVFS
Readers pull data from PVFS and send to display

9
Tested access patterns

Subtile
Each reader grabs a piece of a tile
Small noncontiguous accesses
Lots of accesses for a frame
Tile
Each reader grabs a whole tile
Larger noncontiguous accesses
Six accesses for a frame
Reading individual pieces is simply too slow

10
Noncontiguous Access in ROMIO

ROMIO performs data sieving to cut down number
of I/O operations
Uses large reads which grab multiple
noncontiguous pieces
Example, reading tile 1

11
Noncontiguous Access in PVFS

ROMIO data sieving
Works for all file systems (just uses contiguous
read)
Reads extra data (three times desired amount)
Noncontiguous access primitive allows requesting
just desired bytes (A. Ching, NWU)
Support in ROMIO allowstransparent use of new
optimization (K. Coloma,NWU)
PVFS and ROMIO supportimplemented

12
Metadata in File Systems

Associative arrays of information related to a
file
Seen in other file systems (MacOS, BeOS,
ReiserFS)
Some potential uses
Ancillary data (from applications)
Derived values
Thumbnail images
Execution parameters
I/O library metadata
Block layout information
Attributes on variables
Attributes of dataset as a whole
Headers
Keeps header out of data stream
Eliminates need for alignment in libraries

13
Metadata and PVFS2 Status

Prototype metadata storage for PVFS2 implemented
R. Ross (ANL)
Uses Berkeley DB for storage of keyword/value
pairs
Need to investigate how to interface to MPI-IO
Other components of PVFS2 coming along
Networking in testing (P. Carns, Clemson)
Client side API under development (Clemson)
PVFS2 beta early fourth quarter?

14
ROMIO MPI-IO Implementation

Written by R. Thakur (ANL)
R. Ross and R. Latham (ANL), developers
K. Coloma (NWU), collaborator
Implementation of MPI-2 I/O specification
Operates on wide variety of platforms
Abstract Device Interface for I/O (ADIO) aids in
porting to new file systems
Successes
Adopted by industry(e.g. Compaq, HP, SGI)
Used at ASCI sites(e.g. LANL Blue Mountain)

15
ROMIO Current Directions

Support for PVFS noncontiguous requests
K. Coloma (NWU)
Hints - key to efficient use of HW SW
components
Collective I/O
Aggregation (synergy)
Performance portability
Controlling ROMIO Optimizations
Access patterns
Grid I/O
Scalability
Parallel I/O benchmarking

16
ROMIO Aggregation Hints

Part of ASCI Software Pathforward project
Contact Gary Grider (LANL)
Implementation by R. Ross, R. Latham (ANL)
Hints control what processes do I/O in
collectives
Examples
All processes on same node as attached storage
One process per host
Additionally limit number of processes who open
file
Good for systems w/out shared FS (e.g. O2K
clusters)
More scalable

17
Aggregation Example

Cluster of SMPs
Only one SMP box has connection to disks
Data is aggregated to processes on single box
Processes on that box perform I/O on behalf of
the others

18
Optimization Hints

MPI-IO calls should be chosen to best describe
the I/O taking place
Use of file views
Collective calls for inherently collective
operations
Unfortunately sometimes choosing the right
calls can result on lower performance
Allow application programmers to tune ROMIO with
hints rather than using different MPI-IO calls
Avoid the misapplication of optimizations
(aggregation, data sieving)

19
Optimization Problems

ROMIO checks for applicability of two-phase
optimization when collective I/O is used
With tiled display application using subtile
access, this optimization is never used
Checking for applicability requires communication
between processes
Results in 33 drop in throughput (on test
system)
A hint that tells ROMIO not to apply the
optimization can avoid this without changes to
the rest of the application

20
Access Pattern Hints

Collaboration between ANL and LLNL (and growing)
Examining how access pattern information can be
passed to MPI-IO interface, through to underlying
file system
Used as input to optimizations in MPI-IO layer
Used as input to optimizations in FS layer as
well
Prefetching
Caching
Writeback

21
Status of Hints

Aggregation control finished
Optimization hints
Collectives, data sieving read finished
Data sieving write control in progress
PVFS noncontiguous I/O control in progress
Access pattern hints
Exchanging log files, formats
Getting up to speed on respective tools

22
Parallel I/O Benchmarking

No common parallel I/O benchmarks
New effort (consortium) to
Define some terminology
Define test methodology
Collect tests
Goal provide a meaningful test suite with
consistent measurement techniques
Interested parties at numerous sites (and
growing)
LLNL, Sandia, UIUC, ANL, UCAR, Clemson
In infancy

23
Grid I/O

Looking at ways to connect our I/O work with
components and APIs used in the Grid
New ways of getting data in and out of PVFS
Using MPI-IO to access data in the Grid
Alternative mechanisms for transporting data
across the Grid (synergy)
Working towards more seamless integration of the
tools used in the Grid and those used on clusters
and in parallel applications (specifically MPI
applications)
Facilitate moving between Grid and Cluster worlds

24
Local Access to GridFTP Data

Grid I/O Contact B. Allcock (ANL)
GridFTP striped server provides high-throughput
mechanism for moving data across Grid
Relies on proprietary storage format on striped
servers
Must manage metadata on stripe location
Data stored on servers must be read back from
servers
No alternative/more direct way to access local
data
Next version assumes shared file system underneath

25
GridFTP Striped Servers

Remote applications connect to multiple striped
servers to quickly transfer data over Grid
Multiple TCP streams better utilize WAN network
Local processes would need to use same mechanism
to get to data on striped servers

26
PVFS under GridFTP

With PVFS underneath, GridFTP servers would store
data on PVFS I/O servers
Stripe information stored on PVFS metadata server

27
Local Data Access

Application tasks that are part of a local
parallel job could access data directly off PVFS
file system
Output from application could be retrieved
remotely via GridFTP

28
MPI-IO Access to GridFTP

Applications such as tiled display reader desire
remote access to GridFTP data
Access through MPI-IO would allow this with no
code changes
ROMIO ADIO interface provides the infrastructure
necessary to do this
MPI-IO hints provide means for specifying number
of stripes, transfer sizes, etc.

29
WAN File Transfer Mechanism

B. Gropp (ANL), P. Dickens (IIT)
Applications
PPM and COMMAS (Paul Woodward, UMN)
Alternative mechanism for moving data across Grid
using UDP
Focuses on requirements for file movement
All data must arrive at destination
Ordering doesnt matter
Lost blocks can be retransmitted when detected,
but need not stop the remainder of the transfer

30
WAN File Transfer Performance

Comparing TCP utilization to WAN FT technique
See 10-12 utilization with single TCP stream (8
streams to approach max. utilization)
With WAN FT obtain near 90 utilization, more
uniform performance

31
Grid I/O Status

Planning with Grid I/O group
Matching up components
Identifying useful hints
Globus FTP client library is available
2nd generation striped server being implemented
XIO interface prototyped
Hooks for alternative local file systems
Obvious match for PVFS under GridFTP

32
NetCDF

Applications in climate and fusion
PCM
John Drake (ORNL)
Weather Research and Forecast Model (WRF)
John Michalakes (NCAR)
Center for Extended Magnetohydrodynamic Modeling
Steve Jardin (PPPL)
Plasma Microturbulence Project
Bill Nevins (LLNL)
Maintained by Unidata Program Center
API and file format for storing multidimensional
datasets and associated metadata (in a single
file)

33
NetCDF Interface

Strong points
Its a standard!
I/O routines allow for subarray and strided
access with single calls
Access is clearly split into two modes
Defining the datasets (define mode)
Accessing and/or modifying the datasets (data
mode)
Weakness no parallel writes, limited parallel
read capability
This forces applications to ship data to a single
node for writing, severely limiting usability in
I/O intensive applications

34
Parallel NetCDF

Rich I/O routines and explicit define/data modes
provide a good foundation
Existing applications are already describing
noncontiguous regions
Modes allow for a synchronization point when file
layout changes
Missing
Semantics for parallel access
Collective routines
Option for using MPI datatypes
Implement in terms of MPI-IO operations
Retain file format for interoperability

35
Parallel NetCDF Status

Design document created
B. Gropp, R. Ross, and R. Thakur (ANL)
Prototype in progress
J. Li (NWU)
Focus is on write functions first
Biggest bottleneck for checkpointing applications
Read functions follow
Investigate alternative file formats in future
Address differences in access modes between
writing and reading

36
FLASH Astrophysics Code

Developed at ASCI Center at University of Chicago
Contact Mike Zingale
Adaptive mesh (AMR) code for simulating
astrophysical thermonuclear flashes
Written in Fortran90, uses MPI for communication,
HDF5 for checkpointing and visualization data
Scales to thousands of processors, runs for
weeks, needs to checkpoint
At the time, I/O was a bottleneck (½ of runtime
on 1024 processors)

37
HDF5 Overhead Analysis