IO File Striping

About This Presentation

Title:

IO File Striping

Description:

Lustre file system on Franklin made up of an underlying set of parallel I/O servers. OSSs (Object Storage Servers) - nodes dedicated to I/O connected to high speed ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 50

Provided by: nerscwork

Category:

more less

Transcript and Presenter's Notes

Title: IO File Striping

1
IO File Striping Katie Antypas User Services
Group Kantypas_at_lbl.gov NERSC User Group
Meeting October 2, 2007
2
Outline

File Striping
Definition
Default Striping
Pros and Cons
One file-per-processor
Shared Files
Gotchas

3
Motivation

We wish users didnt have to deal with striping
Ideally it would be hidden from users
Unfortunately however, performance is too
critical to ignore it

4
What is File Striping?

Lustre file system on Franklin made up of an
underlying set of parallel I/O servers
OSSs (Object Storage Servers) - nodes dedicated
to I/O connected to high speed torus interconect
OSTs (Object Storage Targets) software
abstraction of physical disk (1 OST maps to 1
LUN)
File is said to be striped when read and write
operations access multiple OSTs concurrently
Striping can increase I/O performance since
writing or reading from multiple OSTs
simultaneously increases the available I/O
bandwidth

5
Franklin Configuration
Franklin Compute and Interactive Nodes
The Torus
20 OSS 80 OST

FC Network
5 DDN 80 LUN

Connectivity and configuration set in a good
way for parallelism. Using 20 OSTs will spread
evenly over the 5 DDN appliances.
6
Default Striping on /scratch
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 19
I/O Servers
OSTs
0,20,40,60
1,21,41,61
2,22,42,62
3,23,43,63
4,24,44,64
5,25,45,65
19,39,59,79

3 parameters characterize striping pattern of a
file
Stripe count
Number of OSTs file is split across
Default is 4
Stripe size
Number of bytes to write on each OST before
cycling to next OST
Default is 1MB
OST offset
Indicates starting OST
Default is round robin across all requests on
system

7
Max Bandwidth of OST and OSS
Compute Nodes
Torus Network
Storage Servers
4 OSTs
4 OSTs
Max Bandwidth to single OST 350 MB/sec
Max Bandwidth to single OSS 700 MB/sec
8
Default Stripe Count of 4 on /scratch
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 6
OSS 19
I/O Servers
OSTs
0,20,40,60
1,21,41,61
2,22,42,62
3,23,43,63
4,24,44,64
5,25,45,65
19,39,59,79

Pros
Get 4 times the bandwidth you could from using 1
OST
Max bandwidth to 1 OST 350 MB/Sec
Using 4 OSTs 1,400 MB/Sec
Cons
For better or worse your file now is in 4
different places
Metadata operations like ls -l on the file
could be slower
For small files (lt100MB) no performance gain from
striping

9
Why a stripe count of 4?
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 6
OSS 19
OSTs
0,20,40,60
1,21,41,61
2,22,42,62
3,23,43,63
4,24,44,64
5,25,45,65
19,39,59,79

Balance
With a few important exceptions, should work
decently for most users
Protection
Each OST is backed up by a physical disk (LUN)
Stripe count of 1 leave us vulnerable to single
user writing out huge amount of data filling the
disk
Striping of 4 is a reasonable compromise,
although not good for large shared files or
certain file-per-proc cases

10
Changing the Default Stripe Count

A number of applications will see benefits from
changing the default striping
Striping can be set at a file or directory level
When striping set on a directory all files
created in that directory with inherit striping
set on the directory
Stripe size - bytes written on each OST before
cycling to next OST
OST offset - indicates starting OST
Stripe count - of OSTs file is split across

lstripe ltdirectoryfilegt ltstripe sizegt ltOST
Offsetgt ltstripe countgt
lstripe mydirectory 0 -1 X
11
Parallel I/O Multi-file
0
1
2
3
4
5
processors
File
File
File
File
File
File

Each processor writes its own data to a separate
file
Advantages
Simple to program
Can be fast -- (up to a point)
Disadvantages
Can quickly accumulate many files
Hard to manage
Requires post processing
Difficult for storage systems, HPSS, to handle
many small files

12
One File-Per-Processor IO with Default Striping
0
1
2
3
4
5
40,000
Torus Network
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
1,21,41,61
2,22,42,62
3,23,43,63
4,24,44,64
5,25,45,65
OSTs
0,20,40,60

With greater than roughly 20 processors/files,
each striping over 4 OSTs, files begin to overlap
with each other on OSTs
Stripe count of 4 is not helping application get
more overall bandwidth
Stripe count of 4 can lead to contention

13
One File-Per-Processor IO with Stripe Count of 1
0
1
2
3
4
5
40,000
Torus Network
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 19
4 OSTs

Use all OSTs but dont add more contention than
is necessary

14
Recommendations for One-File-Per Processor IO
gt 80
Number of Processors
lt 80
gt 1GB
lt100s MB
File Size Per Processor
15
Parallel I/O Single-file
1
2
3
4
5
0
processors
File

Each processor writes its own data to the same
file using MPI-IO mapping
Advantages
Single file
Manageable data
Disadvantages
Lower performance than one file per processor at
some concurrencies

16
Shared File I/O with Default Stripe Count 4
0
1
2
3
4
5
40000
Torus Network
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 19

All processors writing shared file will write to
4 OSTs
No matter how much data the application is
writing, it wont get more than 1,400 MB/Sec (4
OSTs 350 MB/Sec)
Less sophisticated than you might think - no
optimizations for matching processor writer to
same OST
Need to use more OSTs for large shared files

17
Shared File I/O with Default Stripe Count 80
0
1
2
3
4
5
40000
Torus Network
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 19

Now Striping over all 80 OSTs
Increased available bandwidth to application
Theoretically (700 MB/Sec (OSS Max) 20 OSSs)
In practice 11-12 GB/Sec

18
Reduced Writers to Single-file
1
2
3
4
5
0
processors
File

On Franklin best performance when writers matches
of OSTs (80)
Subset of processors writes data to single file
Functionality not yet available in XT4 MPI-IO
Advantages
Single file manageable data
Better performance than all tasks writing for
high concurrency jobs
Disadvantages
Requires changes to code
Application may not have enough memory to handle
data from other processors

19
Recommendations
Legend
gt4096
Single Shared File, Default Striping

Stripe over some OSTs
lstripe dir 0 -1 11

Dont do anything
Default striping fine

2048
Single Shared File, Stripe over some OSTs (11)
Processors
1024
Single Shared File, Stripe over all OSTs
lt512
Try fewer writers
lt1 GB
10 GB
100 GB
1 TB
Aggregate File Size
20
Striping Summary

Franklin Default Striping
Stripe size - 1MB (enter 0 for default)
OST offset - round robin starting OST (enter
-1 for default)
Stripe over 4 OSTs (Stripe count 4)
One File-Per-Processor
lstripe mydir 0 -1 1
Large shared files
lstripe mydir 0 -1 80
Medium shared files
Experiment a little 10-40 OSTs
lstripe mydir 0 -1 11

21
Gotchas

File system is a shared resource and a heavy I/O
job can slow the system down for everyone
Write fewer large blocks of data rather than many
small chunks of data
Some MPI-IO features under optimized
Collective I/O
2 Phase I/O
Lustre file system sensitive write size and write
offset

OSS 2
OSS 3
OSS 4
OSS 5
Please contact consultants if you have low
performing I/O. There may be something simple we
can do to increase performance substantially
22
Cray/NERSC I/O Initiatives

Working to address and characterize I/O variation
Benchmarking runs
Lustre Monitoring Tool
Improving MPI-IO layer
Multiple people at Sun and Cray working on this
Discussions forming about improving HDF5
performance on Lustre

23
Best Practices

Add striping line to batch script so you dont
forget it!
Do large I/O write fewer big chunks of data
(1MB) rather than small bursty I/O
Do parallel I/O.
Serial I/O (single writer) can not take advantage
of the systems parallel capabilities.
Stripe large files over many OSTs.
If job uses many cores, reduce the number of
tasks performing IO (experiment with this number,
80, 160, 320)
Use a single, shared file instead of 1 file per
writer, esp. at high parallel concurrency.
Use an IO library API and write flexible,
portable programs.

24
Extra Slides
25
Parallel I/O A User Perspective

Wish List
Write data from multiple processors into a single
file
File can be read in the same manner regardless of
the number of CPUs that read from or write to the
file. (eg. want to see the logical data layout
not the physical layout)
Do so with the same performance as writing
one-file-per-processor (only writing
one-file-per-processor because of performance
problems)
And make all of the above portable from one
machine to the next

26
I/O Formats
27
Common Storage Formats

ASCII
Slow
Takes more space!
Inaccurate
Binary
Non-portable (eg. byte ordering and types sizes)
Not future proof
Parallel I/O using MPI-IO
Self-Describing formats
NetCDF/HDF4, HDF5, Parallel NetCDF
Example in HDF5 API implements Object DB model
in portable file
Parallel I/O using pHDF5/pNetCDF (hides MPI-IO)
Community File Formats
FITS, HDF-EOS, SAF, PDB, Plot3D
Modern Implementations built on top of HDF,
NetCDF, or other self-describing object-model API

28
HDF5 Library
HDF5 is a general purpose library and file format
for storing scientific data

Can store data structures, arrays, vectors,
grids, complex data types, text
Can use basic HDF5 types integers, floats, reals
or user defined types such as multi-dimensional
arrays, objects and strings
Stores metadata necessary for portability -
endian type, size, architecture

29
HDF5 Data Model

Groups
Arranged in directory hierarchy
root group is always /
Datasets
Dataspace
Datatype
Attributes
Bind to Group Dataset
References
Similar to softlinks
Can also be subsets of data

/ (root)
authorJane Doe
date10/24/2006
subgrp
Dataset0 type,space
Dataset1 type, space
time0.2345
validityNone
Dataset0.1 type,space
Dataset0.2 type,space
30
A Plug for Self Describing Formats ...

Application developers shouldnt care about about
physical layout of data
Using own binary file format forces user to
understand layers below the application to get
optimal IO performance
Every time code is ported to a new machine or
underlying file system is changed or upgraded,
user is required to make changes to improve IO
performance
Let other people do the work
HDF5 can be optimized for given platforms and
file systems by HDF5 developers
User can stay with the high level
But what about performance?

31
IO Library Overhead
Very little, if any overhead from HDF5 for one
file per processor IO compared to Posix and MPI-IO
Data from Hongzhang Shan
32
Ways to do Parallel IO
33
Serial I/O
0
1
2
3
4
5
processors

Each processor sends its data to the master who
then writes the data to a file
Advantages
Simple
May perform ok for very small IO sizes
Disadvantages
Not scalable
Not efficient, slow for any large number of
processors or data sizes
May not be possible if memory constrained

File
34
Parallel I/O Multi-file
0
1
2
3
4
5
processors
File
File
File
File
File
File

Each processor writes its own data to a separate
file
Advantages
Simple to program
Can be fast -- (up to a point)
Disadvantages
Can quickly accumulate many files
Hard to manage
Requires post processing
Difficult for storage systems, HPSS, to handle
many small files

35
Flash Center IO Nightmare

Large 32,000 processor run on LLNL BG/L
Parallel IO libraries not yet available
Intensive I/O application
checkpoint files .7 TB, dumped every 4 hours, 200
dumps
used for restarting the run
full resolution snapshots of entire grid
plotfiles - 20GB each, 700 dumps
coarsened by a factor of two averaging
single precision
subset of grid variables
particle files 1400 particle files 470MB each
154 TB of disk capacity
74 million files!
Unix tool problems
2 Years Later still trying to sift though data,
sew files together

36
Parallel I/O Single-file
1
2
3
4
5
0
processors
File

Each processor writes its own data to the same
file using MPI-IO mapping
Advantages
Single file
Manageable data
Disadvantages
Lower performance than one file per processor at
some concurrencies

37
Parallel IO single file
0
1
2
3
4
5
processors
array of data
Each processor writes to a section of a data
array. Each must know its offset from the
beginning of the array and the number of elements
to write
38
Trade offs

Ideally users want speed, portability and
usability
speed - one file per processor
portability - high level IO library
usability
single shared file and
own file format or community file format layered
on top of high level IO library

It isnt hard to have speed, portability or
usability. It is hard to have speed, portability
and usability in the same implementation
39
Good I/O Performance with Simple I/O Patterns

File system capable of high performance for
shared files
Large block sequential I/O
Transfer size multiple of stripe size
No metadata

40
More complicated I/O patterns

Harder for file system to handle
Smaller amounts of data, (MBs/proc)
Transfer size not multiple of stripe width
Start offset doesnt match stripe width
Strided data
Can result in lower shared file performance

41
Description of IOR

Developed by LLNL used for purple procurement
Focuses on parallel/sequential read/write
operations that are typical in scientific
applications
Can exercise one file per processor or shared
file access for common set of testing parameters
Exercises array of modern file APIs such as
MPI-IO, POSIX (shared or unshared), HDF5 and
parallel-netCDF
Parameterized parallel file access patterns to
mimic different application situations

42
Benchmark Methodology
Focus on performance difference between single
shared and one file per processor
43
Small Aggregate Output Sizes 100 MB - 1GB
One File per Processor vs Shared File - GB/Sec
Aggregate File Size 100 MB
Aggregate File Size 1 GB
Peak performance line - Anything greater than
this is due to caching effect or timer
granularity
Clearly the one file per processor strategy
wins in the low concurrency cases correct?
44
Small Aggregate Output Sizes 100 MB - 1GB
One File per Processor vs Shared File - Time
Aggregate File Size 1 GB
Aggregate File Size 100 MB
But when looking at absolute time, the difference
doesnt seem so big...
45
Aggregate Output Size 100GB
One File per Processor vs Shared File
Rate GB/Sec
Time Seconds
Peak performance line
2.5 mins
390 MB/proc
24 MB/proc
Is there anything we can do to improve the
performance of the 4096 processor shared file
case ?
46
Aggregate Output Size 1TB
One File per Processor vs Shared File
Rate GB/Sec
Time Seconds
3 mins
976 MB/proc
244 MB/proc
Is there anything we can do to improve the
performance of the 4096 processor shared file
case ?
47
Recommendations

Think about the big picture
Run time vs Post Processing trade off
Decide how much IO overhead you can afford
Data Analysis
Portability
Longevity
H5dump works on all platforms
Can view an old file with h5dump
If you use your own binary format you must keep
track of not only your file format version but
the version of your file reader as well
Storability

48
Recommendations

Use a standard IO format, even if you are
following a one file per processor model
One file per processor model really only makes
some sense when writing out very large files at
high concurrencies, for small files, overhead is
low
If you must do one file per processor IO then at
least put it in a standard IO format so pieces
can be put back together more easily
Splitting large shared files into a few files
appears promising
Option for some users, but requires code changes
and output format changes
Could be implemented better in IO library APIs
Follow striping recommendations
Ask the consultants, we are here to help!