Title: IO File Striping
1 IO File Striping Katie Antypas User Services
Group Kantypas_at_lbl.gov NERSC User Group
Meeting October 2, 2007
2Outline
- File Striping
- Definition
- Default Striping
- Pros and Cons
- One file-per-processor
- Shared Files
- Gotchas
3Motivation
- We wish users didnt have to deal with striping
- Ideally it would be hidden from users
- Unfortunately however, performance is too
critical to ignore it
4What is File Striping?
- Lustre file system on Franklin made up of an
underlying set of parallel I/O servers - OSSs (Object Storage Servers) - nodes dedicated
to I/O connected to high speed torus interconect - OSTs (Object Storage Targets) software
abstraction of physical disk (1 OST maps to 1
LUN) - File is said to be striped when read and write
operations access multiple OSTs concurrently - Striping can increase I/O performance since
writing or reading from multiple OSTs
simultaneously increases the available I/O
bandwidth
5Franklin Configuration
Franklin Compute and Interactive Nodes
The Torus
20 OSS 80 OST
FC Network
5 DDN 80 LUN
Connectivity and configuration set in a good
way for parallelism. Using 20 OSTs will spread
evenly over the 5 DDN appliances.
6Default Striping on /scratch
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 19
I/O Servers
OSTs
0,20,40,60
1,21,41,61
2,22,42,62
3,23,43,63
4,24,44,64
5,25,45,65
19,39,59,79
- 3 parameters characterize striping pattern of a
file - Stripe count
- Number of OSTs file is split across
- Default is 4
- Stripe size
- Number of bytes to write on each OST before
cycling to next OST - Default is 1MB
- OST offset
- Indicates starting OST
- Default is round robin across all requests on
system
7Max Bandwidth of OST and OSS
Compute Nodes
Torus Network
Storage Servers
4 OSTs
4 OSTs
Max Bandwidth to single OST 350 MB/sec
Max Bandwidth to single OSS 700 MB/sec
8Default Stripe Count of 4 on /scratch
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 6
OSS 19
I/O Servers
OSTs
0,20,40,60
1,21,41,61
2,22,42,62
3,23,43,63
4,24,44,64
5,25,45,65
19,39,59,79
- Pros
- Get 4 times the bandwidth you could from using 1
OST - Max bandwidth to 1 OST 350 MB/Sec
- Using 4 OSTs 1,400 MB/Sec
- Cons
- For better or worse your file now is in 4
different places - Metadata operations like ls -l on the file
could be slower - For small files (lt100MB) no performance gain from
striping
9Why a stripe count of 4?
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 6
OSS 19
OSTs
0,20,40,60
1,21,41,61
2,22,42,62
3,23,43,63
4,24,44,64
5,25,45,65
19,39,59,79
- Balance
- With a few important exceptions, should work
decently for most users - Protection
- Each OST is backed up by a physical disk (LUN)
- Stripe count of 1 leave us vulnerable to single
user writing out huge amount of data filling the
disk - Striping of 4 is a reasonable compromise,
although not good for large shared files or
certain file-per-proc cases
10Changing the Default Stripe Count
- A number of applications will see benefits from
changing the default striping - Striping can be set at a file or directory level
- When striping set on a directory all files
created in that directory with inherit striping
set on the directory - Stripe size - bytes written on each OST before
cycling to next OST - OST offset - indicates starting OST
- Stripe count - of OSTs file is split across
lstripe ltdirectoryfilegt ltstripe sizegt ltOST
Offsetgt ltstripe countgt
lstripe mydirectory 0 -1 X
11Parallel I/O Multi-file
0
1
2
3
4
5
processors
File
File
File
File
File
File
- Each processor writes its own data to a separate
file - Advantages
- Simple to program
- Can be fast -- (up to a point)
- Disadvantages
- Can quickly accumulate many files
- Hard to manage
- Requires post processing
- Difficult for storage systems, HPSS, to handle
many small files
12One File-Per-Processor IO with Default Striping
0
1
2
3
4
5
40,000
Torus Network
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
1,21,41,61
2,22,42,62
3,23,43,63
4,24,44,64
5,25,45,65
OSTs
0,20,40,60
- With greater than roughly 20 processors/files,
each striping over 4 OSTs, files begin to overlap
with each other on OSTs - Stripe count of 4 is not helping application get
more overall bandwidth - Stripe count of 4 can lead to contention
13One File-Per-Processor IO with Stripe Count of 1
0
1
2
3
4
5
40,000
Torus Network
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 19
4 OSTs
- Use all OSTs but dont add more contention than
is necessary
14Recommendations for One-File-Per Processor IO
gt 80
Number of Processors
lt 80
gt 1GB
lt100s MB
File Size Per Processor
15Parallel I/O Single-file
1
2
3
4
5
0
processors
File
- Each processor writes its own data to the same
file using MPI-IO mapping - Advantages
- Single file
- Manageable data
- Disadvantages
- Lower performance than one file per processor at
some concurrencies
16Shared File I/O with Default Stripe Count 4
0
1
2
3
4
5
40000
Torus Network
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 19
- All processors writing shared file will write to
4 OSTs - No matter how much data the application is
writing, it wont get more than 1,400 MB/Sec (4
OSTs 350 MB/Sec) - Less sophisticated than you might think - no
optimizations for matching processor writer to
same OST - Need to use more OSTs for large shared files
17Shared File I/O with Default Stripe Count 80
0
1
2
3
4
5
40000
Torus Network
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 19
- Now Striping over all 80 OSTs
- Increased available bandwidth to application
- Theoretically (700 MB/Sec (OSS Max) 20 OSSs)
- In practice 11-12 GB/Sec
18Reduced Writers to Single-file
1
2
3
4
5
0
processors
File
- On Franklin best performance when writers matches
of OSTs (80) - Subset of processors writes data to single file
- Functionality not yet available in XT4 MPI-IO
- Advantages
- Single file manageable data
- Better performance than all tasks writing for
high concurrency jobs - Disadvantages
- Requires changes to code
- Application may not have enough memory to handle
data from other processors
19Recommendations
Legend
gt4096
Single Shared File, Default Striping
- Stripe over some OSTs
- lstripe dir 0 -1 11
- Dont do anything
- Default striping fine
2048
Single Shared File, Stripe over some OSTs (11)
Processors
1024
Single Shared File, Stripe over all OSTs
lt512
Try fewer writers
lt1 GB
10 GB
100 GB
1 TB
Aggregate File Size
20Striping Summary
- Franklin Default Striping
- Stripe size - 1MB (enter 0 for default)
- OST offset - round robin starting OST (enter
-1 for default) - Stripe over 4 OSTs (Stripe count 4)
- One File-Per-Processor
- lstripe mydir 0 -1 1
- Large shared files
- lstripe mydir 0 -1 80
- Medium shared files
- Experiment a little 10-40 OSTs
- lstripe mydir 0 -1 11
21Gotchas
- File system is a shared resource and a heavy I/O
job can slow the system down for everyone - Write fewer large blocks of data rather than many
small chunks of data - Some MPI-IO features under optimized
- Collective I/O
- 2 Phase I/O
- Lustre file system sensitive write size and write
offset
OSS 2
OSS 3
OSS 4
OSS 5
Please contact consultants if you have low
performing I/O. There may be something simple we
can do to increase performance substantially
22Cray/NERSC I/O Initiatives
- Working to address and characterize I/O variation
- Benchmarking runs
- Lustre Monitoring Tool
- Improving MPI-IO layer
- Multiple people at Sun and Cray working on this
- Discussions forming about improving HDF5
performance on Lustre
23Best Practices
- Add striping line to batch script so you dont
forget it! - Do large I/O write fewer big chunks of data
(1MB) rather than small bursty I/O - Do parallel I/O.
- Serial I/O (single writer) can not take advantage
of the systems parallel capabilities. - Stripe large files over many OSTs.
- If job uses many cores, reduce the number of
tasks performing IO (experiment with this number,
80, 160, 320) - Use a single, shared file instead of 1 file per
writer, esp. at high parallel concurrency. - Use an IO library API and write flexible,
portable programs.
24Extra Slides
25Parallel I/O A User Perspective
- Wish List
- Write data from multiple processors into a single
file - File can be read in the same manner regardless of
the number of CPUs that read from or write to the
file. (eg. want to see the logical data layout
not the physical layout) - Do so with the same performance as writing
one-file-per-processor (only writing
one-file-per-processor because of performance
problems) - And make all of the above portable from one
machine to the next
26I/O Formats
27Common Storage Formats
- ASCII
- Slow
- Takes more space!
- Inaccurate
- Binary
- Non-portable (eg. byte ordering and types sizes)
- Not future proof
- Parallel I/O using MPI-IO
- Self-Describing formats
- NetCDF/HDF4, HDF5, Parallel NetCDF
- Example in HDF5 API implements Object DB model
in portable file - Parallel I/O using pHDF5/pNetCDF (hides MPI-IO)
- Community File Formats
- FITS, HDF-EOS, SAF, PDB, Plot3D
- Modern Implementations built on top of HDF,
NetCDF, or other self-describing object-model API
28HDF5 Library
HDF5 is a general purpose library and file format
for storing scientific data
- Can store data structures, arrays, vectors,
grids, complex data types, text - Can use basic HDF5 types integers, floats, reals
or user defined types such as multi-dimensional
arrays, objects and strings - Stores metadata necessary for portability -
endian type, size, architecture
29HDF5 Data Model
- Groups
- Arranged in directory hierarchy
- root group is always /
- Datasets
- Dataspace
- Datatype
- Attributes
- Bind to Group Dataset
- References
- Similar to softlinks
- Can also be subsets of data
/ (root)
authorJane Doe
date10/24/2006
subgrp
Dataset0 type,space
Dataset1 type, space
time0.2345
validityNone
Dataset0.1 type,space
Dataset0.2 type,space
30A Plug for Self Describing Formats ...
- Application developers shouldnt care about about
physical layout of data - Using own binary file format forces user to
understand layers below the application to get
optimal IO performance - Every time code is ported to a new machine or
underlying file system is changed or upgraded,
user is required to make changes to improve IO
performance - Let other people do the work
- HDF5 can be optimized for given platforms and
file systems by HDF5 developers - User can stay with the high level
- But what about performance?
31IO Library Overhead
Very little, if any overhead from HDF5 for one
file per processor IO compared to Posix and MPI-IO
Data from Hongzhang Shan
32Ways to do Parallel IO
33Serial I/O
0
1
2
3
4
5
processors
- Each processor sends its data to the master who
then writes the data to a file - Advantages
- Simple
- May perform ok for very small IO sizes
- Disadvantages
- Not scalable
- Not efficient, slow for any large number of
processors or data sizes - May not be possible if memory constrained
File
34Parallel I/O Multi-file
0
1
2
3
4
5
processors
File
File
File
File
File
File
- Each processor writes its own data to a separate
file - Advantages
- Simple to program
- Can be fast -- (up to a point)
- Disadvantages
- Can quickly accumulate many files
- Hard to manage
- Requires post processing
- Difficult for storage systems, HPSS, to handle
many small files
35Flash Center IO Nightmare
- Large 32,000 processor run on LLNL BG/L
- Parallel IO libraries not yet available
- Intensive I/O application
- checkpoint files .7 TB, dumped every 4 hours, 200
dumps - used for restarting the run
- full resolution snapshots of entire grid
- plotfiles - 20GB each, 700 dumps
- coarsened by a factor of two averaging
- single precision
- subset of grid variables
- particle files 1400 particle files 470MB each
- 154 TB of disk capacity
- 74 million files!
- Unix tool problems
- 2 Years Later still trying to sift though data,
sew files together
36Parallel I/O Single-file
1
2
3
4
5
0
processors
File
- Each processor writes its own data to the same
file using MPI-IO mapping - Advantages
- Single file
- Manageable data
- Disadvantages
- Lower performance than one file per processor at
some concurrencies
37Parallel IO single file
0
1
2
3
4
5
processors
array of data
Each processor writes to a section of a data
array. Each must know its offset from the
beginning of the array and the number of elements
to write
38Trade offs
- Ideally users want speed, portability and
usability - speed - one file per processor
- portability - high level IO library
- usability
- single shared file and
- own file format or community file format layered
on top of high level IO library
It isnt hard to have speed, portability or
usability. It is hard to have speed, portability
and usability in the same implementation
39Good I/O Performance with Simple I/O Patterns
- File system capable of high performance for
shared files - Large block sequential I/O
- Transfer size multiple of stripe size
- No metadata
40More complicated I/O patterns
- Harder for file system to handle
- Smaller amounts of data, (MBs/proc)
- Transfer size not multiple of stripe width
- Start offset doesnt match stripe width
- Strided data
- Can result in lower shared file performance
41Description of IOR
- Developed by LLNL used for purple procurement
- Focuses on parallel/sequential read/write
operations that are typical in scientific
applications - Can exercise one file per processor or shared
file access for common set of testing parameters - Exercises array of modern file APIs such as
MPI-IO, POSIX (shared or unshared), HDF5 and
parallel-netCDF - Parameterized parallel file access patterns to
mimic different application situations
42Benchmark Methodology
Focus on performance difference between single
shared and one file per processor
43Small Aggregate Output Sizes 100 MB - 1GB
One File per Processor vs Shared File - GB/Sec
Aggregate File Size 100 MB
Aggregate File Size 1 GB
Peak performance line - Anything greater than
this is due to caching effect or timer
granularity
Clearly the one file per processor strategy
wins in the low concurrency cases correct?
44Small Aggregate Output Sizes 100 MB - 1GB
One File per Processor vs Shared File - Time
Aggregate File Size 1 GB
Aggregate File Size 100 MB
But when looking at absolute time, the difference
doesnt seem so big...
45Aggregate Output Size 100GB
One File per Processor vs Shared File
Rate GB/Sec
Time Seconds
Peak performance line
2.5 mins
390 MB/proc
24 MB/proc
Is there anything we can do to improve the
performance of the 4096 processor shared file
case ?
46Aggregate Output Size 1TB
One File per Processor vs Shared File
Rate GB/Sec
Time Seconds
3 mins
976 MB/proc
244 MB/proc
Is there anything we can do to improve the
performance of the 4096 processor shared file
case ?
47Recommendations
- Think about the big picture
- Run time vs Post Processing trade off
- Decide how much IO overhead you can afford
- Data Analysis
- Portability
- Longevity
- H5dump works on all platforms
- Can view an old file with h5dump
- If you use your own binary format you must keep
track of not only your file format version but
the version of your file reader as well - Storability
48Recommendations
- Use a standard IO format, even if you are
following a one file per processor model - One file per processor model really only makes
some sense when writing out very large files at
high concurrencies, for small files, overhead is
low - If you must do one file per processor IO then at
least put it in a standard IO format so pieces
can be put back together more easily - Splitting large shared files into a few files
appears promising - Option for some users, but requires code changes
and output format changes - Could be implemented better in IO library APIs
- Follow striping recommendations
- Ask the consultants, we are here to help!
49Questions?