IO File Striping - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

IO File Striping

Description:

Lustre file system on Franklin made up of an underlying set of parallel I/O servers. OSSs (Object Storage Servers) - nodes dedicated to I/O connected to high speed ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 50
Provided by: nerscwork
Category:
Tags: file | servers | striping

less

Transcript and Presenter's Notes

Title: IO File Striping


1
IO File Striping Katie Antypas User Services
Group Kantypas_at_lbl.gov NERSC User Group
Meeting October 2, 2007
2
Outline
  • File Striping
  • Definition
  • Default Striping
  • Pros and Cons
  • One file-per-processor
  • Shared Files
  • Gotchas

3
Motivation
  • We wish users didnt have to deal with striping
  • Ideally it would be hidden from users
  • Unfortunately however, performance is too
    critical to ignore it

4
What is File Striping?
  • Lustre file system on Franklin made up of an
    underlying set of parallel I/O servers
  • OSSs (Object Storage Servers) - nodes dedicated
    to I/O connected to high speed torus interconect
  • OSTs (Object Storage Targets) software
    abstraction of physical disk (1 OST maps to 1
    LUN)
  • File is said to be striped when read and write
    operations access multiple OSTs concurrently
  • Striping can increase I/O performance since
    writing or reading from multiple OSTs
    simultaneously increases the available I/O
    bandwidth

5
Franklin Configuration
Franklin Compute and Interactive Nodes
The Torus
20 OSS 80 OST

FC Network
5 DDN 80 LUN

Connectivity and configuration set in a good
way for parallelism. Using 20 OSTs will spread
evenly over the 5 DDN appliances.
6
Default Striping on /scratch
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 19
I/O Servers
OSTs
0,20,40,60
1,21,41,61
2,22,42,62
3,23,43,63
4,24,44,64
5,25,45,65
19,39,59,79
  • 3 parameters characterize striping pattern of a
    file
  • Stripe count
  • Number of OSTs file is split across
  • Default is 4
  • Stripe size
  • Number of bytes to write on each OST before
    cycling to next OST
  • Default is 1MB
  • OST offset
  • Indicates starting OST
  • Default is round robin across all requests on
    system

7
Max Bandwidth of OST and OSS
Compute Nodes
Torus Network
Storage Servers
4 OSTs
4 OSTs
Max Bandwidth to single OST 350 MB/sec
Max Bandwidth to single OSS 700 MB/sec
8
Default Stripe Count of 4 on /scratch
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 6
OSS 19
I/O Servers
OSTs
0,20,40,60
1,21,41,61
2,22,42,62
3,23,43,63
4,24,44,64
5,25,45,65
19,39,59,79
  • Pros
  • Get 4 times the bandwidth you could from using 1
    OST
  • Max bandwidth to 1 OST 350 MB/Sec
  • Using 4 OSTs 1,400 MB/Sec
  • Cons
  • For better or worse your file now is in 4
    different places
  • Metadata operations like ls -l on the file
    could be slower
  • For small files (lt100MB) no performance gain from
    striping

9
Why a stripe count of 4?
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 6
OSS 19
OSTs
0,20,40,60
1,21,41,61
2,22,42,62
3,23,43,63
4,24,44,64
5,25,45,65
19,39,59,79
  • Balance
  • With a few important exceptions, should work
    decently for most users
  • Protection
  • Each OST is backed up by a physical disk (LUN)
  • Stripe count of 1 leave us vulnerable to single
    user writing out huge amount of data filling the
    disk
  • Striping of 4 is a reasonable compromise,
    although not good for large shared files or
    certain file-per-proc cases

10
Changing the Default Stripe Count
  • A number of applications will see benefits from
    changing the default striping
  • Striping can be set at a file or directory level
  • When striping set on a directory all files
    created in that directory with inherit striping
    set on the directory
  • Stripe size - bytes written on each OST before
    cycling to next OST
  • OST offset - indicates starting OST
  • Stripe count - of OSTs file is split across

lstripe ltdirectoryfilegt ltstripe sizegt ltOST
Offsetgt ltstripe countgt
lstripe mydirectory 0 -1 X
11
Parallel I/O Multi-file
0
1
2
3
4
5
processors
File
File
File
File
File
File
  • Each processor writes its own data to a separate
    file
  • Advantages
  • Simple to program
  • Can be fast -- (up to a point)
  • Disadvantages
  • Can quickly accumulate many files
  • Hard to manage
  • Requires post processing
  • Difficult for storage systems, HPSS, to handle
    many small files

12
One File-Per-Processor IO with Default Striping
0
1
2
3
4
5
40,000
Torus Network
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
1,21,41,61
2,22,42,62
3,23,43,63
4,24,44,64
5,25,45,65
OSTs
0,20,40,60
  • With greater than roughly 20 processors/files,
    each striping over 4 OSTs, files begin to overlap
    with each other on OSTs
  • Stripe count of 4 is not helping application get
    more overall bandwidth
  • Stripe count of 4 can lead to contention

13
One File-Per-Processor IO with Stripe Count of 1
0
1
2
3
4
5
40,000
Torus Network
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 19
4 OSTs
  • Use all OSTs but dont add more contention than
    is necessary

14
Recommendations for One-File-Per Processor IO
gt 80
Number of Processors
lt 80
gt 1GB
lt100s MB
File Size Per Processor
15
Parallel I/O Single-file
1
2
3
4
5
0
processors
File
  • Each processor writes its own data to the same
    file using MPI-IO mapping
  • Advantages
  • Single file
  • Manageable data
  • Disadvantages
  • Lower performance than one file per processor at
    some concurrencies

16
Shared File I/O with Default Stripe Count 4
0
1
2
3
4
5
40000
Torus Network
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 19
  • All processors writing shared file will write to
    4 OSTs
  • No matter how much data the application is
    writing, it wont get more than 1,400 MB/Sec (4
    OSTs 350 MB/Sec)
  • Less sophisticated than you might think - no
    optimizations for matching processor writer to
    same OST
  • Need to use more OSTs for large shared files

17
Shared File I/O with Default Stripe Count 80
0
1
2
3
4
5
40000
Torus Network
OSS 0
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 19
  • Now Striping over all 80 OSTs
  • Increased available bandwidth to application
  • Theoretically (700 MB/Sec (OSS Max) 20 OSSs)
  • In practice 11-12 GB/Sec

18
Reduced Writers to Single-file
1
2
3
4
5
0
processors
File
  • On Franklin best performance when writers matches
    of OSTs (80)
  • Subset of processors writes data to single file
  • Functionality not yet available in XT4 MPI-IO
  • Advantages
  • Single file manageable data
  • Better performance than all tasks writing for
    high concurrency jobs
  • Disadvantages
  • Requires changes to code
  • Application may not have enough memory to handle
    data from other processors

19
Recommendations
Legend
gt4096
Single Shared File, Default Striping
  • Stripe over some OSTs
  • lstripe dir 0 -1 11
  • Dont do anything
  • Default striping fine

2048
Single Shared File, Stripe over some OSTs (11)
Processors
1024
Single Shared File, Stripe over all OSTs
lt512
Try fewer writers
lt1 GB
10 GB
100 GB
1 TB
Aggregate File Size
20
Striping Summary
  • Franklin Default Striping
  • Stripe size - 1MB (enter 0 for default)
  • OST offset - round robin starting OST (enter
    -1 for default)
  • Stripe over 4 OSTs (Stripe count 4)
  • One File-Per-Processor
  • lstripe mydir 0 -1 1
  • Large shared files
  • lstripe mydir 0 -1 80
  • Medium shared files
  • Experiment a little 10-40 OSTs
  • lstripe mydir 0 -1 11

21
Gotchas
  • File system is a shared resource and a heavy I/O
    job can slow the system down for everyone
  • Write fewer large blocks of data rather than many
    small chunks of data
  • Some MPI-IO features under optimized
  • Collective I/O
  • 2 Phase I/O
  • Lustre file system sensitive write size and write
    offset

OSS 2
OSS 3
OSS 4
OSS 5
Please contact consultants if you have low
performing I/O. There may be something simple we
can do to increase performance substantially
22
Cray/NERSC I/O Initiatives
  • Working to address and characterize I/O variation
  • Benchmarking runs
  • Lustre Monitoring Tool
  • Improving MPI-IO layer
  • Multiple people at Sun and Cray working on this
  • Discussions forming about improving HDF5
    performance on Lustre

23
Best Practices
  • Add striping line to batch script so you dont
    forget it!
  • Do large I/O write fewer big chunks of data
    (1MB) rather than small bursty I/O
  • Do parallel I/O.
  • Serial I/O (single writer) can not take advantage
    of the systems parallel capabilities.
  • Stripe large files over many OSTs.
  • If job uses many cores, reduce the number of
    tasks performing IO (experiment with this number,
    80, 160, 320)
  • Use a single, shared file instead of 1 file per
    writer, esp. at high parallel concurrency.
  • Use an IO library API and write flexible,
    portable programs.

24
Extra Slides
25
Parallel I/O A User Perspective
  • Wish List
  • Write data from multiple processors into a single
    file
  • File can be read in the same manner regardless of
    the number of CPUs that read from or write to the
    file. (eg. want to see the logical data layout
    not the physical layout)
  • Do so with the same performance as writing
    one-file-per-processor (only writing
    one-file-per-processor because of performance
    problems)
  • And make all of the above portable from one
    machine to the next

26
I/O Formats
27
Common Storage Formats
  • ASCII
  • Slow
  • Takes more space!
  • Inaccurate
  • Binary
  • Non-portable (eg. byte ordering and types sizes)
  • Not future proof
  • Parallel I/O using MPI-IO
  • Self-Describing formats
  • NetCDF/HDF4, HDF5, Parallel NetCDF
  • Example in HDF5 API implements Object DB model
    in portable file
  • Parallel I/O using pHDF5/pNetCDF (hides MPI-IO)
  • Community File Formats
  • FITS, HDF-EOS, SAF, PDB, Plot3D
  • Modern Implementations built on top of HDF,
    NetCDF, or other self-describing object-model API

28
HDF5 Library
HDF5 is a general purpose library and file format
for storing scientific data
  • Can store data structures, arrays, vectors,
    grids, complex data types, text
  • Can use basic HDF5 types integers, floats, reals
    or user defined types such as multi-dimensional
    arrays, objects and strings
  • Stores metadata necessary for portability -
    endian type, size, architecture

29
HDF5 Data Model
  • Groups
  • Arranged in directory hierarchy
  • root group is always /
  • Datasets
  • Dataspace
  • Datatype
  • Attributes
  • Bind to Group Dataset
  • References
  • Similar to softlinks
  • Can also be subsets of data

/ (root)
authorJane Doe
date10/24/2006
subgrp
Dataset0 type,space
Dataset1 type, space
time0.2345
validityNone
Dataset0.1 type,space
Dataset0.2 type,space
30
A Plug for Self Describing Formats ...
  • Application developers shouldnt care about about
    physical layout of data
  • Using own binary file format forces user to
    understand layers below the application to get
    optimal IO performance
  • Every time code is ported to a new machine or
    underlying file system is changed or upgraded,
    user is required to make changes to improve IO
    performance
  • Let other people do the work
  • HDF5 can be optimized for given platforms and
    file systems by HDF5 developers
  • User can stay with the high level
  • But what about performance?

31
IO Library Overhead
Very little, if any overhead from HDF5 for one
file per processor IO compared to Posix and MPI-IO
Data from Hongzhang Shan
32
Ways to do Parallel IO
33
Serial I/O
0
1
2
3
4
5
processors
  • Each processor sends its data to the master who
    then writes the data to a file
  • Advantages
  • Simple
  • May perform ok for very small IO sizes
  • Disadvantages
  • Not scalable
  • Not efficient, slow for any large number of
    processors or data sizes
  • May not be possible if memory constrained

File
34
Parallel I/O Multi-file
0
1
2
3
4
5
processors
File
File
File
File
File
File
  • Each processor writes its own data to a separate
    file
  • Advantages
  • Simple to program
  • Can be fast -- (up to a point)
  • Disadvantages
  • Can quickly accumulate many files
  • Hard to manage
  • Requires post processing
  • Difficult for storage systems, HPSS, to handle
    many small files

35
Flash Center IO Nightmare
  • Large 32,000 processor run on LLNL BG/L
  • Parallel IO libraries not yet available
  • Intensive I/O application
  • checkpoint files .7 TB, dumped every 4 hours, 200
    dumps
  • used for restarting the run
  • full resolution snapshots of entire grid
  • plotfiles - 20GB each, 700 dumps
  • coarsened by a factor of two averaging
  • single precision
  • subset of grid variables
  • particle files 1400 particle files 470MB each
  • 154 TB of disk capacity
  • 74 million files!
  • Unix tool problems
  • 2 Years Later still trying to sift though data,
    sew files together

36
Parallel I/O Single-file
1
2
3
4
5
0
processors
File
  • Each processor writes its own data to the same
    file using MPI-IO mapping
  • Advantages
  • Single file
  • Manageable data
  • Disadvantages
  • Lower performance than one file per processor at
    some concurrencies

37
Parallel IO single file
0
1
2
3
4
5
processors
array of data
Each processor writes to a section of a data
array. Each must know its offset from the
beginning of the array and the number of elements
to write
38
Trade offs
  • Ideally users want speed, portability and
    usability
  • speed - one file per processor
  • portability - high level IO library
  • usability
  • single shared file and
  • own file format or community file format layered
    on top of high level IO library

It isnt hard to have speed, portability or
usability. It is hard to have speed, portability
and usability in the same implementation
39
Good I/O Performance with Simple I/O Patterns
  • File system capable of high performance for
    shared files
  • Large block sequential I/O
  • Transfer size multiple of stripe size
  • No metadata

40
More complicated I/O patterns
  • Harder for file system to handle
  • Smaller amounts of data, (MBs/proc)
  • Transfer size not multiple of stripe width
  • Start offset doesnt match stripe width
  • Strided data
  • Can result in lower shared file performance

41
Description of IOR
  • Developed by LLNL used for purple procurement
  • Focuses on parallel/sequential read/write
    operations that are typical in scientific
    applications
  • Can exercise one file per processor or shared
    file access for common set of testing parameters
  • Exercises array of modern file APIs such as
    MPI-IO, POSIX (shared or unshared), HDF5 and
    parallel-netCDF
  • Parameterized parallel file access patterns to
    mimic different application situations

42
Benchmark Methodology
Focus on performance difference between single
shared and one file per processor
43
Small Aggregate Output Sizes 100 MB - 1GB
One File per Processor vs Shared File - GB/Sec
Aggregate File Size 100 MB
Aggregate File Size 1 GB
Peak performance line - Anything greater than
this is due to caching effect or timer
granularity
Clearly the one file per processor strategy
wins in the low concurrency cases correct?
44
Small Aggregate Output Sizes 100 MB - 1GB
One File per Processor vs Shared File - Time
Aggregate File Size 1 GB
Aggregate File Size 100 MB
But when looking at absolute time, the difference
doesnt seem so big...
45
Aggregate Output Size 100GB
One File per Processor vs Shared File
Rate GB/Sec
Time Seconds
Peak performance line
2.5 mins
390 MB/proc
24 MB/proc
Is there anything we can do to improve the
performance of the 4096 processor shared file
case ?
46
Aggregate Output Size 1TB
One File per Processor vs Shared File
Rate GB/Sec
Time Seconds
3 mins
976 MB/proc
244 MB/proc
Is there anything we can do to improve the
performance of the 4096 processor shared file
case ?
47
Recommendations
  • Think about the big picture
  • Run time vs Post Processing trade off
  • Decide how much IO overhead you can afford
  • Data Analysis
  • Portability
  • Longevity
  • H5dump works on all platforms
  • Can view an old file with h5dump
  • If you use your own binary format you must keep
    track of not only your file format version but
    the version of your file reader as well
  • Storability

48
Recommendations
  • Use a standard IO format, even if you are
    following a one file per processor model
  • One file per processor model really only makes
    some sense when writing out very large files at
    high concurrencies, for small files, overhead is
    low
  • If you must do one file per processor IO then at
    least put it in a standard IO format so pieces
    can be put back together more easily
  • Splitting large shared files into a few files
    appears promising
  • Option for some users, but requires code changes
    and output format changes
  • Could be implemented better in IO library APIs
  • Follow striping recommendations
  • Ask the consultants, we are here to help!

49
Questions?
Write a Comment
User Comments (0)
About PowerShow.com