ASC PI Meeting I/O and File Systems R - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

ASC PI Meeting I/O and File Systems R

Description:

Heterogeneous access and WAN access (University of Michigan) ... delivered Lustre I/O over 4x1GbE. Thunder 1,024 Port QsNet Elan4. 924 Dual P4 Compute Nodes ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 34
Provided by: garyg2
Category:

less

Transcript and Presenter's Notes

Title: ASC PI Meeting I/O and File Systems R


1
ASC PI MeetingI/O and File SystemsRD and
Deployment
  • Lee Ward, Sandia
  • Bill Boas, LLNL
  • Gary Grider, LANL

02/2005
2
The Primary Driver A Balanced System Approach
Computing Speed
FLOP/s
Memory
TeraBytes
50
Year
2003
5
00
0.5
Application Performance
97
Disk
2500
0.05
250
TeraBytes
96
25
Programs
2.5
1
0.1
1
10
0.8
0.01
100
Parallel I/O
8
0.1
GigaBytes/sec
80
1
Computational Resource Scaling for ASCI Physics
Applications 1 FLOP/s Peak Compute 0.5 Byte/FLOP/
s Memory 0.001 Byte/s/FLOP/s Peak Parallel
I/O 0.008 bit/s/FLOP/s Peak Network 0.0001
Byte/s/FLOP/s Archive
800
10
Archival Storage
Network Speed
Gigabits/sec
Gigabytes/sec
3
FS Requirements Summary
  • From Tri-Lab File System Path Forward RFQ (which
    came from the Tri-labs file systems requirements
    document) ftp//ftp.lanl.gov/public/ggrider/ASCIFS
    RFP.DOC
  • POSIX-like Interface, Works well with MPI-IO,
    Open Protocols, Open Source (parts or all), No
    Single Point Of Failure , Global Access
  • Global name space,
  • Scalable bandwidth, metadata, management,
    security 1 GB/sec per Tflop, 1000s of
    metadata ops/sec
  • WAN Access, Global Identities, Wan Security,
  • Manage, tune, diagnose, statistics, RAS, build,
    document, snapshot,
  • Authentication, Authorization, Logging,

4
Enterprise Wide Scalable Sharing
Central Global Parallel File System
Non Linux Viz/analysis Clusters and workstations
NFS
CIFS
Native
Linux Viz Clusters
Native
Native
Native
Archive
Analysis Visualization Tools
Altix
Linux Compute Clusters
5
Vendor Collaborations
  • Solution for Linux clusters and enterprise class
    heterogeneous global parallel file systems
  • HP/CFS/Intel Lustre Path Forward for object based
    secure global parallel file system 2002-present
  • Very scalable bandwidth, good non scaled metadata
  • Being deployed/used at LLNL and Sandia
  • Starting metadata scaling work finishing CMD-1
    (4 metadata servers, starting CMD-2 (gt15 metadata
    servers)
  • Panasas 2000-present
  • Very scalable bandwidth, good non scaled metadata
  • Being deployed/used at LANL and LLNL
  • Multi-metadata servers by volume done, arbitrary
    multiple metadata servers this year

6
With only vendor relationships there are still
gaps in our Strategy
  • May not be a native client to new OBSD file
    systems (more heterogeneous access), need WAN
    access to OBSD file systems,
  • We need to deal with the fact that file systems
    are based on a geometry at the application may
    not be able to easily conform its memory to that
    geometry
  • Scalable metadata and security has not been done
    before in the most general sense
  • How would we integrate Archive/HSM into this
    common enterprise class OBSD FS

7
Our University Partnerships
  • Heterogeneous access and WAN access (University
    of Michigan)
  • Secure file sharing with access control and need
    to know as replacement for DFS which fits into
    ICSE infrastructure for LAN and WAN
  • Provide heterogeneous near native client to OBFS
  • Showing up in Linux 2.6 first Kerberos/gss file
    system access shown Jan 2005
  • Aligning application memory with file system
    geometry (Northwestern University and ANL)
  • Move coordination of collaborative caching to
    MPI-IO, out of the file system for overlapped,
    unaligned, small I/O
  • This is a good position to be in, as we have been
    building plumbing for years, now we are working
    with applications to enable exploitation of these
    new file systems
  • Showing up in ROMIO/MPI-IO
  • Metadata and Security scaling (UC Santa Cruz,
    Lustre, Panasas, and Sandia)
  • Squashing/replicating permissions and directory
    decomposition caching studies
  • Panasas and Lustre working on metadata
    decomposition and scaling with current ideas
  • Scaling matadata performance into the petaflops
    range looks problematic!
  • Minnesota (Intelligent Storage Consortia) and STK
  • Can we leverage both the object based parallel
    file system technologies and commercial (non
    parallel) archive products to provide a scalable
    parallel archive that is or could be integrated
    tightly into a globally shared scalable parallel
    file system?
  • What would need to change in these global
    parallel file systems and commercial
    backup/archive/hsm products to enable this
    capability
  • Work just beginning.

8
POSIX IO Additions
  • Most of the current issues with POSIX IO
    semantics lie in the lack of support for
    distributed/parallel processes
  • Concepts that involved implied ordering need to
    have alternative verbs that do not imply ordering
  • Vectored I/O read/write calls that dont imply
    ordering
  • Extending the end of file issues
  • Group opens
  • Etc.
  • Concepts that involve serialization for strict
    POSIX metadata query/update need to have lazy
    POSIX alternatives
  • Last update date (mtime,atime,ctime), Size of
    file
  • Active storage
  • Status
  • Pursuing Labs joining The Open Group which
    holds the current One Unix charter which merges
    IEEE, ANSI, and POSIX standards and evolves the
    POSIX standard for UNIX
  • Next Step is to write up proposed new changes and
    begin discussion process within the POSIX UNIX IO
    API working group in the Open Group forum.

9
Government multi-agency I/O coordination
  • Authored joint RD needs document for next five
    years of needed I/O and file systems RD work
    (DOE NNSA/Office of Science, DOD NSA)
  • Working with DARPA on how to spin up more I/O RD
    for HPC (as part of high productivity project)
  • Working with NSF on how to spin up more I/O RD
    for HPC ( under HECURA umbrella or not )




10
Are we headed towards our vision?
  • Lustre in use, in programmatic role at LLNL and
    Sandia
  • Panasas in use, in programmatic and supports
    roles LANL, Sandia and LLNL
  • Academic and other partnerships paying off
  • ANL and Northwestern stuff going into ROMIO,
    aggregation, two phase I/O, sieving, better
    caching coming
  • NFSv4 getting into Linux
  • We showed multi-realm kerberized controlled
    access to a file system via NFSv4
  • UCSC ideas beginning to become appealing to our
    file system partners
  • Implementing first generation enterprise class
    common parallel file systems, shared between
    multiple clusters (Linux natively, others via
    NFSv3)
  • pNFS entering IETF to allow for non Linux clients
    to access object file systems natively
  • MPI-IO beginning to be used in more apps
  • Future works spinning up, POSIX api, object
    archive, multi-agency I/O RD investments, object
    archive

11
LANL Deployments
  • Turquoise Institutional and Alliance Computing
  • Yellow Unclassified programmatic computing
  • Red Classified programmatic computing

12
Current Turquoise Configuration
Future Institutional machine fy05
Pink 1916 Xeon procs
Gig-E
Panasas 16 Storage Shelves 80 TB 6.4
GB/s
Myrinet
64 IO nodes
Myrinet
16 IO nodes
Mauve Altix Cluster 256 procs
TLC 224 AMD64 procs
112 Compute Nodes
13
Current Yellow Configuration
Flash 728 AMD64 procs
Gig-E
Panasas 8 Storage Shelves 40 TB 3.2
GB/s
Myrinet
64 IO nodes
Future Viz cluster FY05/06
Future Direct HPSS movement agents FY05
14
Red Configuration in a couple of weeks
Lightning 3072 AMD64 procs
128 IO nodes
Gig-E
Panasas 48 Storage Shelves 200 TB 20
GB/s
Myrinet
Future Viewmaster Viz cluster fy05/06
Future Capacity machine fy05/06
Future direct HPSS movement agents (FY05
ASAP)
15
PaScalBB Parallel Scalable Backbone
Archive
Core switches
I/O nodes
To Site Network
Compute Nodes
Single file system
Cluster A, interconnect A
Layer 2 ethernet switches
Cluster B, interconnect B
Cluster C, interconnect C
16
Pink Data Throughput
17
Pink Metadata Throughput
18
Sandia Deployments
  • Lustre tested in two environments and in GA or
    imminent
  • ICC NWCC capacity machines
  • Feynman
  • Lustre intended for Red Storm and in test on the
    hardware
  • Panasas in production test on Feynman

02/2005
19
Sandia Capacity Machines
  • EA version of SFS on Liberty and Spirit
  • Successful limited availability trial
  • Withdrawn to upgrade hardware and software
  • Hardware now in place and being verified for all
    (ICC NWCC)
  • Installing SFS 1.1.-1 on all
  • Based on CFS 1.2.3 Lustre
  • Sandia pre-production verification currently in
    progress

20
Sandia Feynman
  • CFS 1.4 Lustre deployed
  • GA on this machine and has supported science runs
  • Adding additional disk and OSS nodes to expand
    capability
  • Intended as primary scratch space
  • Four shelves of Panasas deployed
  • GA on this machine
  • At least one of these will be used as the
    environment morphs into ROSE

21
Sandia Red Storm
  • CFS 1.4 server and a, heavily, modified client
    tested on Red Squall
  • Cray beginning to bring it up on Red Storm
    hardware in place at Sandia now

22
LLNL AC and Open Compute Resources
23
52 Dual P4
6 Dual P4
1,004 Quad Itanium2 Compute Nodes
LLNL Open Computing Facility - 2005
VIZ
GW
MDS
18.7 TFLOPs LP
Thunder 1,024 Port QsNet Elan4
PVC - 128 Port Elan3
HPSS Archive
LLNL External Backbone
NAS Home Dirs
GW
GW
MDS
MDS
448 GigE Ports
36 GB/s
9.6 GB/s
4 Login nodes with 6 Gb-Enet
16 Gateway nodes _at_ 350 MB/s delivered Lustre I/O
over 4x1GbE
128 GigE Ports
2 Login nodes
B451
OSS
OSS
OSS
OSS
OSS
OSS
64 OSS 180 TB SATA
OSS
OSS
OSS
OSS
OSS
OSS
12.5 GB/s
128 GigE Ports
224 OSS 900 TB SATA
B453
386 GigE Ports
25 GB/s
OSS
OSS
OSS
OSS
OSS
OSS
MDS
MDS
192 OSS 258 TB FC
B439
45 GB/s
2 Login nodes with 4 Gb-Enet
32 Gateway nodes _at_ 190 MB/s delivered Lustre I/O
over 2x1GbE
4 Login nodes with 4 Gb-Enet
32 Gateway nodes _at_ 190 MB/s delivered Lustre I/O
over 2x1GbE
1152 GigE Ports
360 TFLOPs peak
1024 I/O Nodes PPC440
19.2 GB/s
GW
GW
MDS
MDS
GW
GW
MDS
MDS
B453
B439
B439
BG/L torus, global tree/barrier
7.6 TFLOPs LP
ALC - 960 Port QsNet Elan3
MCR 1,152 Port QsNet Elan3
6.5 TFLOPs LP
924 Dual P4 Compute Nodes
1,114 Dual P4 Compute Nodes
65,536 Dual PowerPC 440 Compute Nodes
6509 Ethernet Switches with 1 or 10 Gig ports
BB/MKS Version 9, Feb3 2005
24
Steps to SWGFS in OCF 1 of 2
  • Deploy all necessary network connections for BG/L
    IONs, OSS, and network MDS
  • Migrate all MDS from Elan 3/4 in MCR, ALC-P,
    Thunder to Federated Gigabit Ethernet
    infrastructure
  • FY 05 CFS Development SOW includes one generation
    of backward compatibility, i.e Version 1.2 to 1.4
    for protocols and file formats
  • Finalize features in Lustre Version 1.4 for
    testing in Testbed and on ALC-T
  • Finalize plan for CHAOS migration to Linux 2.6
  • First deploy CHAOS 2.6 and Lustre 1.4 on
    ALC-T and IO Testbed.

25
Steps to SWGFS in OCF - 2 of 2
  • 6. Proposal to Users for cross-mounting
  • ALC-P sees /p/gm1, gm2, gT1
  • MCR sees /p/gm1, gT1
  • Thunder sees /p/gm1, gm2, gA1
  • PVC/Sphere sees /p/gm1, gm2, gA1, gT1
  • Review with Users to gain approval and prepare
    scripts
  • 7. Deploy SWGFS Phase 1 on MCR, PVC/Sphere, ALC-P
    and Thunder
  • 8. Migrate Lustre from Version 1.2 to 1.4
  • 9. Migrate CHAOS in production to Linux 2.6
  • 10. Cross mount BG/L and SWGFS Phase 1 with as
    above to complete SWGFS Phase 2 to include BG/L
  • 11. pNFS is maybe the direction

26
-
Strawman SWGFS Testing and Deployment Schedule
OCF
ALC-T
4
5
ALC-P
BG/L Vis
3
BG/L-O
Thunder
Sphere
10
6
PVC
9
2
7
MCR
1
CHAOS
8
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q4
Q3
Q4
FY06
FY04
FY05

White
Steps for SWGFS Deployment in OCF 1. Deploy all
necessary network connections for BG/L IONs, OSS,
and network MDS 2. Migrate all MDS from Elan3/4
in MCR, ALC-P, Thunder to Federated Gigabit
Ethernet infrastructure 3. FY 05 CFS Development
SOW includes one generation of backward
compatibility, i.e Version 1.2 to 1.4 for
protocols and file formats 4. Finalize features
in Lustre Version 1.4 for testing in Testbed and
on ALC-T 5. Finalize plan for CHAOS migration to
Linux 2.6, first deploy CHAOS 2.6 and Lustre 1.4
on ALC-T and IO Testbed.
6. Proposal to Users for cross-mounting ALC-P
sees /p/gm1, gm2, gT1 (remove root FN?) MCR sees
/p/gm1, gT1 Thunder sees /p/gm1, gm2,
gA1 PVC/Sphere sees /p/gm1, gm2, gA1, gT1 Review
with Users to gain approval and prepare
scripts 7. Deploy SWGFS Phase 1 on MCR,
PVC/Sphere, ALC-P and Thunder 8. Migrate Lustre
from Version 1.2 to 1.4 9. Migrate CHAOS in
production to Linux 2.6 10. Cross mount BG/L and
SWGFS Phase 1 with as above to complete SWGFS
Phase 2 to include BG/L
White
UM
UV
Purpura
Purple
Lilac
SCF
Q1
Q2
Q3
Q4
Q1
Q2
Q3
Q3
Q4
Test Dev.
SWGFS
Science Runs
Limtd Avail Prod
Gen Avail Prod
Key
2
Step No.
27
52 Dual P4
6 Dual P4
1,004 Quad Itanium2 Compute Nodes
LLNL Open Computing Facility SWGFS Deployed
VIZ
GW
MDS
18.7 TFLOPs LP
Thunder 1,024 Port QsNet Elan4
PVC - 128 Port Elan3
HPSS Archive
LLNL External Backbone
NAS Home Dirs
GW
GW
MDS
MDS
448 GigE Ports
36 GB/s
9.6 GB/s
4 Login nodes with 6 Gb-Enet
16 Gateway nodes _at_ 350 MB/s delivered Lustre I/O
over 4x1GbE
128 GigE Ports
2 Login nodes
B451
OSS
OSS
OSS
OSS
OSS
OSS
64 OSS 180 TB SATA
OSS
OSS
OSS
OSS
OSS
OSS
12.5 GB/s
128 GigE Ports
224 OSS 900 TB SATA
B453
386 GigE Ports
25 GB/s
OSS
OSS
OSS
OSS
OSS
OSS
MDS
MDS
192 OSS 258 TB FC
B439
45 GB/s
2 Login nodes with 4 Gb-Enet
32 Gateway nodes _at_ 190 MB/s delivered Lustre I/O
over 2x1GbE
4 Login nodes with 4 Gb-Enet
32 Gateway nodes _at_ 190 MB/s delivered Lustre I/O
over 2x1GbE
1152 GigE Ports
360 TFLOPs peak
1024 I/O Nodes PPC440
19.2 GB/s
MDS
MDS
GW
GW
MDS
MDS
GW
GW
B453
B439
B439
BG/L torus, global tree/barrier
7.6 TFLOPs LP
ALC-P - 480 Port QsNet Elan3
MCR 1,152 Port QsNet Elan3
6.5 TFLOPs LP
924 Dual P4 Compute Nodes
1,114 Dual P4 Compute Nodes
65,536 Dual PowerPC 440 Compute Nodes
6509 Ethernet Switches with 1 or 10 Gig ports
BB/MKS Version 9, Feb3 2005
28
SWGFS in SCF
  • White AIX/GPFS
  • UM (magenta) AIX/GPFS
  • UV (violet) AIX/GPFS
  • Purple GPFS by Q1 FY06
  • Lilac Linux/Lustre Version 1.2
  • GViz Linux/Lustre in FY 05 Q2
  • Some hints that IBM may open up GPFS

29
backup
30
Example of poor alignment and extremely
small write problems
Process 1
Process 3
Process 4
Process 2
44
23
14
11
12
13
21
22
24
31
32
33
34
41
42
43
Parallel file
23
44
14
33
43
13
34
24
RAID Group 1
RAID Group 2
RAID Group 3
Very small, unbalanced, and unaligned writes,
much smaller than a full RAID stripe, every
process queues up to read/update/write a partial
block, all serialized on RAID Group 1, then
everyone moves and does it again, eventually they
all move to serialize on RAID Group 2, etc.
31
Apps can utilize smart SCS
investments in Middleware (MPI-IO)
Process 1
Process 2
Process 3
Process 4
44
23
14
11
12
13
21
22
24
31
32
33
34
41
42
43
CB procs
Parallel file
23
44
14
33
43
13
34
24
RAID Group 1
RAID Group 2
RAID Group 3
RAID Group 4
Thanks to the SCS Argonne and Northwestern work,
collective two-phased buffered I/O helps align,
balance, and aggregate.
32
Historical Time Line for ASC I/O RD
Path Forward proposal with OBSD vendor, Panasas
born
proposed Path Forward activity for SGPFS
pNFS effort begins
Alliance contracts placed with universities on
OBSD, small/unaligned/unbalanced/overlapped I/O,
and NFSv4
propose initial architecture
Path Forward project to pursue RFI/RFQ, analysis,
recommend funding open source OBSD development
and NFSv4 projects
build initial requirements document
U of Minn Object Archive begins
SGPFS workshop You are Crazy
Path Forward Lustre project born
Lets re-invent Posix I/O ?
Tri-Lab joint requirements document complete
2002
2000
2001
1999
2003
2004
33
Why Panasas?
  • Needed a true production file system for Pink
    FY2003 and Lightning (early) FY2004
  • Request for Proposal for File System
  • RFP in early 2003, based on file systems Path
    Forward RFP
  • Panasas strongest proposal
  • No real Lustre bidders (HP said too early for
    Lustre to step up to production role)
  • Acceptance on Pink FY2003, Acceptance on
    Lightning early FY2004
  • Passed all Acceptance Tests
  • Maximums File size, of files, etc.
  • I/O Performance N to N N to 1 read/write
    bandwidth
  • Metadata Performance Creates, Deletes,
    stats/sec.
  • RAS Graceful reaction to catastrophic events
  • Uptime
  • Will we re-evaluate yes, when it makes sense,
    that is our job
Write a Comment
User Comments (0)
About PowerShow.com