Title: Peter J. BraamTim Reddin
1The Lustre Storage ArchitectureLinux Clusters
for Super ComputingLinköping 2003
- Peter J. Braam Tim Reddin
- braam_at_clusterfs.com tim.reddin_at_hp.com
- http//www.clusterfs.com
2Topics
- History of project
- High level picture
- Networking
- Devices and fundamental APIs
- File I/O
- Metadata recovery
- Project status
- Cluster File Systems, Inc
3Lustres History
4Project history
- 1999 CMU Seagate
- Worked with Seagate for one year
- Storage management, clustering
- Built prototypes, much design
- Much survives today
52000-2002 File system challenge
- First put forward Sep 1999 Santa Fe
- New architecture for National Labs
- Characteristics
- 100s GBs/sec of I/O throughput
- trillions of files
- 10,000s of nodes
- Petabytes
- From start Garth Peter in the running
62002 2003 fast lane
- 3 year ASCI Path Forward contract
- with HP and Intel
- MCR ALC, 2x 1000 node Linux Clusters
- PNNL HP IA64, 1000 node Linux cluster
- Red Storm, Sandia (8000 nodes, Cray)
- Lustre Lite 1.0
- Many partnerships (HP, Dell, DDN, )
72003 Production, perfomance
- Spring and summer
- LLNL MCR from no, to partial, to full time use
- PNNL similar
- Stability much improved
- Performance
- Summer 2003 I/O problems tackled
- Metadata much faster
- Dec/Jan
- Lustre 1.0
8High level picture
9Lustre Systems Major Components
- Clients
- Have access to file system
- Typical role compute server
- OST
- Object storage targets
- Handle (stripes of, references to) file data
- MDS
- Metadata request transaction engine.
- Also LDAP, Kerberos, routers etc.
10Linux OST Servers with disk arrays
QSW Net
SAN
Lustre Clients (1,000 Lustre Lite) Up to 10,000s
GigE
3rd party OST Appliances
Lustre Object Storage Targets (OST)
11configuration information, network connection
details, security management
directory operations, meta-data, concurrency
file I/O file locking
recovery, file status, file creation
12Networking
13Lustre Networking
- Currently runs over
- TCP
- Quadrics Elan 3 4
- Lustre can route can use heterogeneous nets
- Beta
- Myrinet, SCI
- Under development
- SAN (FC/iSCSI), I/B
- Planned
- SCTP, some special NUMA and other nets
14Lustre Network Stack - Portals
0-copy marshalling libraries, Service
framework, Client request dispatch, Connection
address naming, Generic recovery infrastructure
Move small large buffers, Remote DMA
handling, Generate events
Sandias API, CFS improved impl.
Network Abstraction Layer for TCP, QSW, etc.
Small hard Includes routing api.
15Devices and APIs
16Lustre Devices APIs
- Lustre has numerous driver modules
- One API - very different implementations
- Driver binds to named device
- Stacking devices is key
- Generalized object devices
- Drivers currently export several APIs
- Infrastructure - a mandatory API
- Object Storage
- Metadata Handling
- Locking
- Recovery
17Lustre Clients APIs
Data Object Lock
Metadata Lock
18Object Storage Api
- Objects are (usually) unnamed files
- Improves on the block device api
- create, destroy, setattr, getattr, read, write
- OBD driver does block/extent allocation
- Implementation
- Linux drivers, using a file system backend
19Bringing it all together
Recovery
Lustre Client File System
Metadata WB cache
Request Processing
NIO API
Portal Library
System Parallel File I/O, File Locking
OSCs
MDC
Lock Client
Portal NALs
Networking
Device (Elan,TCP,)
Directory Metadata Concurrency
OST
MDS
Networking
Recovery
Object-Based Disk Server (OBD server
Lock Server
Recovery, File Status, File Creation
Ext3, Reiser, XFS, FS
Fibre Channel
Fibre Channel
20File I/O
21File I/O Write Operation
- Open file on meta-data server
- Get information on all objects that are part of
file - Objects ids
- What storage controllers (OST)
- What part of the file (offset)
- Striping pattern
- Create LOV, OSC drivers
- Use connection to OST
- Object writes to OST
- No MDS involvement at all
22Lustre Client
Meta-data Server
File system
File open request
MDS
MDC
LOV
File meta-data
OSC 2
OSC 1
Inode A (O1,obj1),(O3, obj2)
Write (obj 1)
Write (obj 2)
OST 1
OST 2
OST 3
23I/O bandwidth
- 100s GB/sec gt saturate many100s OSTs
- OSTs
- Do ext3 extent allocation, non-caching direct I/O
- Lock management spread over cluster
- Achieve 90-95 of network throughput
- Single client, single thread Elan3 W 269MB/sec
- OSTs handle up to 260MB/sec
- W/O extent code, on 2 way 2.4GHz Xeon
24Metadata
25Intent locks Write Back caching
- Clients MDS protocol adaptation
- Low concurrency - write back caching
- Client in memory updates
- delayed replay to MDS
- High concurrency (mostly merged in 2.6)
- Single network request per transaction
- No lock revocations to clients
- Intent based lock includes complete request
26(No Transcript)
27Lustre 1.0
- Only has high concurrency model
- Aggregate throughput (1,000 clients)
- Achieve 5000 file creations (open/close) /sec
- Achieve 7800 stats in 10 x1M file directories
- Single client
- Around 1500 creations or stats /sec
- Handling 10M file directories is effortless
- Many changes to ext3 (all merged in 2.6)
28Metadata Future
- Lustre 2.0 2004
- Metadata clustering
- Common operations will parallelize
- 100 WB caching in memory or on disk
- Like AFS
29Metadata Odds and Ends
- Logical drivers
- Local persistent metadata cache, like
AFS/Coda/InterMezzo - Replicated metadata server driver
- Remotely mirrored MDS
- Small scale clusters
- CFS focused on big systems
- Our drivers ordinary FS can export all protocols
- Get shared ext3/Reiser/.. file systems
30Recovery
31Recovery approach
- Keep it simple!
- Based on failover circles
- Use existing failover software
- Left working neighbor is failover node for you
- At HP we use failover pairs
- Simplify storage connectivity
- I/O failure triggers
- Peer node serves failed OST
- Retry from client routed to new OST node
32OST Server redundant pair
OST1
OST 2
FC Switch
FC Switch
C2
C1
C1
C2
33Configuration
34Lustre 1.0
- Good tools to build configuration
- Configuration is recorded on MDS
- Or on dedicated management server
- Configuration can be changed,
- 1.0 requires downtime
- Clients auto configure
- mount t lustre o mds//fileset/sub/dir
/mnt/pt - SNMP support
35Futures
36Advanced Management
- Snapshots
- All features you might expect
- Global namespace
- Combine best of AFS autofs4
- HSM, hot migration
- Driven by customer demand (we plan XDSM)
- Online 0-downtime re-configuration
- Part of Lustre 2.0
37Security
38Security
- Authentication
- POSIX style authorization
- NASD style OST authorization
- Refinement use OST ACLs and cookies
- File crypting with group key service
- STK secure file system
39Step 1 Authenticate user, get session key
Step 7 Get SFS file key
Step 2 Authenticated open RPCs
Step 4 Get OST ACL
Step 6 Read encrypted file data
Step 5 Send ACL capability cookie
40CFS Cluster Tools for 2.6
- Remote serial GDB debugging over UDP
- Conman UDP consoles for
- syslog
- sysrq
- Core dumps over net or to local disk
- Many dump format enhancements
- Analyze dumps with gdb extension (not lcrash)
- Llanalyze
- Analyzes distributed Lustre logs
41Metadata transaction protocol
- No synchronous I/O unless requested
- Reply and commit confirmation
- Lustre covers single component failure
- Replay of requests central
- Preserve transaction sequence
- Acknowledge replies to remove barriers
- Avoid cascading aborts
- In DB parlor strict execution
42Distributed persistent data
- Happens in many places
- Inode object creation/removal (MDS/OST)
- Replicating OSTs
- Metadata clustering
- Recovery with replay logs
- Cancellation of log records
- Logs ubiquitous in Lustre
- Recovery, WB caching logs, replication etc.
- Configuration
43Project status
44Lustre Feature Roadmap
Lustre (Lite) 1.0 (Linux 2.4 2.6) Lustre 2.0 (2.6) Lustre 3.0
2003 2004 2005
Failover MDS Metadata cluster Metadata cluster
Basic Unix security Basic Unix security Advanced Security
File I/O very fast (100s OSTs) Collaborative read cache Storage management
Intent based scalable metadata Write back metadata Load balanced MD
POSIX compliant Parallel I/O Global namespace
45Cluster File Systems, Inc.
46Cluster File Systems
- Small service company 20-30 people
- Software development service (95 Lustre)
- contract work for Government labs
- OSS but defense contracts
- Extremely specialized and extreme expertise
- we only do file systems and storage
- Investments - not needed. Profitable.
- Partners HP, Dell, DDN, Cray
47Lustre conclusions
- Great vehicle for advanced storage software
- Things are done differently
- Protocols design from Coda InterMezzo
- Stacking DB recovery theory applied
- Leverage existing components
- Initial signs promising
48HP Lustre
- Two projects
- ASCI PathForward Hendrix
- Lustre Storage product
- Field trial in Q1 of 04
49Questions?