Title: Improving File System Reliability with I/O Shepherding
1Improving File System Reliability with I/O
Shepherding
- Haryadi S. Gunawi,
- Vijayan Prabhakaran, Swetha Krishnan,
- Andrea C. Arpaci-Dusseau,
- Remzi H. Arpaci-Dusseau
University of Wisconsin - Madison
2Storage Reality
- Complex Storage Subsystem
- Mechanical/electrical failures, buggy drivers
- Complex Failures
- Intermittent faults, latent sector errors,
corruption, lost writes, misdirected writes, etc. - FS Reliability is important
- Managing disk and individual block failures
File System
3File System Reality
- Good news
- Rich literature
- Checksum, parity, mirroring
- Versioning, physical/logical identity
- Important for single and multiple disks setting
- Bad news
- File system reliability is brokenSOSP05
- Unlike other components (performance,
consistency) - Reliability approaches hard-to understand and
evolve
4Broken FS Reliability
- Lack of good reliability strategy
- No remapping, checksumming, redundancy
- Existing strategy is coarse-grained
- Mount read-only, panic, retry
- Inconsistent policies
- Different techniques in similar failure scenarios
- Bugs
- Ignored write failures
Lets fix them!
With current Framework? Not so easy
5No Reliability Framework
Reliability Policy
- Diffused
- Handle each fault in each I/O location
- Different developers might increase diffusion
File System
Disk Subsystem
- Inflexible
- Fixed policies, hard to change
- But, no policy that fits all diverse settings
- Less reliable vs. more reliable drives
- Desktop workload vs. web-server apps
- The need for new framework
- Reliability is a first-class file system concern
6Localized
- I/O Shepherd
- Localized policies,
- More correct, less bug, simpler reliability
management
File System
Shepherd
Disk Subsystem
7Flexible
- I/O Shepherd
- Localized, flexible policies,
File System
Shepherd
Disk Subsystem
8Powerful
- I/O Shepherd
- Localized, flexible, and powerful policies
File System
Shepherd
Add Mirror
Check- sum
More Retry
More Protection
Add Mirror
Check- sum
More Retry
More Protection
Less Protection
Compo- sable Policies
Disk Subsystem
ATA
SCSI
Archival
Scientific Data
Networked Storage
Less Reliable Drive
More Reliable Drive
Custom Drive
9Outline
- Introduction
- I/O Shepherd Architecture
- Implementation
- Evaluation
- Conclusion
10Architecture
File System
- Building reliability framework
- How to specify reliability policies?
- How to make powerful policies?
- How to simplify reliability management?
- I/O Shepherd layer
- Four important components
- Policy table
- Policy code
- Policy primitives
- Policy Metadata
I/O Shepherd
Policy Table Policy Table
Data Mirror()
Inode
Super
Policy Code
DynMirrorWrite(DiskAddr D, MemAddr A) DiskAddr
copyAddr IOS_MapLookup(MMap, D,
copyAddr) if (copyAddr NULL)
PickMirrorLoc(MMap, D, copyAddr)
IOS_MapAllocate(MMap, D, copyAddr) return
(IOS_Write(D, A, copyAddr, A))
Disk Subsystem
11Policy Table
Policy Table Policy Table Policy Table
Block Type Write Policy Read Policy
- How to specify reliability policies?
- Different block types, different
levels of importance - Different volumes, different
reliability levels - Need fine-grained policy
- Policy table
- Different policies across different
block types - Different policy tables across different volumes
Superblock TrippleMirror()
Inode ChecksumParity()
Inode Bitmap ChecksumParity()
Data WriteRetry1sec()
12Policy Metadata
- What support is needed to make powerful policies?
- Remapping track bad block remapping
- Mirroring allocate new block
- Sanity check need on-disk structure
specification - Integration with file system
- Runtime allocation
- Detailed knowledge of on-disk structures
- I/O Shepherd Maps
- Managed by the shepherd
- Commonly used maps
- Mirror-map
- Checksum-map
- Remap-map
13Policy Primitives and Code
Policy Primitives
Maps
Computation
- How to make reliability management simple?
- I/O Shepherd Primitives
- Rich set and reusable
- Complexities are hidden
- Policy writer simply composes primitives into
Policy Code
Checksum
Map Update
Parity
Map Lookup
FS-Level
Layout
Sanity Check
Allocate Near
Stop FS
Allocate Far
Policy Code
MirrorData(Addr D) Addr M MapLookup (MMap,
D, M) if (M NULL) M PickMirrorLoc
(D) MapAllocate (MMap, D, M) Copy (D,
M) Write (D, M)
14File System
D
D
I/O Shepherd
Policy Table Policy Table
Data MirrorData()
Inode
Super
Policy Code
MirrorData(Addr D) Addr R R
MapLookup(MMap, D) if (R NULL) R
PickMirrorLoc(D) MapAllocate(MMap, D, R)
Copy(D, R) Write(D, R)
Disk Subsystem
R
D
D
15Summary
- Interposition simplifies reliability management
- Localized policies
- Simple and extensible policies
- Challenge Keeping new data and metadata
consistent
16Outline
- Introduction
- I/O Shepherd Architecture
- Implementation
- Consistency Management
- Evaluation
- Conclusion
17Implementation
- CrookFS
- (named for the hooked staff of a shepherd)
- An ext3 variant with I/O shepherding capabilities
- Implementation
- Changes in Core OS
- Semantic information, layout and allocation
interface, allocation during recovery - Consistency management (data journaling mode)
- 900 LOC (non-intrusive)
- Shepherd Infrastructure
- Shepherd primitives, thread support, maps
management, etc. - 3500 LOC (reusable for other file systems)
- Well-integrated with the file system
- Small overhead
18Data Journaling Mode
Memory
D
I
Bm
Tx Release
Sync (intent is logged)
Journal
Fixed Location
Checkpoint (intent is realized)
19Reliability Policy Journaling
- When to run policies?
- Policies (e.g. mirroring) are executed during
checkpoint - Is current journaling approach adequate to
support reliability policy? - Could we run remapping/mirroring during
checkpoint? - No Problem of failed intentions
- Cannot react to checkpoint failures
20Failed Intentions
Example Policy Remapping
Crash
Memory
D
I
RMD?R
Impossible
R
I
Tx Release
Journal
Inconsistencies 1) Pointer I?D invalid 2) No
reference to R
TB
D
I
TC
Fixed Location
RMD?0
R
D
I
RMD?0
Remap-Map
RMD?R
Checkpoint completes
Checkpoint (failed intent)
21Journaling Flaw
- Journal log intent to the journal
- If journal write failure occurs? Simply abort the
transaction - Checkpoint intent is realized to final location
- If checkpoint failure occurs? No solution!
- Ext3, IBM JFS ignore
- ReiserFS stop the FS (coarse-grained recovery)
- Flaw in current journaling approach
- No consistency for any checkpoint recovery that
changes state - Too late, transaction has been committed
- Crash could occur anytime
- Hopes checkpoint writes always succeed (wrong!)
- Consistent reliability current journal
impossible
22Chained Transactions
- Contains all recent changes (e.g. modified
shepherds metadata) - Chained with previous transaction
- Rule Only after the chained transaction commits,
can we release the previous transaction
23Chained Transactions
Example Policy Remapping
Memory
D
I
RMD?R
RMD?R
New Tx Release after CTx commits
Old Tx Release
Journal
TB
D
I
TC
TB
TC
Fixed Location
R
D
I
RMD?0
Checkpoint completes
24Summary
- Chained Transactions
- Handles failed-intentions
- Works for all policies
- Minimal changes in the journaling layer
- Repeatable across crashes
- Idempotent policy
- An important property for consistency in multiple
crashes
25Outline
- Introduction
- I/O Shepherd Architecture
- Implementation
- Evaluation
- Conclusion
26Evaluation
- Flexible
- Change ext3 to all-stop or more-retry policies
- Fine-Grained
- Implement gracefully-degrade RAIDTOS05
- Composable
- Perform multiple lines of defense
- Simple
- Craft 8 policies in a simple manner
27Flexibility
- Modify ext3 inconsistent read recovery policies
Workload
Failed Block Type
Failed Block Indirect block Workload Path
traversal cd /mnt/fs2/test/a/b/ Policy
observed Detect failure and propagate
failure to app
Propagate
Retry
Ignore failure
Stop
ext3
28Flexibility
- Modify ext3 policies to all-stop policies
ext3
All-Stop
Policy Table Policy Table
Any Block Type AllStopRead()
No Recovery
Retry
Stop
AllStopRead (Block B) if (Read(B) OK)
return OK else
Stop()
Propagate
29Flexibility
- Modify ext3 policies to retry-more policies
ext3
Retry-More
Policy Table Policy Table
Any Block Type RetryMoreRead()
No Recovery
Retry
RetryMoreRead (Block B) for (int i 0 i
lt RETRY_MAX i) if (Read(B)
SUCCESS) return SUCCESS
return FAILURE
Stop
Propagate
30Fine-Granularity
- RAID problem
- Extreme unavailability
- Partially available data
- Unavailable root directory
- DGRAIDTOS05
- Degrade gracefully
- Fault isolate a file to a disk
- Highly replicate metadata
f1.pdf
f2.pdf
31Fine-Granularity
F 1 A 90
F 2 A 80
DGRAID Policy Table DGRAID Policy Table
Superblock MirrorXway()
Group Desc MirrorXway()
Bitmaps MirrorXway()
Directory MirrorXway()
Inode MirrorXway()
Indirect MirrorXway()
Data IsolateAFileToADisk()
10-way Linear
X 1, 5, 10
F 3 A 40
32Composability
ReadInode(Block B) C Lookup(Ch-Map,
B) Read(B,C) if ( CompareChecksum(B, C)
OK ) return OK M Lookup(M-Map,
B) Read(M) if ( CompareChecksum(M, C)
OK ) B M return OK if (
SanityCheck(B) OK ) return OK if (
SanityCheck(M) OK ) B M
return OK RunOnlineFsck() return
ReadInode(B)
Time (ms)
- Multiple lines of defense
- Assemble both low-level and high-level recovery
mechanism
33Simplicity
Policy LOC
Propagate 8
Sanity Check 10
Reboot 15
Retry 15
Mirroring 18
Parity 28
Multiple Lines of D 39
D-GRAID 79
- Writing reliability policy is simple
- Implement 8 policies
- Using reusable primitives
- Complex one lt 80 LOC
34Conclusion
- Modern storage failures are complex
- Not only fail-stop, but also exhibit individual
block failures - FS reliability framework does not exist
- Scattered policy code cant expect much
reliability - Journaling Block Failures ? Failed intentions
(Flaw) - I/O Shepherding
- Powerful
- Deploy disk-level, RAID-level, FS-level policies
- Flexible
- Reliability as a function of workload and
environment - Consistent
- Chained-transactions
35ADvanced Systems Laboratorywww.cs.wisc.edu/adsl
Thanks to
I/O Shepherds shepherd Frans Kaashoek
ScholarshipSponsor
ResearchSponsor
36Extra Slides
37Policy Table Policy Table
Data RemapMirrorData()
..
..
Policy Code
D
RemapMirrorData(Addr D) Addr R, Q
MapLookup(MMap, D, R) if (R NULL) R
PickMirrorLoc(D) MapAllocate(MMap, D,
R) Copy(D, R) Write(D, R) if (Fail(R))
Deallocate(R) Q PickMirrorLoc(D)
MapAllocate(MMap, D, Q) Write(Q)
Disk Subsystem
R
D
Q
38Chained Transactions (2)
Example Policy RemapMirrorData
Memory
D
I
MD?R1
MD?R2
MD?R2
Journal
TB
TB
TC
TC
D
I
Fixed Location
MD?0
D
I
R1
R2
MD?0
Checkpoint completes
39Existing Solution Enough?
- Is machinery in high-end systems enough (e.g.
disk scrubbing, redundancy, end-to-end
checksums)? - Not pervasive in home environment (store photos,
tax returns) - New trend commodity storage clusters (Google,
EMC Centera) - Is RAID enough?
- Requires more than one disk
- Does not protect faults above disk system
- Focus on whole disk failure
- Does not enable fine-grained policies