Title: IRON File Systems
1IRON File Systems
- Remzi Arpaci-Dusseau
- University of Wisconsin, Madison
2Understanding How ThingsFail Is Important
3How Disks Fail
4Classic Failure Model Fail Stop
- As defined Schneider 90
- Stop Upon failure, halt
- Make known But first, switch to state s.t.
other components can detect that you have
failed - Very simple model of disk failure
- Used by all early file and storage systems(once
controllers could detect failure) - But is it realistic?
5AssertionModern Disks Are Not Whole-Disk Fail
Stop
6Real Failures
- Latent sector errors Kari 93, Bairavasundaram
07 - Block or blocks becomes inaccessible
- Data corruption Weinberg 04, Greene 05,
Bairavasundaram 08 - Controller bugs, not bit rot
- Transient errors too Talagala 99
- Bus stuttering, etc.
- Result Partial failures are a reality
7So What Should We Do?
8High-end Systems Extra Measures
- Disk Scrubbing Kari 93
- Proactively scan drives in search of latent
errors - When detected, correct from redundant copyon
another disk - Extra redundancy Corbett 04
- RAID system with two parity disks
- Checksums Bartlett 04, Weinberg 04
- Extra computation over data
- Guard against corruption
9But What About Desktop File Systems?
10Desktop FSs Lost In The Past?
- Desktop file systems are important
- Home use Photos, movies, tax returns, ...
- Cluster use too GoogleFS built on local FSs
- Performance policies are well known
- e.g., FFS placement policy
- But what is their fault-handling policy?
- Do they handle partial disk failures?
- How can we tell?
11Two Questions
12Questions I Will Answer
- Question 1 How do local file systems reactto
the more realistic set of disk failures? - Question 2 How can we change file systemsto
better handle these types of faults?
13How Disks Fail The Details
14The Storage Stack
Host
- Not just file system on top of the disk
- Many layers
- Lots of software
- Even within disk!
- Failures occur at all levels
Disk
15Latent Sector Errors
- Disks experience partial failures
- a small portion of data on disk
becomestemporarily or permanently
unavailableCorbett 04 - Root causes
- Surface is scratched, inaccurate arm movement,
interconnect problems - Bottom line A single read or write can fail
16Data Corruption
- Suns ZFS Weinberg 04
- Misdirected writes Right data, wrong location
- Phantom/Lost writes Yes I wrote the data!
(but didnt) - EIDE Interface on motherboards Greene 05
- Read reported as done when not (race)
- Similar problem at Google Ghemewat 03
- Network Appliance Lewis 99
- Disk occasionally returns byte-shifted data
17Transient Errors
- 18-month study of large disk farm Talagala 99
- Most machines had SCSI timeout errors(loose
cables, bad cables?) - SCSI parity errors were common too(data
corrupted when moving across the bus) - Failures can be transient too
- Might work if just retried
18Even Worse With ATA (Not SCSI)
- ATA drives Less reliable Anderson 03, Hughes
Murray 05 - Few are returned for failure analysis
- Some are partially flaw marked during testing
- Test conditions not as harsh (power, temp.)
- High-end reliability features missing(filters
remove particles, chemicals humidity) - Cheap disks -gt less testing -gt less reliability
- But cost drives many purchasing decisions
19Trend More Problems, Not Less
- Denser drives Capacity sells drives
- More logic -gt more complexity
- More complexity -gt more bugs
- Cost per byte dominates Pennies matter
- Manufacturers will cut corners
- Reliability features are the first to go
- Increasing amount of software
- 400K lines of code in modern Seagate drive
- Hard to write, hard to debug
20The Fail-Partial Failure Model
21The Fail-Partial Failure Model
- Disk failure
- Entire disk may fail
- Block failure
- Part of disk may fail
- Block corruption
- Part of disk may get corrupted
- All can be either transient or sticky
22Important Parameters
- Locality
- Are partial faults independent of each other?
- Frequency
- How often do partial faults occur?
23Frequency of Failures
- Study of Latent Sector Errors Bairavasundaram et
al. 07 - 1.53 millions disks, 3 years of data
- ATA 8.5 - SCSI 1.9
- Latent sector errors are not independent
- Spatial locality exists, disk capacity matters
- Study of Block Corruption Bairavasundaram et al.
08 - Same data set
- ATA 0.6 - SCSI 0.06
- Corruptions within disk are not independent
- Spatial locality exists
- The bad block number problem
24How Do File Systems ReactTo Partial Failures?
25How To Detect Handle Failures?
- Need Classification of techniques
- Detection Discovering a failure took place
- Recovery Recovering from the failure
- Detection Recovery IRON
- File systems with Internal RObustNess
- IRON Taxonomy Classify techniques
26IRON Detection Taxonomy
- How to detect block failure or corruption?
- Possible strategies
- Zero No detection technique used
- Error Code Check return codes from disk
- Sanity Check data structures for consistency
- Redundancy Add checksums or otherforms of
computed replication to detect problems
27IRON Recovery Taxonomy
- How to recover from a detected failure?
- Possible strategies
- Zero Dont do anything
- Propagate Pass error on to higher level
- Stop Halt activity (fail stop)
- Guess Manufacture data, return to user
- Retry Assume failure is transient
- Repair If inconsistency is detected
- Remap Redirect to another block
- Redundancy Use another copy of block
28What IRON Techniques DoModern File Systems Use?
29Fault Injection
- Typical fault injection
- Insert failures at random disk locations/times
- Watch system to see what happens
- Not good enough
- May miss interesting behavior
- May find problems, but not explanatory
- What we do Space- and Time-aware injection
- A gray box approach to testing
30Space Awareness
- File systems comprised of many on-disk structures
- e.g., superblocks, inodes, etc.
- Idea Make fault injection layer awareof file
system structures - Inject faults across all block types
31Time Awareness
- Time is key to testing as well
- e.g., update sequence
- Idea Build model of file system I/O activity
Writes
J Journal C Commit K Checkpoint S Superblock
Data Journaling (Simplified)
- Use model to induce faults at crucial times
- Dont miss interesting behaviors
32Making It Comprehensive
- Workloads
- Exercise as much of FS as possible
- Two types of workloads
- Singlets Stress single system call(open, lstat,
rename, symlink, write, etc.) - Generics Stress common functionality(path
traversal, recovery, log writes, etc.)
33Injecting Faults
- Disk Hard to do -gtits hardware
- Software approach
- Easy
- Desirable
- Fail-partial faults
- Read, write errors
- Read corruption
Host
Disk
34The File Systems We Tested
- Linux ext3
- Popular, simple, compatible Linux file system
- Linux ReiserFS
- Scalable, database-like file system
- Linux IBM JFS
- Big Blues classic journaling file system
- Windows NTFS
- Yes, a non-Linux file system
35Result Matrix
Workloads
Data Structures
36Read ErrorsRecovery
Ext3
- Ext3 Stop and propagate(dont tolerate
transience) - ReiserFS Mostly propagate
- JFS Stop, propagate, retry
- All Some cases missed
ReiserFS
JFS
37Write Errors Recovery
Ext3
- Ext3/JFS Ignore write faults
- No detection -gt no recovery
- Can corrupt entire volume
- ReiserFS always calls panic
- Exception indirect blocks
ReiserFS
JFS
38Corruption Recovery
Ext3
- Ext3/Reiser/JFS
- Some sanity checking used
- Stop/Propagate common
- Sanity checking not enough
ReiserFS
JFS
39File System Specific Results
- Ext3 Overall simplicity
- Checks error codes, modest sanity
checking,propagates errors, aborts operation - Overreacts on read errors -gt halt instead of
propagate - But, some write errors are ignored
- ReiserFS First, do no harm
- At slightest sign of failure, panic() file system
- Preserves integrity overreacts to transients
- IBM JFS The kitchen sink
- Uses broadest range of techniques
- Windows NTFS Persistence is a virtue
- Liberal retry (understands disks can be flaky)
40General Results (1 of 3)
- Illogical inconsistency is common
- Similar faults -gt different reactions(e.g., JFS
failed read of superblock) - Bugs are common
- Code not stress-tested enough?(e.g., ReiserFS
indirect block code paths) - Error codes are sometimes ignored
- Highly surprising Easiest to detect(but
sometimes hard to act upon)
41General Results (2 of 3)
- Sanity checking is of limited utility
- Doesnt help if read right type, wrong block
- Hard to do for some structures (e.g., bitmaps)
- Stop is useful (if used correctly)
- ReiserFS halts on write errors
- Ext3 tries to do this (but aborts too late)
- Stop should not be overused
- Faults can be transient
- Faults can be sticky, too!
42General Results (3 of 3)
- Retry is underutilized
- JFS does it some, NTFS quite a bit
- But transient faults occur
- Automatic repair is rare
- Almost all stop actions involve administrator
intervention/repair (running fsck, reboot, etc.) - Redundancy is rarely used
- Only superblocks are replicated, sometimes
43Towards an IRON File System
44IRON ext3 ixt3
- Prototype of an IRON file system
- First cut Many other possibilities still exist
- Start with Linux ext3
- Add checksums To detect corruption
- Add replication For important structures(e.g.,
meta-data) - Add parity For user data
- Result IRON ext3 (ixt3)
45Ixt3 Implementation
- Checksums
- Initially write to the ext3 log,then checkpoint
them to their final location - Meta-data replicas
- Write to replica log, checkpointlater to their
final on-disk location - Parity protection for data
- One block per file, extra pointer in inode
- Performance issues
- Space overhead Low
- Time overhead?
46Ixt3 Performance Evaluation
- For home use or read-mostly No overhead
- Has cost for write-intensive workloads
47Wrapping Up
48Summary
- File systems are important
- Used everywhere, in many different ways
- Disks fail in interesting ways
- New model Fail-partial failure model
- Local file systems Not ready for local faults
- Illogical inconsistencies, bugs, and little
recovery - Need IRON file systems
- Ixt3 Low-cost protection from partial failures
49Challenges and Directions
- Need to rethink how we build file systems
- Performance policy isnt the only policy
- Fault-handling policy is critical
- Testing and beyond testing
- Failure handling must be tested (continuously?)
- Beyond testing Code analysis too?
- Guiding principles
- Lessons from networking
- Put simply Dont trust the disk
50ADvanced Systems Lab (ADSL)
51ADvanced Systems Lab (ADSL)
- Who did the real work
- Nitin Agrawal
- Lakshmi Bairavasundaram
- Haryadi Gunawi
- Vijayan Prabhakaran
52Backup Slides
53Read Errors Detection Techniques
- Across all three file systems
- Error codes checked forread errors(rarely
ignored)
54Write Errors Detection Techniques
- Ext3, JFS ignore write errors!
- Either ignored altogetheror not used
meaningfully - ReiserFS Much more careful
55Corruption Detection Techniques
- Sanity checking used acrossall three file
systems - Sanity checking not sufficient
- e.g., when you read blockof similar type
56File Systems The Manager of Your Data
57Why File Systems Are Important
- The file system The manager of most data
- Consists of named files Linear array of bytes
- Organized in directories /this/is/my/file
- Access methods open(), read(), write(), close()
- Where we use them Everywhere
- Home use Photos, tax returns, home movies
- Servers Network file servers, Google search
engine - Why we use them
- Simple, convenient
- Good performance Subject of much research
- Reliable? Depends on how disks fail
58File System Background
- Meta-data Structures the file system usesto
track what it needs to track - Superblock File-system wide parameters
- Inodes Information about a file
- Data Blocks to hold user data