Title: Transactions and Reliability
1Transactions and Reliability
- Sarah Diesburg
- Operating Systems
- COP 4610
2Motivation
- File systems have lots of metadata
- Free blocks, directories, file headers, indirect
blocks - Metadata is heavily cached for performance
3Problem
- System crashes
- OS needs to ensure that the file system does not
reach an inconsistent state - Example move a file between directories
- Remove a file from the old directory
- Add a file to the new directory
- What happens when a crash occurs in the middle?
4UNIX File System (Ad Hoc Failure-Recovery)
- Metadata handling
- Uses a synchronous write-through caching policy
- A call to update metadata does not return until
the changes are propagated to disk - Updates are ordered
- When crashes occur, run fsck to repair
in-progress operations
5Some Examples of Metadata Handling
- Undo effects not yet visible to users
- If a new file is created, but not yet added to
the directory - Delete the file
- Continue effects that are visible to users
- If file blocks are already allocated, but not
recorded in the bitmap - Update the bitmap
6UFS User Data Handling
- Uses a write-back policy
- Modified blocks are written to disk at 30-second
intervals - Unless a user issues the sync system call
- Data updates are not ordered
- In many cases, consistent metadata is good enough
7Example Vi
- Vi saves changes by doing the following
- 1. Writes the new version in a temp file
- Now we have old_file and new_temp file
- 2. Moves the old version to a different temp
file - Now we have new_temp and old_temp
- 3. Moves the new version into the real file
- Now we have new_file and old_temp
- 4. Removes the old version
- Now we have new_file
8Example Vi
- When crashes occur
- Looks for the leftover files
- Moves forward or backward depending on the
integrity of files
9Transaction Approach
- A transaction groups operations as a unit, with
the following characteristics - Atomic all operations either happen or they do
not (no partial operations) - Serializable transactions appear to happen one
after the other - Durable once a transaction happens, it is
recoverable and can survive crashes
10More on Transactions
- A transaction is not done until it is committed
- Once committed, a transaction is durable
- If a transaction fails to complete, it must
rollback as if it did not happen at all - Critical sections are atomic and serializable,
but not durable
11Transaction Implementation (One Thread)
- Example money transfer
- Begin transaction
- x x 1
- y y 1
- Commit
12Transaction Implementation (One Thread)
- Common implementations involve the use of a log,
a journal that is never erased - A file system uses a write-ahead log to track all
transactions
13Transaction Implementation (One Thread)
- Once accounts of x and y are on a log, the log is
committed to disk in a single write - Actual changes to those accounts are done later
14Transaction Illustrated
x 1 y 1
x 1 y 1
15Transaction Illustrated
x 0 y 2
x 1 y 1
16Transaction Illustrated
x 0 y 2
x 1 y 1
17Transaction Steps
- Mark the beginning of the transaction
- Log the changes in account x
- Log the changes in account y
- Commit
- Modify account x on disk
- Modify account y on disk
18Scenarios of Crashes
- If a crash occurs after the commit
- Replays the log to update accounts
- If a crash occurs before or during the commit
- Rolls back and discard the transaction
19Two-Phase Locking (Multiple Threads)
- Logging alone not enough to prevent multiple
transactions from trashing one another (not
serializable) - Solution two-phase locking
- 1. Acquire all locks
- 2. Perform updates and release all locks
- Thread A cannot see thread Bs changes until
thread A commits and releases locks
20Transactions in File Systems
- Almost all file systems built since 1985 use
write-ahead logging - NTFS, HFS, ext3, ext4,
- Eliminates running fsck after a crash
- Write-ahead logging provides reliability
- - All modifications need to be written twice
21Log-Structured File System (LFS)
- If logging is so great, why dont we treat
everything as log entries? - Log-structured file system
- Everything is a log entry (file headers,
directories, data blocks) - Write the log only once
- Use version stamps to distinguish between old and
new entries
22More on LFS
- New log entries are always appended to the end of
the existing log - All writes are sequential
- Seeks only occurs during reads
- Not so bad due to temporal locality and caching
- Problem
- Need to create more contiguous space all the time
23RAID and Reliability
- So far, we assume that we have a single disk
- What if we have multiple disks?
- The chance of a single-disk failure increases
- RAID redundant array of independent disks
- Standard way of organizing disks and classifying
the reliability of multi-disk systems - General methods data duplication, parity, and
error-correcting codes (ECC)
24RAID 0
- No redundancy
- Uses block-level striping across disks
- i.e., 1st block stored on disk 1, 2nd block
stored on disk 2 - Failure causes data loss
25Non-Redundant Disk Array Diagram (RAID Level 0)
open(foo)
read(bar)
write(zoo)
File System
26Mirrored Disks (RAID Level 1)
- Each disk has a second disk that mirrors its
contents - Writes go to both disks
- Reliability is doubled
- Read access faster
- - Write access slower
- - Expensive and inefficient
27Mirrored Disk Diagram (RAID Level 1)
open(foo)
read(bar)
write(zoo)
File System
28Memory-Style ECC (RAID Level 2)
- Some disks in array are used to hold ECC
- Byte to detect error, extra bits for error
correcting - More efficient than mirroring
- Can correct, not just detect, errors
- - Still fairly inefficient
- e.g., 4 data disks require 3 ECC disks
29Memory-Style ECC Diagram (RAID Level 2)
open(foo)
read(bar)
write(zoo)
File System
30Bit-Interleaved Parity (RAID Level 3)
- Uses bit-level striping across disks
- i.e., 1st byte stored on disk 1, 2nd byte stored
on disk 2 - One disk in the array stores parity for the other
disks - No detection bits needed, relies on disk
controller to detect errors - More efficient than Levels 1 and 2
- - Parity disk doesnt add bandwidth
31Parity Method
- Disk 1 1001
- Disk 2 0101
- Disk 3 1000
- Parity 0100 1001 xor 0101 xor 1000
- To recover disk 2
- Disk 2 0101 1001 xor 1000 xor 0100
32Bit-Interleaved RAID Diagram (Level 3)
open(foo)
read(bar)
write(zoo)
File System
33Block-Interleaved Parity (RAID Level 4)
- Like bit-interleaved, but data is interleaved in
blocks - More efficient data access than level 3
- - Parity disk can be a bottleneck
- - Small writes require 4 I/Os
- Read the old block
- Read the old parity
- Write the new block
- Write the new parity
34Block-Interleaved Parity Diagram (RAID Level 4)
open(foo)
read(bar)
write(zoo)
File System
35Block-Interleaved Distributed-Parity (RAID Level
5)
- Sort of the most general level of RAID
- Spreads the parity out over all disks
- No parity disk bottleneck
- All disks contribute read bandwidth
- Requires 4 I/Os for small writes
36Block-Interleaved Distributed-Parity Diagram
(RAID Level 5)
open(foo)
read(bar)
write(zoo)
File System