Title: LogStructured File Systems Rosenblum and Ousterhout
1Log-Structured File Systems(Rosenblum and
Ousterhout)
2Rotational latency
Seek
3Introduction
- CPUs are fast. Disks are slow. Real slow.
- CPU cycle time is about .3 ns.
- Memory latency is about 50 ns.
- Disk seek is about 5,000,000 ns.
- Transfer rate is about 100 MB/s.
- Log-structured file systems based on assumptions
- Memory is plentiful and cheap.
- Most reads come from RAM.
- Disk I/O dominated by writes.
- Contrast with logging or journalling FS.
4File Systems of the 1990s
- Workloads
- Office and engineering
- Many small files
- Times dominated by bookkeeping (metadata)
- Scientific computing
- Sequential access to large files
- Time dominated by transfer rate (hopefully)
- Others?
- Other workloads?
5- Problems with existing file systems (old FFS/UFS)
- Spread information around the disk that causes
too many small accesses. - Physically separates different files.
- Attributes (inode), directory entry, and file
data are all separate. - Five I/O operations to create a new file.
- When writing small files, less than 5 of
potential bandwidth is used. - Tend to write synchronously
- Data is asynchronous, but metadata is
synchronous. - With small files, traffic is dominated by
metadata.
6Log-Structured File Systems
7Log-Structured File System
- Buffer sequence of changes, write it all at once
sequentially to end of log. - Another use for buffers allow reordering.
- Write includes almost everything file data,
inodes, etc. - Faster writes, and faster crash recovery. (Why?)
- Two issues
- Locating and reading files.
- Managing free space.
- Good performance requires large extents of free
space.
8File Location and Reading
- Inode contains metadata and addresses for first
10 data blocks. - For larger files, inode also contains indirect
blocks. - In FFS, inode location is static. In LFS, they
are written to the log. - Any changes require rewriting to log.
- Inode map maintains location of inode.
- Inode map itself is written to log.
- Fixed checkpoint region identifies inode map
blocks.
9FFS
10(No Transcript)
11Free Space Management Segments
- Most difficult issue is management of free space.
- Goal is to maintain large free extents.
- Two choices threading or compaction
- Pros/cons?
1
2
3
4
1
2
3
4
12Hybrid
- Sprite LFS uses a hybrid scheme.
- Disk divided into fixed size segments.
- Threaded between segments.
- Compaction within a segment.
- Segment size chosen so that transfer time is much
greater than access time 512 KB or 1 MB.
3
6
4
2
5
2
3
6
1
4
1
5
13Segment Cleaning
- Steps are
- Read some number of segments.
- Why needs to read more than one?
- Identify live data.
- Write to smaller number of clean segs.
- Original segments marked as clean.
- Essentially read segments, coalesce live data,
write to new segments.
14- Must identify
- Which blocks of a segment are live
- To which file each block belongs to
- Position of block within file
- Needed to update inode.
- Use segment summary block identifies each piece
of information in a segment. - Liveness determined by checking inode to or
indirect block to see if it points to the claimed
data block. - Optimization use generation (version) number
that is incremented whenever file is truncated or
deleted. Segment summary also includes this
number. - When must still check?
- Overflow danger?
15Segment Cleaning Policies
- Four policy issues
- When should segment cleaner execute?
- Continuously, only when needed, etc.
- How many segments to clean at a time?
- Must find enough free space to result in one
clean segment. - Which segments to clean?
- Lowest utilized, oldest, etc.
- What reorderings to perform when rewriting a
segment? - Group files in same directory, group temporally,
etc. - First two ignored, do not seem to be important.
- Starts cleaning when drops below a watermark.
Cleans a few tens of segments at a time. - Third and fourth are important.
16Write Cost
- Ratio of total time to time to just transfer the
data. - Can ignore access time for LFS. (Why?)
(3 2 1)/1
17Write Cost
- Do we want to clean segments with lots of live
data, or little live data?
18Distribution of u
- How do we get segments with low u?
- Can all segments have low u?
19Bimodal Distribution of u
- Implies that we want to force a bimodal
distribution. - You want to clean segments with low u, because if
you are going to put in the work, you want to get
the most reclaimed space out of it.
- Which distribution is better?
Variance
20Simulation Results
21Simulation
- Fixed number of 4 KB files. No reading, just
rewriting. - Two access patterns
- Uniform (no cleaner reordering)
- Hot-and-cold (cleaner reordering based on age)
- One group contains 10 of the files with 90
chance of being selected. - Other group contains 90 of the files with 10
chance of being selected. - Simulator runs till all clean segments exhausted,
then runs cleaner until a threshold of clean
segments reached. - Cleaner always chooses least-utilized segments,
with no reordering. - How do you expect this to perform?
22- Hot-and-cold performs worse than they expected.
23Promoting Variance
- The key to low write cost is variance.
- For a given total disk utilization, high variance
will result in more segments that give a lot of
bang-for-the-buck. - The cleaner generates variance.
Before
After
- How do we make this variance last a long time?
24Preferring Stable Files
- If we clean stable files, the resulting full
segments are more likely to stay full for a long
time, which increases variance (zero-sum).
- Segment usage table to help support this
computation.
25Hot-and-cold
Lowest utilization
Uniform
Max benefit
- Which has higher variance in each graph?
26(No Transcript)
27Crash Recovery
28Crash Recovery
- Two-pronged approach
- Checkpoint a complete, self-contained record of
a consistent state of the file system. - Roll-forward recover operations performed after
the checkpoint by processing a list of changes.
- Why not just use checkpoints?
- Why not just use the log?
- When to checkpoint more often? Less often?
29Checkpoints
- Two-phases
- Write out all modified information.
- Write pointers to all inode maps and segment
usage table, plus current time and pointer to
last segment written to checkpoint region. - What happens if crash while writing checkpoint
region? - Two checkpoint regions
- How do we make sure it wont use the wrong one?
- What kind of ordering guarantees are required?
- Are we sure that the previous checkpoint is good?
- What happens if we crash while writing the last
block?
30Roll-Forward
- Use segment summary information to update inode
map. - If new inode found, update inode map obtained
from checkpoint. - Also adjust segment utilization table.
- Directory entries and inodes may be inconsistent.
Two blocks which may be written in different
orders. Solutions? - Use a directory operation log within the main
log. - Do we need the directory entries anymore? Inodes?
- Checkpoints would take longer.
- Other solutions?
31Performance
32Micro-benchmarks
- Machines not fast enough to be disk-bound for
general use. - Sun-4/260, 32 MB RAM Wren IV (1.3 MB/s, 17.5 ms
average seek) - No cleaning. Cleaning overhead measured
separately.
33- Small-file performance under Sprite LFS and
SunOS create 10,000 1K files, then read back in
same order, then delete. - The logging approach provides an order of magn.
speedup for creation and deletion. - In SunOS, disk 85 saturated, so faster
processors will not help much. In Sprite LFS,
disk only 17 saturated, while the CPU was 100
utilized.
34Faster CPU?
.25(1-.17)
100
CPU
CPU
66
Done
Wait/
Done
Wait
Issue
Issue
Disk
17
Disk
17
0
.17
1
0
.17
.38
1/.38 2.65
- Do we really get 4 times speedup if the CPU is 4
times faster?
35Large Files
- Large-file performance under Sprite LFS and
SunOS create a 100-MB file with sequential
writes, read back sequentially, write 100 MB
randomly, read 100 MB randomly, finally read
sequentially.
36Fair?
- Could you design a FS to beat or equal Sprite on
these micro-benchmarks? - Cleaning overheads measured separately.
37Cleaning Overheads
- Recorded statistics over period of several
months. - /user6 Home directories for Sprite developers.
Program development, text processing, electronic
communication and simulations. - /pcs Home directories and project area for
research on parallel processing and VLSI circuit
design. - /src/kernel Source and binaries for Sprite
kernel. - /swap2 Sprite client workstation swap files.
Backing strore for 40 diskless workstations.
Large, sparse, non-sequential access. - /tmp Temporary file storage area for 40
workstations.
38Segment Utilization in /user6
- Distribution of segment utilizations in a recent
snapshot of the /user6 disk. - Good or bad?
39- More than half of the segments cleaned were
empty. Nonempty also very low utilization. - Long-term write performance is 70 of the maximum
sequential write bandwidth. - Why better than simulation?
- Files are bigger. (So what?)
- Traffic is extremely low. Does that suggest some
ways to improve?
40Recovery Time
- Recovery time of 1, 10, and 50 MB of fixed-size
files. Same system as micro-benchmarks. Time
dominated by number of files to be recovered. - No checkpoints were done.
41Disk Usage
- Block types marked with have equivalent data
structures in Unix FFS. - Inode map usage a little high.
42Summary
- Key design principle is to minimize head
movement, but - Make good use of disk space.
- Dont be too CPU intensive.
- Be able to recover from crashes.
- And bad blocks.
- Absolutely critical to know the access pattern.
43(No Transcript)
44More Background
- Ill try to give a half-lecture on background.
- But you can help me a lot by sending me e-mail
explaining exactly what I need to cover. Just
briefly skim the paper.
45Next Week
- No class on ?.
- KD will give a lecture on ? on the Stankovic
paper.
46Make-Up Lecture
- Saturday
- Sunday
- Monday
- 1-230
- 630-800
- Tuesday
- 1130-100
- Wed
- 1200-300
- 630-800
- Friday
- 1200-330
47Soft Updates(McKusick and Ganger)
48Two Kinds of Data
- Two kinds of data
- Metadata
- Directories, inodes, free block maps, etc.
- File data
- Actual data in the file.
- What kinds of consistency guarantees?
- No pointers to uninitialized space
- No multiple resource ownership
- No unreferenced live resources
- File data guarantees?
49Kinds of Failures
- What kinds of failures are there?
- Power
- Bad disk
- Misbehaving controller, wrong blocks
- Taxonomy
- Halting failure
- Byzantine
50Traditional Synchronicity
- Creating a file
- First initialize inode
- Then point to it.
- Alternatives
- Ignore it
- Logging/atomic/doing-things-carefully
- Non-volatile
- Soft-updates
51Soft-Updates
- Allow block writes to be reordered.
- But make sure the data that is written is
consistent with what has been written before. - Use additional bookkeeping to coerce the data in
the block to be consistent.
52Set Theory
- Reflexive
- Irreflexive
- Symmetric
- Antisymmetric, asymmetric
- Transitive
53Orders
- Weak
- Reflexive, antisymmetric, transitive
- Strict (Strong)
- Irreflexive, asymmetric, transitive
- Partial order
- Not all elements comparable
- Total order
- All elements comparable
54Dependencies
O1
O1
A?3
O3
O2
O3
O2
B?2
O4
O4
- Strict or weak? Total or partial?
55Execution
T1
T2
T1
T2
O1
O1
T1
T2
O3
O2
O1
O3
O2
O4
O3
O2
O4
Step 1
Step 3
O4
Step 2
T1
T2
T1
T2
O3
O2
O3
O2
T1
T2
O4
O4
O4
Step 5
Step 4
Step 6
56Multiple Views
A gt B
A?3
A3
A1
B0
B2
B?2
On disk
Ops
In core
S1
S2
S3
S4
S5
S6
- Some invariants to preserve.
- Some write ordering to preserve those invariants.
- One view in memory, must be efficient.
- Another view on disk.
- May not match.
57Cycles
A?1
A,B,C
C?3
B?2
- What if all on one disk block? Solve the problem?
Strict or weak?
58Induced Cycles
A?1
A,D
B?2
B
C?3
C
D?4
In-core blocks(false sharing)
Ops
59Undo/Redo
A?1
A,D
B?2
C?3
B
C
D?4
In-core blocks
Ops
- Undo (roll-back) before writing.
- Redo (roll-forward) after writing.
60Efficiency
O1
O1
O4
O3
O2
O2,O3,O4
Which block should be written first?
61Three Rules
- Never point to a structure before it has been
initialized. (Why?) - An inode must be initialized before a directory
entry references it. - Never reuse a resource before nullifying all
previous pointers to it. - An inodes pointer to a data block must be
nullified before that disk block may be
reallocated for a new inode. - Never reset the last pointer to a live resource
before a new pointer has been set. - When renaming a file, do not remove the old name
for an inode until after the new name has been
written.
62Previous Solutions
- Synchronous writes
- NVRAM
- Atomic updates
- Scheduler-enforced ordering
- Changes to the disk scheduler
- Interbuffer dependencies
- Too many synchronous writes to avoid cycles.
63Characteristics of an Ideal Solution
- Applications should never wait unless they choose
to do so. - Propagage modified metadata with minimum number
of writes (allow coalescing). - Minimize memory usage.
- Write-back code and disk scheduler should not be
constrained in ordering. - Any inherent conflicts?
64Cyclic Dependency
65Undo/Redo 1
66Undo/Redo 2
- Ordering between add and delete?
67Undo/Redo 3
68Soft Updates in FFS
- Block allocation
- Block deallocation
- Link addition
- Link removal
69Block Allocation
Initializeblock
Set blockpointer
Freebitmap
70Block Deallocation
Clear blockpointer
Freebitmap
71Link Addition
Newinode
Newdirectoryentry
Freebitmap
72Link Removal
Decinoderef cnt
Cleardirectoryentry