LogStructured File Systems Rosenblum and Ousterhout - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

LogStructured File Systems Rosenblum and Ousterhout

Description:

Transfer rate is about 100 MB/s. Log-structured file systems based on assumptions: ... Sun-4/260, 32 MB RAM; Wren IV (1.3 MB/s, 17.5 ms average seek) No cleaning. ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 73
Provided by: Ken667
Category:

less

Transcript and Presenter's Notes

Title: LogStructured File Systems Rosenblum and Ousterhout


1
Log-Structured File Systems(Rosenblum and
Ousterhout)
  • Kenneth Chiu

2
Rotational latency
Seek
3
Introduction
  • CPUs are fast. Disks are slow. Real slow.
  • CPU cycle time is about .3 ns.
  • Memory latency is about 50 ns.
  • Disk seek is about 5,000,000 ns.
  • Transfer rate is about 100 MB/s.
  • Log-structured file systems based on assumptions
  • Memory is plentiful and cheap.
  • Most reads come from RAM.
  • Disk I/O dominated by writes.
  • Contrast with logging or journalling FS.

4
File Systems of the 1990s
  • Workloads
  • Office and engineering
  • Many small files
  • Times dominated by bookkeeping (metadata)
  • Scientific computing
  • Sequential access to large files
  • Time dominated by transfer rate (hopefully)
  • Others?
  • Other workloads?

5
  • Problems with existing file systems (old FFS/UFS)
  • Spread information around the disk that causes
    too many small accesses.
  • Physically separates different files.
  • Attributes (inode), directory entry, and file
    data are all separate.
  • Five I/O operations to create a new file.
  • When writing small files, less than 5 of
    potential bandwidth is used.
  • Tend to write synchronously
  • Data is asynchronous, but metadata is
    synchronous.
  • With small files, traffic is dominated by
    metadata.

6
Log-Structured File Systems
7
Log-Structured File System
  • Buffer sequence of changes, write it all at once
    sequentially to end of log.
  • Another use for buffers allow reordering.
  • Write includes almost everything file data,
    inodes, etc.
  • Faster writes, and faster crash recovery. (Why?)
  • Two issues
  • Locating and reading files.
  • Managing free space.
  • Good performance requires large extents of free
    space.

8
File Location and Reading
  • Inode contains metadata and addresses for first
    10 data blocks.
  • For larger files, inode also contains indirect
    blocks.
  • In FFS, inode location is static. In LFS, they
    are written to the log.
  • Any changes require rewriting to log.
  • Inode map maintains location of inode.
  • Inode map itself is written to log.
  • Fixed checkpoint region identifies inode map
    blocks.

9
FFS
10
(No Transcript)
11
Free Space Management Segments
  • Most difficult issue is management of free space.
  • Goal is to maintain large free extents.
  • Two choices threading or compaction
  • Pros/cons?

1
2
3
4
1
2
3
4
12
Hybrid
  • Sprite LFS uses a hybrid scheme.
  • Disk divided into fixed size segments.
  • Threaded between segments.
  • Compaction within a segment.
  • Segment size chosen so that transfer time is much
    greater than access time 512 KB or 1 MB.

3
6
4
2
5
2
3
6
1
4
1
5
13
Segment Cleaning
  • Steps are
  • Read some number of segments.
  • Why needs to read more than one?
  • Identify live data.
  • Write to smaller number of clean segs.
  • Original segments marked as clean.
  • Essentially read segments, coalesce live data,
    write to new segments.

14
  • Must identify
  • Which blocks of a segment are live
  • To which file each block belongs to
  • Position of block within file
  • Needed to update inode.
  • Use segment summary block identifies each piece
    of information in a segment.
  • Liveness determined by checking inode to or
    indirect block to see if it points to the claimed
    data block.
  • Optimization use generation (version) number
    that is incremented whenever file is truncated or
    deleted. Segment summary also includes this
    number.
  • When must still check?
  • Overflow danger?

15
Segment Cleaning Policies
  • Four policy issues
  • When should segment cleaner execute?
  • Continuously, only when needed, etc.
  • How many segments to clean at a time?
  • Must find enough free space to result in one
    clean segment.
  • Which segments to clean?
  • Lowest utilized, oldest, etc.
  • What reorderings to perform when rewriting a
    segment?
  • Group files in same directory, group temporally,
    etc.
  • First two ignored, do not seem to be important.
  • Starts cleaning when drops below a watermark.
    Cleans a few tens of segments at a time.
  • Third and fourth are important.

16
Write Cost
  • Ratio of total time to time to just transfer the
    data.
  • Can ignore access time for LFS. (Why?)

(3 2 1)/1
17
Write Cost
  • Do we want to clean segments with lots of live
    data, or little live data?

18
Distribution of u
  • How do we get segments with low u?
  • Can all segments have low u?

19
Bimodal Distribution of u
  • Implies that we want to force a bimodal
    distribution.
  • You want to clean segments with low u, because if
    you are going to put in the work, you want to get
    the most reclaimed space out of it.
  • Which distribution is better?

Variance
20
Simulation Results
21
Simulation
  • Fixed number of 4 KB files. No reading, just
    rewriting.
  • Two access patterns
  • Uniform (no cleaner reordering)
  • Hot-and-cold (cleaner reordering based on age)
  • One group contains 10 of the files with 90
    chance of being selected.
  • Other group contains 90 of the files with 10
    chance of being selected.
  • Simulator runs till all clean segments exhausted,
    then runs cleaner until a threshold of clean
    segments reached.
  • Cleaner always chooses least-utilized segments,
    with no reordering.
  • How do you expect this to perform?

22
  • Hot-and-cold performs worse than they expected.

23
Promoting Variance
  • The key to low write cost is variance.
  • For a given total disk utilization, high variance
    will result in more segments that give a lot of
    bang-for-the-buck.
  • The cleaner generates variance.

Before
After
  • How do we make this variance last a long time?

24
Preferring Stable Files
  • If we clean stable files, the resulting full
    segments are more likely to stay full for a long
    time, which increases variance (zero-sum).
  • Segment usage table to help support this
    computation.

25
Hot-and-cold
Lowest utilization
Uniform
Max benefit
  • Which has higher variance in each graph?

26
(No Transcript)
27
Crash Recovery
28
Crash Recovery
  • Two-pronged approach
  • Checkpoint a complete, self-contained record of
    a consistent state of the file system.
  • Roll-forward recover operations performed after
    the checkpoint by processing a list of changes.
  • Why not just use checkpoints?
  • Why not just use the log?
  • When to checkpoint more often? Less often?

29
Checkpoints
  • Two-phases
  • Write out all modified information.
  • Write pointers to all inode maps and segment
    usage table, plus current time and pointer to
    last segment written to checkpoint region.
  • What happens if crash while writing checkpoint
    region?
  • Two checkpoint regions
  • How do we make sure it wont use the wrong one?
  • What kind of ordering guarantees are required?
  • Are we sure that the previous checkpoint is good?
  • What happens if we crash while writing the last
    block?

30
Roll-Forward
  • Use segment summary information to update inode
    map.
  • If new inode found, update inode map obtained
    from checkpoint.
  • Also adjust segment utilization table.
  • Directory entries and inodes may be inconsistent.
    Two blocks which may be written in different
    orders. Solutions?
  • Use a directory operation log within the main
    log.
  • Do we need the directory entries anymore? Inodes?
  • Checkpoints would take longer.
  • Other solutions?

31
Performance
32
Micro-benchmarks
  • Machines not fast enough to be disk-bound for
    general use.
  • Sun-4/260, 32 MB RAM Wren IV (1.3 MB/s, 17.5 ms
    average seek)
  • No cleaning. Cleaning overhead measured
    separately.

33
  • Small-file performance under Sprite LFS and
    SunOS create 10,000 1K files, then read back in
    same order, then delete.
  • The logging approach provides an order of magn.
    speedup for creation and deletion.
  • In SunOS, disk 85 saturated, so faster
    processors will not help much. In Sprite LFS,
    disk only 17 saturated, while the CPU was 100
    utilized.

34
Faster CPU?
.25(1-.17)
100
CPU
CPU
66
Done
Wait/
Done
Wait
Issue
Issue
Disk
17
Disk
17
0
.17
1
0
.17
.38
1/.38 2.65
  • Do we really get 4 times speedup if the CPU is 4
    times faster?

35
Large Files
  • Large-file performance under Sprite LFS and
    SunOS create a 100-MB file with sequential
    writes, read back sequentially, write 100 MB
    randomly, read 100 MB randomly, finally read
    sequentially.

36
Fair?
  • Could you design a FS to beat or equal Sprite on
    these micro-benchmarks?
  • Cleaning overheads measured separately.

37
Cleaning Overheads
  • Recorded statistics over period of several
    months.
  • /user6 Home directories for Sprite developers.
    Program development, text processing, electronic
    communication and simulations.
  • /pcs Home directories and project area for
    research on parallel processing and VLSI circuit
    design.
  • /src/kernel Source and binaries for Sprite
    kernel.
  • /swap2 Sprite client workstation swap files.
    Backing strore for 40 diskless workstations.
    Large, sparse, non-sequential access.
  • /tmp Temporary file storage area for 40
    workstations.

38
Segment Utilization in /user6
  • Distribution of segment utilizations in a recent
    snapshot of the /user6 disk.
  • Good or bad?

39
  • More than half of the segments cleaned were
    empty. Nonempty also very low utilization.
  • Long-term write performance is 70 of the maximum
    sequential write bandwidth.
  • Why better than simulation?
  • Files are bigger. (So what?)
  • Traffic is extremely low. Does that suggest some
    ways to improve?

40
Recovery Time
  • Recovery time of 1, 10, and 50 MB of fixed-size
    files. Same system as micro-benchmarks. Time
    dominated by number of files to be recovered.
  • No checkpoints were done.

41
Disk Usage
  • Block types marked with have equivalent data
    structures in Unix FFS.
  • Inode map usage a little high.

42
Summary
  • Key design principle is to minimize head
    movement, but
  • Make good use of disk space.
  • Dont be too CPU intensive.
  • Be able to recover from crashes.
  • And bad blocks.
  • Absolutely critical to know the access pattern.

43
(No Transcript)
44
More Background
  • Ill try to give a half-lecture on background.
  • But you can help me a lot by sending me e-mail
    explaining exactly what I need to cover. Just
    briefly skim the paper.

45
Next Week
  • No class on ?.
  • KD will give a lecture on ? on the Stankovic
    paper.

46
Make-Up Lecture
  • Saturday
  • Sunday
  • Monday
  • 1-230
  • 630-800
  • Tuesday
  • 1130-100
  • Wed
  • 1200-300
  • 630-800
  • Friday
  • 1200-330

47
Soft Updates(McKusick and Ganger)
  • Kenneth Chiu

48
Two Kinds of Data
  • Two kinds of data
  • Metadata
  • Directories, inodes, free block maps, etc.
  • File data
  • Actual data in the file.
  • What kinds of consistency guarantees?
  • No pointers to uninitialized space
  • No multiple resource ownership
  • No unreferenced live resources
  • File data guarantees?

49
Kinds of Failures
  • What kinds of failures are there?
  • Power
  • Bad disk
  • Misbehaving controller, wrong blocks
  • Taxonomy
  • Halting failure
  • Byzantine

50
Traditional Synchronicity
  • Creating a file
  • First initialize inode
  • Then point to it.
  • Alternatives
  • Ignore it
  • Logging/atomic/doing-things-carefully
  • Non-volatile
  • Soft-updates

51
Soft-Updates
  • Allow block writes to be reordered.
  • But make sure the data that is written is
    consistent with what has been written before.
  • Use additional bookkeeping to coerce the data in
    the block to be consistent.

52
Set Theory
  • Reflexive
  • Irreflexive
  • Symmetric
  • Antisymmetric, asymmetric
  • Transitive

53
Orders
  • Weak
  • Reflexive, antisymmetric, transitive
  • Strict (Strong)
  • Irreflexive, asymmetric, transitive
  • Partial order
  • Not all elements comparable
  • Total order
  • All elements comparable

54
Dependencies
O1
O1
A?3
O3
O2
O3
O2
B?2
O4
O4
  • Strict or weak? Total or partial?

55
Execution
T1
T2
T1
T2
O1
O1
T1
T2
O3
O2
O1
O3
O2
O4
O3
O2
O4
Step 1
Step 3
O4
Step 2
T1
T2
T1
T2
O3
O2
O3
O2
T1
T2
O4
O4
O4
Step 5
Step 4
Step 6
56
Multiple Views
A gt B
A?3
A3
A1
B0
B2
B?2
On disk
Ops
In core
S1
S2
S3
S4
S5
S6
  • Some invariants to preserve.
  • Some write ordering to preserve those invariants.
  • One view in memory, must be efficient.
  • Another view on disk.
  • May not match.

57
Cycles
A?1
A,B,C
C?3
B?2
  • What if all on one disk block? Solve the problem?
    Strict or weak?

58
Induced Cycles
A?1
A,D
B?2
B
C?3
C
D?4
In-core blocks(false sharing)
Ops
  • Solutions/workarounds?

59
Undo/Redo
A?1
A,D
B?2
C?3
B
C
D?4
In-core blocks
Ops
  • Undo (roll-back) before writing.
  • Redo (roll-forward) after writing.

60
Efficiency
O1
O1
O4
O3
O2
O2,O3,O4
Which block should be written first?
61
Three Rules
  • Never point to a structure before it has been
    initialized. (Why?)
  • An inode must be initialized before a directory
    entry references it.
  • Never reuse a resource before nullifying all
    previous pointers to it.
  • An inodes pointer to a data block must be
    nullified before that disk block may be
    reallocated for a new inode.
  • Never reset the last pointer to a live resource
    before a new pointer has been set.
  • When renaming a file, do not remove the old name
    for an inode until after the new name has been
    written.

62
Previous Solutions
  • Synchronous writes
  • NVRAM
  • Atomic updates
  • Scheduler-enforced ordering
  • Changes to the disk scheduler
  • Interbuffer dependencies
  • Too many synchronous writes to avoid cycles.

63
Characteristics of an Ideal Solution
  • Applications should never wait unless they choose
    to do so.
  • Propagage modified metadata with minimum number
    of writes (allow coalescing).
  • Minimize memory usage.
  • Write-back code and disk scheduler should not be
    constrained in ordering.
  • Any inherent conflicts?

64
Cyclic Dependency
65
Undo/Redo 1
66
Undo/Redo 2
  • Ordering between add and delete?

67
Undo/Redo 3
68
Soft Updates in FFS
  • Block allocation
  • Block deallocation
  • Link addition
  • Link removal

69
Block Allocation
Initializeblock
Set blockpointer
Freebitmap
  • Why?

70
Block Deallocation
Clear blockpointer
Freebitmap
  • Why?

71
Link Addition
Newinode
Newdirectoryentry
Freebitmap
  • Why?

72
Link Removal
Decinoderef cnt
Cleardirectoryentry
  • Why?
Write a Comment
User Comments (0)
About PowerShow.com