LogStructured File Systems Rosenblum and Ousterhout - PowerPoint PPT Presentation

1 / 72

About This Presentation

Title:

LogStructured File Systems Rosenblum and Ousterhout

Description:

Transfer rate is about 100 MB/s. Log-structured file systems based on assumptions: ... Sun-4/260, 32 MB RAM; Wren IV (1.3 MB/s, 17.5 ms average seek) No cleaning. ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 73

Provided by: Ken667

Category:

more less

Transcript and Presenter's Notes

Title: LogStructured File Systems Rosenblum and Ousterhout

1
Log-Structured File Systems(Rosenblum and
Ousterhout)

Kenneth Chiu

2
Rotational latency
Seek
3
Introduction

CPUs are fast. Disks are slow. Real slow.
CPU cycle time is about .3 ns.
Memory latency is about 50 ns.
Disk seek is about 5,000,000 ns.
Transfer rate is about 100 MB/s.
Log-structured file systems based on assumptions
Memory is plentiful and cheap.
Most reads come from RAM.
Disk I/O dominated by writes.
Contrast with logging or journalling FS.

4
File Systems of the 1990s

Workloads
Office and engineering
Many small files
Times dominated by bookkeeping (metadata)
Scientific computing
Sequential access to large files
Time dominated by transfer rate (hopefully)
Others?
Other workloads?

Problems with existing file systems (old FFS/UFS)
Spread information around the disk that causes
too many small accesses.
Physically separates different files.
Attributes (inode), directory entry, and file
data are all separate.
Five I/O operations to create a new file.
When writing small files, less than 5 of
potential bandwidth is used.
Tend to write synchronously
Data is asynchronous, but metadata is
synchronous.
With small files, traffic is dominated by
metadata.

6
Log-Structured File Systems
7
Log-Structured File System

Buffer sequence of changes, write it all at once
sequentially to end of log.
Another use for buffers allow reordering.
Write includes almost everything file data,
inodes, etc.
Faster writes, and faster crash recovery. (Why?)
Two issues
Locating and reading files.
Managing free space.
Good performance requires large extents of free
space.

8
File Location and Reading

Inode contains metadata and addresses for first
10 data blocks.
For larger files, inode also contains indirect
blocks.
In FFS, inode location is static. In LFS, they
are written to the log.
Any changes require rewriting to log.
Inode map maintains location of inode.
Inode map itself is written to log.
Fixed checkpoint region identifies inode map
blocks.

9
FFS
10
(No Transcript)
11
Free Space Management Segments

Most difficult issue is management of free space.
Goal is to maintain large free extents.
Two choices threading or compaction
Pros/cons?

1
2
3
4
1
2
3
4
12
Hybrid

Sprite LFS uses a hybrid scheme.
Disk divided into fixed size segments.
Threaded between segments.
Compaction within a segment.
Segment size chosen so that transfer time is much
greater than access time 512 KB or 1 MB.

3
6
4
2
5
2
3
6
1
4
1
5
13
Segment Cleaning

Steps are
Read some number of segments.
Why needs to read more than one?
Identify live data.
Write to smaller number of clean segs.
Original segments marked as clean.
Essentially read segments, coalesce live data,
write to new segments.

Must identify
Which blocks of a segment are live
To which file each block belongs to
Position of block within file
Needed to update inode.
Use segment summary block identifies each piece
of information in a segment.
Liveness determined by checking inode to or
indirect block to see if it points to the claimed
data block.
Optimization use generation (version) number
that is incremented whenever file is truncated or
deleted. Segment summary also includes this
number.
When must still check?
Overflow danger?

15
Segment Cleaning Policies

Four policy issues
When should segment cleaner execute?
Continuously, only when needed, etc.
How many segments to clean at a time?
Must find enough free space to result in one
clean segment.
Which segments to clean?
Lowest utilized, oldest, etc.
What reorderings to perform when rewriting a
segment?
Group files in same directory, group temporally,
etc.
First two ignored, do not seem to be important.
Starts cleaning when drops below a watermark.
Cleans a few tens of segments at a time.
Third and fourth are important.

16
Write Cost

Ratio of total time to time to just transfer the
data.
Can ignore access time for LFS. (Why?)

(3 2 1)/1
17
Write Cost

Do we want to clean segments with lots of live
data, or little live data?

18
Distribution of u

How do we get segments with low u?
Can all segments have low u?

19
Bimodal Distribution of u

Implies that we want to force a bimodal
distribution.
You want to clean segments with low u, because if
you are going to put in the work, you want to get
the most reclaimed space out of it.

Which distribution is better?

Variance
20
Simulation Results
21
Simulation

Fixed number of 4 KB files. No reading, just
rewriting.
Two access patterns
Uniform (no cleaner reordering)
Hot-and-cold (cleaner reordering based on age)
One group contains 10 of the files with 90
chance of being selected.
Other group contains 90 of the files with 10
chance of being selected.
Simulator runs till all clean segments exhausted,
then runs cleaner until a threshold of clean
segments reached.
Cleaner always chooses least-utilized segments,
with no reordering.
How do you expect this to perform?

Hot-and-cold performs worse than they expected.

23
Promoting Variance

The key to low write cost is variance.
For a given total disk utilization, high variance
will result in more segments that give a lot of
bang-for-the-buck.
The cleaner generates variance.

Before
After

How do we make this variance last a long time?

24
Preferring Stable Files

If we clean stable files, the resulting full
segments are more likely to stay full for a long
time, which increases variance (zero-sum).

Segment usage table to help support this
computation.

25
Hot-and-cold
Lowest utilization
Uniform
Max benefit

Which has higher variance in each graph?

26
(No Transcript)
27
Crash Recovery
28
Crash Recovery

Two-pronged approach
Checkpoint a complete, self-contained record of
a consistent state of the file system.
Roll-forward recover operations performed after
the checkpoint by processing a list of changes.

Why not just use checkpoints?
Why not just use the log?
When to checkpoint more often? Less often?

29
Checkpoints

Two-phases
Write out all modified information.
Write pointers to all inode maps and segment
usage table, plus current time and pointer to
last segment written to checkpoint region.
What happens if crash while writing checkpoint
region?
Two checkpoint regions
How do we make sure it wont use the wrong one?
What kind of ordering guarantees are required?
Are we sure that the previous checkpoint is good?
What happens if we crash while writing the last
block?

30
Roll-Forward

Use segment summary information to update inode
map.
If new inode found, update inode map obtained
from checkpoint.
Also adjust segment utilization table.
Directory entries and inodes may be inconsistent.
Two blocks which may be written in different
orders. Solutions?
Use a directory operation log within the main
log.
Do we need the directory entries anymore? Inodes?
Checkpoints would take longer.
Other solutions?

31
Performance
32
Micro-benchmarks

Machines not fast enough to be disk-bound for
general use.
Sun-4/260, 32 MB RAM Wren IV (1.3 MB/s, 17.5 ms
average seek)
No cleaning. Cleaning overhead measured
separately.

Small-file performance under Sprite LFS and
SunOS create 10,000 1K files, then read back in
same order, then delete.
The logging approach provides an order of magn.
speedup for creation and deletion.
In SunOS, disk 85 saturated, so faster
processors will not help much. In Sprite LFS,
disk only 17 saturated, while the CPU was 100
utilized.

34
Faster CPU?
.25(1-.17)
100
CPU
CPU
66
Done
Wait/
Done
Wait
Issue
Issue
Disk
17
Disk
17
0
.17
1
0
.17
.38
1/.38 2.65

Do we really get 4 times speedup if the CPU is 4
times faster?

35
Large Files

Large-file performance under Sprite LFS and
SunOS create a 100-MB file with sequential
writes, read back sequentially, write 100 MB
randomly, read 100 MB randomly, finally read
sequentially.

36
Fair?

Could you design a FS to beat or equal Sprite on
these micro-benchmarks?
Cleaning overheads measured separately.

37
Cleaning Overheads

Recorded statistics over period of several
months.
/user6 Home directories for Sprite developers.
Program development, text processing, electronic
communication and simulations.
/pcs Home directories and project area for
research on parallel processing and VLSI circuit
design.
/src/kernel Source and binaries for Sprite
kernel.
/swap2 Sprite client workstation swap files.
Backing strore for 40 diskless workstations.
Large, sparse, non-sequential access.
/tmp Temporary file storage area for 40
workstations.

38
Segment Utilization in /user6

Distribution of segment utilizations in a recent
snapshot of the /user6 disk.
Good or bad?

More than half of the segments cleaned were
empty. Nonempty also very low utilization.
Long-term write performance is 70 of the maximum
sequential write bandwidth.
Why better than simulation?
Files are bigger. (So what?)
Traffic is extremely low. Does that suggest some
ways to improve?

40
Recovery Time

Recovery time of 1, 10, and 50 MB of fixed-size
files. Same system as micro-benchmarks. Time
dominated by number of files to be recovered.
No checkpoints were done.

41
Disk Usage

Block types marked with have equivalent data
structures in Unix FFS.
Inode map usage a little high.

42
Summary

Key design principle is to minimize head
movement, but
Make good use of disk space.
Dont be too CPU intensive.
Be able to recover from crashes.
And bad blocks.
Absolutely critical to know the access pattern.

43
(No Transcript)
44
More Background

Ill try to give a half-lecture on background.
But you can help me a lot by sending me e-mail
explaining exactly what I need to cover. Just
briefly skim the paper.

45
Next Week

No class on ?.
KD will give a lecture on ? on the Stankovic
paper.

46
Make-Up Lecture

Saturday
Sunday
Monday
1-230
630-800
Tuesday
1130-100
Wed
1200-300
630-800
Friday
1200-330

47
Soft Updates(McKusick and Ganger)

Kenneth Chiu

48
Two Kinds of Data

Two kinds of data
Metadata
Directories, inodes, free block maps, etc.
File data
Actual data in the file.
What kinds of consistency guarantees?
No pointers to uninitialized space
No multiple resource ownership
No unreferenced live resources
File data guarantees?

49
Kinds of Failures

What kinds of failures are there?
Power
Bad disk
Misbehaving controller, wrong blocks
Taxonomy
Halting failure
Byzantine

50
Traditional Synchronicity

Creating a file
First initialize inode
Then point to it.
Alternatives
Ignore it
Logging/atomic/doing-things-carefully
Non-volatile
Soft-updates

51
Soft-Updates

Allow block writes to be reordered.
But make sure the data that is written is
consistent with what has been written before.
Use additional bookkeeping to coerce the data in
the block to be consistent.

52
Set Theory

Reflexive
Irreflexive
Symmetric
Antisymmetric, asymmetric
Transitive

53
Orders

Weak
Reflexive, antisymmetric, transitive
Strict (Strong)
Irreflexive, asymmetric, transitive
Partial order
Not all elements comparable
Total order
All elements comparable

54
Dependencies
O1
O1
A?3
O3
O2
O3
O2
B?2
O4
O4

Strict or weak? Total or partial?

55
Execution
T1
T2
T1
T2
O1
O1
T1
T2
O3
O2
O1
O3
O2
O4
O3
O2
O4
Step 1
Step 3
O4
Step 2
T1
T2
T1
T2
O3
O2
O3
O2
T1
T2
O4
O4
O4
Step 5
Step 4
Step 6
56
Multiple Views
A gt B
A?3
A3
A1
B0
B2
B?2
On disk
Ops
In core
S1
S2
S3
S4
S5
S6

Some invariants to preserve.
Some write ordering to preserve those invariants.
One view in memory, must be efficient.
Another view on disk.
May not match.

57
Cycles
A?1
A,B,C
C?3
B?2

What if all on one disk block? Solve the problem?
Strict or weak?

58
Induced Cycles
A?1
A,D
B?2
B
C?3
C
D?4
In-core blocks(false sharing)
Ops

Solutions/workarounds?

59
Undo/Redo
A?1
A,D
B?2
C?3
B
C
D?4
In-core blocks
Ops

Undo (roll-back) before writing.
Redo (roll-forward) after writing.

60
Efficiency
O1
O1
O4
O3
O2
O2,O3,O4
Which block should be written first?
61
Three Rules

Never point to a structure before it has been
initialized. (Why?)
An inode must be initialized before a directory
entry references it.
Never reuse a resource before nullifying all
previous pointers to it.
An inodes pointer to a data block must be
nullified before that disk block may be
reallocated for a new inode.
Never reset the last pointer to a live resource
before a new pointer has been set.
When renaming a file, do not remove the old name
for an inode until after the new name has been
written.

62
Previous Solutions