ecs150 Fall 2006: Operating System - PowerPoint PPT Presentation

About This Presentation
Title:

ecs150 Fall 2006: Operating System

Description:

File system resides on secondary storage (disks). File system ... File control block storage structure consisting of information about a ... Bit ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 123
Provided by: astpr
Category:

less

Transcript and Presenter's Notes

Title: ecs150 Fall 2006: Operating System


1
ecs150 Fall 2006Operating System5 File
Systems(chapters 6.46.7, 8)
  • Dr. S. Felix Wu
  • Computer Science Department
  • University of California, Davis
  • http//www.cs.ucdavis.edu/wu/
  • sfelixwu_at_gmail.com

2
File System Abstraction
  • Files
  • Directories

3
System-call interface
Active file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
dirp opendir(const char filename) struct
dirent direntp readdir(dirp) struct dirent
ino_t d_ino char d_nameNAME_MAX1
directory
dirent inode file_name
file
dirent inode file_name
dirent inode file_name
file
file
8
Local versus Remote
  • System Call Interface
  • V-node
  • Local versus remote
  • NFS or i-node
  • Stackable File System
  • Hard-disk blocks

9
File-System Structure
  • File structure
  • Logical storage unit
  • Collection of related information
  • File system resides on secondary storage (disks).
  • File system organized into layers.
  • File control block storage structure consisting
    of information about a file.

10
File ? Disk
  • separate the disk into blocks
  • separate the file into blocks as well
  • paging from file to disk

blocks 4 - 7- 2- 10- 12
How to represent the file?? How to link these 5
pages together??
11
Bit torrent pieces
  • 1 big file (X Gigabytes) with a number of pieces
    (5) already in (and sharing with others).
  • How much disk space do we need at this moment?

12
Hard Disk
  • Track, Sector, Head
  • Track Heads ? Cylinder
  • Performance
  • seek time
  • rotation time
  • transfer time
  • LBA
  • Linear Block Addressing

13
File ? Disk blocks
0
file block 0
file block 1
file block 2
file block 3
file block 4
4
7
2
10
12
  • What are the disadvantages?
  • disk access can be slow for random access.
  • How big is each block? 64 bytes? 68 bytes?

14
Kernel Hacking Session
  • This Friday from 730 p.m. until midnight..
  • 3083 Kemper
  • Bring your laptop
  • And bring your mug

15
A File System
partition
partition
partition
i-list
directory and data blocks
s
b
d
i-node
i-node
.
i-node
16
One Logical File ? Physical Disk Blocks
efficient representation access
17
An i-node
A file
??? entries in one disk block
Typical each block 8K or 16K bytes
18
inode (index node) structure
  • meta-data of the file.
  • di_mode 02
  • di_nlinks 02
  • di_uid 02
  • di_gid 02
  • di_size 04
  • di_addr 39
  • di_gen 01
  • di_atime 04
  • di_mtime 04
  • di_ctime 04

19
System-call interface
Active file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
20
(No Transcript)
21
A File System
partition
partition
partition
i-list
directory and data blocks
s
b
d
i-node
i-node
.
i-node
22
(No Transcript)
23
125 struct ufs2_dinode 126 u_int16_t di_mode
/ 0 IFMT, permissions see below. / 127
int16_t di_nlink / 2 File link count. / 128
u_int32_t di_uid / 4 File owner. / 129
u_int32_t di_gid / 8 File group. / 130
u_int32_t di_blksize / 12 Inode blocksize. /
131 u_int64_t di_size / 16 File byte count.
/ 132 u_int64_t di_blocks / 24 Bytes
actually held. / 133 ufs_time_t di_atime /
32 Last access time. / 134 ufs_time_t
di_mtime / 40 Last modified time. / 135
ufs_time_t di_ctime / 48 Last inode change
time. / 136 ufs_time_t di_birthtime / 56
Inode creation time. / 137 int32_t
di_mtimensec / 64 Last modified time. / 138
int32_t di_atimensec / 68 Last access time. /
139 int32_t di_ctimensec / 72 Last inode
change time. / 140 int32_t di_birthnsec / 76
Inode creation time. / 141 int32_t di_gen /
80 Generation number. / 142 u_int32_t
di_kernflags / 84 Kernel flags. / 143
u_int32_t di_flags / 88 Status flags
(chflags). / 144 int32_t di_extsize / 92
External attributes block. / 145 ufs2_daddr_t
di_extbNXADDR/ 96 External attributes block.
/ 146 ufs2_daddr_t di_dbNDADDR / 112
Direct disk blocks. / 147 ufs2_daddr_t
di_ibNIADDR / 208 Indirect disk blocks. /
148 int64_t di_spare3 / 232 Reserved
currently unused / 149
24
166 struct ufs1_dinode 167 u_int16_t di_mode
/ 0 IFMT, permissions see below. / 168
int16_t di_nlink / 2 File link count. / 169
union 170 u_int16_t oldids2 / 4 Ffs old
user and group ids. / 171 di_u 172
u_int64_t di_size / 8 File byte count. / 173
int32_t di_atime / 16 Last access time. /
174 int32_t di_atimensec / 20 Last access
time. / 175 int32_t di_mtime / 24 Last
modified time. / 176 int32_t di_mtimensec /
28 Last modified time. / 177 int32_t di_ctime
/ 32 Last inode change time. / 178 int32_t
di_ctimensec / 36 Last inode change time. /
179 ufs1_daddr_t di_dbNDADDR / 40 Direct
disk blocks. / 180 ufs1_daddr_t di_ibNIADDR
/ 88 Indirect disk blocks. / 181 u_int32_t
di_flags / 100 Status flags (chflags). / 182
int32_t di_blocks / 104 Blocks actually held.
/ 183 int32_t di_gen / 108 Generation
number. / 184 u_int32_t di_uid / 112 File
owner. / 185 u_int32_t di_gid / 116 File
group. / 186 int32_t di_spare2 / 120
Reserved currently unused / 187
25
Bittorrent pieces
File size 10 GB Pieces downloaded 512 MB How
much disk space do we need?
26
include include int main (
void) FILE f1 fopen("./sss.txt", "w")
int i for (i 0 i fseek(f1, rand(), SEEK_SET) fprintf(f1,
"dddd", rand(), rand(), rand(),
rand()) if (i 100 0) sleep(1)
fflush(f1)
./t ls l ./sss.txt
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
An i-node
A file
??? entries in one disk block
Typical each block 1K
31
i-node
  • How many disk blocks can a FS have?
  • How many levels of i-node indirection will be
    necessary to store a file of 2G bytes? (I.e., 0,
    1, 2 or 3)
  • What is the largest possible file size in i-node?
  • What is the size of the i-node itself for a file
    of 10GB with only 512 MB downloaded?

32
Answer
  • How many disk blocks can a FS have?
  • 264 or 232 Pointer (to blocks) size is 8/4
    bytes.
  • How many levels of i-node indirection will be
    necessary to store a file of 2G (231) bytes?
    (I.e., 0, 1, 2 or 3)
  • 12210 28 210 28 28 2 10 28 28 28 2
    10 ? 231
  • What is the largest possible file size in i-node?
  • 12210 28 210 28 28 2 10 28 28 28 2
    10
  • 264 1
  • 232 210

You need to consider three issues and find the
minimum!
33
Answer
  • How many pointers?
  • 512MB divided by the block size (1K)
  • 512K pointers times 8 (4) bytes 4 (2) MB

34
A File System
partition
partition
partition
i-list
directory and data blocks
s
b
d
i-node
i-node
.
i-node
35
FFS and UFS
  • /usr/src/sys/ufs/ffs/
  • Higher-level directory structure
  • Soft updates Snapshot
  • /usr/src/sys/ufs/ufs/
  • Lower-level buffer, i-node

36
of i-nodes
  • UFS1 pre-allocation
  • 3 of HD, about
  • UFS2 dynamic allocation
  • Still limited of i-nods

37
di_size vs. di_blocks
  • ???

38
One Logical File ? Physical Disk Blocks
efficient representation access
39
di_size vs. di_blocks
  • Logical
  • Physical
  • fstat
  • du

40
Extended Attributes in UFS2
  • Attributes associated with the File
  • di_extb2
  • two blocks, but indirection if needed.
  • Format
  • Length 4
  • Name Space 1
  • Content Pad Length 1
  • Name Length 1
  • Name mod 8
  • Content variable
  • Applications ACL, Data Labelling

41
Some thoughts.
  • What can you do with extended attributes?
  • How to design/implement?
  • Should/can we do it Stackable File Systems?
  • Otherwise, the program to manipulate the EAs
    will have to be very UFS2-dependent or FiST with
    an UFS2 optimization option.
  • Are there any counter examples?
  • security and performance considerations.

42
(No Transcript)
43
(No Transcript)
44
struct dirent ino_t d_ino char
d_nameNAME_MAX1 struct stat short
nlinks
directory
dirent inode file_name
file
dirent inode file_name
dirent inode file_name
file
file
45
A File System
partition
partition
partition
i-list
directory and data blocks
s
b
d
i-node
i-node
.
i-node
46
  • ln s /usr/src/sys/sys/proc.h ppp.h
  • ln /usr/src/sys/sys/proc.h ppp.h

47
File System Buffer Cache
application read/write files
translate file to disk blocks
OS
...
...buffer cache
maintains
controls disk accesses read/write blocks
hardware
Any problems?
48
File System Consistency
  • To maintain file system consistency the ordering
    of updates from buffer cache to disk is critical
  • Example
  • if the directory block is written back before the
    i-node and the system crashes, the directory
    structure will be inconsistent

49
File System Consistency
  • File system almost always use a buffer/disk cache
    for performance reasons
  • This problem is critical especially for the
    blocks that contain control information i-node,
    free-list, directory blocks
  • Two copies of a disk block (buffer cache, disk) ?
    consistency problem if the system crashes before
    all the modified blocks are written back to disk
  • Write back critical blocks from the buffer cache
    to disk immediately
  • Data blocks are also written back periodically
    sync

50
Two Strategies
  • Prevention
  • Use un-buffered I/O when writing i-nodes or
    pointer blocks
  • Use buffered I/O for other writes and force sync
    every 30 seconds
  • Detect and Fix
  • Detect the inconsistency
  • Fix them according to the rules
  • Fsck (File System Checker)

51
File System Integrity
  • Block consistency
  • Block-in-use table
  • Free-list table
  • File consistency
  • how many directories pointing to that i-node?
  • nlink?
  • three cases D L, L D, D L
  • What to do with the latter two cases?

0
1
1
1
0
0
0
1
0
0
0
2
1
0
0
0
1
1
1
0
1
0
2
0
52
File System Integrity
  • File system states
  • (a) consistent
  • (b) missing block
  • (c) duplicate block in free list
  • (d) duplicate data block

53
Metadata Operations
  • Metadata operations modify the structure of the
    file system
  • Creating, deleting, or renamingfiles,
    directories, or special files
  • Directory I-node
  • Data must be written to disk in such a way that
    the file system can be recovered to a consistent
    state after a system crash

54
Metadata Integrity
  • FFS uses synchronous writes to guarantee the
    integrity of metadata
  • Any operation modifying multiple pieces of
    metadata will write its data to disk in a
    specific order
  • These writes will be blocking
  • Guarantees integrity and durability of metadata
    updates

55
Deleting a file (I)
i-node-1
abc
def
i-node-2
ghi
i-node-3
Assume we want to delete file def
56
Deleting a file (II)
i-node-1
abc
?
def
ghi
i-node-3
Cannot delete i-node before directory entry def
57
Deleting a file (III)
  • Correct sequence is
  • Write to disk directory block containing deleted
    directory entry def
  • Write to disk i-node block containing deleted
    i-node
  • Leaves the file system in a consistent state

58
Creating a file (I)
i-node-1
abc
ghi
i-node-3

Assume we want to create new file tuv
59
Creating a file (II)
i-node-1
abc
ghi
i-node-3
tuv
?
Cannot write directory entry tuv before i-node
60
Creating a file (III)
  • Correct sequence is
  • Write to disk i-node block containing new i-node
  • Write to disk directory block containing new
    directory entry
  • Leaves the file system in a consistent state

61
Synchronous Updates
  • Used by FFS to guarantee consistency of metadata
  • All metadata updates are done through blocking
    writes
  • Increases the cost of metadata updates
  • Can significantly impact the performance of whole
    file system

62
(No Transcript)
63
SOFT UPDATES
  • Use delayed writes (write back)
  • Maintain dependency information about cached
    pieces of metadata
  • This i-node must be updated before/after this
    directory entry
  • Guarantee that metadata blocks are written to
    disk in the required order

64
3 Soft Update Rules
  • Never point to a structure before it has been
    initialized.
  • Never reuse a resource before nullifying all
    previous pointers to it.
  • Never reset the old pointer to a live resource
    before the new pointer has been set.

65
Problem 1 with S.U.
  • Synchronous writes guaranteed that metadata
    operations were durable once the system call
    returned
  • Soft Updates guarantee that file system will
    recover into a consistent state but not
    necessarily the most recent one
  • Some updates could be lost

66
What are the dependency relationship?
We want to delete file foo and create new file
bar
Block A
Block B


i-node-2
foo
NEW bar
NEW i-node-3


67
Circular Dependency
X-2nd
Y-1st
We want to delete file foo and create new file
bar
Block A
Block B


i-node-2
foo
NEW bar
NEW i-node-3


68
Problem 2 with S.U.
  • Cyclical dependencies
  • Same directory block contains entries to be
    created and entries to be deleted
  • These entries point to i-nodes in the same block
  • Brainstorming
  • How to resolve this issue in S.U.?

69
How to update?? i-node first or director block
first?
70
(No Transcript)
71
Solution in S.U.
  • Roll back metadata in one of the blocks to an
    earlier, safe state
  • (Safe state does not contain new directory entry)

Block A
72
  • Write first block with metadata that were rolled
    back (block A of example)
  • Write blocks that can be written after first
    block has been written (block B of example)
  • Roll forward block that was rolled back
  • Write that block
  • Breaks the cyclical dependency but must now write
    twice block A

73
Before any Write Operation
SU Dependency Checking (roll back if necessary)
After any Write Operation
SU Dependency Processing (task list
updating) (roll forward if necessary)
74
  • two most popular approaches for improving the
    performance of metadata operations and recovery
  • Journaling
  • Soft Updates
  • Journaling systems record metadata operations on
    an auxiliary log
  • Soft Updates uses ordered writes

75
JOURNALING
  • Journaling systems maintain an auxiliary log that
    records all meta-data operations
  • Write-ahead logging ensures that the log is
    written to disk before any blocks containing data
    modified by the corresponding operations.
  • After a crash, can replay the log to bring the
    file system to a consistent state

76
JOURNALING
  • Log writes are performed in addition to the
    regular writes
  • Journaling systems incur log write overhead but
  • Log writes can be performed efficiently because
    they are sequential (block operation
    consideration)
  • Metadata blocks do not need to be written back
    after each update

77
JOURNALING
  • Journaling systems can provide
  • same durability semantics as FFS if log is
    forced to disk after each meta-data operation
  • the laxer semantics of Soft Updates if log writes
    are buffered until entire buffers are full

78
Soft Updates vs. Journaling
  • Advantages
  • disadvantages

79
With Soft Updates??
Do we still need FSCK? at boot time?
CPU
80
Recover the Missing Resources
  • In the background, in an active FS
  • We dont want to wait for the lengthy FSCK
    process to complete
  • A related issue
  • the virus scanning process
  • what happens if we get a new virus signature?

81
Snapshot of the FS
  • backup and restore
  • dump reliably an active File System
  • what will we do today to dump our 40GB FS
    consistent snapshots? (in the midnight)
  • background FSCK checks

82
What is a snapshot?(I mean conceptually.)
  • Freeze all activities related to the FS.
  • Copy everything to some space.
  • Resume the activities.

How do we efficiently implement this concept such
that the activities will only be blocked for
about 0.25 seconds, and we dont have to buy a
really big hard drive?
83
(No Transcript)
84
(No Transcript)
85
Copy-on-Write
86
Snapshot a file
Logical size Versus physical size
87
Example
mkdir /backups/usr/noon mount u o snapshot
/usr/snap.noon /usr mdconfig a t vnode u 0
f /usr/snap.noon mount r /dev/md0
/backups/usr/noon / do whatever you want to
test it / umount /backups/usr/noon mdconfig
d u 0 rm f /usr/snap.noon
88
(No Transcript)
89
(No Transcript)
90
include include int main (
void) FILE f1 fopen("./sss.txt", "w")
int i for (i 0 i fseek(f1, rand(), SEEK_SET) fprintf(f1,
"dddd", rand(), rand(), rand(),
rand()) if (i 100 0) sleep(1)
fflush(f1)
91
Example
mkdir /backups/usr/noon mount u o snapshot
/usr/snap.noon /usr mdconfig a t vnode u 0
f /usr/snap.noon mount r /dev/md0
/backups/usr/noon / do whatever you want to
test it / umount /backups/usr/noon mdconfig
d u 0 rm f /usr/snap.noon
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
(No Transcript)
99
Example
mkdir /backups/usr/noon mount u o snapshot
/usr/snap.noon /usr mdconfig a t vnode u 0
f /usr/snap.noon mount r /dev/md0
/backups/usr/noon / do whatever you want to
test it / umount /backups/usr/noon mdconfig
d u 0 rm f /usr/snap.noon
100
Copy-on-Write
101
(No Transcript)
102
A File System
A file
??? entries in one disk block
103
A Snapshot i-node
A file
??? entries in one disk block
Not used or Not yet copy
104
Copy-on-write
A file
??? entries in one disk block
Not used or Not yet copy
105
Copy-on-write
A file
??? entries in one disk block
Not used or Not yet copy
106
Multiple Snapshots
  • about 20 snapshots
  • Interactions/sharing among snapshots

107
Snapshot of the FS
  • backup and restore
  • dump reliably an active File System
  • what will we do today to dump our 40GB FS
    consistent snapshots? (in the midnight)
  • background FSCK checks

108
(No Transcript)
109
VFS the FS Switch
  • Sun Microsystems introduced the virtual file
    system interface in 1985 to accommodate diverse
    filesystem types cleanly.
  • VFS allows diverse specific file systems to
    coexist in a file tree, isolating all
    FS-dependencies in pluggable filesystem modules.

VFS was an internal kernel restructuring with no
effect on the syscall interface.
Incorporates object-oriented concepts a generic
procedural interface with multiple
implementations.
Based on abstract objects with dynamic method
binding by type...in C.
Other abstract interfaces in the kernel device
drivers, file objects, executable files, memory
objects.
110
vnode
  • In the VFS framework, every file or directory in
    active use is represented by a vnode object in
    kernel memory.

Each vnode has a standard file attributes struct.
Generic vnode points at filesystem-specific
struct (e.g., inode, rnode), seen only by the
filesystem.
Each specific file system maintains a cache of
its resident vnodes.
Vnode operations are macros that vector
to filesystem-specific procedures.
111
vnode Operations and Attributes
vnode attributes (vattr) type (VREG, VDIR, VLNK,
etc.) mode (9 bits of permissions) nlink (hard
link count) owner user ID owner group
ID filesystem ID unique file ID file size (bytes
and blocks) access time modify time generation
number
directories only vop_lookup (OUT vpp,
name) vop_create (OUT vpp, name,
vattr) vop_remove (vp, name) vop_link (vp,
name) vop_rename (vp, name, tdvp, tvp,
name) vop_mkdir (OUT vpp, name, vattr) vop_rmdir
(vp, name) vop_symlink (OUT vpp, name, vattr,
contents) vop_readdir (uio, cookie) vop_readlink
(uio) files only vop_getpages (page, count,
offset) vop_putpages (page, count, sync,
offset) vop_fsync ()
generic operations vop_getattr
(vattr) vop_setattr (vattr) vhold() vholdrele()
112
Network File System (NFS)
server
client
syscall layer
user programs
VFS
syscall layer
NFS server
VFS
UFS
NFS client
UFS
network
113
vnode Cache
VFS free list head

HASH(fsid, fileid)
Active vnodes are reference- counted by the
structures that hold pointers to them. -
system open file table - process current
directory - file system mount points
- etc. Each specific file system maintains its
own hash of vnodes (BSD). - specific FS
handles initialization - free list is
maintained by VFS
vget(vp) reclaim cached inactive vnode from VFS
free list vref(vp) increment reference count on
an active vnode vrele(vp) release reference
count on a vnode vgone(vp) vnode is no longer
valid (file is removed)
114
(No Transcript)
115
(No Transcript)
116
struct vnode struct mtx v_interlock / lock
for "i" things / u_long v_iflag / i vnode
flags (see below) / int v_usecount / i ref
count of users / long v_numoutput / i writes
in progress / struct thread v_vxthread / i
thread owning VXLOCK / int v_holdcnt / i
page buffer references / struct buflists
v_cleanblkhd / i SORTED clean blocklist
/ struct buf v_cleanblkroot / i clean buf
splay tree / int v_cleanbufcnt / i number
of clean buffers / struct buflists
v_dirtyblkhd / i SORTED dirty blocklist
/ struct buf v_dirtyblkroot / i dirty buf
splay tree / int v_dirtybufcnt
117
System-call interface
Active file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
118
(No Transcript)
119
How Stacking Works
User process
USER
System Call Interface
read()
data error codes
KERNEL
File System Interface
ext2fs_read()
EXT2FS
120
  • FiST File System Translator
  • Language compiler
  • Code portability
  • Average code size over other stackable
    file-systems is reduced ten times.
  • Average development time is reduced seven times
  • Developers need only to describe the core
    functionality of their file systems.
  • Basefs minimalist template derived from Wrapfs
  • Extending platform-specific vnode interfaces in a
    platform independent way.

121
(No Transcript)
122
Transaction-based FS
  • Performance versus consistency
  • Atomic Writes on Multiple Blocks
  • See the paper titled Atomic Writes for Data
    Integrity and Consistency in Shared Storage
    Devices for Clusters by Okun and Barak, FGCS,
    vol. 20, pages 539-547, 2004.
  • Modify SCSI handling
Write a Comment
User Comments (0)
About PowerShow.com