Title: ecs150 Fall 2006: Operating System
1ecs150 Fall 2006Operating System5 File
Systems(chapters 6.46.7, 8)
- Dr. S. Felix Wu
- Computer Science Department
- University of California, Davis
- http//www.cs.ucdavis.edu/wu/
- sfelixwu_at_gmail.com
2File System Abstraction
3System-call interface
Active file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
4(No Transcript)
5(No Transcript)
6(No Transcript)
7dirp opendir(const char filename) struct
dirent direntp readdir(dirp) struct dirent
ino_t d_ino char d_nameNAME_MAX1
directory
dirent inode file_name
file
dirent inode file_name
dirent inode file_name
file
file
8Local versus Remote
- System Call Interface
- V-node
- Local versus remote
- NFS or i-node
- Stackable File System
- Hard-disk blocks
9File-System Structure
- File structure
- Logical storage unit
- Collection of related information
- File system resides on secondary storage (disks).
- File system organized into layers.
- File control block storage structure consisting
of information about a file.
10File ? Disk
- separate the disk into blocks
- separate the file into blocks as well
- paging from file to disk
blocks 4 - 7- 2- 10- 12
How to represent the file?? How to link these 5
pages together??
11Bit torrent pieces
- 1 big file (X Gigabytes) with a number of pieces
(5) already in (and sharing with others). - How much disk space do we need at this moment?
12Hard Disk
- Track, Sector, Head
- Track Heads ? Cylinder
- Performance
- seek time
- rotation time
- transfer time
- LBA
- Linear Block Addressing
13File ? Disk blocks
0
file block 0
file block 1
file block 2
file block 3
file block 4
4
7
2
10
12
- What are the disadvantages?
- disk access can be slow for random access.
- How big is each block? 64 bytes? 68 bytes?
14Kernel Hacking Session
- This Friday from 730 p.m. until midnight..
- 3083 Kemper
- Bring your laptop
- And bring your mug
15A File System
partition
partition
partition
i-list
directory and data blocks
s
b
d
i-node
i-node
.
i-node
16One Logical File ? Physical Disk Blocks
efficient representation access
17An i-node
A file
??? entries in one disk block
Typical each block 8K or 16K bytes
18inode (index node) structure
- meta-data of the file.
- di_mode 02
- di_nlinks 02
- di_uid 02
- di_gid 02
- di_size 04
- di_addr 39
- di_gen 01
- di_atime 04
- di_mtime 04
- di_ctime 04
19System-call interface
Active file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
20(No Transcript)
21A File System
partition
partition
partition
i-list
directory and data blocks
s
b
d
i-node
i-node
.
i-node
22(No Transcript)
23125 struct ufs2_dinode 126 u_int16_t di_mode
/ 0 IFMT, permissions see below. / 127
int16_t di_nlink / 2 File link count. / 128
u_int32_t di_uid / 4 File owner. / 129
u_int32_t di_gid / 8 File group. / 130
u_int32_t di_blksize / 12 Inode blocksize. /
131 u_int64_t di_size / 16 File byte count.
/ 132 u_int64_t di_blocks / 24 Bytes
actually held. / 133 ufs_time_t di_atime /
32 Last access time. / 134 ufs_time_t
di_mtime / 40 Last modified time. / 135
ufs_time_t di_ctime / 48 Last inode change
time. / 136 ufs_time_t di_birthtime / 56
Inode creation time. / 137 int32_t
di_mtimensec / 64 Last modified time. / 138
int32_t di_atimensec / 68 Last access time. /
139 int32_t di_ctimensec / 72 Last inode
change time. / 140 int32_t di_birthnsec / 76
Inode creation time. / 141 int32_t di_gen /
80 Generation number. / 142 u_int32_t
di_kernflags / 84 Kernel flags. / 143
u_int32_t di_flags / 88 Status flags
(chflags). / 144 int32_t di_extsize / 92
External attributes block. / 145 ufs2_daddr_t
di_extbNXADDR/ 96 External attributes block.
/ 146 ufs2_daddr_t di_dbNDADDR / 112
Direct disk blocks. / 147 ufs2_daddr_t
di_ibNIADDR / 208 Indirect disk blocks. /
148 int64_t di_spare3 / 232 Reserved
currently unused / 149
24166 struct ufs1_dinode 167 u_int16_t di_mode
/ 0 IFMT, permissions see below. / 168
int16_t di_nlink / 2 File link count. / 169
union 170 u_int16_t oldids2 / 4 Ffs old
user and group ids. / 171 di_u 172
u_int64_t di_size / 8 File byte count. / 173
int32_t di_atime / 16 Last access time. /
174 int32_t di_atimensec / 20 Last access
time. / 175 int32_t di_mtime / 24 Last
modified time. / 176 int32_t di_mtimensec /
28 Last modified time. / 177 int32_t di_ctime
/ 32 Last inode change time. / 178 int32_t
di_ctimensec / 36 Last inode change time. /
179 ufs1_daddr_t di_dbNDADDR / 40 Direct
disk blocks. / 180 ufs1_daddr_t di_ibNIADDR
/ 88 Indirect disk blocks. / 181 u_int32_t
di_flags / 100 Status flags (chflags). / 182
int32_t di_blocks / 104 Blocks actually held.
/ 183 int32_t di_gen / 108 Generation
number. / 184 u_int32_t di_uid / 112 File
owner. / 185 u_int32_t di_gid / 116 File
group. / 186 int32_t di_spare2 / 120
Reserved currently unused / 187
25Bittorrent pieces
File size 10 GB Pieces downloaded 512 MB How
much disk space do we need?
26include include int main (
void) FILE f1 fopen("./sss.txt", "w")
int i for (i 0 i fseek(f1, rand(), SEEK_SET) fprintf(f1,
"dddd", rand(), rand(), rand(),
rand()) if (i 100 0) sleep(1)
fflush(f1)
./t ls l ./sss.txt
27(No Transcript)
28(No Transcript)
29(No Transcript)
30An i-node
A file
??? entries in one disk block
Typical each block 1K
31i-node
- How many disk blocks can a FS have?
- How many levels of i-node indirection will be
necessary to store a file of 2G bytes? (I.e., 0,
1, 2 or 3) - What is the largest possible file size in i-node?
- What is the size of the i-node itself for a file
of 10GB with only 512 MB downloaded?
32Answer
- How many disk blocks can a FS have?
- 264 or 232 Pointer (to blocks) size is 8/4
bytes. - How many levels of i-node indirection will be
necessary to store a file of 2G (231) bytes?
(I.e., 0, 1, 2 or 3) - 12210 28 210 28 28 2 10 28 28 28 2
10 ? 231 - What is the largest possible file size in i-node?
- 12210 28 210 28 28 2 10 28 28 28 2
10 - 264 1
- 232 210
You need to consider three issues and find the
minimum!
33Answer
- How many pointers?
- 512MB divided by the block size (1K)
- 512K pointers times 8 (4) bytes 4 (2) MB
34A File System
partition
partition
partition
i-list
directory and data blocks
s
b
d
i-node
i-node
.
i-node
35FFS and UFS
- /usr/src/sys/ufs/ffs/
- Higher-level directory structure
- Soft updates Snapshot
- /usr/src/sys/ufs/ufs/
- Lower-level buffer, i-node
36 of i-nodes
- UFS1 pre-allocation
- 3 of HD, about
- UFS2 dynamic allocation
- Still limited of i-nods
37di_size vs. di_blocks
38One Logical File ? Physical Disk Blocks
efficient representation access
39di_size vs. di_blocks
- Logical
- Physical
- fstat
- du
40Extended Attributes in UFS2
- Attributes associated with the File
- di_extb2
- two blocks, but indirection if needed.
- Format
- Length 4
- Name Space 1
- Content Pad Length 1
- Name Length 1
- Name mod 8
- Content variable
- Applications ACL, Data Labelling
41Some thoughts.
- What can you do with extended attributes?
- How to design/implement?
- Should/can we do it Stackable File Systems?
- Otherwise, the program to manipulate the EAs
will have to be very UFS2-dependent or FiST with
an UFS2 optimization option. - Are there any counter examples?
- security and performance considerations.
42(No Transcript)
43(No Transcript)
44struct dirent ino_t d_ino char
d_nameNAME_MAX1 struct stat short
nlinks
directory
dirent inode file_name
file
dirent inode file_name
dirent inode file_name
file
file
45A File System
partition
partition
partition
i-list
directory and data blocks
s
b
d
i-node
i-node
.
i-node
46- ln s /usr/src/sys/sys/proc.h ppp.h
- ln /usr/src/sys/sys/proc.h ppp.h
47File System Buffer Cache
application read/write files
translate file to disk blocks
OS
...
...buffer cache
maintains
controls disk accesses read/write blocks
hardware
Any problems?
48File System Consistency
- To maintain file system consistency the ordering
of updates from buffer cache to disk is critical - Example
- if the directory block is written back before the
i-node and the system crashes, the directory
structure will be inconsistent
49File System Consistency
- File system almost always use a buffer/disk cache
for performance reasons - This problem is critical especially for the
blocks that contain control information i-node,
free-list, directory blocks - Two copies of a disk block (buffer cache, disk) ?
consistency problem if the system crashes before
all the modified blocks are written back to disk - Write back critical blocks from the buffer cache
to disk immediately - Data blocks are also written back periodically
sync
50Two Strategies
- Prevention
- Use un-buffered I/O when writing i-nodes or
pointer blocks - Use buffered I/O for other writes and force sync
every 30 seconds - Detect and Fix
- Detect the inconsistency
- Fix them according to the rules
- Fsck (File System Checker)
51File System Integrity
- Block consistency
- Block-in-use table
- Free-list table
- File consistency
- how many directories pointing to that i-node?
- nlink?
- three cases D L, L D, D L
- What to do with the latter two cases?
0
1
1
1
0
0
0
1
0
0
0
2
1
0
0
0
1
1
1
0
1
0
2
0
52File System Integrity
- File system states
- (a) consistent
- (b) missing block
- (c) duplicate block in free list
- (d) duplicate data block
53Metadata Operations
- Metadata operations modify the structure of the
file system - Creating, deleting, or renamingfiles,
directories, or special files - Directory I-node
- Data must be written to disk in such a way that
the file system can be recovered to a consistent
state after a system crash
54Metadata Integrity
- FFS uses synchronous writes to guarantee the
integrity of metadata - Any operation modifying multiple pieces of
metadata will write its data to disk in a
specific order - These writes will be blocking
- Guarantees integrity and durability of metadata
updates
55Deleting a file (I)
i-node-1
abc
def
i-node-2
ghi
i-node-3
Assume we want to delete file def
56Deleting a file (II)
i-node-1
abc
?
def
ghi
i-node-3
Cannot delete i-node before directory entry def
57Deleting a file (III)
- Correct sequence is
- Write to disk directory block containing deleted
directory entry def - Write to disk i-node block containing deleted
i-node - Leaves the file system in a consistent state
58Creating a file (I)
i-node-1
abc
ghi
i-node-3
Assume we want to create new file tuv
59Creating a file (II)
i-node-1
abc
ghi
i-node-3
tuv
?
Cannot write directory entry tuv before i-node
60Creating a file (III)
- Correct sequence is
- Write to disk i-node block containing new i-node
- Write to disk directory block containing new
directory entry - Leaves the file system in a consistent state
61Synchronous Updates
- Used by FFS to guarantee consistency of metadata
- All metadata updates are done through blocking
writes - Increases the cost of metadata updates
- Can significantly impact the performance of whole
file system
62(No Transcript)
63SOFT UPDATES
- Use delayed writes (write back)
- Maintain dependency information about cached
pieces of metadata - This i-node must be updated before/after this
directory entry - Guarantee that metadata blocks are written to
disk in the required order
643 Soft Update Rules
- Never point to a structure before it has been
initialized. - Never reuse a resource before nullifying all
previous pointers to it. - Never reset the old pointer to a live resource
before the new pointer has been set.
65Problem 1 with S.U.
- Synchronous writes guaranteed that metadata
operations were durable once the system call
returned - Soft Updates guarantee that file system will
recover into a consistent state but not
necessarily the most recent one - Some updates could be lost
66What are the dependency relationship?
We want to delete file foo and create new file
bar
Block A
Block B
i-node-2
foo
NEW bar
NEW i-node-3
67Circular Dependency
X-2nd
Y-1st
We want to delete file foo and create new file
bar
Block A
Block B
i-node-2
foo
NEW bar
NEW i-node-3
68Problem 2 with S.U.
- Cyclical dependencies
- Same directory block contains entries to be
created and entries to be deleted - These entries point to i-nodes in the same block
- Brainstorming
- How to resolve this issue in S.U.?
69How to update?? i-node first or director block
first?
70(No Transcript)
71Solution in S.U.
- Roll back metadata in one of the blocks to an
earlier, safe state - (Safe state does not contain new directory entry)
Block A
72- Write first block with metadata that were rolled
back (block A of example) - Write blocks that can be written after first
block has been written (block B of example) - Roll forward block that was rolled back
- Write that block
- Breaks the cyclical dependency but must now write
twice block A
73Before any Write Operation
SU Dependency Checking (roll back if necessary)
After any Write Operation
SU Dependency Processing (task list
updating) (roll forward if necessary)
74- two most popular approaches for improving the
performance of metadata operations and recovery - Journaling
- Soft Updates
- Journaling systems record metadata operations on
an auxiliary log - Soft Updates uses ordered writes
75JOURNALING
- Journaling systems maintain an auxiliary log that
records all meta-data operations - Write-ahead logging ensures that the log is
written to disk before any blocks containing data
modified by the corresponding operations. - After a crash, can replay the log to bring the
file system to a consistent state
76JOURNALING
- Log writes are performed in addition to the
regular writes - Journaling systems incur log write overhead but
- Log writes can be performed efficiently because
they are sequential (block operation
consideration) - Metadata blocks do not need to be written back
after each update
77JOURNALING
- Journaling systems can provide
- same durability semantics as FFS if log is
forced to disk after each meta-data operation - the laxer semantics of Soft Updates if log writes
are buffered until entire buffers are full
78Soft Updates vs. Journaling
79With Soft Updates??
Do we still need FSCK? at boot time?
CPU
80Recover the Missing Resources
- In the background, in an active FS
- We dont want to wait for the lengthy FSCK
process to complete - A related issue
- the virus scanning process
- what happens if we get a new virus signature?
81Snapshot of the FS
- backup and restore
- dump reliably an active File System
- what will we do today to dump our 40GB FS
consistent snapshots? (in the midnight) - background FSCK checks
82What is a snapshot?(I mean conceptually.)
- Freeze all activities related to the FS.
- Copy everything to some space.
- Resume the activities.
How do we efficiently implement this concept such
that the activities will only be blocked for
about 0.25 seconds, and we dont have to buy a
really big hard drive?
83(No Transcript)
84(No Transcript)
85Copy-on-Write
86Snapshot a file
Logical size Versus physical size
87Example
mkdir /backups/usr/noon mount u o snapshot
/usr/snap.noon /usr mdconfig a t vnode u 0
f /usr/snap.noon mount r /dev/md0
/backups/usr/noon / do whatever you want to
test it / umount /backups/usr/noon mdconfig
d u 0 rm f /usr/snap.noon
88(No Transcript)
89(No Transcript)
90include include int main (
void) FILE f1 fopen("./sss.txt", "w")
int i for (i 0 i fseek(f1, rand(), SEEK_SET) fprintf(f1,
"dddd", rand(), rand(), rand(),
rand()) if (i 100 0) sleep(1)
fflush(f1)
91Example
mkdir /backups/usr/noon mount u o snapshot
/usr/snap.noon /usr mdconfig a t vnode u 0
f /usr/snap.noon mount r /dev/md0
/backups/usr/noon / do whatever you want to
test it / umount /backups/usr/noon mdconfig
d u 0 rm f /usr/snap.noon
92(No Transcript)
93(No Transcript)
94(No Transcript)
95(No Transcript)
96(No Transcript)
97(No Transcript)
98(No Transcript)
99Example
mkdir /backups/usr/noon mount u o snapshot
/usr/snap.noon /usr mdconfig a t vnode u 0
f /usr/snap.noon mount r /dev/md0
/backups/usr/noon / do whatever you want to
test it / umount /backups/usr/noon mdconfig
d u 0 rm f /usr/snap.noon
100Copy-on-Write
101(No Transcript)
102A File System
A file
??? entries in one disk block
103A Snapshot i-node
A file
??? entries in one disk block
Not used or Not yet copy
104Copy-on-write
A file
??? entries in one disk block
Not used or Not yet copy
105Copy-on-write
A file
??? entries in one disk block
Not used or Not yet copy
106Multiple Snapshots
- about 20 snapshots
- Interactions/sharing among snapshots
107Snapshot of the FS
- backup and restore
- dump reliably an active File System
- what will we do today to dump our 40GB FS
consistent snapshots? (in the midnight) - background FSCK checks
108(No Transcript)
109VFS the FS Switch
- Sun Microsystems introduced the virtual file
system interface in 1985 to accommodate diverse
filesystem types cleanly. - VFS allows diverse specific file systems to
coexist in a file tree, isolating all
FS-dependencies in pluggable filesystem modules.
VFS was an internal kernel restructuring with no
effect on the syscall interface.
Incorporates object-oriented concepts a generic
procedural interface with multiple
implementations.
Based on abstract objects with dynamic method
binding by type...in C.
Other abstract interfaces in the kernel device
drivers, file objects, executable files, memory
objects.
110vnode
- In the VFS framework, every file or directory in
active use is represented by a vnode object in
kernel memory.
Each vnode has a standard file attributes struct.
Generic vnode points at filesystem-specific
struct (e.g., inode, rnode), seen only by the
filesystem.
Each specific file system maintains a cache of
its resident vnodes.
Vnode operations are macros that vector
to filesystem-specific procedures.
111vnode Operations and Attributes
vnode attributes (vattr) type (VREG, VDIR, VLNK,
etc.) mode (9 bits of permissions) nlink (hard
link count) owner user ID owner group
ID filesystem ID unique file ID file size (bytes
and blocks) access time modify time generation
number
directories only vop_lookup (OUT vpp,
name) vop_create (OUT vpp, name,
vattr) vop_remove (vp, name) vop_link (vp,
name) vop_rename (vp, name, tdvp, tvp,
name) vop_mkdir (OUT vpp, name, vattr) vop_rmdir
(vp, name) vop_symlink (OUT vpp, name, vattr,
contents) vop_readdir (uio, cookie) vop_readlink
(uio) files only vop_getpages (page, count,
offset) vop_putpages (page, count, sync,
offset) vop_fsync ()
generic operations vop_getattr
(vattr) vop_setattr (vattr) vhold() vholdrele()
112Network File System (NFS)
server
client
syscall layer
user programs
VFS
syscall layer
NFS server
VFS
UFS
NFS client
UFS
network
113vnode Cache
VFS free list head
HASH(fsid, fileid)
Active vnodes are reference- counted by the
structures that hold pointers to them. -
system open file table - process current
directory - file system mount points
- etc. Each specific file system maintains its
own hash of vnodes (BSD). - specific FS
handles initialization - free list is
maintained by VFS
vget(vp) reclaim cached inactive vnode from VFS
free list vref(vp) increment reference count on
an active vnode vrele(vp) release reference
count on a vnode vgone(vp) vnode is no longer
valid (file is removed)
114(No Transcript)
115(No Transcript)
116struct vnode struct mtx v_interlock / lock
for "i" things / u_long v_iflag / i vnode
flags (see below) / int v_usecount / i ref
count of users / long v_numoutput / i writes
in progress / struct thread v_vxthread / i
thread owning VXLOCK / int v_holdcnt / i
page buffer references / struct buflists
v_cleanblkhd / i SORTED clean blocklist
/ struct buf v_cleanblkroot / i clean buf
splay tree / int v_cleanbufcnt / i number
of clean buffers / struct buflists
v_dirtyblkhd / i SORTED dirty blocklist
/ struct buf v_dirtyblkroot / i dirty buf
splay tree / int v_dirtybufcnt
117System-call interface
Active file entries
VNODE Layer or VFS
Local naming (UFS)
FFS
Buffer cache
Block or character device driver
Hardware
118(No Transcript)
119How Stacking Works
User process
USER
System Call Interface
read()
data error codes
KERNEL
File System Interface
ext2fs_read()
EXT2FS
120- FiST File System Translator
- Language compiler
- Code portability
- Average code size over other stackable
file-systems is reduced ten times. - Average development time is reduced seven times
- Developers need only to describe the core
functionality of their file systems. - Basefs minimalist template derived from Wrapfs
- Extending platform-specific vnode interfaces in a
platform independent way.
121(No Transcript)
122Transaction-based FS
- Performance versus consistency
- Atomic Writes on Multiple Blocks
- See the paper titled Atomic Writes for Data
Integrity and Consistency in Shared Storage
Devices for Clusters by Okun and Barak, FGCS,
vol. 20, pages 539-547, 2004. - Modify SCSI handling