CS 140 Lecture files and directories
Turn off comments!
  • Dawson Engler
  • Stanford CS department

File system fun
Processes, vm, synchronization, fs all around
since 60s or earlier. Clear win
  • File systems the hardest part of OS
  • More papers on FSes than any other single topic
  • Main tasks of file system
  • dont go away (ever)
  • associate bytes with name (files)
  • associate names with each other (directories)
  • Can implement file systems on disk, over network,
    in memory, in non-volatile ram (NVRAM), on tape,
    w/ paper.
  • Well focus on disk and generalize later
  • Today files and directories a bit of speed.

The medium is the message
  • Disk First thing weve seen that doesnt go
  • So Where everything important lives. Failure.
  • Slow (ms access vs ns for memory)
  • Huge (100x bigger than memory)
  • How to organize large collection of ad hoc
    information? Taxonomies! (Basically FS
    general way to make these)

Optimization usability. Cache everything
files, directories, names, non-existant names
Memory vs. Disk
Disk is just memory. We already know memory.
But there are some differences. The big
difference minimum transfer unit, that it
doesnt go away (multiple writes, crash ?).
Note 100,000 in terms of latency, only 10x in
terms of bandwidth.
  • Smallest write sector
  • Atomic write sector
  • 10ms
  • not on a good curve
  • 20MB/s
  • NUMA
  • Crash?
  • Contents not gone (non-volatile)
  • Lose? Corrupt? No ok.
  • (usually) bytes
  • byte, word
  • Random access nanosecs
  • faster all the time
  • Seq access 200-1000MB/s
  • UMA
  • Crash?
  • Contents gone (volatile)
  • Lose start over ok

Some useful facts
  • Disk reads/writes in terms of sectors, not bytes
  • read/write single sector or adjacent groups
  • How to write a single byte? Read-modify-write
  • read in sector containing the byte
  • modify that byte
  • write entire sector back to disk
  • key if cached, dont need to read in
  • Sector unit of atomicity.
  • sector write done completely, even if crash in
  • (disk saves up enough momentum to complete)
  • larger atomic units have to be synthesized by OS

Can happen all the time on alphas --- only do
word ops, so to write a byte, had to do a read,
modify it, and write it out. Means you can now
have a cache miss, which happens here too. RMW
for assigning to a bit in memory. Means
Just like we built large atomic units from small
atomic instructions, well build up large atomic
ops based on sector writes.
The equation that ruled the world.
  • Approximate time to get data
  • So?
  • Each time touch disk 10s ms.
  • Touch 50-100 times 1 second
  • Can do billions of ALU ops in same time.
  • This fact Huge social impact on OS research
  • Most pre-2000 research based on speed.
  • Publishable speedup 30
  • Easy to get gt 30 by removing just a few
  • Result more papers on FSes than any other single

seek time(ms) rotational delay(ms) bytes /
disk bandwidth
Files named bytes on disk
  • File abstraction
  • users view named sequence of bytes
  • FSs view collection of disk blocks
  • file systems job translate name offset to
    disk blocks
  • File operations
  • create a file, delete a file
  • read from file, write to file
  • Want operations to have as few disk accesses as
    possible have minimal space overhead

disk addrint
The operations you do on a noun
Whats so hard about grouping blocks???
Like usual, were going to call the same thing by
different names. Well be using lists and trees
of arrays to track integers, but instead of
calling them that or page tables, now meta data.
Purpose the same construct a mapping.
  • In some sense, the problems we will look at are
    no different than those in virtual memory
  • like page tables, file system meta data are
    simply data structures used to construct
  • Page table map virtual page to physical page
  • file meta data map byte offset to disk block
  • directory map name to disk address or file

Unix inode
FS vs VM
  • In some ways problem similar
  • want location transparency, oblivious to size,
  • In some ways the problem is easier
  • CPU time to do FS mappings not a big deal ( no
  • Page tables deal with sparse address spaces and
    random access, files are dense (0 .. filesize-1)
  • In some ways problem is harder
  • Each layer of translation potential disk access
  • Space a huge premium! (But disk is huge?!?!)
    Reason? Cache space never enough, the amount of
    data you can Get into one fetch never enough.
  • Range very extreme Many lt10k, some more than GB.
  • Implications?

Recall can fetch a track at a time, or about 64K
Problem how to track files data?
  • Disk management
  • Need to keep track of where file contents are on
  • Must be able to use this to map byte offset to
    disk block
  • Things to keep in mind while designing file
  • Most files are small
  • Much of the disk is allocated to large files
  • Many of the I/O operations are made to large
  • Want good sequential and good random access (what
    do these require?)
  • Just like VM data structures recapitulate cs107
  • Arrays, linked list, trees (of arrays), hash

Fixed cost must be low, must be able to nicely
represent large files, and accessing them must
not take too much time
Simple mechanism contiguous allocation
Just call malloc() on disk memory. Essentially
we will be putting lists and trees on disk, where
every pointer reference possibly disk acesss
  • Extent-based allocate files like segmented
  • When creating a file, make the user specify
    pre-specify its length and allocate all space at
  • File descriptor contents location and size
  • Example IBM OS/360
  • Pro?
  • Cons? (What does VM scheme does this correspond

What happened in segmentation? Variable sized
units fragmentation. Large files impossible
without expensive compaction, hard to predict
size at creation time
Linked files
If you increase the block size of a stupid file
system, what do you expect will happen?
  • Basically a linked list on disk.
  • Keep a linked list of all free blocks
  • file descriptor contents a pointer to files
    first block
  • in each block, keep a pointer to the next one
  • Pro?
  • Con?
  • Examples (sort-of) Alto, TOPS-10, DOS FAT

Variably sized, flexibly laid out files
Random access impossible, lots of seeks even for
sequential access
Example DOS FS (simplified)
But linked list expensive? Why does this work
  • Uses linked files. Cute links reside in
    fixed-sized file allocation table (FAT) rather
    than in the blocks.
  • Still do pointer chasing, but can cache entire
    FAT so can be cheap compared to disk access.

FAT (16-bit entries)
Directory (5)
file b
2 1
FAT discussion
64k 2 bytes 128K 64K.5k 32M FS
  • Entry size 16 bits
  • Whats the maximum size of the FAT?
  • Given a 512 byte block, whats the maximum size
    of FS?
  • One attack go to bigger blocks. Pro? Con?
  • Space overhead of FAT is trivial
  • 2 bytes / 512 byte block .4 (Compare to Unix)
  • Reliability how to protect against errors?
  • Bootstrapping where is root directory?

Bigger internal frag, faster access
Indexed files
  • Each file has an array holding all of its block
  • (purpose and issues those of a page table)
  • max file size fixed by arrays size (static or
  • create allocate array to hold all files blocks,
    but allocate on demand using free list
  • Pro?
  • con?

Large continguous chunk of disk space.
Essentially the same problem.
Indexed files
Want it to incrementally grow on use, and dont
want to contig allocate
  • Issues same as in page tables
  • Large possible file size lots of unused entries
  • Large actual size? table needs large contiguous
    disk chunk
  • Solve identically small regions with index
    array, this array with another array,

4K block size, 4GB file 1M entries (4MB!)
Multi-level indexed files 4.3 BSD
  • File descriptor (inode) 14 block pointers

data blocks
Indirect block
Ptr 1 ptr 2 ptr 3 ptr 4 ... ptr 13 ptr 14
Indirect blks
Ptr 1 ptr 2 ptr 128
Double indirect block
Unix discussion
  • Pro?
  • simple, easy to build, fast access to small
  • Maximum file length fixed, but large. (With 4k
  • Cons
  • whats the worst case of accesses?
  • Whats some bad space overheads?
  • An empirical problem
  • because you allocate blocks by taking them off
    unordered freelist, meta data and data get strewn
    across disk

4K file inode size/4k. 2.5 File with one
indirect block. inodeindirect/52k 8
More about inodes
  • Inodes are stored in a fixed sized array
  • Size of array determined when disk is initialized
    and cant be changed. Array lives in known
    location on disk. Originally at one side of
  • Now is smeared across it (why?)
  • The index of an inode in the inode array called
    an i-number. Internally, the OS refers to files
    by inumber
  • When file is opened, the inode brought in memory,
    when closed, it is flushed back to disk.

Example (oversimplified) Unix file system
  • Want to modify byte 4 in /a/b.c
  • readin root directory (inode 2)
  • lookup a (inode 12) readin
  • lookup inode for b.c (13) readin
  • use inode to find blk for byte 4 (blksize 512,
    so offset 0 gives blk 14) readin and modify

. 12 dir .. 2dir b.c 13inode
. 2 dir a 12 dir
Root directory
0 0
int main()
Disk contains millions of messy things.
  • Problem
  • spend all day generating data, come back the
    next morning, want to use it. F. Corbato, on
    why files/dirs invented.
  • Approach 0 have user remember where on disk the
    file is.
  • (e.g., social security numbers)
  • Yuck. People want human digestible names
  • we use directories to map names to file blocks
  • Next What is in a directory and why?

A short history of time
  • Approach 1 have a single directory for entire
  • put directory at known location on disk
  • directory contains ltname, indexgt pairs
  • if one user uses a name, no one else can
  • many ancient PCs work this way. (cf hosts.txt)
  • Approach 2 have a single directory for each user
  • still clumsy. And ls on 10,000 files is a real
  • (many older mathematicians work this way)
  • Approach 3 hierarchical name spaces
  • allow directory to map names to files or other
  • file system forms a tree (or graph, if links
  • large name spaces tend to be hierarchical (ip
    addresses, domain names, scoping in programming
    languages, etc.)

Hierarchical Unix
afs bin cdrom dev sbin tmp
  • Used since CTSS (1960s)
  • Unix picked up and used really nicely.
  • Directories stored on disk just like regular
  • inode contains special flag bit set
  • users can read just like any other file
  • only special programs can write (why?)
  • Inodes at fixed disk location
  • File pointed to by the index may be another
  • makes FS into hierarchical tree (what needed
    to make a DAG?)
  • Simple. Plus speeding up file ops speeding up
    dir ops!

awk chmod chown
Naming magic
  • Bootstrapping Where do you start looking?
  • Root directory
  • inode 2 on the system
  • 0 and 1 used for other purposes
  • Special names
  • Root directory /
  • Current directory .
  • Parent directory ..
  • users home directory
  • Using the given names, only need two operations
    to navigate the entire name space
  • cd name move into (change context to)
    directory name
  • ls enumerate all names in current directory

Unix example /a/b/c.c
Name space
Physical organization
Inode table
What inode holds file for a? b? c.c?
ltc.c, 14gt
Default context working directory
  • Cumbersome to constantly specify full path names
  • in Unix, each process associated with a current
    working directory
  • file names that do not begin with / are assumed
    to be relative to the working directory,
    otherwise translation happens as before
  • Shells track a default list of active contexts
  • a search path
  • given a search path A, B, C a shell will
    check in A, then check in B, then check in C
  • can escape using explicit paths ./foo
  • Example of locality

Creating synonyms Hard and soft links
  • More than one dir entry can refer to a given file
  • Unix stores count of pointers (hard links) to
  • to make ln foo bar creates a synonym
    (bar) for foo
  • Soft links
  • also point to a file (or dir), but object can be
    deleted from underneath it (or never even exist).
  • Unix builds like directories normal file holds
    pointed to name, with special sym link bit set
  • When the file system encounters a symbolic link
    it automatically translates it (if possible).

Micro-case study speeding up a FS
  • Original Unix FS Simple and elegant
  • Nouns
  • data blocks
  • inodes (directories represented as files)
  • hard links
  • superblock. (specifies number of blks in FS,
    counts of max of files, pointer to head of free
  • Problem slow
  • only gets 20Kb/sec (2 of disk maximum) even for
    sequential disk transfers!

inodes data blocks (512 bytes)

A plethora of performance costs
  • Blocks too small (512 bytes)
  • file index too large
  • too many layers of mapping indirection
  • transfer rate low (get one block at time)
  • Sucky clustering of related objects
  • Consecutive file blocks not close together
  • Inodes far from data blocks
  • Inodes for directory not close together
  • poor enumeration performance e.g., ls, grep
    foo .c
  • Next how FFS fixes these problems (to a degree)

Two disk accesses. Poor enumeration performance.
Before, after!
Problem 1 Too small block size
  • Why not just make bigger?
  • Bigger block increases bandwidth, but how to deal
    with wastage (internal fragmentation)?
  • Use idea from malloc split unused portion.

Block size space wasted file
bandwidth 512 6.9 2.6 1024 11.8 3.3 2048
22.4 6.4 4096 45.6 12.0 1MB 99.0 97
Handling internal fragmentation
Only at the ends may contain one or more
consequetive fragments. Cant have in middle if
you do small writes can get screwed by copying
  • has large block size (4096 or 8192)
  • allow large blocks to be chopped into small ones
  • Used for little files and pieces at the ends of
  • Best way to eliminate internal fragmentation?
  • Variable sized splits of course
  • Why does FFS use fixed-sized fragments (1024,

File b
file a
Finding all objects, dont have time to search
entire heap. External fragmentation
Prob 2 Where to allocate data?
  • Our central fact
  • Moving disk head expensive
  • So? Put related data close
  • Fastest adjacent
  • sectors (can span platters)
  • Next in same cylinder
  • (can also span platters)
  • Next in cylinder close by

Clustering related objects in FFS
  • 1 or more consecutive cylinders into a cylinder
  • Key can access any block in a cylinder without
    performing a seek. Next fastest place is
    adjacent cylinder.
  • Tries to put everything related in same cylinder
  • Tries to put everything not related in different
    group (?!)

Cylinder group 1 cylinder group 2
Clustering in FFS
  • Tries to put sequential blocks in adjacent
  • (access one block, probably access next)
  • Tries to keep inode in same cylinder as file
  • (if you look at inode, most likely will look at
    data too)
  • Tries to keep all inodes in a dir in same
    cylinder group
  • (access one name, frequently access many)
  • ls -l

1 2 3 1 2
file a file b
Inode 1 2 3
Frequently hack in same working dir
Whats a cylinder group look like?
  • Basically a mini-Unix file system
  • How how to ensure theres space for related
  • Place different directories in different cylinder
  • Keep a free space reserve so can allocate near
    existing things
  • when file grows to big (1MB) send its remainder
    to different cylinder group.

inodes data blocks (512 bytes)
Prob 3 Finding space for related objects
Array of bits, one per block supports ffs
heuristic of trying to allocate each block in
adjacent sector
  • Old Unix ( dos) Linked list of free blocks
  • Just take a block off of the head. Easy.
  • Bad free list gets jumbled over time. Finding
    adjacent blocks hard and slow
  • FFS switch to bit-map of free blocks
  • 1010101111111000001111111000101100
  • easier to find contiguous blocks.
  • Small, so usually keep entire thing in memory
  • key keep a reserve of free blocks. Makes
    finding a close block easier

Using a bitmap
  • Usually keep entire bitmap in memory
  • 4G disk / 4K byte blocks. How big is map?
  • Allocate block close to block x?
  • check for blocks near bmapx/32
  • if disk almost empty, will likely find one near
  • as disk becomes full, search becomes more
    expensive and less effective.
  • Trade space for time (search time, file access
  • keep a reserve (e.g, 10) of disk always free,
    ideally scattered across disk
  • dont tell users (df --gt 110 full)
  • N platters N adjacent blocks
  • with 10 free, can almost always find one of them

So what did we gain?
Average waste .5 fragment, same mapping
structure has larger reach
  • Performance improvements
  • able to get 20-40 of disk bandwidth for large
  • 10-20x original Unix file system!
  • Better small file performance (why?)
  • Is this the best we can do? No.
  • Block based rather than extent based
  • name contiguous blocks with single pointer and
  • (Linux ext2fs)
  • Writes of meta data done synchronously
  • really hurts small file performance
  • make asynchronous with write-ordering (soft
    updates) or logging (the episode file system,
  • play with semantics (/tmp file systems)

Map integers to integers. Basically this is just
like base and bounds
doesnt exploit multiple disks
Other hacks?
  • Obvious
  • Big file cache.
  • Fact no rotation delay if get whole track.
  • How to use?
  • Fact transfer cost negligible.
  • Can get 20x the data for only 5 more overhead
  • 1 sector 10ms 8ms 50us (512/10MB/s) 18ms
  • 20 sectors 10ms 8ms 1ms 19ms
  • How to use?
  • Fact if transfer huge, seek rotation
  • Mendel LFS. Hoard data, write out MB at a time.
