Title: More on Disks and File Systems
1More on Disks and File Systems
- CS-502 Operating SystemsFall 2006
- (Slides include materials from Operating System
Concepts, 7th ed., by Silbershatz, Galvin,
Gagne and from Modern Operating Systems, 2nd ed.,
by Tanenbaum)
2Additional Topics
- Mounting a file system
- Mapping files to virtual memory
- RAID Redundant Array of Inexpensive Disks
- Stable Storage
- Log Structured File Systems
- Linux Virtual File System
3Summary of Reading Assignmentsin Silbershatz
- Disks (general) 12.1 to 12.6
- File systems (general) Chapter 11
- Ignore 11.9, 11.10 for now!
- RAID 12.7
- Stable Storage 12.8
- Log-structured File System 11.8 6.9
4Mounting
- mount t type device pathname
- Attach device (which contains a file system of
type type) to the directory at pathname - File system implementation for type gets loaded
and connected to the device - Anything previously below pathname becomes hidden
until the device is un-mounted again - The root of the file system on device is now
accessed as pathname - E.g.,
- mount t iso9660 /dev/cdrom /myCD
5Mounting (continued)
- OS automatically mount devices in its mount table
at initialization time - /etc/fstab in Linux
- Type may be implicit in device
- Users or applications may mount devices at run
time, explicitly or implicitly e.g., - Insert a floppy disk
- Plug in a USB flash drive
6Linux Virtual File System (VFS)
- A generic file system interface provided by the
kernel - Common object framework
- superblock a specific, mounted file system
- i-node object a specific file in storage
- d-entry object a directory entry
- file object an open file associated with a
process
7Linux Virtual File System (continued)
- VFS operations
- super_operations
- read_inode, sync_fs, etc.
- inode_operations
- create, link, etc.
- d_entry_operations
- d_compare, d_delete, etc.
- file_operations
- read, write, seek, etc.
8Linux Virtual File System (continued)
- Individual file system implementations conform to
this architecture. - May be linked to kernel or loaded as modules
- Linux supports over 50 file systems in official
kernel - E.g., minix, ext, ext2, ext3, iso9660, msdos,
nfs, smb,
9Linux Virtual File System (continued)
- A special file type proc
- Mounted as /proc
- Provides access to kernel internal data
structures as if those structures were files! - E.g., /proc/dmesg
- There are several other special file types
- Vary from one version/vendor to another
- See Silbershatz, 11.2.3
- Love, Linux Kernel Development, Chapter 12
- SUSE Linux Administrator Guide, Chapter 20
10Questions?
11Mapping files to Virtual Memory
- Instead of reading from disk into virtual
memory, why not simply use file as the swapping
storage for certain VM pages? - Called mapping
- Page tables in kernel point to disk blocks of the
file
12Memory-Mapped Files
- Memory-mapped file I/O allows file I/O to be
treated as routine memory access by mapping a
disk block to a page in memory - A file is initially read using demand paging. A
page-sized portion of the file is read from the
file system into a physical page. Subsequent
reads/writes to/from the file are treated as
ordinary memory accesses. - Simplifies file access by allowing application to
simple access memory rather than be forced to use
read() write() calls to file system.
13Memory-Mapped Files (continued)
- A tantalizingly attractive notion, but
- Cannot use C/C pointers within mapped data
structure - Corrupted data structures likely to persist in
file - Recovery after a crash is more difficult
- Dont really save anything in terms of
- Programming energy
- Thought processes
- Storage space efficiency
14Memory-Mapped Files (continued)
- Nevertheless, the idea has its uses
- Simpler implementation of file operations
- read(), write() are memory-to-memory operations
- seek() is simply changing a pointer, etc
- Called memory-mapped I/O
- Shared Virtual Memory among processes
15Shared Virtual Memory
16Shared Virtual Memory (continued)
- Supported in
- Windows XP
- Apollo DOMAIN
- Linux??
- Synchronization is the responsibility of the
sharing applications - OS retains no knowledge
- Few (if any) synchronization primitives between
processes in separate address spaces
17Questions?
18Problem
- Question
- If mean time to failure of a disk drive is
100,000 hours, - and if your system has 100 identical disks,
- what is mean time between drive replacement?
- Answer
- 1000 hours (i.e., 41.67 days ? 6 weeks)
- I.e.
- You lose 1 of your data every 6 weeks!
- But dont worry you can restore most of it from
backup!
19Can we do better?
- Yes, mirrored
- Write every block twice, on two separate disks
- Mean time between simultaneous failure of both
disks is 57,000 years - Can we do even better?
- E.g., use fewer extra disks?
- E.g., get more performance?
20RAID Redundant Array of Inexpensive Disks
- Distribute a file system intelligently across
multiple disks to - Maintain high reliability and availability
- Enable fast recovery from failure
- Increase performance
21Levels of RAID
- Level 0 non-redundant striping of blocks across
disk - Level 1 simple mirroring
- Level 2 striping of bytes or bits with ECC
- Level 3 Level 2 with parity, not ECC
- Level 4 Level 0 with parity block
- Level 5 Level 4 with distributed parity blocks
22RAID Level 0 Simple Striping
- Each stripe is one or a group of contiguous
blocks - Block/group i is on disk (i mod n)
- Advantage
- Read/write n blocks in parallel n times
bandwidth - Disadvantage
- No redundancy at all. System MBTF is 1/n disk
MBTF!
23RAID Level 1 Striping and Mirroring
- Each stripe is written twice
- Two separate, identical disks
- Block/group i is on disks (i mod 2n) (in mod
2n) - Advantages
- Read/write n blocks in parallel n times
bandwidth - Redundancy System MBTF (Disk MBTF)2 at twice
the cost - Failed disk can be replaced by copying
- Disadvantage
- A lot of extra disks for much more reliability
than we need
24RAID Levels 2 3
- Bit- or byte-level striping
- Requires synchronized disks
- Highly impractical
- Requires fancy electronics
- For ECC calculations
- Not used academic interest only
- See Silbershatz, 12.7.3 (pp. 471-472)
25Observation
- When a disk or stripe is read incorrectly,
- we know which one failed!
- Conclusion
- A simple parity disk can provide very high
reliability - (unlike simple parity in memory)
26RAID Level 4 Parity Disk
- parity 0-3 stripe 0 xor stripe 1 xor stripe 2
xor stripe 3 - n stripes plus parity are written/read in
parallel - If any disk/stripe fails, it can be reconstructed
from others - E.g., stripe 1 stripe 0 xor stripe 2 xor stripe
3 xor parity 0-3 - Advantages
- n times read bandwidth
- System MBTF (Disk MBTF)2 at 1/n additional
cost - Failed disk can be reconstructed on-the-fly
(hot swap) - Hot expansion simply add n 1 disks all
initialized to zeros - However
- Writing requires read-modify-write of parity
stripe ? only 1x write bandwidth.
27RAID Level 5 Distributed Parity
stripe 15
- Parity calculation is same as RAID Level 4
- Advantages Disadvantages Same as RAID Level 4
- Additional advantages
- avoids beating up on parity disk
- Some writes in parallel
- Writing individual stripes (RAID 4 5)
- Read existing stripe and existing parity
- Recompute parity
- Write new stripe and new parity
28RAID 4 5
- Very popular in data centers
- Corporate and academic servers
- Built-in support in Windows XP and Linux
- Connect a group of disks to fast SCSI port (320
MB/sec bandwidth) - OS RAID support does the rest!
29New Topic
30Incomplete Operations
- Problem how to protect against disk write
operations that dont finish - Power or CPU failure in the middle of a block
- Related series of writes interrupted before all
are completed - Examples
- Database update of charge and credit
- RAID 1, 4, 5 failure between redundant writes
31Solution (part 1) Stable Storage
- Write everything twice to separate disks
- Be sure 1st write does not invalidate previous
2nd copy - RAID 1 is okay RAID 4/5 not okay!
- Read blocks back to validate then report
completion - Reading both copies
- If 1st copy okay, use it i.e., newest value
- If 2nd copy different, update it with 1st copy
- If 1st copy is bad use 2nd copy i.e., old value
32Stable Storage (continued)
- Crash recovery
- Scan disks, compare corresponding blocks
- If one is bad, replace with good one
- If both good but different, replace 2nd with 1st
copy - Result
- If 1st block is good, it contains latest value
- If not, 2nd block still contains previous value
- An abstraction of an atomic disk write of a
single block - Uninterruptible by power failure, etc.
33What about more complex disk operations?
- E.g., File create operation involves
- Allocating free blocks
- Constructing and writing i-node
- Possibly multiple i-node blocks
- Reading and updating directory
- What if system crashes with the sequence only
partly completed? - Answer inconsistent data structures on disk
34Solution (Part 2) Log-Structured File System
- Make changes to cached copies in memory
- Collect together all changed blocks
- Including i-nodes and directory blocks
- Write to log file (aka journal file)
- A circular buffer on disk
- Fast, contiguous write
- Update log file pointer in stable storage
- Offline Play back log file to actually update
directories, i-nodes, free list, etc. - Update playback pointer in stable storage
35Transaction Data Base Systems
- Similar techniques
- Every transaction is recorded in log before
recording on disk - Stable storage techniques for managing log
pointers - One log exist is confirmed, disk can be updated
in place - After crash, replay log to redo disk operations
36Berkeley LFS a slight variation
- Everything is written to log
- i-nodes point to updated blocks in log
- i-node cache in memory updated whenever i-node is
written - Cleaner daemon follows behind to compact log
- Advantages
- LFS is always consistent
- LFS performance
- Much better than Unix file system for small
writes - At least as good for reads and large writes
- Tanenbaum, 6.3.8, pp. 428-430
- Rosenblum Ousterhout, Log-structured File
System (pdf) - Note not same as Linux LFS (large file system)
37Example
After
Before
log
38Summary of Reading Assignmentsin Silbershatz
- Disks (general) 12.1 to 12.6
- File systems (general) Chapter 11
- Ignore 11.9, 11.10 for now!
- RAID 12.7
- Stable Storage 12.8
- Log-structured File System 11.8 6.9