More on Disks and File Systems - PowerPoint PPT Presentation

About This Presentation
Title:

More on Disks and File Systems

Description:

... explicitly or implicitly e.g., Insert a floppy disk Plug in a USB flash drive Linux Virtual File System (VFS) ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 39
Provided by: Hug68
Learn more at: http://web.cs.wpi.edu
Category:
Tags: disk | disks | drive | file | floppy | more | systems

less

Transcript and Presenter's Notes

Title: More on Disks and File Systems


1
More on Disks and File Systems
  • CS-502 Operating SystemsFall 2006
  • (Slides include materials from Operating System
    Concepts, 7th ed., by Silbershatz, Galvin,
    Gagne and from Modern Operating Systems, 2nd ed.,
    by Tanenbaum)

2
Additional Topics
  • Mounting a file system
  • Mapping files to virtual memory
  • RAID Redundant Array of Inexpensive Disks
  • Stable Storage
  • Log Structured File Systems
  • Linux Virtual File System

3
Summary of Reading Assignmentsin Silbershatz
  • Disks (general) 12.1 to 12.6
  • File systems (general) Chapter 11
  • Ignore 11.9, 11.10 for now!
  • RAID 12.7
  • Stable Storage 12.8
  • Log-structured File System 11.8 6.9

4
Mounting
  • mount t type device pathname
  • Attach device (which contains a file system of
    type type) to the directory at pathname
  • File system implementation for type gets loaded
    and connected to the device
  • Anything previously below pathname becomes hidden
    until the device is un-mounted again
  • The root of the file system on device is now
    accessed as pathname
  • E.g.,
  • mount t iso9660 /dev/cdrom /myCD

5
Mounting (continued)
  • OS automatically mount devices in its mount table
    at initialization time
  • /etc/fstab in Linux
  • Type may be implicit in device
  • Users or applications may mount devices at run
    time, explicitly or implicitly e.g.,
  • Insert a floppy disk
  • Plug in a USB flash drive

6
Linux Virtual File System (VFS)
  • A generic file system interface provided by the
    kernel
  • Common object framework
  • superblock a specific, mounted file system
  • i-node object a specific file in storage
  • d-entry object a directory entry
  • file object an open file associated with a
    process

7
Linux Virtual File System (continued)
  • VFS operations
  • super_operations
  • read_inode, sync_fs, etc.
  • inode_operations
  • create, link, etc.
  • d_entry_operations
  • d_compare, d_delete, etc.
  • file_operations
  • read, write, seek, etc.

8
Linux Virtual File System (continued)
  • Individual file system implementations conform to
    this architecture.
  • May be linked to kernel or loaded as modules
  • Linux supports over 50 file systems in official
    kernel
  • E.g., minix, ext, ext2, ext3, iso9660, msdos,
    nfs, smb,

9
Linux Virtual File System (continued)
  • A special file type proc
  • Mounted as /proc
  • Provides access to kernel internal data
    structures as if those structures were files!
  • E.g., /proc/dmesg
  • There are several other special file types
  • Vary from one version/vendor to another
  • See Silbershatz, 11.2.3
  • Love, Linux Kernel Development, Chapter 12
  • SUSE Linux Administrator Guide, Chapter 20

10
Questions?
11
Mapping files to Virtual Memory
  • Instead of reading from disk into virtual
    memory, why not simply use file as the swapping
    storage for certain VM pages?
  • Called mapping
  • Page tables in kernel point to disk blocks of the
    file

12
Memory-Mapped Files
  • Memory-mapped file I/O allows file I/O to be
    treated as routine memory access by mapping a
    disk block to a page in memory
  • A file is initially read using demand paging. A
    page-sized portion of the file is read from the
    file system into a physical page. Subsequent
    reads/writes to/from the file are treated as
    ordinary memory accesses.
  • Simplifies file access by allowing application to
    simple access memory rather than be forced to use
    read() write() calls to file system.

13
Memory-Mapped Files (continued)
  • A tantalizingly attractive notion, but
  • Cannot use C/C pointers within mapped data
    structure
  • Corrupted data structures likely to persist in
    file
  • Recovery after a crash is more difficult
  • Dont really save anything in terms of
  • Programming energy
  • Thought processes
  • Storage space efficiency

14
Memory-Mapped Files (continued)
  • Nevertheless, the idea has its uses
  • Simpler implementation of file operations
  • read(), write() are memory-to-memory operations
  • seek() is simply changing a pointer, etc
  • Called memory-mapped I/O
  • Shared Virtual Memory among processes

15
Shared Virtual Memory
16
Shared Virtual Memory (continued)
  • Supported in
  • Windows XP
  • Apollo DOMAIN
  • Linux??
  • Synchronization is the responsibility of the
    sharing applications
  • OS retains no knowledge
  • Few (if any) synchronization primitives between
    processes in separate address spaces

17
Questions?
18
Problem
  • Question
  • If mean time to failure of a disk drive is
    100,000 hours,
  • and if your system has 100 identical disks,
  • what is mean time between drive replacement?
  • Answer
  • 1000 hours (i.e., 41.67 days ? 6 weeks)
  • I.e.
  • You lose 1 of your data every 6 weeks!
  • But dont worry you can restore most of it from
    backup!

19
Can we do better?
  • Yes, mirrored
  • Write every block twice, on two separate disks
  • Mean time between simultaneous failure of both
    disks is 57,000 years
  • Can we do even better?
  • E.g., use fewer extra disks?
  • E.g., get more performance?

20
RAID Redundant Array of Inexpensive Disks
  • Distribute a file system intelligently across
    multiple disks to
  • Maintain high reliability and availability
  • Enable fast recovery from failure
  • Increase performance

21
Levels of RAID
  • Level 0 non-redundant striping of blocks across
    disk
  • Level 1 simple mirroring
  • Level 2 striping of bytes or bits with ECC
  • Level 3 Level 2 with parity, not ECC
  • Level 4 Level 0 with parity block
  • Level 5 Level 4 with distributed parity blocks

22
RAID Level 0 Simple Striping
  • Each stripe is one or a group of contiguous
    blocks
  • Block/group i is on disk (i mod n)
  • Advantage
  • Read/write n blocks in parallel n times
    bandwidth
  • Disadvantage
  • No redundancy at all. System MBTF is 1/n disk
    MBTF!

23
RAID Level 1 Striping and Mirroring
  • Each stripe is written twice
  • Two separate, identical disks
  • Block/group i is on disks (i mod 2n) (in mod
    2n)
  • Advantages
  • Read/write n blocks in parallel n times
    bandwidth
  • Redundancy System MBTF (Disk MBTF)2 at twice
    the cost
  • Failed disk can be replaced by copying
  • Disadvantage
  • A lot of extra disks for much more reliability
    than we need

24
RAID Levels 2 3
  • Bit- or byte-level striping
  • Requires synchronized disks
  • Highly impractical
  • Requires fancy electronics
  • For ECC calculations
  • Not used academic interest only
  • See Silbershatz, 12.7.3 (pp. 471-472)

25
Observation
  • When a disk or stripe is read incorrectly,
  • we know which one failed!
  • Conclusion
  • A simple parity disk can provide very high
    reliability
  • (unlike simple parity in memory)

26
RAID Level 4 Parity Disk
  • parity 0-3 stripe 0 xor stripe 1 xor stripe 2
    xor stripe 3
  • n stripes plus parity are written/read in
    parallel
  • If any disk/stripe fails, it can be reconstructed
    from others
  • E.g., stripe 1 stripe 0 xor stripe 2 xor stripe
    3 xor parity 0-3
  • Advantages
  • n times read bandwidth
  • System MBTF (Disk MBTF)2 at 1/n additional
    cost
  • Failed disk can be reconstructed on-the-fly
    (hot swap)
  • Hot expansion simply add n 1 disks all
    initialized to zeros
  • However
  • Writing requires read-modify-write of parity
    stripe ? only 1x write bandwidth.

27
RAID Level 5 Distributed Parity
stripe 15
  • Parity calculation is same as RAID Level 4
  • Advantages Disadvantages Same as RAID Level 4
  • Additional advantages
  • avoids beating up on parity disk
  • Some writes in parallel
  • Writing individual stripes (RAID 4 5)
  • Read existing stripe and existing parity
  • Recompute parity
  • Write new stripe and new parity

28
RAID 4 5
  • Very popular in data centers
  • Corporate and academic servers
  • Built-in support in Windows XP and Linux
  • Connect a group of disks to fast SCSI port (320
    MB/sec bandwidth)
  • OS RAID support does the rest!

29
New Topic
30
Incomplete Operations
  • Problem how to protect against disk write
    operations that dont finish
  • Power or CPU failure in the middle of a block
  • Related series of writes interrupted before all
    are completed
  • Examples
  • Database update of charge and credit
  • RAID 1, 4, 5 failure between redundant writes

31
Solution (part 1) Stable Storage
  • Write everything twice to separate disks
  • Be sure 1st write does not invalidate previous
    2nd copy
  • RAID 1 is okay RAID 4/5 not okay!
  • Read blocks back to validate then report
    completion
  • Reading both copies
  • If 1st copy okay, use it i.e., newest value
  • If 2nd copy different, update it with 1st copy
  • If 1st copy is bad use 2nd copy i.e., old value

32
Stable Storage (continued)
  • Crash recovery
  • Scan disks, compare corresponding blocks
  • If one is bad, replace with good one
  • If both good but different, replace 2nd with 1st
    copy
  • Result
  • If 1st block is good, it contains latest value
  • If not, 2nd block still contains previous value
  • An abstraction of an atomic disk write of a
    single block
  • Uninterruptible by power failure, etc.

33
What about more complex disk operations?
  • E.g., File create operation involves
  • Allocating free blocks
  • Constructing and writing i-node
  • Possibly multiple i-node blocks
  • Reading and updating directory
  • What if system crashes with the sequence only
    partly completed?
  • Answer inconsistent data structures on disk

34
Solution (Part 2) Log-Structured File System
  • Make changes to cached copies in memory
  • Collect together all changed blocks
  • Including i-nodes and directory blocks
  • Write to log file (aka journal file)
  • A circular buffer on disk
  • Fast, contiguous write
  • Update log file pointer in stable storage
  • Offline Play back log file to actually update
    directories, i-nodes, free list, etc.
  • Update playback pointer in stable storage

35
Transaction Data Base Systems
  • Similar techniques
  • Every transaction is recorded in log before
    recording on disk
  • Stable storage techniques for managing log
    pointers
  • One log exist is confirmed, disk can be updated
    in place
  • After crash, replay log to redo disk operations

36
Berkeley LFS a slight variation
  • Everything is written to log
  • i-nodes point to updated blocks in log
  • i-node cache in memory updated whenever i-node is
    written
  • Cleaner daemon follows behind to compact log
  • Advantages
  • LFS is always consistent
  • LFS performance
  • Much better than Unix file system for small
    writes
  • At least as good for reads and large writes
  • Tanenbaum, 6.3.8, pp. 428-430
  • Rosenblum Ousterhout, Log-structured File
    System (pdf)
  • Note not same as Linux LFS (large file system)

37
Example
After
Before
log
38
Summary of Reading Assignmentsin Silbershatz
  • Disks (general) 12.1 to 12.6
  • File systems (general) Chapter 11
  • Ignore 11.9, 11.10 for now!
  • RAID 12.7
  • Stable Storage 12.8
  • Log-structured File System 11.8 6.9
Write a Comment
User Comments (0)
About PowerShow.com