Improving the Reliability of Commodity Operating Systems - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Improving the Reliability of Commodity Operating Systems

Description:

own heap and a pool of stacks. Changing protection domains requires a change of page tables ... Heap. Stacks. I/O. Buffers. Object. Table. Extension 1. Linux ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 69
Provided by: ssrnet
Category:

less

Transcript and Presenter's Notes

Title: Improving the Reliability of Commodity Operating Systems


1
Improving the Reliability ofCommodity
OperatingSystems
  • Mike Swift
  • University of Wisconsin
  • Brian Bershad, Hank Levy
  • University of Washington

2
Outline
  • Introduction
  • Vision
  • Design
  • Evaluation
  • Summary

3
(No Transcript)
4
OS Today
5
(No Transcript)
6
(No Transcript)
7
OS Today
8
Outline
  • Introduction
  • Vision
  • Design
  • Evaluation
  • Summary

9
OS With Reliability
10
(No Transcript)
11
OS With Reliability
12
(No Transcript)
13
(No Transcript)
14
Outline
  • Introduction
  • Vision
  • Design
  • Evaluation
  • Summary

15
Previous Approaches toReliability
  • 1. Fix the code
  • MetaCompilation, Windows Driver Verifier
  • 2. Build a new system
  • Singularity, Tandem NonStop, QuickSilver
  • 3. Add hardware
  • VaxClusters, Wolfpack, Google

16
Contributions
  • We designed and built a new kernel
  • subsystem that
  • Prevents majority of driver-caused crashes
  • Requires no changes to existing drivers
  • Requires only minor changes to OS
  • Minimally impacts performance

17
What Is a Driver?
  • A module that translates OS requests to
  • device requests
  • 10s of thousands exist
  • 81 drivers running on this laptop!
  • Run in the OS kernel
  • Small of common interfaces

18
Why Do Drivers Fail?
  • Complex and hard to write
  • Must handle asynchronous events
  • Must obey kernel programming rules
  • Non-reproducible failures
  • Difficult to test and debug
  • Written by inexperienced programmers

19
Objectives
  • Eliminate most downtime caused by drivers
  • 1. Prevent system crashes - isolation
  • 2. Keep applications running - recovery

20
Design of Nooks
  • Standard Linux kernel and drivers
  • Plus
  • Isolation
  • Recovery
  • Compatible with existing code

21
System Architecture
22
Outline
  • Introduction
  • Vision
  • Design
  • Isolation
  • Recovery
  • Evaluation
  • Summary

23
Existing Kernels
24
Memory Isolation
25
Lightweight protection domain
  • Provide protection by having extensions execute
    with a different page table giving them
  • Read/write access to their own pages
  • Read access to other kernel page
  • Lightweight solution because extension still
    execute in kernel mode
  • A malicious extension could switch back to the
    kernels page table

26
Lightweight protection domain
  • Each lw. protection domain has
  • A synchronized copy of the kernel page table
  • Many private structures,
  • own heap and a pool of stacks
  • Changing protection domains requires a change of
    page tables
  • Results in a TLB flush!
  • Provides no protection against DMA by extensions

27
How to isolate extensions?
Ext. 1 R Ext. 2 R Kernel R/W
Linux Kernel
X
X
Ext. 1 R/W Ext. 2 R Kernel R/W
Ext. 1 R Ext. 2 R/W Kernel R/W
28
Control Transfer
29
Nooks interposition mechanisms
  • Ensure that
  • All extension-to-kernel and kernel-to-extension
    control flow goes through Nooks XPC mechanism
  • All data transfer between kernel and extension
    goes through Nooks object tracking code
  • All interfaces are done through wrappers, similar
    to the stubs of an RPC package

30
Transparency
31
Interposition (I)
  • Through wrapper stubs all executing in kernel
    protection domain
  • Before call when kernel calls an extension
  • After the call when an extension calls the kernel
  • Wrappers
  • check parameters for validity
  • implement call by value and result

32
Wrappers
33
Interposition (II)
  • Writing wrappers is not an easy task
  • Requires knowing how parameters are used
  • Significant amount of wrapper sharing among
    extensions
  • Especially when extensions implement the same
    functionality

34
Nooks object tracking
  • Maintain a list of kernel data structures
    accessed by each extension
  • Control all modifications to these structures
  • Provide object information for cleanup if object
    fails
  • Initially implemented 43 object tracting
    procedures for 43 kernel objects

35
Nooks object tracking
  • Extensions cannot directly modify kernel data
    structures
  • Object tracking code will
  • Copy kernel data structures into extension
    address space
  • Copy them back after changes have been applied
  • Perform some checks whenever possible

36
Knowing object lifetime
  • Why is it important?
  • Garbage collection
  • Prevent accessing dangling reference
  • Common scheme
  • Add entry, Remove end of the call
  • Explicitly allocate and deallocate
  • For complex object,
  • Deep copy
  • Use page tracker

37
Data Access
38
Outline
  • Introduction
  • Vision
  • Design
  • Isolation
  • Recovery
  • Evaluation
  • Summary

39
Nooks recovery functions
  • Detect and recover from various extension faults
  • When an extension improperly invokes a kernel
    function
  • When processor raises an exception
  • Recovery helped by Nooks isolation mechanisms
  • All access to kernel structures are done through
    wrappers
  • Nooks can successfully release extension-held
    kernel structures

40
Recovery
  • Goals
  • Restore driver state so it can process requests
    as if it had never failed
  • Conceal failure from applications
  • Observation
  • Driver interface specifies how driver responds to
    requests
  • Approach Model drivers as state machines

41
Drivers as State Machines
42
Drivers as State Machines
  • Recovery
  • Advance driver from initial state to state at
    time of crash
  • Reply to requests with valid responses according
    to driver state

43
Shadow Drivers
  • Generic code that
  • Normally
  • Records state-changing inputs
  • On failure
  • Restarts driver
  • Replays inputs to recover
  • Emulates driver to applications/OS
  • One shadow driver handles recovery for an entire
    class of drivers

44
Shadow Driver Overview
45
Preparing for Recovery
46
Recovering a Failed Driver
47
Recovering a Failed Driver
  • Summary
  • Reset driver
  • Reinitialize driver
  • Replay logged requests

48
Spoofing a Failed Driver
49
Spoofing a Failed Driver
  • Shadow acts as driver -- replies to requests with
    valid possible responses
  • Applications and OS unaware that driver failed
  • No device control
  • General Strategies
  • 1. Answer request from log
  • 2. Act busy
  • 3. Block caller
  • 4. Queue request
  • 5. Drop request

50
Design Summary
  • Isolation
  • Lightweight Kernel Protection Domains
  • eXtension Procedure Call (XPC)
  • Object Table
  • Wrappers
  • Recovery
  • Shadow Drivers

51
Nooks Architecture
52
Outline
  • Introduction
  • Vision
  • Design
  • Evaluation
  • Implementation
  • Benefit
  • Cost
  • Summary

53
Drivers Tested
54
Details
  • 26,000 lines of code
  • Linux kernel has 2.4 million lines
  • Makes no use of Intel x86 protection rings
  • Extensions execute at same protection level as
    kernel

55
Implementation Complexity
  • New code
  • Isolation 23,000 lines
  • Recovery 3,300 lines

56
Implementation Complexity
  • New code
  • Isolation 23,000 lines
  • Recovery 3,300 lines

57
Outline
  • Introduction
  • Vision
  • Design
  • Evaluation
  • Implementation
  • Benefit
  • Isolation
  • Recovery
  • Cost
  • Summary

58
RELIABILITY
  • Tested eight extensions
  • Two sound card drivers
  • Four ethernet drivers
  • A Win95 compatible file system
  • An in-kernel Web server
  • Injected 400 faults

59
Test results
  • Nooks eliminated 99 of the crashes observed with
    native Linux (313 out of 317)
  • Nooks is slower
  • VFAT benchmark spent 165 s in the kernel instead
    of 29.5 s for native Linux
  • Webs server could only serve about 6,000 pages/s
    instead of 15,000 pages/s for native Linux

60
Isolation Works
61
Isolation Works
62
Isolation Works
63
Recovery Works
64
Relative Performance
65
CPU Usage
66
Future Work
  • Application to virtual machines
  • Apply shadow drivers to Xen
  • Fixing drivers
  • Reducing hardware-induced driver failures
  • New driver architectures
  • Testing drivers w/o devices
  • Writing drivers in Python

67
Conclusion
  • Nooks is an architecture and a set of components
    and techniques for improving system reliability
    that is
  • Highly effective at preventing crashes
  • Compatible with existing code
  • Low performance overhead
  • Low implementation cost

68
Questions?
Write a Comment
User Comments (0)
About PowerShow.com