Title: Improving the Reliability of Commodity Operating Systems
1Improving the Reliability ofCommodity
OperatingSystems
- Mike Swift
- University of Wisconsin
- Brian Bershad, Hank Levy
- University of Washington
2Outline
- Introduction
- Vision
- Design
- Evaluation
- Summary
3(No Transcript)
4OS Today
5(No Transcript)
6(No Transcript)
7OS Today
8Outline
- Introduction
- Vision
- Design
- Evaluation
- Summary
9OS With Reliability
10(No Transcript)
11OS With Reliability
12(No Transcript)
13(No Transcript)
14Outline
- Introduction
- Vision
- Design
- Evaluation
- Summary
15Previous Approaches toReliability
- 1. Fix the code
- MetaCompilation, Windows Driver Verifier
- 2. Build a new system
- Singularity, Tandem NonStop, QuickSilver
- 3. Add hardware
- VaxClusters, Wolfpack, Google
16Contributions
- We designed and built a new kernel
- subsystem that
- Prevents majority of driver-caused crashes
- Requires no changes to existing drivers
- Requires only minor changes to OS
- Minimally impacts performance
17What Is a Driver?
- A module that translates OS requests to
- device requests
- 10s of thousands exist
- 81 drivers running on this laptop!
- Run in the OS kernel
- Small of common interfaces
18Why Do Drivers Fail?
- Complex and hard to write
- Must handle asynchronous events
- Must obey kernel programming rules
- Non-reproducible failures
- Difficult to test and debug
- Written by inexperienced programmers
19Objectives
- Eliminate most downtime caused by drivers
- 1. Prevent system crashes - isolation
- 2. Keep applications running - recovery
20Design of Nooks
- Standard Linux kernel and drivers
- Plus
- Isolation
- Recovery
- Compatible with existing code
21System Architecture
22Outline
- Introduction
- Vision
- Design
- Isolation
- Recovery
- Evaluation
- Summary
23Existing Kernels
24Memory Isolation
25Lightweight protection domain
- Provide protection by having extensions execute
with a different page table giving them - Read/write access to their own pages
- Read access to other kernel page
- Lightweight solution because extension still
execute in kernel mode - A malicious extension could switch back to the
kernels page table
26Lightweight protection domain
- Each lw. protection domain has
- A synchronized copy of the kernel page table
- Many private structures,
- own heap and a pool of stacks
- Changing protection domains requires a change of
page tables - Results in a TLB flush!
- Provides no protection against DMA by extensions
27How to isolate extensions?
Ext. 1 R Ext. 2 R Kernel R/W
Linux Kernel
X
X
Ext. 1 R/W Ext. 2 R Kernel R/W
Ext. 1 R Ext. 2 R/W Kernel R/W
28Control Transfer
29Nooks interposition mechanisms
- Ensure that
- All extension-to-kernel and kernel-to-extension
control flow goes through Nooks XPC mechanism - All data transfer between kernel and extension
goes through Nooks object tracking code - All interfaces are done through wrappers, similar
to the stubs of an RPC package
30Transparency
31Interposition (I)
- Through wrapper stubs all executing in kernel
protection domain - Before call when kernel calls an extension
- After the call when an extension calls the kernel
- Wrappers
- check parameters for validity
- implement call by value and result
32Wrappers
33Interposition (II)
- Writing wrappers is not an easy task
- Requires knowing how parameters are used
- Significant amount of wrapper sharing among
extensions - Especially when extensions implement the same
functionality
34Nooks object tracking
- Maintain a list of kernel data structures
accessed by each extension - Control all modifications to these structures
- Provide object information for cleanup if object
fails - Initially implemented 43 object tracting
procedures for 43 kernel objects
35Nooks object tracking
- Extensions cannot directly modify kernel data
structures - Object tracking code will
- Copy kernel data structures into extension
address space - Copy them back after changes have been applied
- Perform some checks whenever possible
36Knowing object lifetime
- Why is it important?
- Garbage collection
- Prevent accessing dangling reference
- Common scheme
- Add entry, Remove end of the call
- Explicitly allocate and deallocate
- For complex object,
- Deep copy
- Use page tracker
37Data Access
38Outline
- Introduction
- Vision
- Design
- Isolation
- Recovery
- Evaluation
- Summary
39Nooks recovery functions
- Detect and recover from various extension faults
- When an extension improperly invokes a kernel
function - When processor raises an exception
- Recovery helped by Nooks isolation mechanisms
- All access to kernel structures are done through
wrappers - Nooks can successfully release extension-held
kernel structures
40Recovery
- Goals
- Restore driver state so it can process requests
as if it had never failed - Conceal failure from applications
- Observation
- Driver interface specifies how driver responds to
requests - Approach Model drivers as state machines
41Drivers as State Machines
42Drivers as State Machines
- Recovery
- Advance driver from initial state to state at
time of crash - Reply to requests with valid responses according
to driver state
43Shadow Drivers
- Generic code that
- Normally
- Records state-changing inputs
- On failure
- Restarts driver
- Replays inputs to recover
- Emulates driver to applications/OS
- One shadow driver handles recovery for an entire
class of drivers
44Shadow Driver Overview
45Preparing for Recovery
46Recovering a Failed Driver
47Recovering a Failed Driver
- Summary
- Reset driver
- Reinitialize driver
- Replay logged requests
48Spoofing a Failed Driver
49Spoofing a Failed Driver
- Shadow acts as driver -- replies to requests with
valid possible responses - Applications and OS unaware that driver failed
- No device control
- General Strategies
- 1. Answer request from log
- 2. Act busy
- 3. Block caller
- 4. Queue request
- 5. Drop request
50Design Summary
- Isolation
- Lightweight Kernel Protection Domains
- eXtension Procedure Call (XPC)
- Object Table
- Wrappers
- Recovery
- Shadow Drivers
51Nooks Architecture
52Outline
- Introduction
- Vision
- Design
- Evaluation
- Implementation
- Benefit
- Cost
- Summary
53Drivers Tested
54Details
- 26,000 lines of code
- Linux kernel has 2.4 million lines
- Makes no use of Intel x86 protection rings
- Extensions execute at same protection level as
kernel
55Implementation Complexity
- New code
- Isolation 23,000 lines
- Recovery 3,300 lines
56Implementation Complexity
- New code
- Isolation 23,000 lines
- Recovery 3,300 lines
57Outline
- Introduction
- Vision
- Design
- Evaluation
- Implementation
- Benefit
- Isolation
- Recovery
- Cost
- Summary
58RELIABILITY
- Tested eight extensions
- Two sound card drivers
- Four ethernet drivers
- A Win95 compatible file system
- An in-kernel Web server
- Injected 400 faults
59Test results
- Nooks eliminated 99 of the crashes observed with
native Linux (313 out of 317) - Nooks is slower
- VFAT benchmark spent 165 s in the kernel instead
of 29.5 s for native Linux - Webs server could only serve about 6,000 pages/s
instead of 15,000 pages/s for native Linux
60Isolation Works
61Isolation Works
62Isolation Works
63Recovery Works
64Relative Performance
65CPU Usage
66Future Work
- Application to virtual machines
- Apply shadow drivers to Xen
- Fixing drivers
- Reducing hardware-induced driver failures
- New driver architectures
- Testing drivers w/o devices
- Writing drivers in Python
67Conclusion
- Nooks is an architecture and a set of components
and techniques for improving system reliability
that is - Highly effective at preventing crashes
- Compatible with existing code
- Low performance overhead
- Low implementation cost
68Questions?