Improving the Reliability of Commodity Operating Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Improving the Reliability of Commodity Operating Systems

Description:

Extensions are generally well-behaved (not malicious) ... Malicious code can bypass these mechanisms. Goals. Isolation of kernel from extension failures ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 52
Provided by: csF2
Learn more at: http://www.cs.fsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Improving the Reliability of Commodity Operating Systems


1
Improving the Reliability of Commodity Operating
Systems
2
Introduction
  • Nooks
  • Allows existing OS extensions to execute safely
    in commodity kernels
  • Use lightweight kernel protection domains
  • Restricted write access to kernel memory
  • Track and validate all modifications to kernel
    data structures

3
Motivation
  • Computer reliability a unsolved problem
  • Cost of failures continues to rise
  • OS extensions have become prevalent
  • 70 of Linux kernel code
  • 35,000 drivers on Windows XP
  • Written by people who are less experienced in
    kernel organization

4
Motivation
  • Extensions are leading cost of failures
  • In Windows XP, drivers cause 85 of failures
  • In Linux, device drivers introduce 7x errors than
    the rest of the kernel
  • Extended OS cannot be tested completely

5
Nooks Approach
  • Target existing extension architecture
  • Use conventional C instead of type-safe languages
  • Aim to reduce the number of crashes due to
    drivers and extensions
  • Prototype implemented in Linux
  • Showed graceful recovery for 99 of fault
    injections

6
Related Work
  • Hardware approaches
  • Capability-based architectures
  • Recovery difficult for shared resources
  • Segment architectures
  • Difficult to program
  • New OS structures
  • Microkernels
  • Good fault isolation
  • Rebooting required to restart services

7
Related Work
  • Transaction-based systems
  • Works well for file systems
  • Language-based approaches
  • Limited applicability

8
Architecture
  • Core principles
  • Design for fault resistance, not fault tolerance
  • Prevent and recover from most, not all
  • Design for mistakes, not abuse
  • Extensions are generally well-behaved (not
    malicious)
  • Can explore the design space between unproctected
    and safe

9
Architecture
  • Implications
  • Can define an architecture that supports
    existing drivers with moderate performance costs
  • - Malicious code can bypass these mechanisms

10
Goals
  • Isolation of kernel from extension failures
  • Need to detect failures before they spread
  • Automatic recovery from failures
  • Backward compatibility

11
Functions
  • Reliability layer inserted between the extensions
    and the OS kernel
  • Intercepts all interactions between the
    extensions and the OS kernel
  • Major functions
  • Isolation
  • Interposition
  • Object tracking
  • Recovery

12
Isolation
  • Lightweight kernel protection domain
  • Write access to a limited portion of the kernels
    address space
  • Major tasks
  • Creation, manipulation, and maintenance of
    lightweight kernel protection domains
  • Inter-domain control transfer

13
Isolation
  • Extension procedure call (XPC)
  • Similar to lightweight RPC
  • Assume trusted interactions
  • Asymmetric relationship
  • Kernel has more privileges

14
Interposition
  • The Nooks interposition mechanisms
  • Make sure that
  • All control flows between the kernel and
    extensions are through the XPC mechanism
  • All data flows between the kernel and extensions
    are managed by Nooks object-tracking code
  • Extensions and the kernel communicate through
    wrapper stubs

15
Object Tracking
  • Maintains a list of kernel data structures that
    are manipulated by an extension
  • Controls all modifications to those structures
  • Provides object info for cleanup when an
    extension fails

16
Object Tracking
  • An object must be copied into an extension before
    it is modified
  • Object tracking code verifies the type and
    accessibility of each parameter being passed

17
Recovery
  • Nooks detects software faults
  • When kernel services are invoked incorrectly
  • When an extension consumes too many resources
  • Actions
  • Return to the extension
  • Generate an error code

18
Recovery
  • Nooks detects hardware faults
  • Processor raises an exception during extension
    execution
  • Attempts to read unmapped memory
  • Write memory outside of its protection domain
  • A user or a program trigger Nooks recovery
    explicitly

19
Recovery
  • Since extensions are decoupled from kernel, Nooks
    can freely release extension-held kernel
    structures, such as objects or locks, during the
    recovery process

20
Architecture
21
Implementation
  • Linux 2.4.18
  • Worst-case target
  • 18 months of development
  • 22,000 lines of Nooks code (vs. 2.4 million lines
    of Linux code and 50 million lines of Windows
    2003 code)

22
Isolation
  • Two parts
  • Memory management
  • Extension procedure call

23
Memory Management
  • Kernel has read-write access to the entire
    address space
  • Each extension is restricted to read-only kernel
    access and read-write access to its local domain
  • Nooks maintains a copy of the kernel page table
    for each domain

24
Memory Management
  • Changing protection domains is not as costly as
    changing processes
  • Protection domains share kernel address space

25
Extension Procedure Call
  • Transparent to both the kernel and its extensions
  • Managed by two functions
  • nooks_driver_call(func_ptr, arg_list, domain)
  • nooks_kernel_call(func_ptr, arg_list, domain)
  • Deferred call mechanisms available
  • Useful for network drivers to queue up packets
    and perform bulk transfers

26
Changes to Linux Kernel
  • Maintain coherency between the kernel and
    extension page tables
  • Detect exceptions that occurs within Nooks
    protection domains
  • Locate tasks that are no longer collocated on the
    kernel stack due to isolation

27
Interposition
  • Provides wrapper stubs between extensions and the
    kernel
  • Transparent to the kernel and drivers
  • Kernel modifications
  • Make standard module load to bind extensions to
    wrappers instead of kernel functions
  • The kernel is initialized to interpose on the
    Nooks call into extensions

28
Interposition
  • Some data references are interposed
  • Certain objects are linked directly into the
    extension for reading
  • Kernel modification calls are wrapped
  • Performance critical data structure
  • Shadow object in extension that are synchronized
    before and after XPCs
  • Otherwise, just XPCs

29
Wrappers
  • Within the kernels protection domain
  • Three basic tasks
  • Check parameters for validity
  • Create a copy of kernel objects in the
    extensions protection domain
  • No serialization/deserialization necessary
  • Synchronization code placed in wrappers
  • Perform an XPC into the kernel or extension
  • Automatically generated

30
Wrapper Code Sharing
  • 50 of Nooks code base
  • Shared among multiple drivers

31
Object Tracking
  • Supports 43 kernel object types
  • Records the addresses of all objects in use by an
    extension
  • Records the association between the kernel and
    the extension versions of writable objects
  • Performs garbage collection
  • Determines whether to copy an object

32
Recovery
  • Recovery manager releases resources
  • Unloading the extension
  • Releasing its kernel and physical resources
  • Reloading and restarting the extension
  • User-mode agent coordinates recovery
  • Each object is associated with a recovery function

33
Implementation Limitations
  • Nooks does not handle all possible errors
  • Deliberate corruptions of system states
  • Infinite loops
  • However, a moderate reduction of system crashes
    is a significant contribution

34
Achieving Transparency
  • Wrapper stubs for every call in the
    extension-kernel interface
  • Object-tracking code for every object type that
    is passed between the extension and the kernel
  • Nooks transparent to both the extension and the
    kernel

35
Reliability
  • Nooks can detect and recover 99 of extension
    faults

36
Test Methodology
  • Synthetic fault injection
  • Automatically changes single instructions in the
    extension code to emulate common errors
  • Uninitialized variables
  • Bad parameters

37
Types of Extensions Isolated
  • Device drivers (network, sound cards)
  • Optional kernel subsystems (VFAT)
  • Application-specific kernel extension (kHTTPd)

38
Test Environment
  • VMware
  • Allows automation of crash testing without
    reboots
  • 5 extensions
  • 400 tests each

39
Test Results
  • Not all faulty-injection trials cause faulty
    behavior

40
System Crashes
  • A system crash is easiest to detect
  • OS panics
  • Hangs
  • Reboots
  • Linux experienced 317 crashes
  • Nooks eliminated 313 crashes, or 99
  • 4 deadlocks

41
System Crashes
  • Sound blaster and VFAT extensions are
    process-oriented
  • Fewer crashes
  • kHTTPd, pcnet32, e1000 are interrupted-based
  • More crashes

42
Non-Fatal Extension Failures
  • Nooks cannot detect erroneous extension behaviors
  • Network could disappear
  • Mounted file system hangs

43
Recovery Errors
  • A faulting extension is unloaded, reloaded, and
    restarted
  • Works well with kHTTPp
  • Not as well with VFAT
  • Corruptions can propagate to disk if not detected
    in time

44
Summary of Reliability Experiments
  • Nooks eliminated 99 of the system crashes in
    extensions
  • Nooks eliminated nearly 60 of non-fatal
    extension failures

45
Performance
  • Dell 1.7 GHz Pentium 4
  • 890 MB of RAM
  • SoundBlaster 16
  • Intel Pro/1000 Gb Ethernet Adapter
  • 7200 RPM, 41 GB IDE HD
  • Linux 2.4.18

46
Sound Benchmark
  • Plays an MP3 file at 128 Kb/sec
  • 150 XPCs/sec
  • Nooks imposes little overhead

47
Network Benchmark
  • netperf performance tool
  • A node sends/receives a stream of 32 KB TCP
    messages via a 256KB buffer
  • 10 overhead

48
Compile Benchmark
  • Linux kernel compilation on VFAT
  • 25 slowdown

49
Web Server Benchmarks
  • httperf
  • Repeatedly request a 1-KB file and measure the
    maximum request rate
  • 60 slowdown
  • CPU bound
  • SPECweb99
  • 3 slowdown

50
Summary
  • If the computation is not CPU bound, the penalty
    may not be important

51
Conclusions
  • Nooks is achievable with modest engineering
    effort
  • Extensions such as device drivers can be isolated
    without changes to extension code
  • Isolation and recovery can dramatically improve
    the systems ability to survive extension faults
Write a Comment
User Comments (0)
About PowerShow.com