Many-Core Operating Systems - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Many-Core Operating Systems

Description:

Overhead is automatically amortized. I talked about this stuff ... It would be nice to amortize it... The more resources you get, the longer you may keep them ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 18
Provided by: burt161
Category:

less

Transcript and Presenter's Notes

Title: Many-Core Operating Systems


1
Many-Core Operating Systems
  • Burton SmithTechnical FellowAdvanced Strategies
    and Policy

2
The von Neumann Premise
  • Simply put there is exactly one program
    counter
  • It has led to some artifacts
  • Synchronous coprocessor coroutining (e.g. 8087)
  • Interrupts for asynchronous concurrency
  • Demand paging
  • to make memory allocation incremental
  • to let virtual gt physical
  • And some serious problems
  • The memory wall (insufficient memory concurrency)
  • The ILP wall (diminished improvement in ILP)
  • The power wall (the cost of run-time ILP
    exploitation)
  • Given multiple program counters, what should we
    change?
  • Scheduling?
  • Synchronization?

3
Computing is at a Crossroads
  • Continual performance improvement is our
    lifeblood
  • It encourages people to buy new hardware
  • It opens up new software possibilities
  • Single-thread performance is nearing the end of
    the line
  • But Moores Law will continue for some time to
    come
  • What can we do with all those transistors?
  • Computation needs to become as parallel as
    possible
  • Henceforth, serial means slow
  • Systems must support general purpose parallel
    computing
  • The alternative to all this is commoditization
  • New many-core chips will need new system software
  • And vice versa!
  • This talk is about interplay between OS and
    hardware

4
Many-Core OS Challenges
  • Architecture of the parallel virtual machine
  • Processor management
  • Multiple processors
  • A mix of in-order and out-of-order CPUs
  • GPUs and other performance accelerators
  • I/O processors and devices
  • Memory management
  • Performance problems due to paging
  • TLB pressure from larger working sets
  • Bandwidth resources
  • Quality of service (time management)
  • For media applications, games, real-time apps,
    etc.
  • For deadlines

5
The Parallel Virtual Machine
  • What should the interface that the OS presents to
    parallel application software look like?
  • Stable, negotiated resource allocation
  • Isolation among protection domains
  • Freedom from bottlenecks in OS services
  • The key objective is fine-grain application
    parallelism
  • We need the whole tree, not just the low-hanging
    fruit

6
Fine-grain Parallelism
  • Exploitable parallelism grows as task granularity
    shrinks
  • But dependences among tasks become more numerous
  • Inter-task dependence enforcement demands
    scheduling
  • A task needing a value from elsewhere must wait
    for it
  • User-level work scheduling is called for
  • No privilege change is needed to stop or restart
    a task
  • Locality (e.g. cache content) can be better
    preserved
  • Todays OS and hardware dont encourage waiting
  • OS thread scheduling makes blocking dangerous
  • Instruction sets encourage non-blocking
    approaches
  • Busy-waiting wastes instruction issue
    opportunities
  • Impact
  • Better instruction set support for blocking
    synchronization
  • Changes to OS processor and memory resource
    management

7
Multithreading and Synchronization
  • Fine-grain multithreading can use TLP to tolerate
    latency
  • Memory latency
  • Other operation latency, e.g. branch latency
  • Synchronization latency
  • In the latter case, some architectural support is
    helpful
  • To stop issuing from a context while it is
    waiting
  • To resume issuing when the wait is over
  • To free up the context if and when a wait becomes
    long
  • The benefits
  • Waiting does not consume issue slots
  • Overhead is automatically amortized
  • I talked about this stuff in my 1996 FCRC keynote

8
Resource Scheduling Consequences
  • Since the user runtime is scheduling work on
    processors, the OS should not attempt to do the
    same
  • An asynchronous OS API is a necessary corollary
  • Scheduling memory via demand paging is also
    problematic
  • Instead, the two schedulers should negotiate
  • The application tells the OS its resource
    needs/desires
  • The OS makes decisions based on the big picture
  • Availability of resources
  • Appropriateness of power consumption level
  • Requirements for quality of service
  • The OS can preempt resources to reclaim them
  • But with notification, so the application can
    rearrange things
  • Resources should be time- and space-shared in
    chunks
  • Scheduling turns into a bin-packing problem

9
Bin Packing
  • The more resources allocated, the more swapping
    overhead
  • It would be nice to amortize it
  • The more resources you get, the longer you may
    keep them
  • Roughly, this means scheduling packing squarish
    blocks
  • QOS applications might need long rectangles
    instead
  • When the blocks dont fit, the OS can morph them
    a little
  • Or cut corners when absolutely necessary

Quantity of resource
Time
10
What About Priority Scheduling?
  • Priorities are appropriate for some kinds of
    scheduling
  • Especially when some things to be scheduled are
    optional
  • If it all has to be done, how do the priorities
    get set?
  • The answer is usually ad-hoc, and often!
  • Fairness is seldom maintained in the process
  • Quality of service needs a different approach
  • How much work must be done before the next
    deadline?
  • Even highly interactive tasks can benefit
  • Deadlines are harder to implement than priorities
  • Then again, so is bin packing compared to fixed
    quanta
  • Fairness can also be based on quality-of-service
    concepts
  • Relative work rates rather than absolute
  • In the next 16 milliseconds, give level i
    activities r times as many processor-seconds as
    level i-1 activities

11
Heterogeneous Processors
  • There are two kinds of heterogeny
  • In architecture, i.e. different instruction sets
  • In implementation, i.e. different performance
    characteristics
  • Both are likely to be important
  • A single application might ask for a
    heterogeneous mix
  • Failure in the HA case might need multiple
    versions or JIT
  • In the HI case, scheduling might be based on
    instrumentation
  • A key question is whether a processor is
    time-sharable
  • If not, the OS has to dedicate it to one
    application at a time
  • With user-level scheduling and some support for
    preemption, application state save and restore
    can be done at user level

12
Virtual Memory Design Alternatives
  • Swapping instead of demand paging
  • Address-space names/identifiers
  • TLB shootdown becomes a rarer event
  • Hardware TLB coherence
  • Two-dimensional addressing (segmentation w/o
    registers)
  • To assist with variable granularity memory
    allocation
  • To help mitigate upward pressure on TLB size
  • To leverage persistent memory via segment sharing
  • A variation of mmap()might suffice for this
    purpose
  • To accommodate variations in memory bank
    architecture
  • Local versus global, for example

13
Physical Memory Bank Architecture
  • Consider this example
  • An application is using 31 cores, about half of
    them
  • 50 of its cache misses are stack references
  • The stacks are all allocated in a compact virtual
    region
  • How many of the 128 memory banks are available?
  • Interleaving addresses across the banks is a
    solution
  • Page granularity is the standard choice
  • If memory access is non-uniform, this is not the
    best idea
  • Stacks should be allocated near their processors
  • So should compiler-allocated temporary arrays on
    the heap
  • Is it one bank architecture scheme fits all, or
    not?
  • If not, how do we manage the virtual address
    space?

14
Hot Spots
  • When processors share memory, they can interfere
  • Not only data races, but also bandwidth
    oversubscription
  • Within an application, this creates performance
    problems
  • Hardware help is needed to discover where these
    are
  • Beween applications, interference is even more
    serious
  • Performance unpredictability
  • Denial of service
  • Covert-channel signaling
  • Bandwidth is a resource like any other
  • We need to be able to partition and isolate it

15
I/O Architecture
  • Direct memory access is usually a good way to do
    I/O
  • Todays DMA mostly demands wired down pages
  • This leads to lots of data copying and other OS
    warts
  • But I/O devices are getting smarter all the time
  • Transistors are cheaper than almost anything else
  • Why not treat I/O devices like heterogeneous
    processors?
  • Teach them to do virtual address translation
  • Allocate them to real-time or sensor-intensive
    applications
  • Allocate them to a not-very-trusted driver
    application
  • Address space sharing can be partial, as it is
    now
  • There is a problem, though inter-domain
    signaling (IPC)
  • This is what interrupts do
  • I have some issues with interrupts

16
Interrupts
  • Interrupts are OK when there is only one
    processor
  • Some people avoid them to make systems more
    predictable
  • If there are many processors, which one do you
    interrupt?
  • The usual solution just pick one and leave it
    to software
  • A better idea is to signal via an address space
    you already share (perhaps only in part) with the
    intended recipient
  • The DMA at address ltagt is (ready)(done)
  • This is kinda like doing programmed I/O via
    device CSRs
  • Its also the way the CDC 6600 and 7600 did
    things
  • You may not want to have the signal recipient
    busy-wait

17
Conclusions
  • It is time to rethink some of the basics of
    computing
  • There is lots of work for everyone to do
  • e.g. Ive left out compilers, debuggers, and
    applications
  • We need basic research as well as industrial
    development
  • Research in computer systems is deprecated these
    days
  • In the USA, NSF and DOD need to take the
    initiative
Write a Comment
User Comments (0)
About PowerShow.com