Adaptive SingleChip Multiprocessing

About This Presentation

Title:

Adaptive SingleChip Multiprocessing

Description:

Many concurrent threads = long-latency memory accesses can be overlapped ... Allow multiple (HW) threads within the same execution pipeline ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 16

Provided by: dangibsont

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive SingleChip Multiprocessing

1
Adaptive Single-Chip Multiprocessing

Dan Gibson
degibson_at_wisc.edu
University of Wisconsin-Madison
Department of Electrical and Computer Engineering

2
Introduction

Moores Law continues to provide more transistors
Devices are getting smaller
Devices are getting faster
Leads to increases in clock frequency
Memories are getting bigger
Large memories often require more time to access
RC Circuits continue to charge exponentially
Long-wire signal propagation time is not
improving as rapidly as switching speed
On-chip communication time is slower relative to
processor clock speeds

3
The Memory Wall

Processors grow faster, memory grows slower
Off-chip cache misses can halt even aggressive
out-of-order processors
On-chip cache accesses are becoming long-latency
events
Latency can sometimes be tolerated
Caching
Perfecting
Speculation
Out-of-order execution
Multithreading

4
The Power Wall

More devices, faster clocks gt More power
Power supply accounts for lots of pins in chip
packaging (3,057 of 5,370 pins on the
POWER5)
Heat dissipation increases total cost of
ownership (34W cooling power
required to remove 100W of heat)
Dynamic Power in CMOS
Devices get smaller, faster, and more numerous
More Capacitance
Higher Frequency
Architects can constrain a, CL, and f

5
Enter Chip Multiprocessors (CMPs)

One chip, many processors
Multiple cores per chip
Often multiple threads per core

Dual-Core AMD Opteron Die Photo
From Microprocessor Report Best Servers of
2004
6
CMPs

CMPs can have good performance
Explicit thread-level parallelism
Related threads experience constructive
prefetching
CMPs can tolerate long-latency events well
Many concurrent threads gt long-latency memory
accesses can be overlapped
CMPs can be power-efficient
Enables use of simpler cores
Distributes hot spots

7
CMPs

CMPs are very specialized
Assumes (highly) threaded workload
Parallel machines are difficult to use
Parallel programming is not (yet) commonplace
Many problems similar to traditional
multiprocessors
Cache coherence
Memory consistency
Many new opportunities
Cache sharing
More integration

8
Adaptive CMPs

To combat specialization, adapt a CMP dynamically
to its current workload and system
Adapt caching policy ( Beckmann et. al., Chang
et. al., and more )
Adapt cache structure ( Alameldeen et. al., and
more )
Adapt thread scheduling ( Kihm et. Al., in the
SMT space)
Current idea
Adaptive thread scheduling from the space of
un-stalled and stalled threads
A union of single-core multithreading and
runahead execution in the context of CMPs

9
Single-Core Multithreading

Allow multiple (HW) threads within the same
execution pipeline
Shares processor resources FUs, Decode, ROB,
etc.
Shares local memory resources L1 caches, LSQ,
etc.
Can increase processor and memory utilization

Suns Niagara pipeline block diagram ( Kongetira
et. al.)
10
Runahead Execution

Continue execution in the face of a cache miss
Checkpoint architectural state
Continue execution speculatively
Convert memory accesses to prefetches
Runahead prefetches can be highly accurate, and
can greatly improve cache performance ( Mutlu,
et. al.)
It is possible to issue useless prefetches
Can be power-inefficient (Mutlu, et. al.)

11
Runahead/Multithreaded Core Interaction

Similar Hardware Requirements
Additional register files
Additional LSQ entries
Competition for Similar Resources
Execution time (Processor pipeline, Functional
units, etc)
Memory bandwidth
TLB Entries, cache space, etc.

12
Runahead/Multithreaded Core Interaction

A multithreaded core in a CMP, with runahead,
must make a difficult scheduling decisions
Thread scheduling considerations
Which thread should run?
Should the thread use runahead?
How long should the thread run/runahead?
Scheduling implications
Is an idle thread making foreword progress at the
expense of a useful thread?
Is a thread spinning on a lock held by another
thread?
Is runahead effective for a given thread?
Is a given thread causing performance problems
elsewhere in the CMP?

13
Proposed Mechanism

Track per-thread state on
Runahead prefetching accuracy
High accuracy favors allowing thread to runahead
HW-assigned thread priority
Highly useful threads are preferred
Selection criteria
Heuristic-guided
Select the best priority/accuracy pair
Probabilistically-guided
Select a thread with likelihood proportional to
its priority/accuracy
Useful-first
Select non-runahead threads first, then select
runahead threads

14
Future Directions

Dynamically Adaptable CMPs offer several future
areas of research
Adapt for power savings / heat dissipation
Computation relocation, load balancing, automatic
low-power modes, etc.
Adapt to error conditions
Dynamically allocate backup threads
Automatically relocate threads to improve
resource sharing
Combined HW/SW/VM approach

15
Summary

Latency now dominates off-chip communication
On-chip communication isnt far behind
Many techniques to tolerate latency, including
multithreading
CMPs provide new challenges and opportunities to
computer architects
Latency tolerance
Potential for power savings
Can adapt a CMPs behavior to its workload
Dynamic management of shared resources

Write a Comment

User Comments (0)

About PowerShow.com

Adaptive SingleChip Multiprocessing - PowerPoint PPT Presentation

Adaptive SingleChip Multiprocessing

Many concurrent threads = long-latency memory accesses can be overlapped ... Allow multiple (HW) threads within the same execution pipeline ... – PowerPoint PPT presentation