Title: TemperatureSensitive Loop Parallelization for Chip Multiprocessors
1Temperature-Sensitive Loop Parallelization for
Chip Multiprocessors
International Conference on Computer Design,
10/2-5, 2005, San Jose
- Sri HK Narayanan, Guilin Chen, Mahmut Kandemir,
Yuan Xie - Embedded Mobile Computing Center (EMC2)
- The Pennsylvania State University
2Outline
- Motivation
- Related Works
- Our Approach
- Example
- Experimental Results Conclusion
3Motivation
- Thermal Hotspots are a cause for concern
- Caused due to increasing power density
- Can result in the permanent chip damage
- How to avoid damage
- Cooling techniques
- How to prevent HotSpots
- Hardware techniques
- This paper proposes a compiler directed technique
to avoid hotspots in CMPs
4Related work Dynamic Thermal Management
- When one unit overheats, migrate its
functionality to a distant, spare unit - Dual pipeline (Intel, ISQED 02)
- Spare register file (Skadron et al. 2003)
- Separate core (CMP) (Heo et al. ISLPED 2003)
- Microarchitectural clusters (Intel, ICCD 2004)
- Raises many interesting issues
- Cost-benefit tradeoff for extra area
- Use both resources (scheduling)
- Run-time Thermal sensing/estimation
- Yesterday, UC Riverside paper _at_ Session 2.2
proposes a run-time thermal tracking method
5Related work Design-time techniques
- MDL _at_ PSU
- Thermal-Aware IP Virtualization and Placement for
Networks-on-Chip - Architecture, ICCD 2004
- Thermal-Aware Allocation and Scheduling for MPSOC
Design, DATE 2005 - Thermal-Aware Floorplanning Using Genetic
Algorithms ISQED 2005 - Thermal-Aware Voltage-island architecting, the
other paper in this session
- Other groups
- Thermal-Aware High Level Synthesis (Northwestern
Univ. Memik, R.Dick (ISLPED 2005, ASP-DAC 2006) - Many more in this conference
-
- Industry
- Gradient Design Automation (a start-up
showcases at DAC 2005)
6CMP
Intel researchers and scientists are
experimenting with "many tens of cores,
potentially even hundreds of cores per die, per
single processor die. ..
Justin R. Rattner, Intel director of the
Corporate Technology Group, Spring 2005 IDF
Industry examples
Last night, Panel discussion on CMP
7This paper- compiler approach
- Temperature and performance sensitive loop
scheduling - Schedules different loop iterations on CMP
- Data locality aware and hence performance aware
- Intuition behind the approach
- Let hot cores idle while cool cores work.
- Static scheduling of parallelized loop iterations
at compiler time
8How can the compiler schedule temperature aware
code?
- This work targets loop intensive programs run on
embedded CMPs - Loop nests are divided into chunks.
- The number of cycles in a chunk is ?.
- Let the starting temperature of a processor be Tc
- The temperature after execution the chunk is
- Tc F(Tc , ? , floorplan, power? )
- ?, power? are obtained by profiling the code.
- Floorplan and physical parameters remain constant.
9Thermal modeling
- Want a good model of chip temperature
- That accounts for adjacency and package
- That does not require detailed designs
- That is fast enough for practical use
- A compact model based on thermal R, C (Hotspot)
- Parameterized to automatically derive a model
based on various - Architectures
- Power models
- Floorplans
- Thermal Packages
10Temperature Estimation
- The temperature of each block depends on the
power consumption and the location of blocks. - The thermal resistance Rij of PEi with respect to
PEj can be represented by units of temperature
rise at PEi due to one unit of power dissipated
at PEj.
11Running ExampleBasic Schedule
Jacobis Algorithm
for (i1 ilt600 i) for (j1 jlt1000 j)
Bij (Ai-1j Ai1j
Aij-1 Aij1) / 4
Parallel Schedule
Parallelized Algorithm for 5 cores
for (ik1201 ilt(k1)120 i) for (j1
jlt1000 j) Bij (Ai-1j
Ai1j Aij-1 Aij1)
/ 4
12Analysis of Basic Schedule
- Assumptions in the example
- Initial temperature is 0
- Threshold temperature is 2
- An idle slot reduces the temperature by 1 degree
( but ?0) - So at most 2 active slots can be scheduled
together on one core - The ideal number of active processors at any time
is 5. - Due to Jacobis algorithm consecutive iteration
chunk exhibit locality
- Analysis
- Great locality
- Uses only 5 processors
- Will definitely overheat
13Pure Temperature Aware Scheduling
- Algorithm
- Start with time slot as 0 and all iterations as
unscheduled - While unscheduled iterations exit
- Select the coolest A processors whose temperature
is less than the threshold. - Schedule the chunks on those processors at
current timeslot. - Reduce number of chunks to be scheduled.
- Increase the time slot by 1.
- Analysis
- Poor locality
- 1 extra time slot is used.
- No temperature problems
14Pure Temperature Aware Scheduling
15Pure Locality Aware Scheduling
- Algorithm
- Start with a clean slate.
- For each iteration chunk
- Schedule it on the processor with greatest
locality with it keeping at most two chunks
together. - If more slots are required (when all processors
are exhausted), increase the scheduling length. - Otherwise move to the next processor
- Analysis
- Very good locality
- However 2 extra time slots are used.
- No temperature problems
16Locality and temperature aware scheduling
- Algorithm
- Use temperature aware scheduling to obtain the
schedulable slots. - Use locality aware scheduling to assign chunks to
these slots.
- Analysis - Best of both worlds
- Great Locality
- No temperature problems
- Good performance
C I0, I1, I2, I3, I4
C
17(No Transcript)
18Experiments
- 5 codes loop intensive codes were tested
19adi - Threshold Temperature 88 ºC
20eflux - Threshold Temperature 88 ºC
21adi - Threshold Temperature 88 ºC
22eflux - Threshold Temperature 88 ºC
23Sensitivity Analysis adi - Threshold Temperature
87 ºC
24Sensitivity Analysis adi - Threshold
Temperature 86 ºC
25Sensitivity Analysis adi - Threshold
Temperature 85 ºC
26Sensitivity Analysis adi - Threshold
Temperature 84 ºC
27Experiments
28Experiments
29Conclusion
- Implemented a compiler directed combined
temperature sensitive and performance aware
scheduling algorithm. - Achieve impressive average and peak chip
temperature reductions. - This allows software to take up the burden of
preventing chip damage due to thermal effects. - Chips can be aggressively scaled
- Cooling costs can be reduced
- Lowers the need for hardware based thermal
management schemes.
30Thank you!