Title: Multicore: Panic or Panacea?
1Multicore Panic or Panacea?
- Mikko H. Lipasti
- Associate Professor
- Electrical and Computer Engineering
- University of Wisconsin Madison
http//www.ece.wisc.edu/pharm
2Multicore Mania
- First, servers
- IBM Power4, 2001
- Then desktops
- AMD Athlon X2, 2005
- Then laptops
- Intel Core Duo, 2006
- Soon, your cellphone
- ARM MPCore, prototypes for a while now
3What is behind this trend?
- Moores Law
- Chip power consumption
- Single-thread performance trend
- source Intel
4Dynamic Power
- Static CMOS current flows when active
- Combinational logic evaluates new inputs
- Flip-flop, latch captures new value (clock edge)?
- Terms
- C capacitance of circuit
- wire length, number and size of transistors
- V supply voltage
- A activity factor
- f frequency
- Future Fundamentally power-constrained
5Easy answer Multicore
Single Core Dual Core Quad Core
Core area A A/2 A/4
Core power W W/2 W/4
Chip power W O W O W O
Core performance P 0.9P 0.8P
Chip performance P 1.8P 3.2P
6Amdahls Law
n
f
CPUs
1
f
1-f
Time
- f fraction that can run in parallel
- 1-f fraction that must run serially
7Fixed Chip Power Budget
n
CPUs
- Amdahls Law
- Ignores (power) cost of n cores
- Revised Amdahls Law
- More cores ? each core is slower
- Parallel speedup lt n
- Serial portion (1-f) takes longer
- Also, interconnect and scaling overhead
8Fixed Power Scaling
- Fixed power budget forces slow cores
- Serial code quickly dominates
9Predictions and Challenges
- Parallel scaling limits many-core
- gt4 cores only for well-behaved programs
- Optimistic about new applications
- Interconnect overhead
- Single-thread performance
- Will degrade unless we innovate
- Parallel programming
- Express/extract parallelism in new ways
- Retrain programming workforce
10Research Agenda
- Programming for parallelism
- Sources of parallelism
- New applications, tools, and approaches
- Single-thread performance and power
- Most attractive to programmer/user
- Chip multiprocessor overheads
- Interconnect, caches, coherence, fairness
11Finding Parallelism
- Functional parallelism
- Car engine, brakes, entertain, nav,
- Game physics, logic, UI, render,
- Automatic extraction UW Multiscalar
- Decompose serial programs
- Data parallelism
- Vector, matrix, db table, pixels,
- Request parallelism
- Web, shared database, telephony,
12Balancing Work
- Amdahls parallel phase f all cores busy
- If not perfectly balanced
- (1-f) term grows (f not fully parallel)
- Performance scaling suffers
- Manageable for data request parallel apps
- Very difficult problem for other two
- Functional parallelism
- Automatically extracted
- Scale power to mismatch Multiscalar
13Coordinating Work
- Synchronization
- Some data somewhere is shared
- Coordinate/order updates and reads
- Otherwise ? chaos
- Traditionally locks and mutual exclusion
- Hard to get right, even harder to tune for perf.
- Research Transactional Memory UW Multifacet
- Programmer Declare potential conflict
- Hardware and/or software speculate check
- Commit or roll back and retry
14Single-thread Performance
- Still most attractive source of performance
- Speeds up parallel and serial phases
- Can use it to buy back power
- Must focus on power consumption
- Performance benefit Power cost
15Single-thread Performance
- Hardware accelerators and circuits
- Domain-specific UW MESA
- Reconfigurable UW Compton
- VLSI and design automation UW WISCAD, Kursun
- Increasing frequency
- Seems prohibitive clock power
- Clever clocking schemes can help UW Pharm
- Increasing instruction-level parallelism
- UW Multiscalar, UW Pharm, UW Smith
- Without blowing power budget
- Alternatively, reduce power for same performance
16Chip Multiprocessor Overheads
- Core Interconnect UW Pharm
- 80 of chip power Borkar, ISLPED 07 panel
- Need fundamentally different approach
- Revisit circuit switching
- Cache coherence UW Multifacet, Pharm
- Match workload behavior
- Optimize for on-chip communication
17Chip Multiprocessor Overheads
- Shared caches UW Multifacet, Multiscalar, Smith
- On-chip memory can be shared
- Optimize replacement, replication
- Fairness UW Smith
- Maintain Performance isolation
- Share resources fairly (memory, caches)
18Research Groups _at_ UW
Group Faculty URL
Compton Kati Compton www.ece.wisc.edu/kati
Kursun Volkan Kursun www.cae.wisc.edu/kursun
MESA Mike Schulte mesa.ece.wisc.edu
Multifacet Mark Hill, David Wood http//www.cs.wisc.edu/multifacet
Multiscalar Guri Sohi www.cs.wisc.edu/mscalar
PHARM Mikko Lipasti www.ece.wisc.edu/pharm
Smith James Smith www.engr.wisc.edu/ece/faculty/smith_james.html
Vertical Karu Sankaralingam www.cs.wisc.edu/vertical/wiki
WISCAD Azadeh Davoodi www.cae.wisc.edu/adavoodi
19Conclusion
- Forecast
- Limited multicore (4) is here to stay
- Manycore (gt4) will find its place
- Hardware Challenges
- Single-thread performance and power
- Multicore overhead
- Software Challenges
- Finding application parallelism
- Creating correct parallel programs
- Creating scalable parallel programs
20Questions?
- http//www.ece.wisc.edu/pharm