Title: Aqeel Mahesri
1Data Scale Applications and Architecture
- Aqeel Mahesri
- Center for Reliable and High Performance
Computing - University of Illinois, Urbana-Champaign
- mahesri_at_crhc.uiuc.edu
2Outline
- Introduction
- Data scale applications and architecture
- Memory system study
- Proposed work
- Related work
- Conclusion
3Previous Work
- Data Scale Architecture
- Aqeel Mahesri, Nicholas J. Wang, Sanjay J. Patel,
Tradeoffs in Cache Design and Simultaneous
Multithreading in Many-Core Architectures,
submitted to International Conference on
Supercomputing, July 2007 - Control Decoupling and NXA
- Aqeel Mahesri, Nicholas J. Wang, Sanjay J. Patel,
Hardware Support for Software Controlled
Multithreading, Workshop on Design, Architecture,
and Simulation of Chip Multiprocessors, 39th
International Symposium on Microarchitecture,
December 2006. - Aqeel Mahesri, Sanjay J. Patel, Exploiting
Parallelism Between Control and Data Computation,
University of Illinois Technical Report,
UILU-ENG-05-2214, September 2005. - Aqeel Mahesri, Exploiting Control/Data
Parallelism, M.S. thesis, May 2004. - Robust Architecture
- Nicholas J. Wang, Aqeel Mahesri, and Sanjay J.
Patel, Examining ACE Analysis Reliability
Estimates Using Fault Injection, 34th
International Symposium on Computer Architecture,
June 2007 - Power Consumption
- Aqeel Mahesri and Vibhore Vardhan, Power
Consumption Breakdown on a Modern Laptop,
Workshop on Power Aware Computing Systems, 37th
International Symposium on Microarchitecture,
December 2004. - Dynamic Optimization
- Brian Fahs, Aqeel Mahesri, Francesco Spadini,
Sanjay J. Patel, and Steven S. Lumetta, The
Performance Potential of Trace-based Dynamic
Optimization, University of Illinois Technical
Report, UILU-ENG-04-2208, November 2004.
4Introduction - Data Scale Architecture
- Motivation for the project
- architecture shift from single-thread performance
to parallel performance - software shift from sequential apps to parallel
apps - envision a future where trend toward parallelism
continues - Goals of the project
- select and analyze data scale applications
- optimize parallel architecture for data scale
applications - evaluate how architecture should evolve as it
scales further
5Motivation Uniprocessor Era
- single-thread performance is king
- ever larger, faster uniprocessors
- exponential performance growth
- but hitting limits
- interconnect delays
- power
- limited ILP of sequential workloads
- performance growth of uniprocessors is slowing
down - (chart taken from Mark Horowitz)
6Motivation Multicore Era
- single-thread and parallel performance compete
- uniprocessors grow slowly
- but increasing number of cores on chip
- most applications still sequential
- slow performance growth for individual apps
- performance growth for running multiple
applications or for throughput applications
L3
7Motivation Data Scale Era
- parallel performance is king
- scaling number of cores rather than performance
of each core - continues to provide exponential performance
growth - BUT the performance growth comes from increasing
parallelism - performance growth for data scale applications
8Motivation Emerging Parallel Workloads
- emerging uses of computers
- what are people going to be doing with computers
in 10 years? - real-time computer vision, AI, speech and image
recognition - visualization, simulation
- RMS (Recognition, Mining, Synthesis) applications
- graphics APIs
- offers massive parallelism
- sometimes sequential application tasks can be
done in parallel - compilers
9Outline
- Introduction
- Data scale applications and architecture
- Memory system study
- Proposed work
- Related work
- Conclusion
10Parallelism of Workloads
- An n-core architecture makes sense when the
available parallelism p gt n. - But the number of cores n is scaling
exponentially over time - need applications where
- the required throughput scales over time
- the available parallelism p scales over time
- To maintain machine utilization
- available parallelism p in the parallel part must
grow at least as fast as n - sequential portion must grow no faster than the
performance of one core
11Data Scale Applications
- A data scale application is one where both the
complexity and the parallelism scale over time - Definition
- Application can be parallelized
- The achievable parallelism grows with the input
data set - The data set, and hence compute time, grows
exponentially over time at a rate fast enough to
require taking advantage of additional
parallelism
12Workloads with data scale properties
- 172.mgrid from SPECfp
- parallelized as part of SPEC OMP
- ILP study
- shows parallelism is available
- grows linearly with input
- mgrid is a scientific application
- multi-grid potential field solver
- domain where we want to solve ever larger problems
13Workloads with data scale properties
- 173.applu from SPECfp
- parallelized as part of SPEC OMP
- ILP study
- shows parallelism grows with cube of input
- applu solves computational fluid dynamics
- again an application domain where we want to
solve larger problems
14What else might be data scale?
- Visualization
- Raster-based graphics, ray tracing, global
illumination, shadow volumes, dynamic texturing - Video processing
- high-definition encoding, transcoding, video
effects - Financial analytics
- options pricing, ticker stream analysis
- Physical simulation
- real-time fluid simulation, rigid bodies,
mesh-based simulation, facial simulation - Artificial intelligence
- real-time AI, multiple intelligent agents,
physically aware AI - Real-time computer vision
- for robotics, autonomous cars, facial recognition
- lots more deep in the bowels of the CS department
15Architecture for Data Scale Workloads
- Parallelism in the workload is assumed
- single thread performance not the focus
- performance can be increased arbitrarily by
adding more parallelism in HW - hence performance must be measured against
constraints - What should we optimize?
- performance/area
- maximize performance given maximum area
- performance/watt
- maximize performance given maximum power supply
or cooling - performance/joule
- minimize energy-delay product for low power
16Architecture for Data Scale Workloads
- How should we optimize an architecture for data
scale workloads? - core design
- ISA design
- Out-of-order vs in-order
- Issue width
- SIMD vs scalar
- Is multithreading worth it?
- memory system
- What to do about memory bandwidth
- Frequency scaling, energy effects, design time,
architectural scaling
17Outline
- Introduction
- Data scale applications and architecture
- Memory system study
- Proposed work
- Related work
- Conclusion
18Memory Latency Problem
- uniprocessors
- huge performance bottleneck
- latency steady as clock rate increases - the
memory gap - long latency memory access can stall machine
- data scale applications
- provide a way around memory latency
- lots of threads can keep running while a long
latency mem op completes - data scale architectures
- how much chip area to devote to countering memory
latency?
19Cache
- in uniprocessors
- primary technique for overcoming memory latency
- cache miss can stall entire machine
- hierarchies of caches attempt to store entire
working set - large fraction of chip area
- in data scale architectures
- cache miss only stalls a single core
- small fraction of the machine
20Simultaneous Multithreading
- in uniprocessors
- keeps machine running despite cache miss
- requires small number of threads
- area cost
- in data scale architectures
- keeps core running despite cache miss
- but lets you put fewer cores on chip
21CMP Architecture
P0
P1
PN
L2
L2
L2
L3
main mem
22Methodology - Workload
- want apps that look like targeted data scale
workloads - want apps with sufficient parallelism to occupy
all cores - use SPECfp and MediaBench apps
- parallelize loops using perfect info on loop
carried dependences - from def of data scale, dont want constraint
from single-thread performance - generate performance numbers looking only at the
parallel portions - does not necessarily reflect the parallelization
from a compiler or programmer - but it doesnt matter because data scale apps are
easy to parallelize - in fact a programmer can probably do a better job
- does accurately represent resource usage for
those apps
23Methodology - Performance
- use simulation to measure throughput
- simple, fast simulation of each core
- fixed core architecture
- 8 stage, 2 wide, in-order pipeline
- 2.4 GHz clock speed
- cache design
- vary L2 (per core) cache
- 8kB - 2MB per core
- vary L3 (shared) cache
- 8kB x core count - 512kB x core count
- latency based on cache latencies of Intel P4 and
IBM POWER4 - roughly proportional to square root of cache size
- 0.45 ns to 7.1 ns for the L2
24Methodology - Area
- chip area core area cache area
- assume 90nm TSMC process
- core area
- area of Alpha 21164 scaled from .35u to 90nm
process - 13.4 mm2
- cache area
- taken from SRAM area data provided by AGEIA
- 0.34 mm2 to 23.754 mm2 for each L2
- SMT area
- 20 increase in 13.4 mm2 core area
25Cache Area vs. Performance
- Area budget of 400 mm2 in 90nm process
- More cores better than more cache
- especially with SMT
26Core Count vs. Performance
- Devote less area for each core
27Optimize With Process Scaling
- available transistors grows with each process
- model as increasing area budget
- assume perfect scaling
90nm 65nm
45nm
- Given enough threads can achieve nearly linear
performance growth - SMT performance falls behind for larger area
budgets
28Scaling Core Count with Process Scaling
- How did we get that speedup with increasing
transistor budget?
90nm 65nm
45nm
29Memory System Summary
- evaluated 2 techniques for countering memory
latency - cache
- SMT
- found cores are a better use of area than
additional cache - especially if cores are multithreaded
- found cores are a better use of area than SMT
- especially for large area budgets and later
process nodes - main point
- a highly parallel favors more execution resources
versus countering memory latency
30Outline
- Introduction
- Data scale applications and architecture
- Memory system study
- Proposed work
- Related work
- Conclusion
31Overview
- suite of data scale applications
- modeling CMP architecture
- hardware design studies
32Data Scale Benchmarks
- no standard benchmark suite for many core
architectures - want to create a benchmark suite for this project
- data scale applications
- small enough to perform large state space
exploration - representative of important future apps
- candidates
- SPEC OMP benchmarks
- physics simulation - Open Dynamics Engine
- ray tracing
- options pricing
33Area Model
- current model
- area core area cache area
- fixed core design and size
- cache area based on data and varies with size
- proposed model
- area core area cache area interconnect area
- cache area stays same
- core area is a map of core parameters to area
- add up area of functional units, pipe latches,
control logic, etc. - validate against real designs Alpha 21064,
21164, 21264, 21464 - interconnect area maps core count and area, link
bandwidth, buffer sizes, and network topology to
area - Kumar, Zyuban, Tullsen, ISCA 2005
34Power Model
- power and energy consumption are additional
metrics - perf/watt
- maximize performance for a fixed power budget
- perf/joule
- minimize energy-delay product due to limited
energy supply - dynamic power model
- numerous published models
- Wattch
- SimplePower
- adapt for use in our studies
35.
36Programming Model Support
- proposals to add HW to make parallel programming
easier - hardware transactional memory
- proposals to remove HW to improve perf at expense
of programmer - Cell
- evaluate possible HW support for parallel
programming - HW support for data communication
- HW support for thread management
- metric is perf/area and perf/power
- complete picture would consider perf/software
cost - beyond scope
37Hardware Supported Data Communication
- proposals range from full SW managed
communication to full HW - SW communication imposes SW overhead
- less HW overhead
- HW communication requires HW structures
- eliminates SW overhead
- measure performance benefit of reduced
communication overhead - . . . vs. cost of extra HW
38Hardware Thread Management
- overhead from thread creation and scheduling
- some massively parallel architectures manage
threads in HW - GPUs NVIDIA G80 and ATI R5xx series
- eliminates OS calls for creation and scheduling
of threads - requires HW structure
- measure performance benefit of less scheduling
overhead - . . . vs. cost of extra HW
39Outline
- Introduction
- Data scale applications and architecture
- Memory system study
- Proposed work
- Related work
- Conclusion
40Related Work
- workloads for CMPs
- Intel-academic venture to create suite of RMS
applications (recognition, mining, synthesis) - P. Dubey, Recognition, Mining, and Synthesis
Moves Computers to the Era of Tera - similar apps as in our effort
- suite is not publicly available
- GPGPU research
- see Owens et. al., A survey of general purpose
computation on graphics hardware, Computer
Graphics Forum 2007 - fourier transform, dynamics simulation
- 13 dwarves
- Asanovic et. al., Landscape of Parallel Computing
Research A View From Berkeley - 13 basic algorithms important for future
performance - most are highly parallel
- not full applications
41Related Work
- CMP optimization studies
- on-chip network studies
- Balfour and Dally, ICS 2006
- synthetic workload, various topologies
- Kumar, Zyuban, Tullsen, ISCA 2005
- shared bus vs. peer links vs. crossbar
- core complexity studies
- Huh, Burger, Keckler, PACT 2001
- copies of sequential workloads, found preference
for higher complexity cores - Li et. al., HPCA 2006
- copies of sequential workloads, vary pipeline
with fixed area, power budgets - Monchiero, Canal, Gonzalez, ICS 2006
- small scale shared memory workloads, performance,
area, and power - cache design studies
- Hsu et. al., CAN April 2005
- server workloads, find shared caches provide
substantial area savings - generally use n copies of sequential apps, or
server benchmarks - still looking at sequential application
performance/throughput - leads to a very different design point
42Conclusion
- Microprocessor architecture scaling is changing
- from scaling single thread performance to scaling
parallel performance - Workloads are changing
- from sequential workloads to massively parallel
workloads - The rise of data scale workloads
- size of dataset, required throughput, achievable
parallelism all grow over time - workloads suited for core count scaling
- Architectures for data scale workloads
- found additional execution resources a better use
of area than hiding memory latency - will be considering core complexity vs. core
count, inter-core communication system, hardware
support for parallel programming
43Backup
44Core Count vs. Performance
- Devote less area for each core
45Memory System Revisited
- re-examine previous results with constrained
memory bandwidth - re-examine previous results in context of power
- cache eases bandwidth usage
- cache uses less power/area than cores
- if chip is power constrained
- limits core count
- use cache to fill up area budget
- SMT uses more power/perf than baseline if cores
idle less - adding cache due to power constraint should make
SMT less desirable
46Core Complexity
- dynamic scheduling
- large performance benefit for uniprocessor
workloads - allows execution to continue past long-latency
operations - finds ILP within thread
- benefits unclear for data scale applications
- cost of large area overhead, 2X
- will mean fewer cores on chip
- less raw execution bandwidth
47Core Complexity
- pipeline depth
- deeper pipeline provides higher clock speed
- increases execution bandwidth per core
- costs power, area for pipeline latches and bypass
networks - pipeline width
- sequential apps favor narrow pipelines
- data scale apps have lots of parallelism
- may favor wider execution per core
- or may favor more cores
48Interconnection Cache Coherence
- multicore roadmaps feature cache coherent shared
memory - with cache coherence
- allows caching of writable shared memory
locations - without cache coherence
- writable shared memory cannot be cached
- all reads and writes must go to shared higher
level caches or memory - increases memory latency
- measure perf/area and perf/watt effect of cache
coherence
49Interconnection Network
- data scale application threads may be independent
- e.g. graphics
- dont need much interconnection
- data scale application threads may not be
independent - e.g. physics
- evaluate perf/area and perf/power
- dense vs. sparse networks
- high vs. low bandwidth links
50Global Optimization
- four previous design studies provide broad
exploration of design space - also want to examine interaction between
different parameters - unified optimization study
- find optimal overall design
- scaling study
- find optimal design points for different area
budgets - examine how tradeoffs change as architectures
scale over next decade
51Chronological Ordering of Projects
- planned order of proposed work
- initial data scale suite
- core area modeling
- core complexity study
- final data scale suite
- interconnect study
- programming model study
- power modeling
- global optimization study
52NXA
- conceptual architecture
- 2 cores
- connected by spawn queue
- allows P0 to spawn work to P1 with low overhead
- communication network
- ensures P0 and P1 see well defined architectural
state - automatically communicates shared data
53NXA Decoupling Approach
- master/worker approach
- main thread runs on P0
- master thread
- spawns off work threads to P1
- unidirectional flow of dependences
- allows P1 to run far behind P0
- a reverse dependence forces P1 and P0 to
re-synchronize - critical thread on P0
- contains control instructions, miss prone memory
accesses, dataflow dependence spine
54NXA Microarchitecture
55Performance
- average control decoupling speedup 1.16
- average memory decoupling speedup 1.14
- average critical path decoupling speedup 1.15
- choosing best decoupling scheme for each program,
average speedup is 1.20
56Multicore NXA
57Multicore NXA