Title: MapReduce for the Cell B. E. Architecture
1MapReduce for the Cell B. E. Architecture
Marc de Kruijf University of Wisconsin-Madison Ad
vised by Professor Sankaralingam
2MapReduce
- A model for parallel programming
- Proposed by Google
- Large scale distributed systems 1,000 node
clusters - Applications
- Distributed sort
- Distributed grep
- Indexing
- Simple, high-level interface
- Runtime handles
- parallelization, scheduling, synchronization, and
communication
3Cell B. E. Architecture
- A heterogeneous computing platform
- 1 PPE, 8 SPEs
- Programming is hard
- Multi-threading is explicit
- SPE local memories are software-managed
- The Cell is like a cluster-on-a-chip
4Motivation
- MapReduce
- Scalable parallel model
- Simple interface
-
- Cell B. E.
- Complex parallel architecture
- Hard to program
MapReduce for the Cell B.E. Architecture
5Overview
- Motivation
- MapReduce
- Cell B.E. Architecture
- MapReduce Example
- Design
- Evaluation
- Workload Characterization
- Application Performance
- Conclusions and Future Work
6MapReduce Example
- Counting word occurrences in a set of documents
7Overview
- Motivation
- MapReduce
- Cell B.E. Architecture
- MapReduce Example
- Design
- Evaluation
- Workload Characterization
- Application Performance
- Conclusions and Future Work
8Design
Flow of Execution Five stages Map, Partition,
Quick-sort, Merge-sort, Reduce
9Design
Flow of Execution Five stages Map, Partition,
Quick-sort, Merge-sort, Reduce 1. Map streams
key/value pairs
10Design
- Flow of Execution
- Five stages Map, Partition, Quick-sort,
Merge-sort, Reduce - 1. Map streams key/value pairs
- Key grouping implemented as
- 2. Partition hash and distribute
- 3. Quick-sort
- 4. Merge-sort
two-phase external sort
11Design
- Flow of Execution
- Five stages Map, Partition, Quick-sort,
Merge-sort, Reduce - 1. Map streams key/value pairs
- Key grouping implemented as
- 2. Partition hash and distribute
- 3. Quick-sort
- 4. Merge-sort
two-phase external sort
12Design
- Flow of Execution
- Five stages Map, Partition, Quick-sort,
Merge-sort, Reduce - 1. Map streams key/value pairs
- Key grouping implemented as
- 2. Partition hash and distribute
- 3. Quick-sort
- 4. Merge-sort
two-phase external sort
13Design
- Flow of Execution
- Five stages Map, Partition, Quick-sort,
Merge-sort, Reduce - 1. Map streams key/value pairs
- Key grouping implemented as
- 2. Partition hash and distribute
- 3. Quick-sort
- 4. Merge-sort
- 5. Reduce reduces
- key/list-of-values pairs to
- key/value pairs.
two-phase external sort
14Overview
- Motivation
- MapReduce
- Cell B.E. Architecture
- MapReduce Example
- Design
- Evaluation
- Workload Characterization
- Application Performance
- Conclusions and Future Work
15Evaluation Methodology
- MapReduce Model Characterization
- Synthetic micro-benchmark with six parameters
- Run on a 3.2 GHz Cell Blade
- Measured effect of each parameter on execution
time - Application Performance Comparison
- Six full applications
- MapReduce versions run on 3.2 GHz Cell Blade
- Single-threaded versions run on 2.4 GHz Core 2
Duo - Evaluation
- Measured speedup comparing execution times
- Measured overheads on the Cell monitoring SPE
idle time - Measured ideal speedup assuming no Cell overheads
16MapReduce Model Characterization
Effect on Execution Time
Characteristic Description
Map intensity Execution cycles per input byte to Map
Reduce intensity Execution cycles per input byte to Reduce
Map fan-out Ratio of input size to output size in Map
Reduce fan-in Number of values per key in Reduce
Partitions Number of partitions
Input size Input size in bytes
17Application Performance
- Applications
- histogram counts bitmap RGB occurrences
- kmeans clustering algorithm
- linearReg least-squares linear regression
- wordCount word count
- NAS_EP EP benchmark from NAS suite
- distSort distributed sort
18Speedup Over Core 2 Duo
19Runtime Overheads
20Overview
- Motivation
- MapReduce
- Cell B.E. Architecture
- MapReduce Example
- Design
- Evaluation
- Workload Characterization
- Application Performance
- Conclusions and Future Work
21Conclusions and Future Work
- Conclusions
- Programmability benefits
- High-performance on computationally intensive
workloads - Not applicable to all application types
- Future Work
- Additional performance tuning
- Extend for clusters of Cell processors
- Hierarchical MapReduce
22Questions?
23Backup Slides
24MapReduce API
- void MapReduce_exec(MapReduce Specification
specification) - The exec function initializes the MapReduce
runtime and executes MapReduce according to the
user specification. - void MapReduce_emitIntermediate(void key, void
value) - void MapReduce_emit(void value)
- These two functions are called by the
user-defined Map and Reduce functions,
respectively. These functions take references to
pointers as arguments, and modify the referenced
pointer to point to pre-allocated storage. It is
then the responsibility of the application to
provision this storage.
25Optimizations
- Priority work queue
- Distributes load
- Avoids serialization
- Pipelined execution maximizes concurrency
- Double-buffering
- Application support
- Map only
- Map with sorted output
- Chaining invocations
26Optimizations
- Priority work queue
- Distributes load
- Avoids serialization
- Pipelined execution maximizes concurrency
- Double-buffering
- Application support
- Map only
- Map with sorted output
- Chaining invocations
27Optimizations
- Balanced merge (n / log(n) better bandwidth
utilization as n ? 8) -
-
- Map and Reduce output regions pre-allocated.
- optimal memory alignment
- bulk memory transfers
- no user memory management
- no dynamic allocation overhead