Title: Mars: A MapReduce Framework on Graphics Processors
1Mars A MapReduce Framework on Graphics Processors
- Bingsheng He1, Wenbin Fang, Qiong Luo
- Hong Kong Univ. of Sci. and Tech.
- Naga K. Govindaraju Tuyong
Wang - Microsoft Corp.
Sina Corp.
Presenter Wenbin Fang
1, Currently in Microsoft Research Asia
2Overview
- Introduction
- Design
- Implementation
- Evaluation
- Conclusion
3Overview
- Introduction
- Design
- Implementation
- Evaluation
- Conclusion
4Graphics Processing Units (GPUs)
- Massively multithreaded co-processors
- Recent NVIDIA GPUs
- 1 TFLOPS peak performance
- Consist of many SIMD multiprocessors
- Thread groups
5Graphics Processing Units (Cont.)
- High bandwidth device memory
- gt10x higher than CPU main memorys bandwidth.
- High latency device memory
- 200 clock cycles of latency
- Latency hiding using large number of concurrent
threads - Low context-switch overhead
6GPGPU
- Linear algebra Larsen 01, Fatahalian 04, Galoppo
05 - FFT Moreland 03, Horn 06
- Matrix operations Jiang 05
- Folding_at_home, Seti_at_home
- Database applications
- Basic Operators Govindaraju 04
- Sorting Govindaraju 06
- Join He 08
7 GPU Programming
Graphics rendering pipeline
Different programming models.
8- Without worrying about hardware details
- Make GPGPU programming much
- easier.
- Well harness high parallelism and high
- computational ability of GPUs.
9The Original MapReduce
- MapReduce is a parallel programming model for
processing and generating large datasets,
proposed by google OSDI04. - It takes a set of records in the form of
key/value pair as input, and produces a set of
output key/value pairs.
10MapReduce Functions
- Programmers specify two functions
- map (in_key, in_value)
- reduce (out_key, list(intermediate_value))
- The MapReduce runtime takes care of
- parallelization
- fault tolerance
- data distribution
- load balancing
11MapReduce Workflow
From http//labs.google.com/papers/mapreduce.html
12MapReduce outside google
- Hadoop Apache project
- MapReduce on multicore CPUs -- Phoenix HPCA'07,
Ranger et al. - MapReduce on Cell 07, Kruijf et al.
- Merge ASPLOS '08, Linderman et al.
- MapReduce on GPUs stmcs'08, Catanzaro et al.)
13Overview
- Motivation
- Design
- Implementation
- Evaluation
- Conclusion
14Software stack of Mars
- Applications
- Matrix Multiplication
- String Match
- Inverted Index
- Similarity Score
- Page View Count
- Page View Rank
Mars A MapReduce on GPUs
NVIDIA CUDA
GPU Driver, OS
15MapReduce on Multi-core CPU (Phoenix HPCA'07)
Input
Split
Map
Partition
Reduce
Merge
Output
16Limitations on GPUs
- Lack of dynamic memory allocation on GPUs
- How to support variant length data?
- How to dynamically allocate output buffer on
GPUs? - Lack of lock support
- How to synchronize to avoid write conflict?
17Data Structure for Mars
Support variant length record!
- A Record ltKey, Value, Index entrygt
Key1
Key2
Key3
Value1
Value2
Value3
IndexEntry1
IndexEntry2
IndexEntry3
An index entry ltkey size, key offset, val size,
val offsetgt
18Lock-free scheme for result output
Basic idea Calculate the offset for each thread
on the output buffer.
- Histogram on key size, value size, and record
count. - Prefixsum on key size, value size, and record
count. - Allocate output buffer on GPU memory.
- Perform computing.
19Lock-free scheme example
Pick up odd numbers from the array 1, 3, 2, 3,
4, 6, 9, 8. map function as a filter filter
all odd numbers
20Lock-free scheme example
T1
T2
T3
T4
1, 3, 2, 3, 4, 7,
9, 8
1
3
2
3
4
7
9
8
Step1 Histogram
Step2 Prefixsum
(5)
21Lock-free scheme example
T1
T2
T3
T4
1, 3, 2, 3, 4, 7,
9, 8
Step3 Allocate
22Lock-free scheme example
T1
T2
T3
T4
1, 3, 2, 3, 4, 7,
9, 8
Step4 Computation
1
3
2
3
4
7
9
8
Prefixsum
23Mars workflow
Input
MapCount
Prefixsum
Allocate intermediate buffer on GPU
Map
Sort and Group
ReduceCount
Prefixsum
Allocate output bufer on GPU
Reduce
Output
24Mars workflow Map Only
Input
MapCount
Prefixsum
Allocate intermediate buffer on GPU
Map
Output
Map only, without grouping and reduce
25Mars workflow - Without Reduce
Input
MapCount
Prefixsum
Allocate intermediate buffer on GPU
Map
Sort and Group
Output
Map and grouping, without reduce
26APIs of Mars
User-defined MapCount Map Compare
(optional) ReduceCount (optional) Reduce
(optional)
- Runtime Provided
- AddMapInput
- MapReduce
- EmitInterCount
- EmitIntermediate
- EmitCount (optional)
- Emit (optional)
27Overview
- Introduction
- Design
- Implementation
- Evaluation
- Conclusion
28Mars-GPU
- NVIDIA CUDA
- Each map invocation or reduce invocation is a GPU
thread.
Mars-CPU
- Operating systems thread APIs
- Each map invocation or reduce invocation is a CPU
thread.
29CUDA Features Used in Mars-GPU Implementation
Coalesced Access
Build-in vector type int4, char4, float4
30Overview
- Motivation
- Design
- Implementation
- Evaluation
- Conclusion
31Experimental Setup
- Comparison
- CPU Phoenix, Mars-CPU (cpu thread 4)
- GPU Mars-GPU (gpu thread app dependent)
32Applications
- String Match (SM) Find the position of a string
in a file. - S 32MB, M 64MB, L 128MB
- Inverted Index (II) Build inverted index for
links in HTML files. - S 16MB, M 32MB, L 64MB
- Matrix Multiplication (MM) Multiply two
matrices. - S 512x512, M 1024x10242, L 2048x2048
33Applications (Cont.)
- Similarity Score (SS) Compute the pair-wise
similarity score for a set of documents. - S 512x128, M 1024x128, L 2048x128
- Page View Rank (PVR) Count the number of
distinct page views from web logs. - S 32MB, M 64MB, L 96MB
- Page View Count (PVC) Find the top-10 hot pages
in the web log. - S 32MB, M 64MB, L 96MB
34Effect of Coalessed Access
Coalessed access achieves a speedup of 1.2-2X
35Effect of Built-In Data Types
Built-in data types achieve a speedup up to 2
times
36GPU accelerates computation in MapReduce
With large data set
37Mars-GPU vs. Phoenix on Quadcore CPU
The speedup is 1.5-16 times with various data
sizes
38Mars-CPU vs. Phoenix
Mars-CPU is 1-5 times as fast as Phoenix
39Overview
- Motivation
- Design
- Implementation
- Evaluation
- Conclusion
40Conclusion
- MapReduce framework on GPUs
- Ease of programming on GPUs
- Promising performance
- Want a Copy of Mars?
- http//www.cse.ust.hk/gpuqp/Mars.html