Title: CS347: Map-Reduce
1CS347 Map-Reduce Pig
- Hector Garcia-Molina
- Stanford University
2"Big Data" Open Source Systems
- Infrastructure for distributed data computations
- Map-Reduce, S4, Hyracks, Pregel Storm, Mupet
- Components
- MemCachedD, ZooKeeper, Kestrel
- Data services
- Pig, F1 Cassandra, H-Base, Big Table Hive
3Motivation for Map-Reduce
Recall one of our sort strategies
Local sort
R1
R1
ko
Result
Local sort
R2
R2
k1
Local sort
R3
R3
process data partition
additional processing
4Another example Asymmetric fragment replicate
join
Local join
Ra
Sa
Rb
Sb
f partition
Result
union
process data partition
additional processing
5Building Text Index - Part I
original Map-Reduce application....
FLUSHING
1
rat
(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat,
1) (rat, 3)
(rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat,
3) (dog, 3)
Intermediate runs
dog
Page stream
2
dog
cat
Disk
rat
3
dog
Loading
Tokenizing
Sorting
6Building Text Index - Part II
Merge
IntermediateRuns
Final index
7Generalizing Map-Reduce
Map
FLUSHING
1
rat
(cat, 2) (dog, 1) (dog, 2) (dog, 3) (rat,
1) (rat, 3)
(rat, 1) (dog, 1) (dog, 2) (cat, 2) (rat,
3) (dog, 3)
Intermediate runs
dog
Page stream
2
dog
cat
Disk
rat
3
dog
Loading
Tokenizing
Sorting
8Generalizing Map-Reduce
Merge
IntermediateRuns
Reduce
Final index
9Map Reduce
- Input Rr1, r2, ...rn, functions M, R
- M(ri) ? k1, v1, k2, v2,..
- R(ki, valSet) ? ki, valSet
- Let S k, v k, v ? M(r) for some r ? R
- Let K k k,v ? S, for any v
- Let G(k) v k, v ? S
- Output k, T k ? K, TR(k, G(k))
S is bag
G is bag
10References
- MapReduce Simplified Data Processing on Large
Clusters, Jeffrey Dean and Sanjay Ghemawat,
available athttp//labs.google.com/papers/mapredu
ce-osdi04.pdf - Pig Latin A Not-So-Foreign Language for Data
Processing, Christopher Olston, Benjamin Reedy,
Utkarsh Srivastavava, Ravi Kumar, Andrew
Tomkins,available athttp//wiki.apache.org/pig/
11Example Counting Word Occurrences
- map(String doc, String value)// doc is document
name// value is document contentfor each word w
in value EmitIntermediate(w, 1) - Example
- map(doc, cat dog cat bat dog) emitscat 1,
dog 1, cat 1, bat 1, dog 1
- Why does maphave 2 parameters?
12Example Counting Word Occurrences
- reduce(String key, Iterator values)// key is a
word// values is a list of countsint result
0for each v in values result
ParseInt(v)Emit(AsString(result)) - Example
- reduce(dog, 1 1 1 1) emits 4
should emit (dog, 4)??
13Google MR Overview
14Implementation Issues
- Combine function
- File system
- Partition of input, keys
- Failures
- Backup tasks
- Ordering of results
15Combine Function
worker
cat 1, cat 1, cat 1...
worker
worker
dog 1, dog 1...
Combine is like a local reduce applied before
distribution
worker
cat 3...
worker
worker
dog 2...
16Distributed File System
reduce worker must be able to access local disks
on map workers
all data transfers are through distributed file
system
any worker must be able to write its part of
answer answer is left as distributed file
worker must be able to access any part of input
file
17Partition of input, keys
- How many workers, partitions of input file?
How many workers? Best to have many splits per
worker Improves load balance if worker fails,
easier to spread its tasks
How many splits?
worker
1
2
3
Should workers be assigned to splits near them?
worker
Similar questions for reduce workers
9
worker
18Failures
- Distributed implementation should produce same
output as would have been produced by a
non-faulty sequential execution of the program. - General strategy Master detects worker failures,
and has work re-done by another worker.
master
ok?
split j
worker
redo j
worker
19Backup Tasks
- Straggler is a machine that takes unusually long
(e.g., bad disk) to finish its work. - A straggler can delay final completion.
- When task is close to finishing, master schedules
backup executions for remaining tasks.
Must be able to eliminate redundant results
20Ordering of Results
- Final result (at each node) is in key order
also in key order
k1, v1 k3, v3
k1, T1 k2, T2 k3, T3 k4, T4
21Example Sorting Records
W1
W5
one or two records for k6?
W2
W3
W6
Map extract k, output k, record
Reduce Do nothing!
22Other Issues
- Skipping bad records
- Debugging
23MR Claimed Advantages
- Model easy to use, hides details of
parallelization, fault recovery - Many problems expressible in MR framework
- Scales to thousands of machines
24MR Possible Disadvantages
- 1-input 2-stage data flow rigid, hard to adapt to
other scenarios - Custom code needs to be written even for the most
common operations, e.g., projection and filtering - Opaque nature of map, reduce functions impedes
optimization
25Questions
- Can MR be made more declarative?
- How can we perform joins?
- How can we perform approximate grouping?
- example for all keys that are similarreduce all
values for those keys
26Additional Topics
- Hadoop open-source Map-Reduce system
- Pig Yahoo system that builds on MR but is more
declarative
27Pig Pig Latin
- A layer on top of map-reduce (Hadoop)
- Pig is the system
- Pig Latin is the query language
- Pig Latin is a hybrid between
- high-level declarative query language in the
spirit of SQL - low-level, procedural programming à la map-reduce.
28Example
- Table urls (url, category, pagerank)
- Find, for each sufficiently large category, the
average pagerank of high-pagerank urls in that
category. In SQL - SELECT category, AVG(pagerank)FROM urls WHERE
pagerank gt 0.2GROUP BY category HAVING COUNT()
gt 106
29Example in Pig Latin
- SELECT category, AVG(pagerank)FROM urls WHERE
pagerank gt 0.2GROUP BY category HAVING COUNT()
gt 106 - In Pig Latin
- good_urls FILTER urls BY pagerank gt 0.2groups
GROUP good_urls BY categorybig_groups
FILTER groups BY
COUNT(good_urls)gt106output FOREACH big_groups
GENERATE category,
AVG(good_urls.pagerank)
30good_urls FILTER urls BY pagerank gt 0.2
urls url, category, pagerank
good_urls url, category, pagerank
31groups GROUP good_urls BY category
good_urls url, category, pagerank
groups category, good_urls
32big_groups FILTER groups BY COUNT(good_urls)gt1
groups category, good_urls
big_groups category, good_urls
33output FOREACH big_groups GENERATE
category, AVG(good_urls.pagerank)
big_groups category, good_urls
output category, good_urls
34Features
- Similar to specifying a query execution plan
(i.e., a dataflow graph), thereby making it
easier for programmers to understand and control
how their data processing task is executed. - Support for a flexible, fully nested data model
- Extensive support for user-defined functions
- Ability to operate over plain input files without
any schema information. - Novel debugging environment useful when dealing
with enormous data sets.
35Execution Control Good or Bad?
- Examplespam_urls FILTER urls BY
isSpam(url)culprit_urls FILTER spam_urls BY
pagerankgt0.8 - Should system re-order filters?
36User Defined Functions
- Example
- groups GROUP urls BY category
- output FOREACH groups GENERATE
category, top10(urls)
should be groups.url ?
.gov (x.fbi.gov, .gov, 0.7) ...
.edu (y.yale.edu, .edu, 0.5) ...
.com (z.cnn.com, .com, 0.9) ...
UDF top10 can return scalar or set
.gov (fbi.gov) (cia.gov) ...
.edu (yale.edu) ...
.com (cnn.com) (ibm.com) ...
37Data Model
- Atom, e.g., alice'
- Tuple, e.g., (alice', lakers')
- Bag, e.g., (alice', lakers') (alice',
(iPod', apple') - Map, e.g., fan of' ? (lakers') (iPod')
age ? 20
Note Bags can currently only hold tuples. So 1,
2, 3 is stored as (1) (2) (3)
38Expressions in Pig Latin
Should be(1) (2)
See flattenexamplesahead
39Specifying Input Data
handle for future use
input file
- queries LOAD query_log.txt'USING myLoad()AS
(userId, queryString, timestamp)
custom deserializer
output schema
40For Each
- expanded_queries FOREACH queries GENERATE
userId, expandQuery(queryString) - See example next slide
- Note each tuple is processed independently good
for parallelism - To remove one level of nestingexpanded_queries
FOREACH queries GENERATE userId,
FLATTEN(expandQuery(queryString))
41ForEach and Flattening
lakers rumors isa single string value
plus userid
42Flattening Example (Fill In)
X A B C
Y FOREACH X GENERATE A, FLATTEN(B), C
43Flattening Example (Fill In)
Y FOREACH X GENERATE A, FLATTEN(B), C
Z FOREACH Y GENERATE A, B, FLATTEN(C)
Is ZZ where
Z FOREACH X GENERATE A, FLATTEN(B),
FLATTEN(C) ?
44Flattening Example
X A B C
Note first tuple is (a1, b1, b2, (c1)(c2))
Y FOREACH X GENERATE A, FLATTEN(B), C
Flatten is not recursive
Note attribute naming gets complicated. For
example, 2 for first tuple is b2 for third
tuple it is (c1)(c2).
45Flattening Example
Y FOREACH X GENERATE A, FLATTEN(B), C
Z FOREACH Y GENERATE A, B, FLATTEN(C)
Note that ZZ where
Z FOREACH X GENERATE A, FLATTEN(B),
FLATTEN(C)
46Filter
- real_queries FILTER queries BY userId neq
bot' - real_queries FILTER queries BY NOT
isBot(userId)
UDF function
47Co-Group
- Two data sets for example
- results (queryString, url, position)
- revenue (queryString, adSlot, amount)
- grouped_data COGROUP results BY
queryString, revenue BY queryString - url_revenues FOREACH grouped_data
GENERATEFLATTEN(distributeRevenue(results,
revenue)) - Co-Group more flexible than SQL JOIN
48CoGroup vs Join
49Group (Simple CoGroup)
- grouped_revenue GROUP revenue BY queryString
- query_revenues FOREACH grouped_revenue GENERATE
queryString, SUM(revenue.amount) AS totalRevenue
50CoGroup Example 1
X A B C
Y A B D
Z1 GROUP X BY A
Z1 A X
51CoGroup Example 1
X A B C
Y A B D
Z1 GROUP X BY A
Z1 A X
52CoGroup Example 2
X A B C
Y A B D
Syntax not in paper but being added
Z2 GROUP X BY (A, B)
Z1 ? X
53CoGroup Example 2
X A B C
Y A B D
Syntax not in paper but being added
Z2 GROUP X BY (A, B)
Z1 A/B? X
54CoGroup Example 3
X A B C
Y A B D
Z3 COGROUP X BY A, Y BY A
Z1 A X
Y
55CoGroup Example 3
X A B C
Y A B D
Z3 COGROUP X BY A, Y BY A
Z1 A X
Y
56CoGroup Example 4
X A B C
Y A B D
Z4 COGROUP X BY A, Y BY B
Z1 A X
Y
57CoGroup Example 4
X A B C
Y A B D
Z4 COGROUP X BY A, Y BY B
Z1 A X
Y
58CoGroup With Function Call?
X A B
Y GROUP X BY A Z GROUP X BY SUM(A)
Adds integers in tuple
Y A X
Z ? X
59CoGroup With Function Call?
X A B
Y GROUP X BY A Z GROUP X BY SUM(A)
Adds integers in tuple
Y A X
Z SUM(A)/A? X
60Pig Latin Join
- join_result JOIN results BYqueryString,
revenue BY queryString - Shorthand for
- temp_var COGROUP results BY queryString,revenue
BY queryString - join_result FOREACH temp_var GENERATEFLATTEN(re
sults), FLATTEN(revenue)
61MapReduce in Pig Latin
- map_result FOREACH input GENERATE
FLATTEN(map()) - key_groups GROUP map_result BY 0
- output FOREACH key_groups
GENERATE reduce()
key is first attribute
all attributes
62Store
- To materialize result in a file
- STORE query_revenuesINTO myoutput' USING
myStore()
output file
custom serializer
63Hadoop
- HDFS Hadoop file system
- How to use Hadoop, examples