Title: MapReduce, Hadoop, and MapReduceMerge
1Map-Reduce,Hadoop,andMap-Reduce-Merge
2Presentation Overview
- What is map-reduce?
- input/output data types
- why is it useful and where is it used?
- Execution overview
- Features
- fault tolerance
- ordering guarantee
- other perks and bonuses
- Hands-on demonstration and follow-along
- Map-reduce-merge
3What is map-reduce?
- Map-reduce is a programming model (and an
associated implementation) for processing and
generating large data sets. - It consists of two steps map and reduce.
- The map step takes a key/value pair and
produces an intermediate key/value pair. - The reduce step takes a key and a list of the
key's values and outputs the final key/value pair.
4Types
- map (k1, v1) ? list(k2, v2)?
- reduce (k2, list(v2)) ? list(v2)?
5Why is this useful?
- Map-reduce jobs are automatically parallelized.
- Partial failure of the processing cluster is
expected and tolerable. - Redundancy and fault-tolerance is built in, so
the programmer doesn't have to worry. - It scales very well.
- Many jobs are naturally expressible in the
map/reduce paradigm.
6What are some uses?
- Word count
- map ltword, 1gt. reduce ltword, gt
- Grep
- map ltfile, linegt. reduce identity
- Inverted index
- map ltword, docIDgt. reduce ltword, list(docID)gt
- Distributed sort (special case)?
- map ltkey, recordgt. reduce identity
- Users Google, Yahoo!, Amazon, Facebook, etc.
7Presentation Overview
- What is map-reduce?
- input/output data types
- why is it useful and where is it used?
- Execution overview
- Features
- fault tolerance
- ordering guarantee
- other perks and bonuses
- Hands-on demonstration and follow-along
- Map-reduce-merge
8Execution overview map
- The user begins a map-reduce job. One of the
machines becomes the master. - Partition the input into M splits (16-64 MB each)
and distribute among the machines. A worker
reads his split and begins work. Upon
completion, the worker notifies the master. - The master partitions the intermediate keyspace
into R pieces with a partitioning function.
9Execution overview reduce
- When a reduce worker is notified about a job, it
uses RPC to read the intermediate data from a
mapper, then sorts it by key. - The reducer processes its job, then writes its
output to the final output file for its reduce
partition. - When all reducers are finished, the master wakes
up the user program.
10What are M and R?
- M is the number of map pieces. R is the number
of reduce pieces. - Ideally, M and R are much larger than the number
of workers. This allows one machine to perform
many different tasks, improving load balancing
and speeds up recovery. - The master makes O(MR) scheduling decisions and
keeps O(MR) states in memory. - At least R files end up being written.
11Example counting words
- We have UTD's fight song
- C-O-M-E-T-S! Go!
- Green, Orange, White!
- Comets! Go!
- Strong of will, we fight for right!
- Let's all show our comet might!
- We want to count the number of occurrences of
each word. - The next slides show the map and reduce phases.
12First stage map
- Go through the input, and for each word return a
tuple of (ltwordgt, 1). - Output
- ltC-O-M-E-T-S!, 1gt
- ltGo!, 1gt
- ltGreen,, 1gt
- ltOrange,, 1gt
- ltWhite!, 1gt
- ltComets!, 1gt
- ltGo!, 1gt
- ltStrong, 1gt
- ltof, 1gt
- ...
13Between map and reduce...
- Between the mapper and the reducer, some gears
turn within Hadoop, and it groups identical keys
and sorts by key before starting the reducer. - Here's the output
- ltC-O-M-E-T-S!, 1gt
- ltComets!, 1gt
- ltGo!, 1,1gt
- ltGreen,, 1gt
- ltOrange,, 1gt
- ltStrong, 1gt
- ltWhite!, 1gt
- ltof, 1gt
- ...
14Second stage reducer
- The reducer receives the content, one
key-valuelist pair at a time, and does its own
processing. - For wordcount, it sums the values in each list.
- Here's the output
- ltC-O-M-E-T-S!, 1gt
- ltGo!, 2gt
- ltGreen,, 1gt
- ltOrange,, 1gt
-
- Then it writes these tuples to the final files in
the HDFS.
15How can we improve our wordcount?Also, any
questions?
16Presentation Overview
- What is map-reduce?
- input/output data types
- why is it useful and where is it used?
- Execution overview
- Features
- fault tolerance
- ordering guarantee
- other perks and bonuses
- Hands-on demonstration and follow-along
- Map-reduce-merge
17Fault tolerance
- Worker failure is expected. If a worker fails
during a map phase, its workload is reassigned to
another worker. If a mapper fails during a
reduce phase, both phases are re-executed. - Master failure is not expected, though
checkpointing can be used for recovery. - If a particular record causes the mapper or
reducer to reliably crash, the map-reduce system
can figure this out, skip the record, and proceed.
18Ordering guarantee
- The implementation of map-reduce guarantees that
within a given partition, the intermediate
key/value pairs are processed in increasing key
order. - This means that each reduce partition ends up
with an output file sorted by key.
19Partitioning function
- By default, your reduce tasks will be distributed
evenly by using a hash(intrmdt-key) mod N
function. - You can specify a custom partitioning function.
- Useful for locality reasons, such as if the key
is a URL and you want all URLs belonging to a
single host to be processed on a single machine.
20Combiner function
- After a map phase, the mapper transmits over the
network the entire intermediate data file to the
reducer. - Sometimes this file is highly compressible.
- The user can specify a combiner function. It's
just like a reduce function, except it's run by
the mapper before passing the job to the reducer.
21Counters
- A counter can be associated with any action that
a mapper or a reducer does. This is in addition
to default counters such as the number of input
and output key/value pairs processed. - A user can watch the counters in real time to
see the progress of a job. - When the map/reduce job finishes, these counters
are provided to the user program.
22Presentation Overview
- What is map-reduce?
- input/output data types
- why is it useful and where is it used?
- Execution overview
- Features
- fault tolerance
- ordering guarantee
- other perks and bonuses
- Hands-on demonstration and follow-along
- Map-reduce-merge
23What is ?
- Hadoop is the implementation of the map/reduce
design that we will use. - Hadoop is released under the Apache License 2.0,
so it's open source. - Hadoop uses the Hadoop Distributed File System,
HDFS. (In contrast to what we've seen with
Lucene.)? - Get the release from
- http//hadoop.apache.org/core/
24Preparing Hadoop on your system
- Configure passwordless public-key SSH on
localhost - Configure Hadoop
- look at the two configuration files at
http//utdallas.edu/pmw033000/hadoop/ - Format the HDFS
- bin/hadoop namenode -format
- Start Hadoop
- cd lthadoop-dirgt
- bin/start-all.sh (and wait 20 seconds)?
25Example grep
- Standard Unix 'grep' behavior run it on the
command line with the search string as the first
argument and the list of files or directories as
the subsequent argument(s). - grep HelloWorld file1.c file2.c file3.c
- file2.cSystem.out.println(I say HelloWorld!)
26Preparing for 'grep' in Hadoop
- Hadoop's jobs always operate within the HDFS.
- Hadoop will read its input from HDFS, and will
write its output to HDFS. - Thus, to prepare
- Download a free electronic book
- http//utdallas.edu/pmw033000/hadoop/book.txt
- Load the file into HDFS
- bin/hadoop fs -copyFromLocal book.txt /book.txt
27Using 'grep' within Hadoop
- bin/hadoop jar \
- hadoop-0.18-2-examples.jar \
- grep /book.txt /grep-result \
- search string
- bin/hadoop fs -ls /grep-result
- bin/hadoop fs -cat /grep-result/part-00000
- A good string to try Horace de \S
- Between job runs bin/hadoop fs -rmr /grep-result
28How 'grep' in Hadoop works
- The program runs two map/reduce jobs in sequence.
The first job counts how many times a matching
string occurred and the second job sorts matching
strings by their frequency and stores the output
in a single output file. - Each mapper of the first job takes a line as
input and matches the user-provided regular
expression against the line. It extracts all
matching strings and emits (matching string, 1)
pairs. Each reducer sums the frequencies of each
matching string. The output is sequence files
containing the matching string and count. The
reduce phase is optimized by running a combiner
that sums the frequency of strings from local map
output. As a result it reduces the amount of data
that needs to be shipped to a reduce task. - The second job takes the output of the first job
as input. The mapper is an inverse map, while the
reducer is an identity reducer. The number of
reducers is one, so the output is stored in one
file, and it is sorted by the count in a
descending order. The output file is text, each
line of which contains count and a matching
string.
29Another example word count
- bin/hadoop jar hadoop-0.18.2-examples.jar \
- wordcount /book.txt /wc-result
- bin/hadoop fs -cat /wc-result/part-00000 \
- sort -n -k 2
- You can also try passing a -r option to
increase the number of parallel reducers. - Each mapper takes a line as input and breaks it
into words. It then emits a key/value pair of the
word and 1. Each reducer sums the counts for each
word and emits a single key/value with the word
and sum. - As an optimization, the reducer is also used as a
combiner on the map outputs. This reduces the
amount of data sent across the network by
combining each word into a single record.
30Presentation Overview
- What is map-reduce?
- input/output data types
- why is it useful and where is it used?
- Execution overview
- Features
- fault tolerance
- ordering guarantee
- other perks and bonuses
- Hands-on demonstration and follow-along
- Map-reduce-merge (proposal not implemented)?
31Does map-reduce satisfy all needs?
- Map-reduce is great for homogeneous data, such as
grepping a large collection of files or
word-counting a huge document. - Joining heterogeneous databases does not work
well. - As is, we'd need additional map-reduce steps,
such as map-reducing one database and reading
from the others on the fly. - We want to support relational algebra.
32Solution
- The solution to these problems is
map-reduce-merge. It is map-reduce with a new
additional merging step. - The merge phase makes it easier to process data
relationships among heterogeneous data sets. - Types
- map (k1, v1)a ? (k2, v2)a
- reduce (k2, v2)a ? (k2, v3)a (notice that
the output v is a list)? - merge ((k2, v3)a, (k3, v4)ß) ? (k4, v5)?
- If aß, then the merging step performs a
self-merge (self-join in R.A.).
33New terms
- Partition selector determines which data
partitions produced by reducers should be
retrieved for merging. - Processor user-defined logic of processing data
from an individual source. - Merger user-defined logic of processing data
merged from two sources where data satisfies a
merge condition. - Configurable iterator next slide.
34Configurable iterators
- The map and reduce user-defined functions get one
iterator for the values. - The merge function gets two iterators, one for
each data source. - The iterators do not have to move forward they
can be instrumented to do whatever the user
wants. - Relational join algorithms have specific patterns
for the merging step.
35Map-reduce-merge example
- Table A emp-id, dept-id, bonus
- 1, B, 100
- 1, B, 50
- 2, A, 0
- 3, A, 150
- 3, A, 100
Table B dept-id, bonus-adjust B, 1.15 A, 0.95
Final table emp-id, bonus 2, 0 3, 237.5 1,
172.5
36Map-reduce-merge diagram