MapReduce, Hadoop, and MapReduceMerge - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

MapReduce, Hadoop, and MapReduceMerge

Description:

Execution overview: map. The user begins a map-reduce job. ... Execution overview: reduce ... If a mapper fails during a reduce phase, both phases are re-executed. ... – PowerPoint PPT presentation

Number of Views:149

Avg rating:3.0/5.0

Slides: 37

Provided by: coursesI

Category:

more less

Transcript and Presenter's Notes

Title: MapReduce, Hadoop, and MapReduceMerge

1
Map-Reduce,Hadoop,andMap-Reduce-Merge
2
Presentation Overview

What is map-reduce?
input/output data types
why is it useful and where is it used?
Execution overview
Features
fault tolerance
ordering guarantee
other perks and bonuses
Hands-on demonstration and follow-along
Map-reduce-merge

3
What is map-reduce?

Map-reduce is a programming model (and an
associated implementation) for processing and
generating large data sets.
It consists of two steps map and reduce.
The map step takes a key/value pair and
produces an intermediate key/value pair.
The reduce step takes a key and a list of the
key's values and outputs the final key/value pair.

4
Types

map (k1, v1) ? list(k2, v2)?
reduce (k2, list(v2)) ? list(v2)?

5
Why is this useful?

Map-reduce jobs are automatically parallelized.
Partial failure of the processing cluster is
expected and tolerable.
Redundancy and fault-tolerance is built in, so
the programmer doesn't have to worry.
It scales very well.
Many jobs are naturally expressible in the
map/reduce paradigm.

6
What are some uses?

Word count
map ltword, 1gt. reduce ltword, gt
Grep
map ltfile, linegt. reduce identity
Inverted index
map ltword, docIDgt. reduce ltword, list(docID)gt
Distributed sort (special case)?
map ltkey, recordgt. reduce identity
Users Google, Yahoo!, Amazon, Facebook, etc.

7
Presentation Overview

What is map-reduce?
input/output data types
why is it useful and where is it used?
Execution overview
Features
fault tolerance
ordering guarantee
other perks and bonuses
Hands-on demonstration and follow-along
Map-reduce-merge

8
Execution overview map

The user begins a map-reduce job. One of the
machines becomes the master.
Partition the input into M splits (16-64 MB each)
and distribute among the machines. A worker
reads his split and begins work. Upon
completion, the worker notifies the master.
The master partitions the intermediate keyspace
into R pieces with a partitioning function.

9
Execution overview reduce

When a reduce worker is notified about a job, it
uses RPC to read the intermediate data from a
mapper, then sorts it by key.
The reducer processes its job, then writes its
output to the final output file for its reduce
partition.
When all reducers are finished, the master wakes
up the user program.

10
What are M and R?

M is the number of map pieces. R is the number
of reduce pieces.
Ideally, M and R are much larger than the number
of workers. This allows one machine to perform
many different tasks, improving load balancing
and speeds up recovery.
The master makes O(MR) scheduling decisions and
keeps O(MR) states in memory.
At least R files end up being written.

11
Example counting words

We have UTD's fight song
C-O-M-E-T-S! Go!
Green, Orange, White!
Comets! Go!
Strong of will, we fight for right!
Let's all show our comet might!
We want to count the number of occurrences of
each word.
The next slides show the map and reduce phases.

12
First stage map

Go through the input, and for each word return a
tuple of (ltwordgt, 1).
Output
ltC-O-M-E-T-S!, 1gt
ltGo!, 1gt
ltGreen,, 1gt
ltOrange,, 1gt
ltWhite!, 1gt
ltComets!, 1gt
ltGo!, 1gt
ltStrong, 1gt
ltof, 1gt
...

13
Between map and reduce...

Between the mapper and the reducer, some gears
turn within Hadoop, and it groups identical keys
and sorts by key before starting the reducer.
Here's the output
ltC-O-M-E-T-S!, 1gt
ltComets!, 1gt
ltGo!, 1,1gt
ltGreen,, 1gt
ltOrange,, 1gt
ltStrong, 1gt
ltWhite!, 1gt
ltof, 1gt
...

14
Second stage reducer

The reducer receives the content, one
key-valuelist pair at a time, and does its own
processing.
For wordcount, it sums the values in each list.
Here's the output
ltC-O-M-E-T-S!, 1gt
ltGo!, 2gt
ltGreen,, 1gt
ltOrange,, 1gt
Then it writes these tuples to the final files in
the HDFS.

15
How can we improve our wordcount?Also, any
questions?
16
Presentation Overview

What is map-reduce?
input/output data types
why is it useful and where is it used?
Execution overview
Features
fault tolerance
ordering guarantee
other perks and bonuses
Hands-on demonstration and follow-along
Map-reduce-merge

17
Fault tolerance

Worker failure is expected. If a worker fails
during a map phase, its workload is reassigned to
another worker. If a mapper fails during a
reduce phase, both phases are re-executed.
Master failure is not expected, though
checkpointing can be used for recovery.
If a particular record causes the mapper or
reducer to reliably crash, the map-reduce system
can figure this out, skip the record, and proceed.

18
Ordering guarantee

The implementation of map-reduce guarantees that
within a given partition, the intermediate
key/value pairs are processed in increasing key
order.
This means that each reduce partition ends up
with an output file sorted by key.

19
Partitioning function

By default, your reduce tasks will be distributed
evenly by using a hash(intrmdt-key) mod N
function.
You can specify a custom partitioning function.
Useful for locality reasons, such as if the key
is a URL and you want all URLs belonging to a
single host to be processed on a single machine.

20
Combiner function

After a map phase, the mapper transmits over the
network the entire intermediate data file to the
reducer.
Sometimes this file is highly compressible.
The user can specify a combiner function. It's
just like a reduce function, except it's run by
the mapper before passing the job to the reducer.

21
Counters

A counter can be associated with any action that
a mapper or a reducer does. This is in addition
to default counters such as the number of input
and output key/value pairs processed.
A user can watch the counters in real time to
see the progress of a job.
When the map/reduce job finishes, these counters
are provided to the user program.

22
Presentation Overview

What is map-reduce?
input/output data types
why is it useful and where is it used?
Execution overview
Features
fault tolerance
ordering guarantee
other perks and bonuses
Hands-on demonstration and follow-along
Map-reduce-merge

23
What is ?

Hadoop is the implementation of the map/reduce
design that we will use.
Hadoop is released under the Apache License 2.0,
so it's open source.
Hadoop uses the Hadoop Distributed File System,
HDFS. (In contrast to what we've seen with
Lucene.)?
Get the release from
http//hadoop.apache.org/core/

24
Preparing Hadoop on your system

Configure passwordless public-key SSH on
localhost
Configure Hadoop
look at the two configuration files at
http//utdallas.edu/pmw033000/hadoop/
Format the HDFS
bin/hadoop namenode -format
Start Hadoop
cd lthadoop-dirgt
bin/start-all.sh (and wait 20 seconds)?

25
Example grep

Standard Unix 'grep' behavior run it on the
command line with the search string as the first
argument and the list of files or directories as
the subsequent argument(s).
grep HelloWorld file1.c file2.c file3.c
file2.cSystem.out.println(I say HelloWorld!)

26
Preparing for 'grep' in Hadoop

Hadoop's jobs always operate within the HDFS.
Hadoop will read its input from HDFS, and will
write its output to HDFS.
Thus, to prepare
Download a free electronic book
http//utdallas.edu/pmw033000/hadoop/book.txt
Load the file into HDFS
bin/hadoop fs -copyFromLocal book.txt /book.txt

27
Using 'grep' within Hadoop

bin/hadoop jar \
hadoop-0.18-2-examples.jar \
grep /book.txt /grep-result \
search string
bin/hadoop fs -ls /grep-result
bin/hadoop fs -cat /grep-result/part-00000
A good string to try Horace de \S
Between job runs bin/hadoop fs -rmr /grep-result

28
How 'grep' in Hadoop works

The program runs two map/reduce jobs in sequence.
The first job counts how many times a matching
string occurred and the second job sorts matching
strings by their frequency and stores the output
in a single output file.
Each mapper of the first job takes a line as
input and matches the user-provided regular
expression against the line. It extracts all
matching strings and emits (matching string, 1)
pairs. Each reducer sums the frequencies of each
matching string. The output is sequence files
containing the matching string and count. The
reduce phase is optimized by running a combiner
that sums the frequency of strings from local map
output. As a result it reduces the amount of data
that needs to be shipped to a reduce task.
The second job takes the output of the first job
as input. The mapper is an inverse map, while the
reducer is an identity reducer. The number of
reducers is one, so the output is stored in one
file, and it is sorted by the count in a
descending order. The output file is text, each
line of which contains count and a matching
string.

29
Another example word count

bin/hadoop jar hadoop-0.18.2-examples.jar \
wordcount /book.txt /wc-result
bin/hadoop fs -cat /wc-result/part-00000 \
sort -n -k 2
You can also try passing a -r option to
increase the number of parallel reducers.
Each mapper takes a line as input and breaks it
into words. It then emits a key/value pair of the
word and 1. Each reducer sums the counts for each
word and emits a single key/value with the word
and sum.
As an optimization, the reducer is also used as a
combiner on the map outputs. This reduces the
amount of data sent across the network by
combining each word into a single record.

30
Presentation Overview

What is map-reduce?
input/output data types
why is it useful and where is it used?
Execution overview
Features
fault tolerance
ordering guarantee
other perks and bonuses
Hands-on demonstration and follow-along
Map-reduce-merge (proposal not implemented)?

31
Does map-reduce satisfy all needs?

Map-reduce is great for homogeneous data, such as
grepping a large collection of files or
word-counting a huge document.
Joining heterogeneous databases does not work
well.
As is, we'd need additional map-reduce steps,
such as map-reducing one database and reading
from the others on the fly.
We want to support relational algebra.

32
Solution

The solution to these problems is
map-reduce-merge. It is map-reduce with a new
additional merging step.
The merge phase makes it easier to process data
relationships among heterogeneous data sets.
Types
map (k1, v1)a ? (k2, v2)a
reduce (k2, v2)a ? (k2, v3)a (notice that
the output v is a list)?
merge ((k2, v3)a, (k3, v4)ß) ? (k4, v5)?
If aß, then the merging step performs a
self-merge (self-join in R.A.).

33
New terms

Partition selector determines which data
partitions produced by reducers should be
retrieved for merging.
Processor user-defined logic of processing data
from an individual source.
Merger user-defined logic of processing data
merged from two sources where data satisfies a
merge condition.
Configurable iterator next slide.

34
Configurable iterators

The map and reduce user-defined functions get one
iterator for the values.
The merge function gets two iterators, one for
each data source.
The iterators do not have to move forward they
can be instrumented to do whatever the user
wants.
Relational join algorithms have specific patterns
for the merging step.

35
Map-reduce-merge example