Outline - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Outline

Description:

Consider a slightly more general program to compute theword frequency of every word in a single document. Biography of Pat Martino. When the anesthesia wore off, Pat ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 60

Provided by: conc84

Category:

more less

Transcript and Presenter's Notes

Title: Outline

1
Outline

Parallel Processing Patterns
MapReduce Abstraction
MapReduce Pseudocode
MapReduce Examples
Relational Join
Matrix Multiplication
MapReduce Implementation Overview

2
Run thousands of simulations
You have sets of Parameters for thousands of
small simulations
3
Run thousands of simulations
You have sets of Parameters for thousands of
small simulations
Divide the parameter sets among k computers
4
Run thousands of simulations
You have sets of Parameters for thousands of
small simulations
Divide the parameter sets among k computers
f runs the simulation and produces some output
apply it to every item
f
f
f
f
f
f
5
Run thousands of simulations
You have sets of Parameters for thousands of
small simulations
Divide the parameter sets among k computers
f runs the simulation and produces some output
apply it to every item
f
f
f
f
f
f
Now we have a big distributed set of simulation
results
6
Find the most common word in each document
You have millions of documents
7
Find the most common word in each document
You have millions of documents
Distribute the documents among k computers
8
Find the most common word in each document
You have millions of documents
Distribute the documents among k computers
f finds the most common word in a single document
f
f
f
f
f
f
9
Find the most common word in each document
You have millions of documents
Distribute the documents among k computers
f finds the most common word in a single document
f
f
f
f
f
f
Now we have a big distributed list of (doc_id,
word) pairs
10
Consider a slightly more general program to
compute theword frequency of every word in a
single document
Biography of Pat Martino When the anesthesia
wore off, Pat Martino looked up hazily at his
parents and his doctors. and tried to piece
together any memory of his life. One of the
greatest guitarists in jazz, Martino had suffered
a severe brain aneurysm and underwent surgery
after being told that his condition could be
terminal. After his operations he could remember
almost nothing. He barely recognized his parents.
and had no memory of his guitar or his career. He
remembers feeling as if he had been "dropped
cold, empty, neutral, cleansed, ... naked. In
the following months. Martino made a remarkable
recovery. Through intensive study of his own
historic recordings, and with the help of
computer technology, Pat managed to reverse his
memory loss and return to form on his instrument.
His past recordings eventually became "an old
friend, a spiritual experience which remained
beautiful and honest." This recovery fits in
perfectly with Pat's illustrious personal
history. Since playing his first notes while
still in his pre-teenage years, Martino has been
recognized as one of the most exciting and
virtuosic guitarists in jazz. With a distinctive,
fat sound and gut-wrenching performances, he
represents the best not just in jazz, but in
music. He embodies thoughtful energy and
soul. Born Pat Azzara in Philadelphia in 1944,
Pat was first exposed to jazz through his father,
Carmen "Mickey" Azzara, who sang in local clubs
and briefly studied guitar with Eddie Lang. He
took Pat to all the city's hot-spots to hear and
meet Wes Montgomery and other musical giants. "I
have always admired my father and have wanted to
impress him. As a result, it forced me to get
serious with my creative powers."
(memory, 3) (jazz, 4) (life, 1) (with,
6) (recovery, 2)
11
Compute the word frequency of 5M documents
You have millions of documents
12
Compute the word frequency of 5M documents
You have millions of documents
Distribute the documents among k computers
13
Compute the word frequency of 5M documents
You have millions of documents
Distribute the documents among k computers
For each document f returns a set of (word, freq)
pairs
f
f
f
f
f
f
14
Compute the word frequency of 5M documents
You have millions of documents
Distribute the documents among k computers
For each document f returns a set of (word, freq)
pairs
f
f
f
f
f
f
Now we have a big distributed list of sets of
word freqs.
15
There is a pattern here

A function that maps a set of parameters to
asimulation result
A function that maps a document to its
mostcommon word
A function that maps a document to a histogramof
word frequencies

16
What if we want to compute the wordfrequency
across all documents?

17
Compute the word frequency across 5M documents
You have millions of documents
18
Compute the word frequency across 5M documents
You have millions of documents
Distribute the documents among k computers
19
Compute the word frequency across 5M documents
You have millions of documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
20
Compute the word frequency across 5M documents
You have millions of documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
How can we make sure that a single computer has
access to every occurrence of a given word
regardless of which document it appeared in?
Now what?
21
Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of
word freqs.
22
Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of
word freqs.
23
Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of
word freqs.
24
Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of
word freqs.
Now just count the occurrences of each word
reduce
reduce
reduce
reduce
25
Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of
word freqs.
Now just count the occurrences of each word
reduce
reduce
reduce
reduce
We have our distributed histogram
26
Outline

Parallel Processing Patterns
MapReduce Abstraction
MapReduce Pseudocode
MapReduce Examples
Relational Join
Matrix Multiplication
MapReduce Implementation Overview

27
MapReduce
MAP
REDUCE
Shuffle
(did1, v1)
(did2, v2)
(did3, v3)
. . .

(w1, 1)
(w2, 1)
(w3, 1)
. . .
(w1, 1)
(w2, 1)
. . .

(w1, (1, 1, 1, , 1))
(w1, (1, 1, ))
(w1, (1, ))
. . .
. . .
. . .
. . .

. . .
. . .
. . .
. . .

28
MapReduce

Google paper published 2004MapReduce Simplifie
d Data Processingon Large ClustersJeffrey
Dean and Sanjay Ghemawat
Free variant Hadoop
MapReduce High-level programmingmodel and
implementation for large-scaleparallel data
processing

29
Hadoop History

Created by Doug Cutting, the creator of
ApacheLucene, the widely used text search
library
2002 Nutch before GFS publication
2004 Nutch Distributed Filesystem
2006 Hadoop at Yahoo!
2008 Yahoo! announced its production search
indexwas being generated by a 10,000 core Hadoop
cluster
2008 Hadoop made its own top-level project at
Apache

30
Hadoop History

April 2008 Won the 1 terabyte sort benchmark in
209 seconds on 900 nodes
April 2009 Won the minute sort by sorting500 GB
in 59 seconds (on 1,400 nodes)and the 100
terabyte sort in 173 minutes(on 3,400 nodes)

31
Apache Hadoop

Open-source implementation of Map-Reduce
The storage is provided by HDFS and analysis
byMap-Reduce
Other parts like Pig, Hive, But above
capabilitiesare its kernel
Pig A data flow language and execution
environmentfor exploring very large datasets.
Hive A distributed data warehouse which
managesdata stored in HDFS and provides a query
languagebased on SQL for querying the data

32
Data Model

A file a bag of (key, value) pairs
A map-reduce program
Input a bag of (inputkey, value) pairs
Output a bag of (outputkey, value) pairs

33
Step 1 The MAP Phase

User provides the MAP function
Input (input key, value)
Outputbag of (intermediate key, value)
System applies the map function in parallelto
all (input key, value) pairs in theinput file.

34
Step 2 The REDUCE Phase

User provides the REDUCE function
Input(intermediate key, bag of values)
Output bag of output (values)
The system will group all pairs with the
sameintermediate key, and passes the bag
ofvalues to the REDUCE function.

35
MapReduce Programming Model

Input Output each a set of key/value pairs
Programmer specifics two functions
map(in_key, in_value) -gt list(out_key,
intermediate_value)
Processes input key/value pair
Produces set of intermediate pairs
reduce(out_key, list(intermediate_value)) -gt
list(out_value)
Combines all intermediate values for a particular
key
Produces a set of merged output values (usually
just one)

36
Outline

Parallel Processing Patterns
MapReduce Abstraction
MapReduce Pseudocode
MapReduce Examples
Relational Join
Matrix Multiplication
MapReduce Implementation Overview

37
Example What does this do?

map(String input_key, String input_value)
// input_key document name// input_value
document content
For each word w in input_value
EmitIntermediate(w, 1)
reduce(String intermediate_key, Iterator
intermediate_values)
// intermediate_key word// intermediate_values
???
int result 0
For each v in intermediate_values
Result v
EmitFinal(intermediate_key, result)

38
Outline

Parallel Processing Patterns
MapReduce Abstraction
MapReduce Pseudocode
MapReduce Examples
Relational Join
Matrix Multiplication
MapReduce Implementation Overview

39
Natural Join

Join of R(A, B) with S(B, C) is the set of
tuples(a, b, c) such that (a, b) is in R and (b,
c) is in S.
Mappers need to send R(a, b) and S(b, c) to
thesame reducer, so they can be joined there.
Mapper output key B-value, value
relationand other component (A, C).
Example
R(1, 2) ? (2, (R, 1))
S(2, 3) ? (2, (S, 3))

40
Mapping Tuples
Mapper for R(1,2)
R(1,2)
(2, (R,1))
Mapper for R(4,2)
R(4,2)
Mapper for S(2,3)
S(2,3)
Mapper for S(5,6)
S(5,6)
41
Grouping Phase

There is a reducer for each key.
Every key-value pair generated by any mapperis
sent to the reducer for its key.

42
Mapping Tuples
Mapper for R(1,2)
(2, (R,1))
Reducer for B 2
Mapper for R(4,2)
(2, (R,4))
Reducer for B 5
Mapper for S(2,3)
(2, (S,3))
Mapper for S(5,6)
(5, (S,6))
43
Constructing Value-Lists

The input to each reducer is organized by
thesystem into a pair
- The key.
- The list of values associated with that key.

44
The Value-List Format
Reducer for B 2
(2, (R,1), (R,4), (S,3))
Reducer for B 5
(5, (S,6))
45
The Reduce Function for Join

Given key b and a list of values that are
either(R, ai) or (S, cj), output each triple
(ai, b, cj).
Thus, the number of outputs made by a reduceris
the product of the number of Rs on the listand
the number of Ss on the list.

46
Output of the Reducers
Reducer for B 2
(2, (R,1), (R,4), (S,3))
(1,2,3), (4,2,3)
Reducer for B 5
(5, (S,6))
47
Outline

Parallel Processing Patterns
MapReduce Abstraction
MapReduce Pseudocode
MapReduce Examples
Relational Join
Matrix Multiplication
MapReduce Implementation Overview

48
Matrix Multiplication
1 -2
4 3
-3 -2
0 4
1 3 4 -2
6 2 -3 1
1 -9
23 4

X
49
Matrix Multiply in MapReduce

50
Matrix Multiply in MapReduce

X
B
A
AB

51
Outline

Parallel Processing Patterns
MapReduce Abstraction
MapReduce Pseudocode
MapReduce Examples
Relational Join
Matrix Multiplication
MapReduce Implementation Overview

52
Cluster Computing

Large number of commodity servers,connected by
high speed, commoditynetwork
Rack holds a small number of servers
Data center holds many racks

53
Cluster Computing

Massive parallelism
- 100s, or 1000s, or 10000s servers
Many hours
Failure
If medium-time-between-failure is 1 year
Then 10000 servers have failure / hour

54
Distributed File System (DFS)

For every large files TBs, PTs
Each file is partitioned into chunks,typically
64MB
Each chunk is replicated several times(3), on
different racks, for fault tolerance
Implementations
Googles DFS GFS, proprietary
Hadoops DFS HDFS, open source

55
MapReduce Phases
Map Task
Reduce Task
P 3
P 1
P 2
P 4
P 5
Split
Record Reader ? Map ? Combine
Copy
Sort
Reduce
file
file
Local Storage
HDFS
HDFS
56
Combiner
Same word appears twice.Why not just send (w1,
2)?
MAP
REDUCE
Shuffle
(did1, v1)
(did2, v2)
(did3, v3)
. . .

(w1, 1)
(w2, 1)
(w1, 1)
. . .
(w1, 1)
(w2, 1)
. . .

(w1, (1, 1, 1, , 1))
(w1, (1, 1, ))
(w1, (1, ))
. . .
. . .
. . .
. . .

. . .
. . .
. . .
. . .

57
Adding a Combiner