Outline - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Outline

Description:

Consider a slightly more general program to compute theword frequency of every word in a single document. Biography of Pat Martino. When the anesthesia wore off, Pat ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 60
Provided by: conc84
Category:
Tags: history | jazz | outline

less

Transcript and Presenter's Notes

Title: Outline


1
Outline
  • Parallel Processing Patterns
  • MapReduce Abstraction
  • MapReduce Pseudocode
  • MapReduce Examples
  • Relational Join
  • Matrix Multiplication
  • MapReduce Implementation Overview

2
Run thousands of simulations
You have sets of Parameters for thousands of
small simulations
3
Run thousands of simulations
You have sets of Parameters for thousands of
small simulations
Divide the parameter sets among k computers
4
Run thousands of simulations
You have sets of Parameters for thousands of
small simulations
Divide the parameter sets among k computers
f runs the simulation and produces some output
apply it to every item
f
f
f
f
f
f
5
Run thousands of simulations
You have sets of Parameters for thousands of
small simulations
Divide the parameter sets among k computers
f runs the simulation and produces some output
apply it to every item
f
f
f
f
f
f
Now we have a big distributed set of simulation
results
6
Find the most common word in each document
You have millions of documents
7
Find the most common word in each document
You have millions of documents
Distribute the documents among k computers
8
Find the most common word in each document
You have millions of documents
Distribute the documents among k computers
f finds the most common word in a single document
f
f
f
f
f
f
9
Find the most common word in each document
You have millions of documents
Distribute the documents among k computers
f finds the most common word in a single document
f
f
f
f
f
f
Now we have a big distributed list of (doc_id,
word) pairs
10
Consider a slightly more general program to
compute theword frequency of every word in a
single document
Biography of Pat Martino When the anesthesia
wore off, Pat Martino looked up hazily at his
parents and his doctors. and tried to piece
together any memory of his life.  One of the
greatest guitarists in jazz, Martino had suffered
a severe brain aneurysm and underwent surgery
after being told that his condition could be
terminal. After his operations he could remember
almost nothing. He barely recognized his parents.
and had no memory of his guitar or his career. He
remembers feeling as if he had been "dropped
cold, empty, neutral, cleansed, ... naked. In
the following months. Martino made a remarkable
recovery. Through intensive study of his own
historic recordings, and with the help of
computer technology, Pat managed to reverse his
memory loss and return to form on his instrument.
His past recordings eventually became "an old
friend, a spiritual experience which remained
beautiful and honest." This recovery fits in
perfectly with Pat's illustrious personal
history. Since playing his first notes while
still in his pre-teenage years, Martino has been
recognized as one of the most exciting and
virtuosic guitarists in jazz. With a distinctive,
fat sound and gut-wrenching performances, he
represents the best not just in jazz, but in
music. He embodies thoughtful energy and
soul. Born Pat Azzara in Philadelphia in 1944,
Pat was first exposed to jazz through his father,
Carmen "Mickey" Azzara, who sang in local clubs
and briefly studied guitar with Eddie Lang. He
took Pat to all the city's hot-spots to hear and
meet Wes Montgomery and other musical giants. "I
have always admired my father and have wanted to
impress him. As a result, it forced me to get
serious with my creative powers."
(memory, 3) (jazz, 4) (life, 1) (with,
6) (recovery, 2)
11
Compute the word frequency of 5M documents
You have millions of documents
12
Compute the word frequency of 5M documents
You have millions of documents
Distribute the documents among k computers
13
Compute the word frequency of 5M documents
You have millions of documents
Distribute the documents among k computers
For each document f returns a set of (word, freq)
pairs
f
f
f
f
f
f
14
Compute the word frequency of 5M documents
You have millions of documents
Distribute the documents among k computers
For each document f returns a set of (word, freq)
pairs
f
f
f
f
f
f
Now we have a big distributed list of sets of
word freqs.
15
There is a pattern here
  • A function that maps a set of parameters to
    asimulation result
  • A function that maps a document to its
    mostcommon word
  • A function that maps a document to a histogramof
    word frequencies

16
What if we want to compute the wordfrequency
across all documents?
 
17
Compute the word frequency across 5M documents
You have millions of documents
18
Compute the word frequency across 5M documents
You have millions of documents
Distribute the documents among k computers
19
Compute the word frequency across 5M documents
You have millions of documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
20
Compute the word frequency across 5M documents
You have millions of documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
How can we make sure that a single computer has
access to every occurrence of a given word
regardless of which document it appeared in?
Now what?
21
Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of
word freqs.
22
Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of
word freqs.
23
Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of
word freqs.
24
Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of
word freqs.
Now just count the occurrences of each word
reduce
reduce
reduce
reduce
25
Compute the word frequency across 5M documents
Distribute the documents among k computers
For each document, returns a set of (word, freq)
pairs
map
map
map
map
map
map
Now we have a big distributed list of sets of
word freqs.
Now just count the occurrences of each word
reduce
reduce
reduce
reduce
We have our distributed histogram
26
Outline
  • Parallel Processing Patterns
  • MapReduce Abstraction
  • MapReduce Pseudocode
  • MapReduce Examples
  • Relational Join
  • Matrix Multiplication
  • MapReduce Implementation Overview

27
MapReduce
MAP
REDUCE
Shuffle
(did1, v1)
(did2, v2)
(did3, v3)
. . .

(w1, 1)
(w2, 1)
(w3, 1)
. . .
(w1, 1)
(w2, 1)
. . .








(w1, (1, 1, 1, , 1))
(w1, (1, 1, ))
(w1, (1, ))
. . .
. . .
. . .
. . .





. . .
. . .
. . .
. . .


28
MapReduce
  • Google paper published 2004MapReduce Simplifie
    d Data Processingon Large ClustersJeffrey
    Dean and Sanjay Ghemawat
  • Free variant Hadoop
  • MapReduce High-level programmingmodel and
    implementation for large-scaleparallel data
    processing

29
Hadoop History
  • Created by Doug Cutting, the creator of
    ApacheLucene, the widely used text search
    library
  • 2002 Nutch before GFS publication
  • 2004 Nutch Distributed Filesystem
  • 2006 Hadoop at Yahoo!
  • 2008 Yahoo! announced its production search
    indexwas being generated by a 10,000 core Hadoop
    cluster
  • 2008 Hadoop made its own top-level project at
    Apache

30
Hadoop History
  • April 2008 Won the 1 terabyte sort benchmark in
    209 seconds on 900 nodes
  • April 2009 Won the minute sort by sorting500 GB
    in 59 seconds (on 1,400 nodes)and the 100
    terabyte sort in 173 minutes(on 3,400 nodes)

31
Apache Hadoop
  • Open-source implementation of Map-Reduce
  • The storage is provided by HDFS and analysis
    byMap-Reduce
  • Other parts like Pig, Hive, But above
    capabilitiesare its kernel
  • Pig A data flow language and execution
    environmentfor exploring very large datasets.
  • Hive A distributed data warehouse which
    managesdata stored in HDFS and provides a query
    languagebased on SQL for querying the data

32
Data Model
  • A file a bag of (key, value) pairs
  • A map-reduce program
  • Input a bag of (inputkey, value) pairs
  • Output a bag of (outputkey, value) pairs

33
Step 1 The MAP Phase
  • User provides the MAP function
  • Input (input key, value)
  • Outputbag of (intermediate key, value)
  • System applies the map function in parallelto
    all (input key, value) pairs in theinput file.

34
Step 2 The REDUCE Phase
  • User provides the REDUCE function
  • Input(intermediate key, bag of values)
  • Output bag of output (values)
  • The system will group all pairs with the
    sameintermediate key, and passes the bag
    ofvalues to the REDUCE function.

35
MapReduce Programming Model
  • Input Output each a set of key/value pairs
  • Programmer specifics two functions
  • map(in_key, in_value) -gt list(out_key,
    intermediate_value)
  • Processes input key/value pair
  • Produces set of intermediate pairs
  • reduce(out_key, list(intermediate_value)) -gt
    list(out_value)
  • Combines all intermediate values for a particular
    key
  • Produces a set of merged output values (usually
    just one)

36
Outline
  • Parallel Processing Patterns
  • MapReduce Abstraction
  • MapReduce Pseudocode
  • MapReduce Examples
  • Relational Join
  • Matrix Multiplication
  • MapReduce Implementation Overview

37
Example What does this do?
  • map(String input_key, String input_value)
  • // input_key document name// input_value
    document content
  • For each word w in input_value
  • EmitIntermediate(w, 1)
  • reduce(String intermediate_key, Iterator
    intermediate_values)
  • // intermediate_key word// intermediate_values
    ???
  • int result 0
  • For each v in intermediate_values
  • Result v
  • EmitFinal(intermediate_key, result)

38
Outline
  • Parallel Processing Patterns
  • MapReduce Abstraction
  • MapReduce Pseudocode
  • MapReduce Examples
  • Relational Join
  • Matrix Multiplication
  • MapReduce Implementation Overview

39
Natural Join
  • Join of R(A, B) with S(B, C) is the set of
    tuples(a, b, c) such that (a, b) is in R and (b,
    c) is in S.
  • Mappers need to send R(a, b) and S(b, c) to
    thesame reducer, so they can be joined there.
  • Mapper output key B-value, value
    relationand other component (A, C).
  • Example
  • R(1, 2) ? (2, (R, 1))
  • S(2, 3) ? (2, (S, 3))

40
Mapping Tuples
Mapper for R(1,2)
R(1,2)
(2, (R,1))
Mapper for R(4,2)
R(4,2)
Mapper for S(2,3)
S(2,3)
Mapper for S(5,6)
S(5,6)
41
Grouping Phase
  • There is a reducer for each key.
  • Every key-value pair generated by any mapperis
    sent to the reducer for its key.

42
Mapping Tuples
Mapper for R(1,2)
(2, (R,1))
Reducer for B 2
Mapper for R(4,2)
(2, (R,4))
Reducer for B 5
Mapper for S(2,3)
(2, (S,3))
Mapper for S(5,6)
(5, (S,6))
43
Constructing Value-Lists
  • The input to each reducer is organized by
    thesystem into a pair
  • - The key.
  • - The list of values associated with that key.

44
The Value-List Format
Reducer for B 2
(2, (R,1), (R,4), (S,3))
Reducer for B 5
(5, (S,6))
45
The Reduce Function for Join
  • Given key b and a list of values that are
    either(R, ai) or (S, cj), output each triple
    (ai, b, cj).
  • Thus, the number of outputs made by a reduceris
    the product of the number of Rs on the listand
    the number of Ss on the list.

46
Output of the Reducers
Reducer for B 2
(2, (R,1), (R,4), (S,3))
(1,2,3), (4,2,3)
Reducer for B 5
(5, (S,6))
47
Outline
  • Parallel Processing Patterns
  • MapReduce Abstraction
  • MapReduce Pseudocode
  • MapReduce Examples
  • Relational Join
  • Matrix Multiplication
  • MapReduce Implementation Overview

48
Matrix Multiplication
1 -2
4 3
-3 -2
0 4
1 3 4 -2
6 2 -3 1
1 -9
23 4

X
49
Matrix Multiply in MapReduce
  •  

50
Matrix Multiply in MapReduce











X
B
A
AB
 
51
Outline
  • Parallel Processing Patterns
  • MapReduce Abstraction
  • MapReduce Pseudocode
  • MapReduce Examples
  • Relational Join
  • Matrix Multiplication
  • MapReduce Implementation Overview

52
Cluster Computing
  • Large number of commodity servers,connected by
    high speed, commoditynetwork
  • Rack holds a small number of servers
  • Data center holds many racks

53
Cluster Computing
  • Massive parallelism
  • - 100s, or 1000s, or 10000s servers
  • Many hours
  • Failure
  • If medium-time-between-failure is 1 year
  • Then 10000 servers have failure / hour

54
Distributed File System (DFS)
  • For every large files TBs, PTs
  • Each file is partitioned into chunks,typically
    64MB
  • Each chunk is replicated several times(3), on
    different racks, for fault tolerance
  • Implementations
  • Googles DFS GFS, proprietary
  • Hadoops DFS HDFS, open source

55
MapReduce Phases
Map Task
Reduce Task
P 3
P 1
P 2
P 4
P 5
Split
Record Reader ? Map ? Combine
Copy
Sort
Reduce
file
file
Local Storage
HDFS
HDFS
56
Combiner
Same word appears twice.Why not just send (w1,
2)?
MAP
REDUCE
Shuffle
(did1, v1)
(did2, v2)
(did3, v3)
. . .

(w1, 1)
(w2, 1)
(w1, 1)
. . .
(w1, 1)
(w2, 1)
. . .








(w1, (1, 1, 1, , 1))
(w1, (1, 1, ))
(w1, (1, ))
. . .
. . .
. . .
. . .





. . .
. . .
. . .
. . .


57
Adding a Combiner
  • map(String input_key, String input_value)
  • // input_key document name// input_value
    document content
  • For each word w in input_value
  • EmitIntermediate(w, 1)
  • combine(String intermediate_key, Iterator
    intermediate values)
  • returns (intermediate_key, intermediate_value)
  • reduce(String intermediate_key, Iterator
    intermediate_values)
  • // intermediate_key word// intermediate_values
    ???
  • int result 0
  • For each v in intermediate_values
  • Result v
  • Emit(result)

58
Apache Hadoop Architecture
INPUT PARTITION
MAP
MAP
MAP
MAP
SHUFFLING
SORT IN PARALLEL
REDUCE
REDUCE
OUTPUT PARTITION
DATA ON HDFS
59
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com