Advanced topics on Mapreduce with Hadoop - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Advanced topics on Mapreduce with Hadoop

Description:

Title: Wireless Sensor Networks: An Overview Last modified by: Created Date: 2/21/1997 7:49:33 AM Document presentation format: (8.5x11 ) – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 23
Provided by: datasearc
Category:

less

Transcript and Presenter's Notes

Title: Advanced topics on Mapreduce with Hadoop


1
Advanced topics on Mapreduce with Hadoop
  • Jiaheng Lu
  • Department of Computer Science
  • Renmin University of China
  • www.jiahenglu.net

2
Outline
  • Brief Review
  • Chaining MapReduce Jobs
  • Join in MapReduce
  • Bloom Filter

3
Brief Review
  • A parallel programming framework
  • Divide and merge

Input data
Mappers
Shuffle
Reducers
Output data
split0
Map task
Reduce task
output0
split1
Map task
Reduce task
output1
split2
Map task
4
Chaining MapReduce jobs
  • Chaining in a sequence
  • Chaining with complex dependency
  • Chaining preprocessing and postprocessing steps

5
Chaining in a sequence
  • Simple and straightforward
  • MAP REDUCE MAP REDUCE MAP
  • Output of last is the input to the next
  • Similar to pipes

6
  • Configuration conf getConf()
  • JobConf job new JobConf(conf)
  • job.setJobName("ChainJob")
  • job.setInputFormat(TextInputFormat.class)
  • job.setOutputFormat(TextOutputFormat.class)
  • FileInputFormat.setInputPaths(job, in)
  • FileOutputFormat.setOutputPath(job, out)
  • JobConf map1Conf new JobConf(false)
  • ChainMapper.addMapper(job, Map1.class,
    LongWritable.class, Text.class, Text.class,
    Text.class, true, map1Conf)

7
Chaining with complex dependency
  • Jobs are not chained in a linear fashion
  • Use addDependingJob() method to add dependency
    information

x.addDependingJob(y)
8
Chaining preprocessing and postprocessing steps
  • Example remove stop word in IR
  • Approaches
  • Separate inefficient
  • Chaining those steps into a single job
  • Use ChainMapper.addMapper() and
    ChainReducer.setReducer

Map Reduce Map
9
Join in MapReduce
  • Reduce-side join
  • Broadcast join
  • Map-side filtering and Reduce-side join
  • A given key
  • A range from dataset(broadcast)
  • a Bloom filter

10
Reduce-side join
  • Map
  • output ltkey, valuegt
  • keygtgtjoin key, valuegtgttagged with data source
  • Reduce
  • do a full cross-product of values
  • output the combination results

11
Example
table x
key
value
key
valuelist
a b
1 ab
1 cd
4 ef
output
x ab
x cd
y b
1
x ab
x cd
shuffle()
map()
1
1
a b c
1 ab b
1 cd b
4 ef c
4
x ef
reduce()
join key
table y
key
value
2
y d
tag
a c
1 b
2 d
4 c
1
y b
x ef
y c
map()
4
2
y d
4
y c
12
Broadcast join (replicated join)
  • Broadcast the smaller table
  • Do join in Map()
  • Using distributed cache
  • DistributedCache.addCacheFile()

13
Map-side filtering and Reduce-side join
  • Join key student IDs from info
  • generate IDs file from info
  • broadcast
  • join
  • What if the IDs file cant be stored in memory?
  • a Bloom Filter

14
A Bloom Filter
  • Introduction
  • Implementation of bloom filter
  • Use in MapReduce join

15
Introduction to Bloom Filter
  • space-efficient data structure, constant size,
    test elements, add(), contains()
  • no false negatives and a small probability of
    false positives

16
Implementation of bloom filter
  • Apply a bit array
  • Add elements
  • generate k indexes
  • set the k bits to 1
  • Test elements
  • generate k indexes
  • all k bits are 1 gtgt true, not all are 1 gtgt false

17
Example
false positives

v
add x(0,2,6)
add y(0,3,9)
contain m(1,3,9)
contain n(0,2,9)
initial state
0
0
0
0
0
0
0
0
0
0
0
1
2
3
4
5
6
7
8
9
1
0
1
0
0
0
1
0
0
0
0
1
2
3
4
5
6
7
8
9
1
0
1
1
0
0
1
0
0
1
0
1
2
3
4
5
6
7
8
9
1
0
1
1
0
0
1
0
0
1
0
1
2
3
4
5
6
7
8
9
1
0
1
1
0
0
1
0
0
1
0
1
2
3
4
5
6
7
8
9
?
?
?
?
?
18
Use in MapReduce join
  • A separate subjob to create a Bloom Filter
  • Broadcast the Bloom Filter and use in Map() of
    join job
  • drop the useless record, and do join in reduce

19
References
  • Chunk Lam, Hadoop in action
  • Jairam Chandar, Join Algorithms using Map/Reduce

20
THANK YOU
21
Hadoop

22
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com