Title: Hadoop
1Hadoop
- ?????? 3.
- ???????? MapReduce
2????
- ??????? ???????? MapReduce
- ?????? MapReduce
- ??????? ????????????? MapReduce
- ??????????? ?????????? MapReduce ? Hadoop
- ?????????? MapReduce
3??????? ????????
- MapReduce ??????????? ?????? ???????????????
??? ???????????? ????????? ??????? ??????? ??????
?? ???? ?????????? ?????? ?? ??????????? ????? - MapReduce ????????? ? Google
- Jeffrey Dean, Sanjay Ghemawat. MapReduce
Simplified Data Processing on Large Clusters.
2004. - ???? ????????? ?????????? Web ??? ?????????
???????
4MapReduce ? Google
- ???????? ???? ????????? ???, ?????????? ??
?????? ???? - ?????????? ? ????????? ??????
- ???????????????, ????????????????? ? ?????????
??????? - ??????????
- ????? ??????? ? ???????? ?????????? ???. ???
????? ?? ??? ?????????? ??? ?????????? ? 3800
????? C ?? 700 - ??????? ???????? ????????? ? ??????? ??????????
(??? ??? ?????? ?????? ???????) - ??????? ??????????????? ????? ??????????? ?????
????? ??? ????????? ?????????? ?????????
5?????????? MapReduce
- Google ???????? ?????????? ?? C
- Apache Hadoop ???????? ?????????? ?? Java
- Erlang
- NoSQL
- MongoDB
- CouchB
6?????? MapReduce
- ?????????????? ????????????????
- ????????? ???????
- ??????? ?????? ?? ??????????
- ???????????, ??? ????? ??????? ? ???????, ? ??
??? ??? ?????? - MapReduce ????????????? ???????????? ????????????
????????? ?? ??????? ????? - ????? ??????, ???? ?????? ???????? ????????????
???????
7??????? Map
- ?????? ? ???????????? ?????? ?????? ??????
- ?????? ??????? Map toUpper(str)
- ???????? ?????? ?? ????????, ????????? ?????!
8??????? Reduce
- ?????? ? ???????????? ?????? ???? ????????
- ?????? ??????? Reduce
- ??????? ?????? ????? ?? ????????
9MapReduce
- ? MapReduce ? ???????? ?????? ???????????
??????????????? ??????? Map ? Reduce - ?????? ??????? ?? ??? ????-????????
- ??????? Map ? Reduce ?? ?? ?????? ?
?????????????? ?????? - Map ????? ???????????? ??? ??????? ????????
???????? ?????? ????????? ????????? ????????? - Reduce ????? ???????????? ????????? ????????
????????
10?????? ????-????????
- ?????? ?????? ???1
- ???? ??? ????????, ??????????? ?????
- ???????? ?????? ? ????? ?????
1http//ftp.rts.ru/pub/info/stats/
11????? ? Reduce
- ????????? ????????? ?????? ? ??????? ???????
????????? Reduce ??????????? ???????? - ??? ??????? ????? ???????????? ????????? ????????
????????
12?????? WordCount
- ??????? ?????????? ???? ?? ??????? ??????
- ??????? ??????
- file1 Hello World Bye World
- file2 Hello Hadoop Goodbye Hadoop
- ????????? ??????????
- Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2
13??????? Map ??? WordCount
- ????????
- map (filename, file-contents)
- for each word in file-contents
- emit (word, 1)
- ?????????
- file 1
- Hello 1
- World 1
- Bye 1
- World 1
- file2
- Hello 1
- Hadoop 1
- Goodbye 1
- Hadoop 1
14??????? Reduce ??? WordCount
- ????????
- reduce (word, values)
- sum 0
- for each value in values
- sum sum value
- emit (word, sum)
- ?????????
- Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2
15????? ?????? ? MapReduce
16????? ?????? ? MapReduce
- ??????? ????? ??????????? ? HDFS ? ??????????????
?? ???? ????? - ?? ?????? ???? ??????????? ???????? Map ?
???????????? ??????? ????? - ????? ??????? Map ????? ???????????? ????? ????
- ???????? Map ?????????? ????????????? ?????? ???
????-????????
17????? ?????? MapReduce
- ???? ????-???????? ?????????? ?? ???? ???
????????? Reduce - ??? ???????? ? ?????????? ?????? ??????????
?????? ???????? Reduce - ???????? ?????? ? ???? ?????? ???????????? ? HDFS
18????? ?????? ? MapReduce
- ?????????? ??????? ?????? MapReduce ????????????
????????????? - ??????????? ?? ????? ?????? ????? ??????
- ????????? ?????????? ?????? ?? ????????????
??????? ????? ????? - ? ????????? ?????? ?????????? ??????????????
?????????????? ????? ???? - ? ?????? ???? ???? ???????? Map ? Reduce
????????????? ??????????????? ?? ?????? ????
19????????????? MapReduce
- MapReduce ?????????? ???
- ??????? ?????? ??????? ?????? (?? ????????
???????? ? ??????) - ??????? ?????????? ????? (?? ???????? ????? ?
??????) - ??? ????????? ??????? ??????? ?????? ?????????
??????? - Hadoop ?? ???????? ??????????? ???????????????????
- ?????????????????? ??????? ???????? ???????
20?????????????? grep
- ?????? ????? ????????? ? ????????? ??????
- ??????? Map ?????? ?????? ????? ? ?????????? ?
????????. ??? ?????????? ?????????? ???? - ???? ??? ?????
- ???????? ??????????? ??????
- ??????? Reduce ???????? ??????? ??????
21????????? ? URL
- ?????? ????????? ?????????? ????????? ? URL
- ??????? Map ?????? ??????? ????????? ?
Web-??????? ? ?????? ???? - ???? URL
- ???????? 1
- ??????? Reduce ????????? ?????????? ??????????
URL ? ?????? ???? - ???? URL
- ???????? ????? ?????????? ?????????
22??????????????? ??????
- ?????? ????????? ?????? ??????????, ? ???????
??????????? ???????? ????? - ??????? Map ?????? ????????? ? ??? ??????? ?????
?????????? ???? - ???? ?????
- ???????? ????????????? ?????????
- ??????? Reduce ?????????? ?????? ??? ???????
????? ? ?????? ???? - ???? ?????
- ???????? ?????? ??????????????? ??????????
23????????? ??? ?????
- ?????? ????? ???????????? ??????? ????????? ???
????? ?? ????????? ?????? - ??????? Map ?????? ?????? ? ????????? ????? ?
?????????? ???? - ???? ??? ????????
- ???????? ????????? ???? ????? ?? ????
- ??????? Reduce ???? ???????? ????? ????????? ??
???? ??? ??????? ????????
24MapReduce ? Hadoop
- Hadoop ?????????? ?????????? MapReduce ?
????????? ????????? ?????? - ???? ???????????????? Java
- ???? ??????????? ?????? ??????? Map ? Reduce ??
?????? ?????? ? ?????????????? Streaming - ???????????? ??????? Linux ? Windows
(??????????), ????? Unix ? Java
25?????? ????? ? Hadoop
26?????? ????? ? Hadoop
- Job ?????? MapReduce
- Task ????? Job, ??????????? Map ??? Reduce
- Job Tracker ?????? ? ???????? Hadoop,
?????????? ?? ?????? ????? ????????????? ??????
?? ????? - Task Tracker ???????????? Task
27????????? ????????? ? Hadoop
public class WordCount //????? ??? ???????
Map public static class Map extends
MapReduceBase implements MapperltLongWritable,
Text, Text, IntWritablegt //????? ???
??????? Reduce public static class Reduce
extends MapReduceBase implements ReducerltText,
IntWritable, Text, IntWritablegt //???????
main ????????? ?????? Hadoop public static void
main(String args) throws Exception
28WordCount Map ? Hadoop
public static class Map extends MapReduceBase
implements MapperltLongWritable, Text, Text,
IntWritablegt private final static
IntWritable one new IntWritable(1) private
Text word new Text() public void
map(LongWritable key, Text value,
OutputCollectorltText, IntWritablegt output,
Reporter reporter) throws IOException
String line value.toString()
StringTokenizer tokenizer new
StringTokenizer(line) while
(tokenizer.hasMoreTokens())
word.set(tokenizer.nextToken())
output.collect(word, one)
29WordCount Reduce ? Hadoop
public static class Reduce extends MapReduceBase
implements ReducerltText, IntWritable, Text,
IntWritablegt public void reduce(Text key,
IteratorltIntWritablegt values, OutputCollectorltTex
t, IntWritablegt output, Reporter reporter)
throws IOException int sum 0 while
(values.hasNext()) sum
values.next().get() output.collect(key,
new IntWritable(sum))
30?????? ?????? ? Hadoop
public static void main(String args) throws
Exception JobConf conf new
JobConf(WordCount.class) conf.setJobName("wordc
ount") conf.setOutputKeyClass(Text.class)
conf.setOutputValueClass(IntWritable.class)
conf.setMapperClass(Map.class)
conf.setCombinerClass(Reduce.class)
conf.setReducerClass(Reduce.class)
conf.setInputFormat(TextInputFormat.class)
conf.setOutputFormat(TextOutputFormat.class)
FileInputFormat.setInputPaths(conf, new
Path(args0)) FileOutputFormat.setOutputPath(c
onf, new Path(args1)) JobClient.runJob(conf)
31?????????
- ??? ???????? ??????
- conf.setCombinerClass(Reduce.class)
- ????????? ????????? ?????????? ???????? ?
??????????? ?????????? ????? ????? ?????? Map ??
???????? Reduce - ??????
- Map (ltHello, 1gt, ltWorld, 1gt, ltHello, 1gt,
ltHadoop, 1gt) - Combiner (ltHello, 2gt, ltWorld, 1gt, ltHadoop, 1gt)
- ????????? ????????? ????????? ????? ????????????
?? ???? ?????? - ????? ???????????? Reducer, ???? ???????
?????????????? ? ????????????
32????????????? ??????
- MapReduce ????????????? ???????????? ???????
?????? ????? ?????????? Map - ?????? ??????? Map ???????????? ???? ???
????????? ??????? ?????? - ???? ???? ???????, ?? ?? ??????? ?? ????? ?
?????????????? ??????? ?????????? Map - ?? ????????? ?????? ????? ?????? ????? 64??
- Hadoop ????????? ????????? ?????? Map ?? ???
????, ??? ????? ??????? ?????? - ??????????? ?????????? ? ??????
33??????????? ?????????? ? ??????
34???????? ??????????? HDFS
- ? MapReduce ??????? ?????? ?? ??????????, ?
????????? ????? - ? HDFS ????? ???????????? ?????? ???? ??? (WORM)
- ? MapReduce ?????? ?? ??????? Map ??????????????
???????? ?? 64?? - ?????? ????? HDFS 64 ??. ?????? ??? ????? ??????
??????????? ?? ???? ???????? - MapReduce ?????? ??????? ????? ??????? ??????
???????????????, ? ????? ???????????????
?????????? ??????? ????? ???????? ?????? - HDFS ?????????????? ??? ????????? ????????
35?????????? MapReduce
- ??????? ?????????????? ??????????
- ???? HBase, Hive, Pig, Mahout ? ??.
- ???????????? ? ????????? ????????? ? ?????????
??????? ?????? - ????????? ??????? ??????
- ???????? ? ????? ???????? Map ??? Reduce ????? ?
???????? ???? ??????
36???????? ? MapReduce
- ???? ?? ?????????????? ???????? Map, Reduce ??
????? ??????????? - ??????? ?????????????????????
37???????? ? MapReduce
38?????
- MapReduce ??????????? ?????? ???????????????
??? ???????????? ????????? ??????? ??????? ??????
?? ???? ?????????? ?????? ?? ??????????? ????? - ??????? Map ? Reduce ???????? ? ??????????????
????????????????? - ?????????? ??????????? ?????????? ? ??????
- ?????????? MapReduce ?????????????? ??????????,
????????????? ?????? ? ??????? ?????????????,
????????
39?????????????? ?????????
- MapReduce Simplified Data Processing on Large
Clusters - http//labs.google.com/papers/mapreduce.html
- MapReduce Tutorial
- http//hadoop.apache.org/common/docs/stable/mapred
_tutorial.html - A Study of Skew in MapReduce Applications
- http//nuage.cs.washington.edu/pubs/opencirrus2011
.pdf
40