Title: Led by ZOU Quan, LIN Chen
1?????
??
306???????
Led by ZOU Quan, LIN Chen Data Mining Group _at_
Xiamen University
2BigData BigDeal
?????2???,?????????? ??,???????? ????????????
??????????????????????????????,?????????????????,?
??????????? ?????,????????????????2??????,??????
???????????????????????
3Big Data Gone Wild
Big Data_at_Tmall
- ??????33.6???????
- ? 2011?11?11?,???????????
- ?????????????????,????
- ???????????????????11?12
- ???,??????????33.6??,???
- ??????????52??,???????
- ???????????6??
- ? ????12??????43.8?
- ? ???12?12?????????,?????
- ????270??,??????,?????4.75
- ?,??278???
- ? ??????????
- ? ?????
- ? ???
- ? ????
- ? ???
4Big Data Gone Wild
Big Data_at_Social NetWork
??10????? Gmail???? 3.5? Google??? ?1.7? Youtube?
?? ????8?, ??????? ???1??? 11?18?, Android???? ???
???2? Android???? ???????? ?100??
5Big Data Gone Wild
Big Data_at_Amazon
- ??????
- 1.37?
- ????S3?
- ??????
- ??????
- ???82??
- ??????
- ??????
- ??Target?
- buy.com??
- ?????5
- ?
6What we will talk?
CONTENTS
7MapReduce Process
8Four Stages
- InputFormat ???? --gt ?? --gt ltK, VgtMapTask
ltK, Vgt --gt map?? --gt ltK', V'gtShuffle
ltK', V'gt --gt Sort and Group --gtltK',
List(V')gtReduceTask ltK', List(V')gt --gt
Reduce?? --gt ltK'', V''gt
9InputFormat
- InputFormat???????,???????
- public interface InputFormatltK, Vgt
InputSplit getSplits(JobConf job, int
numSplits) throws IOException RecordReaderltK,
Vgt getRecordReader(InputSplit split,
JobConf
job,
Reporter reporter) throws IOException - ??getSplits????????splits,splits?????map
tasks???,splits?????????,?64M - ??getSplits???split???records,
????record???ltK,Vgt? - ????InputFormat??????
- InputFile --gt splits --gt ltK,Vgt
10Shuffle
- Shuffle ??
-
- MapReduce ???Shuffle ??child ??????Maptask
?Reducetask ??? - Map ?????????Spill ?Collect
- Reduce ?????????Copy?Sort?Reduce
- ??Shuffle ?????????????Circle
Buffer??????????? - Map ??????????SpillThread ?Collect
????,?????????-?????,Collect ????,SpillThread
????,???????SpillLock ???? -
- ??????????
- ??Shuffle ????????????????????
11???Shuffle ?????
12???Shuffle ?????
- ?WordCount ??
- MapTasks ?ReduceTasks ??3
- ?????a,b,c,d,e,f,g????
-
- ????
- Shuffle_at_Map(No Combiner)
- Shuffle_at_ReduceTask 0(No Combiner)
- ??,Key ????partition ???
- ????????
13Shuffle_at_Map(No Combiner)
14Shuffle_at_Map
- Shuffle_at_Map ?????????
- ???????map task ?????????
- ???map ?????,?????????
- ????????????????????????
- ??map task ??????????map task ????????????
- ???????????,????reduce task ??????
15Shuffle_at_Map Stage 1
- ?map task ???,?????????HDFS ?block, map task
???split?Split ?block ???????????,??? - ??????????WordCount ??,??a,b,c,d,e..???????
16Shuffle_at_Map Stage 2
- ???mapper ????,????mapper
????????key/value ? key ?a, value
???1?????map ??? - ?1 ???,?reduce task ???????????????job ?3 ?reduce
task,?????a??????reduce ???,?? - ???????
-
- MapReduce ??Partitioner ??,????????key
?value ?reduce ???????????????????????reduce task
??????key hash ???reduce task ?????????????????red
uce ????? - ???????,a??Partitioner
???0,?????????????reducer ???????,?????????????, - ??????????? map ??,???? IO ??????? key/value
??? Partition ????????????
17Shuffle_at_Map Stage 3
- ??????????????,???100MB??map task
????????,????????,????????????????????????,???????
?????????????????????? Spill? - ?????????????,????????map ??????????????????map
?????,??????????????spill.percent????????0.8,?????
???????????(buffer size spill percent 100MB
0.8 80MB),??????,???80MB ???,???? - ???Map task ????????????20MB ????,?????
-
- ????????,????80MB ????key ???(Sort)????MapReduce
???????
18Shuffle_at_Map Stage 3
- ????,???????,?????key ?partition ?,????partition
???,??????key ???,????????, - ?????????spill,???????Combiner,????spill
??,???????????Combiner ??,?????? - ?????reduce ???????,?????????a/1,
a/1???WordCount ??,????????????? - ?,??????map task ?????????a???????key,??????????
?????,?????reduce ??combine? - ?MapReduce ????,reduce ??reduce ??????map task
???????????reduce ?,????????????combine - ?????????,MapReduce ??Combiner ???Reducer?
-
- ??client ???Combiner,????????Combiner
?????????key ?key/value ??value
???,????????????Combiner ??? MapReduce
?????,???????????????????????? Combiner
???????,Combiner ???? Reducer ???,Combiner
?????????????Combiner ??????? Reduce ???
key/value ??? - key/value ??????,????????????????,?????Combiner
????????,????,?? job ???????,?????reduce ??????
19Shuffle_at_Map Stage 4
- ?????????????????,??map ?????????,??????????,?????
??????????????map task ?????,????????????????????
???????????????????????????,???????????,??????????
??????,???????Merge? - Merge ???????????,a???map task???????5,?????map
?????8,????????key,???merge ?group???a??????
a, 5, 8, 2, ,????????????????????,???????????
???,??merge ???????????????,?????????key
??,????????client ???Combiner,????Combiner
??????key??????????Combiner - ??Merge ??????mergeParts
??spilln?????,?????????partition ?????spill
??,????? - partition ???????????
20Shuffle_at_Map Over
- ??,map ??????????,?????????????TaskTracker
??????????????reduce task ?????RPC ?JobTracker
????map task ???????,??reduce task
????,????TaskTracker ??map task
????,Shuffle??????????? -
- ???Shuffle_at_Reduce ???
21???-?????
- ????????Shuffle_at_Map ?????????-?????,SpillThread
????,MapTask.MapOutputBuffer.collect ???
?,????????? - ???MapOutputBuffer.collect ?????
- (1)?????????(ltK, Vgt??Mapper ?????)
- (2)spillLock.lock(),?????
- (3)????spill ??,???????spillReady.signal(),??spill
Thread???spill ??(??spillDone.await()??) - (4)spillLock.unlock()
- (5)??key,value ???kvindices ?kvoffsets(??,??collec
t ?synchronized,key ?value ????,??????????????)
22???SpillThread
- ???SpillThread ?????
-
- 1)???????????????kv ?????spillSingleRecord ???
-
- 2)???????spill.percent ?,??SpillThread ???
- 3)Mapper ??????collect ?,?????????????Flush
????SpillThread ??
23Shuffle_at_ReduceTask 0(No Combiner)
24Copy
- Copy ??
- ????????Reduce ????????copy ??(Fetcher),??HTTP
????map task ???TaskTracker??map task ??????
25Merge
- Merge ???
- ???merge ?map ??merge ??,???????????map ?copy
?????Copy ???????????????,??????????map
??????,???JVM ?heap size ??,??Shuffle ??Reducer
??? - ??????????????Shuffle ?
- Merge ???1)????? 2)????????????????????,???????
??merge? - ?map ???,????????,???????????Combiner,??????(????
Combiner ??????)????????????????? - ???merge ???????,????map ????????,?????????????mer
ge ????????????
26Reduce
- Reducer ?????????merge ?,????????????
27Big Cloud_at_China Mobile
- China Mobile looks to data warehousing and
mining of this data to extract insights for
improving marketing operations, network
optimization, and service optimization. - Some typical applications include
- Analyzing user behavior Predicting customer
churn - Analyzing service association
- Analyzing network quality of service (QOS )
- Analyzing signaling data
- Filtering spam messages
28BC-PDM Born
- Because of the limitations of the current system,
China Mobile initiated an - experimental project to develop a parallel data
mining tool set on Hadoop and evaluated it
against its current system. They named the
project Big Cloudbased Parallel Data Mining
(BC-PDM ) and it was architected to achieve four
objectives - Massive scalability Using Hadoop for a
scale-out architecture - Low cost Built around cheap commodity
hardware and free software - Customizable Applications built around speci?
c business requirements - Ease of use Graphical user interface similar
to ones in commercial tools
29Algorithms it includes
- BC-PDM implemented many of the standard ETL
operations and data mining algorithms in
MapReduce. The ETL operations include computing
aggregate statistics, attribute processing, data
sampling, redundancy removal, and others. It
implemented - nine data mining algorithms from three
categories. The categories include clustering - (e.g., K-means ), classi?cation (e.g., C4.5),
and association analysis (e.g., Apriori ).
30Hardware of the Cloud
- MapReduce programs were executed and evaluated
within a Hadoop cluster consist-ing of 256 nodes
connected to a single 264-port Gbps switch. The
hardware for the - nodes are
- Datanode/TaskTracker 1-way 4-core Xeon 2.5
GHz CPU, 8 GB RAM, 4 x 250 GB - SATA II disks
- Namenode/JobTracker 2-way 2-core AMD
Opteron 2.6 GHz CPU, 16 GB RAM, - 4 x 146 GB SAS disks
31Costs Comparison
32Observe the Elephant
- Read the Source Code of Hadoop
- Evolutionary version of Hadoop
- i-MapReduce (Pregel, Hama, haloop, Twister)
- C-MapReduce
33Some News
34Thank you Any Question??
http//weibo.com/wenruij Mail wenruij_at_gmail.com