Led by ZOU Quan, LIN Chen - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Led by ZOU Quan, LIN Chen

Description:

... hardware and free software Customizable Applications built around speci c business requirements Ease of use Graphical user interface ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 35
Provided by: itank
Category:
Tags: lin | zou | chen | hardware | interface | led | quan

less

Transcript and Presenter's Notes

Title: Led by ZOU Quan, LIN Chen


1
?????
??
306???????
Led by ZOU Quan, LIN Chen Data Mining Group _at_
Xiamen University
2
BigData BigDeal
?????2???,?????????? ??,???????? ????????????
??????????????????????????????,?????????????????,?
???????????  ?????,????????????????2??????,??????
???????????????????????
3
Big Data Gone Wild
Big Data_at_Tmall
  • ??????33.6???????
  • ? 2011?11?11?,???????????
  • ?????????????????,????
  • ???????????????????11?12
  • ???,??????????33.6??,???
  • ??????????52??,???????
  • ???????????6??
  • ? ????12??????43.8?
  • ? ???12?12?????????,?????
  • ????270??,??????,?????4.75
  • ?,??278???
  • ? ??????????
  • ? ?????
  • ? ???
  • ? ????
  • ? ???

4
Big Data Gone Wild
Big Data_at_Social NetWork
??10????? Gmail???? 3.5? Google??? ?1.7? Youtube?
?? ????8?, ??????? ???1??? 11?18?, Android???? ???
???2? Android???? ???????? ?100??
5
Big Data Gone Wild
Big Data_at_Amazon
  • ??????
  • 1.37?
  • ????S3?
  • ??????
  • ??????
  • ???82??
  • ??????
  • ??????
  • ??Target?
  • buy.com??
  • ?????5
  • ?

6
What we will talk?
CONTENTS
7
MapReduce Process
8
Four Stages
  • InputFormat   ???? --gt ?? --gt ltK, VgtMapTask  
         ltK, Vgt  --gt map?? --gt ltK', V'gtShuffle     
        ltK', V'gt --gt Sort and Group --gtltK',
    List(V')gtReduceTask   ltK', List(V')gt --gt
    Reduce?? --gt ltK'', V''gt

9
InputFormat
  • InputFormat???????,???????
  • public interface InputFormatltK, Vgt  
    InputSplit getSplits(JobConf job, int
    numSplits) throws IOException  RecordReaderltK,
    Vgt getRecordReader(InputSplit split,             
                            JobConf
    job,                                    
    Reporter reporter) throws IOException
  •      ??getSplits????????splits,splits?????map
    tasks???,splits?????????,?64M
  •      ??getSplits???split???records,
    ????record???ltK,Vgt?
  • ????InputFormat??????
  •  InputFile --gt splits --gt ltK,Vgt

10
Shuffle
  • Shuffle ??
  • MapReduce ???Shuffle ??child ??????Maptask
    ?Reducetask ???
  • Map ?????????Spill ?Collect
  • Reduce ?????????Copy?Sort?Reduce
  • ??Shuffle ?????????????Circle
    Buffer???????????
  • Map ??????????SpillThread ?Collect
    ????,?????????-?????,Collect ????,SpillThread
    ????,???????SpillLock ????
  • ??????????
  • ??Shuffle ????????????????????

11
???Shuffle ?????
12
???Shuffle ?????
  • ?WordCount ??
  • MapTasks ?ReduceTasks ??3
  • ?????a,b,c,d,e,f,g????
  • ????
  • Shuffle_at_Map(No Combiner)
  • Shuffle_at_ReduceTask 0(No Combiner)
  • ??,Key ????partition ???
  • ????????

13
Shuffle_at_Map(No Combiner)
14
Shuffle_at_Map
  • Shuffle_at_Map ?????????
  • ???????map task ?????????
  • ???map ?????,?????????
  • ????????????????????????
  • ??map task ??????????map task ????????????
  • ???????????,????reduce task ??????

15
Shuffle_at_Map Stage 1
  • ?map task ???,?????????HDFS ?block, map task
    ???split?Split ?block ???????????,???
  • ??????????WordCount ??,??a,b,c,d,e..???????

16
Shuffle_at_Map Stage 2
  • ???mapper ????,????mapper
    ????????key/value ? key ?a, value
    ???1?????map ???
  • ?1 ???,?reduce task ???????????????job ?3 ?reduce
    task,?????a??????reduce ???,??
  • ???????
  • MapReduce ??Partitioner ??,????????key
    ?value ?reduce ???????????????????????reduce task
    ??????key hash ???reduce task ?????????????????red
    uce ?????
  • ???????,a??Partitioner
    ???0,?????????????reducer ???????,?????????????,
  • ??????????? map ??,???? IO ??????? key/value
    ??? Partition ????????????

17
Shuffle_at_Map Stage 3
  • ??????????????,???100MB??map task
    ????????,????????,????????????????????????,???????
    ?????????????????????? Spill?
  • ?????????????,????????map ??????????????????map
    ?????,??????????????spill.percent????????0.8,?????
    ???????????(buffer size spill percent 100MB
    0.8 80MB),??????,???80MB ???,????
  • ???Map task ????????????20MB ????,?????
  • ????????,????80MB ????key ???(Sort)????MapReduce
    ???????

18
Shuffle_at_Map Stage 3
  • ????,???????,?????key ?partition ?,????partition
    ???,??????key ???,????????,
  • ?????????spill,???????Combiner,????spill
    ??,???????????Combiner ??,??????
  • ?????reduce ???????,?????????a/1,
    a/1???WordCount ??,?????????????
  • ?,??????map task ?????????a???????key,??????????
    ?????,?????reduce ??combine?
  • ?MapReduce ????,reduce ??reduce ??????map task
    ???????????reduce ?,????????????combine
  • ?????????,MapReduce ??Combiner ???Reducer?
  • ??client ???Combiner,????????Combiner
    ?????????key ?key/value ??value
    ???,????????????Combiner ??? MapReduce
    ?????,???????????????????????? Combiner
    ???????,Combiner ???? Reducer ???,Combiner
    ?????????????Combiner ??????? Reduce ???
    key/value ???
  • key/value ??????,????????????????,?????Combiner
    ????????,????,?? job ???????,?????reduce ??????

19
Shuffle_at_Map Stage 4
  • ?????????????????,??map ?????????,??????????,?????
    ??????????????map task ?????,????????????????????
    ???????????????????????????,???????????,??????????
    ??????,???????Merge?
  • Merge ???????????,a???map task???????5,?????map
    ?????8,????????key,???merge ?group???a??????
    a, 5, 8, 2, ,????????????????????,???????????
    ???,??merge ???????????????,?????????key
    ??,????????client ???Combiner,????Combiner
    ??????key??????????Combiner
  • ??Merge ??????mergeParts
    ??spilln?????,?????????partition ?????spill
    ??,?????
  • partition ???????????

20
Shuffle_at_Map Over
  • ??,map ??????????,?????????????TaskTracker
    ??????????????reduce task ?????RPC ?JobTracker
    ????map task ???????,??reduce task
    ????,????TaskTracker ??map task
    ????,Shuffle???????????
  • ???Shuffle_at_Reduce ???

21
???-?????
  • ????????Shuffle_at_Map ?????????-?????,SpillThread
    ????,MapTask.MapOutputBuffer.collect ???
    ?,?????????
  • ???MapOutputBuffer.collect ?????
  • (1)?????????(ltK, Vgt??Mapper ?????)
  • (2)spillLock.lock(),?????
  • (3)????spill ??,???????spillReady.signal(),??spill
    Thread???spill ??(??spillDone.await()??)
  • (4)spillLock.unlock()
  • (5)??key,value ???kvindices ?kvoffsets(??,??collec
    t ?synchronized,key ?value ????,??????????????)

22
???SpillThread
  • ???SpillThread ?????
  • 1)???????????????kv ?????spillSingleRecord ???
  • 2)???????spill.percent ?,??SpillThread ???
  • 3)Mapper ??????collect ?,?????????????Flush
    ????SpillThread ??

23
Shuffle_at_ReduceTask 0(No Combiner)
24
Copy
  • Copy ??
  • ????????Reduce ????????copy ??(Fetcher),??HTTP
    ????map task ???TaskTracker??map task ??????

25
Merge
  • Merge ???
  • ???merge ?map ??merge ??,???????????map ?copy
    ?????Copy ???????????????,??????????map
    ??????,???JVM ?heap size ??,??Shuffle ??Reducer
    ???
  • ??????????????Shuffle ?
  • Merge ???1)????? 2)????????????????????,???????
    ??merge?
  • ?map ???,????????,???????????Combiner,??????(????
    Combiner ??????)?????????????????
  • ???merge ???????,????map ????????,?????????????mer
    ge ????????????

26
Reduce
  • Reducer ?????????merge ?,????????????

27
Big Cloud_at_China Mobile
  • China Mobile looks to data warehousing and
    mining of this data to extract insights for
    improving marketing operations, network
    optimization, and service optimization.
  • Some typical applications include
  • Analyzing user behavior Predicting customer
    churn
  • Analyzing service association
  • Analyzing network quality of service (QOS )
  • Analyzing signaling data
  • Filtering spam messages

28
BC-PDM Born
  • Because of the limitations of the current system,
    China Mobile initiated an
  • experimental project to develop a parallel data
    mining tool set on Hadoop and evaluated it
    against its current system. They named the
    project Big Cloudbased Parallel Data Mining
    (BC-PDM ) and it was architected to achieve four
    objectives
  • Massive scalability Using Hadoop for a
    scale-out architecture
  • Low cost Built around cheap commodity
    hardware and free software
  • Customizable Applications built around speci?
    c business requirements
  • Ease of use Graphical user interface similar
    to ones in commercial tools

29
Algorithms it includes
  • BC-PDM implemented many of the standard ETL
    operations and data mining algorithms in
    MapReduce. The ETL operations include computing
    aggregate statistics, attribute processing, data
    sampling, redundancy removal, and others. It
    implemented
  • nine data mining algorithms from three
    categories. The categories include clustering
  • (e.g., K-means ), classi?cation (e.g., C4.5),
    and association analysis (e.g., Apriori ).

30
Hardware of the Cloud
  • MapReduce programs were executed and evaluated
    within a Hadoop cluster consist-ing of 256 nodes
    connected to a single 264-port Gbps switch. The
    hardware for the
  • nodes are
  • Datanode/TaskTracker 1-way 4-core Xeon 2.5
    GHz CPU, 8 GB RAM, 4 x 250 GB
  • SATA II disks
  • Namenode/JobTracker 2-way 2-core AMD
    Opteron 2.6 GHz CPU, 16 GB RAM,
  • 4 x 146 GB SAS disks

31
Costs Comparison
32
Observe the Elephant
  • Read the Source Code of Hadoop
  • Evolutionary version of Hadoop
  • i-MapReduce (Pregel, Hama, haloop, Twister)
  • C-MapReduce

33
Some News
34
Thank you Any Question??
http//weibo.com/wenruij Mail wenruij_at_gmail.com
Write a Comment
User Comments (0)
About PowerShow.com