Clustering Very Large Multi-dimensional Datasets with MapReduce - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Clustering Very Large Multi-dimensional Datasets with MapReduce

Description:

INTRODUCTION large dataset of moderate-to-high dimensional elements serial subspace clustering algorithms TB PB e.g.,Twitter crawl: 12TB Yahoo ... – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0
Slides: 21
Provided by: rach2229
Category:

less

Transcript and Presenter's Notes

Title: Clustering Very Large Multi-dimensional Datasets with MapReduce


1
Clustering Very Large Multi-dimensional Datasets
with MapReduce
??
2
INTRODUCTION
  • large dataset of moderate-to-high dimensional
    elements
  • serial subspace clustering algorithms
  • TB?PB
  • e.g.,Twitter crawl gt 12TB
  • Yahoo! operational data 5PB
  • ??combine a fast, scalable serial algorithm
  • and makes it run efficiently in parallel

3
INTRODUCTION
  • bottleneck I/O, network
  • Best of both Worlds -- BoW
  • automatically spots the bottleneck and picks a
    good strategy
  • serial clustering methods as a plugged-in
    clustering subroutine

4
(No Transcript)
5
RELATED WORK
  • MapReduce--??????????,?????????????
  • mapper, reducer
  • map stageinput file and outputs(key, value)pairs
  • shuffle stagetransfers the mappers'output to the
    reducers based on the key
  • reduce stage processes the received pairs and
    outputs thefinal result

6
BoW
  • ParC????,????
  • SnI???,??I/O??network cost
  • trade-off

7
ParC--Parallel Clustering
  • ???????????????
  • ??????????????,?????ß-clusters
  • ??ß-clusters??????

8
(No Transcript)
9
SnI--Sample and Ignore
  • ??,????clusters
  • ????clusters??????
  • ParC

10
(No Transcript)
11
(No Transcript)
12
COST-BASED OPTIMIZATION
  • ParC Cost
  • Map Cost
  • Shuffle Cost
  • Reduce Cost

13
  • SnI Cost

14
Bow
  • compute ParC Cost-gtcostC
  • compute SnI Cost-gtcostCs
  • if costC gt costCs then clusters result of SnI
  • else clusters result of ParC

15
EXPERIMENTAL RESULTS
  • ??Hadoop
  • M451.5PB storage,1TB memory,
  • DISC/Cloud512 cores,64 machines,1TB RAM,256TB
    disk storage,

16
Quality of results
  • ????????????
  • ????

17
Scale-up results
  • ??reducer

18
Scale-up results
  • ????,r128,m700

19
Accuracy of our cost equations
20
????! Thanks for your time
Write a Comment
User Comments (0)
About PowerShow.com