Cloud Computing - I - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Cloud Computing - I

Description:

Yahoo Hadoop, Pig Latin. Microsoft Dryad, DryadLINQ ... espn.com. Sports. 0.9. Visits. URL Info. User. URL. Time. Amy. cnn.com. 8:00. Amy. bbc.com. 10:00 ... – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 50

Provided by: cyb42

Category:

more less

Transcript and Presenter's Notes

Title: Cloud Computing - I

1
Cloud Computing - I

Presenters Abhishek Verma, Nicolas Zea

2
Cloud Computing

Map Reduce
Clean abstraction
Extremely rigid 2 stage group-by aggregation
Code reuse and maintenance difficult
Google ? MapReduce, Sawzall
Yahoo ? Hadoop, Pig Latin
Microsoft ? Dryad, DryadLINQ
Improving MapReduce in heterogeneous environment

3
MapReduce A group-by-aggregate
Input records
Output records
map
reduce
Split
Local QSort
reduce
map
Split
shuffle
4
Shortcomings

Extremely rigid data flow
Other flows hacked in
Stages Joins Splits
Common operations must be coded by hand
Join, filter, projection, aggregates,
sorting,distinct
Semantics hidden inside map-reduce fns
Difficult to maintain, extend, and optimize

M
R
5
Pig Latin A Not-So-Foreign Language for
Data Processing

Christopher Olston, Benjamin Reed, Utkarsh
Srivastava, Ravi Kumar, Andrew Tomkins

Research
6
Pig Philosophy

Pigs Eat Anything
Can operate on data w/o metadata relational,
nested, or unstructured.
Pigs Live Anywhere
Not tied to one particular parallel framework
Pigs Are Domestic Animals
Designed to be easily controlled and modified by
its users.
UDFs transformation functions, aggregates,
grouping functions, and conditionals.
Pigs Fly
Processes data quickly(?)?

7
Features

Dataflow language
Procedural different from SQL
Quick Start and Interoperability
Nested Data Model
UDFs as First-Class Citizens
Parallelism Required
Debugging Environment

8
Pig Latin

Data Model
Atom 'cs'
Tuple ('cs', 'ece', 'ee')?
Bag ('cs', 'ece'), ('cs')
Map 'courses' ? ('523', '525', '599'
Expressions
Fields by position 0
Fields by name f1,
Map Lookup

9
Example Data Analysis Task

Find the top 10 most visited pages in each
category

Visits
URL Info
User URL Time
Amy cnn.com 800
Amy bbc.com 1000
Amy flickr.com 1005
Fred cnn.com 1200
10
Data Flow
Load Visits
Group by url
Foreach url generate count
Load Url Info
Join on url
Group by category
Foreach category generate top10 urls
11
In Pig Latin

visits load /data/visits as
(user, url, time)
gVisits group visits by url
visitCounts foreach gVisits generate url,
count(visits)
urlInfo load /data/urlInfo as (url,
category,pRank)
visitCounts join visitCounts by url, urlInfo
by url
gCategories group visitCounts by category
topUrls foreach gCategories
generate top(visitCounts,10)
store topUrls into /data/topUrls

12
Quick Start and Interoperability

visits load /data/visits as
(user, url, time)
gVisits group visits by url
visitCounts foreach gVisits generate url,
count(visits)
urlInfo load /data/urlInfo as (url,
category,pRank)
visitCounts join visitCounts by url, urlInfo
by url
gCategories group visitCounts by category
topUrls foreach gCategories
generate top(visitCounts,10)
store topUrls into /data/topUrls

Operates directly over files
13
Optional Schemas

visits load /data/visits as
(user, url, time)
gVisits group visits by url
visitCounts foreach gVisits generate url,
count(visits)
urlInfo load /data/urlInfo as (url,
category,pRank)
visitCounts join visitCounts by url, urlInfo
by url
gCategories group visitCounts by category
topUrls foreach gCategories
generate top(visitCounts,10)
store topUrls into /data/topUrls

Schemas 0ptional can be assigned dynamically
14
UDFs as First-class citizens

visits load /data/visits as
(user, url, time)
gVisits group visits by url
visitCounts foreach gVisits generate url,
count(visits)
urlInfo load /data/urlInfo as (url,
category,pRank)
visitCounts join visitCounts by url, urlInfo
by url
gCategories group visitCounts by category
topUrls foreach gCategories
generate top(visitCounts,10)
store topUrls into /data/topUrls

UDFs can be used in every construct
15
Operators

LOAD specifying input data
FOREACH per-tuple processing
FLATTEN eliminate nesting
FILTER discarding unwanted data
COGROUP getting related data together
GROUP, JOIN
STORE asking for output
Other UNION, CROSS, ORDER, DISTINCT

16
COGROUP Vs JOIN
17
Compilation into MapReduce
Every group or join operation forms a map-reduce
boundary
Map1
Load Visits
Group by url
Reduce1
Map2
Foreach url generate count
Load Url Info
Join on url
Reduce2
Map3
Other operations pipelined into map and reduce
phases
Group by category
Reduce3
Foreach category generate top10 urls
18
Debugging Environment

Write-run-debug cycle
Sandbox dataset
Objectives
Realism
Conciseness
Completeness
Problems
UDFs

19
Future Work

Optional safe query optimizer
Performs only high-confidence rewrites
User interface
Boxes and arrows UI
Promote collaboration, sharing code fragments and
UDFs
Tight integration with a scripting language
Use loops, conditionals of host language

20
DryadLINQ A System for General Purpose
Distributed Data-Parallel Computing Using a
High-Level Language

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai
Budiu,
Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey

21
Dryad System Architecture
data plane
Files, TCP, FIFO, Network
job schedule
V
V
V
NS
PD
PD
PD
control plane
Job manager
cluster
22
LINQ
CollectionltTgt collection bool IsLegal(Key) stri
ng Hash(Key) var results from c in collection
where IsLegal(c.key) select new
Hash(c.key), c.value
23
DryadLINQ Constructs

Partitioning Hash, Range, RoundRobin
Apply, Fork
Hints

24
Dryad LINQ DryadLINQ
CollectionltTgt collection bool IsLegal(Key
k) string Hash(Key) var results from c in
collection where IsLegal(c.key) select new
Hash(c.key), c.value
Vertexcode
Queryplan (Dryad job)
Data
collection
C
C
C
C
results
25
DryadLINQ Execution Overview
Client machine
DryadLINQ
C
Data center
Distributed query plan
Invoke
Query Expr
Query
Input Tables
ToDryadTable
Dryad Execution
JM
Output DryadTable
Results
C Objects
Output Tables
(11)
foreach
26
System Implementation

LINQ expressions converted to execution plan
graph (EPG)
similar to database query plan
DAG
annotated with metadata properties
EPG is skeleton of Dryad DFG
as long as native operations are used, properties
can propagate helping optimization

27
Static Optimizations

Pipelining
Multiple operations in a single process
Removing redundancy
Eager Aggregation
Move aggregations in front of partitionings
I/O Reduction
Try to use TCP and in-memory FIFO instead of disk
space

28
Dynamic Optimizations

As information from job becomes available, mutate
execution graph
Dataset size based decisions
Intelligent partitioning of data

29
Dynamic Optimizations

Aggregation can turn into tree to improve I/O
based on locality
Example if part of computation is done locally,
then aggregated before being sent across network

30
Evaluation

TeraSort - scalability

240 computer cluster of 2.6Ghz dual core AMD
Opterons
Sort 10 billion 100-byte records on 10-byte key
Each computer stores 3.87 GBs

31
Evaluation

DryadLINQ vs Dryad - SkyServer

Dryad is hand optimized
No dynamic optimization overhead
DryadLINQ is 10 native code

32
Main Benefits

High level and data type transparent
Automatic optimization friendly
Manual optimizations using Apply operator
Leverage any system running LINQ framework
Support for interacting with SQL databases
Single computer debugging made easy
Strong typing, narrow interface
Deterministic replay execution

33
Discussion

Dynamic optimizations appear data intensive
What kind of overhead?
EPG analysis overhead -gt high latency
No real comparison with other systems
Progress tracking is difficult
No speculation
Will Solid State Drives diminish advantages of
MapReduce?
Why not use Parallel Databases?
MapReduce Vs Dryad
How different from Sawzall and Pig?

34
Comparison
Language Sawzall Pig Latin DryadLINQ
Built by Google Yahoo Microsoft
Programming Imperative Imperative Imperative Declarative Hybrid
Resemblance to SQL Least Moderate Most
Execution Engine Google MapReduce Hadoop Dryad
Performance Very Efficient 5-10 times slower 1.3-2 times slower
Implementation Internal, inside Google Open Source Apache-License Internal, inside Microsoft
Model Operate per record Sequence of MR DAGs
Usage Log Analysis Machine Learning Iterative computations
35
Improving MapReduce Performance in Heterogeneous
Environments

Matei Zaharia, Andy Konwinski, Anthony Joseph,
Randy Katz, Ion Stoica
University of California at Berkeley

36
Hadoop Speculative Execution Overview

Speculative tasks executed only if no failed or
waiting avail.
Notion of progress
3 phases of execution
Copy phase
Sort phase
Reduce phase
Each phase weighted by data processed
Determines whether a job failed or is a straggler
and available for speculation

37
Hadoops Assumptions

Nodes can perform work at exactly the same rate
Tasks progress at a constant rate throughout time
There is no cost to launching a speculative task
on an idle node
The three phases of execution take approximately
same time
Tasks with a low progress score are stragglers
Maps and Reduces require roughly the same amount
of work

38
Breaking Down the Assumptions

Virtualization breaks down homogeneity
Amazon EC2 - multiple vms on same physical host
Compete for memory/network bandwidth
Ex two map tasks can compete for disk bandwidth,
causing one to be a straggler

39
Breaking Down the Assumptions

Progress threshold in Hadoop is fixed and assumes
low progress faulty node
Too Many speculative tasks executed
Speculative execution can harm running tasks

40
Breaking Down the Assumptions

Tasks phases are not equal
Copy phase typically the most expensive due to
network communication cost
Causes rapid jump from 1/3 progress to 1 of many
tasks, creating fake stragglers
Real stragglers get usurped
Unnecessary copying due to fake stragglers
Progress score means anything with gt80 never
speculatively executed

41
LATE Scheduler

Longest Approximate Time to End
Primary assumption best task to execute is the
one that finishes furthest into the future
Secondary tasks make progress at approx.
constant rate
Progress Rate ProgressScore/T
T time task has run for
Time to completion (1-ProgressScore)/T

42
LATE Scheduler

Launch speculative jobs on fast nodes
best chance to overcome straggler vs using first
available node
Cap on total number of speculative tasks
Slowness minimum threshold
Does not take into account data locality

43
Performance Comparison Without Stragglers

EC2 test cluster
1.0-1.2 Ghz Opteron/Xeon w/1.7 GB mem

Sort
44
Performance Comparison With Stragglers

Manually slowed down 8 VMs with background
processes

Sort
45
Performance Comparison With Stragglers
WordCount
Grep
46
Sensitivity
47
Sensitivity
48
Takeaways

Make decisions early
Use finishing times
Nodes are not equal
Resources are precious

49
Further questions

Focusing work on small vms fair?
Would it be better to pay for large vm and
implement system with more customized control?
Could this be used in other systems?
Progress tracking is key
Is this a fundamental contribution? Or just an
optimization?
Good research?

Write a Comment

User Comments (0)