Title: Cloud Computing
1Cloud Computing Languages and Architectures
- Zachary G. Ives
- University of Pennsylvania
- CIS 650 Implementing Data Management Systems
- November 2, 2008
Slides 15-21 by Chris Olston, used with permission
2The Core of Distributed Programming in the Cloud
MapReduce
- In many circles, considered the key building
block for much of Googles data analysis - A programming language built on it
Sawzall,http//labs.google.com/papers/sawzall.htm
l - Sawzall has become one of the most widely used
programming languages at Google. On one
dedicated Workqueue cluster with 1500 Xeon CPUs,
there were 32,580 Sawzall jobs launched, using an
average of 220 machines each. While running those
jobs, 18,636 failures occurred (application
failure, network outage, system crash, etc.) that
triggered rerunning some portion of the job. The
jobs read a total of 3.2x1015 bytes of data
(2.8PB) and wrote 9.9x1012 bytes (9.3TB). - Other similar languages Yahoos Pig Latin and
Pig Microsofts Dryad - Cloned in open source Hadoop,http//hadoop.apach
e.org/core/
3MapReduce Simple Distributed Functional
Programming Primitives
- Modeled after Lisp primitives
- map (apply function to all items in a
collection) and reduce (apply function to set of
items with a common key) - We start with
- A user-defined function to be applied to all
data,map (key,value) ? (key, value) - Another user-specified operation reduce (key,
set of values) ? result - A set of n nodes, each with data
- All nodes run map on all of their data, producing
new data with keys - This data is collected by key, then reduced
4Some Example Tasks
- Count word occurrences
- Map output word with count 1
- Reduce sum the counts
- Distributed grep all lines matching a pattern
- Map filter by pattern
- Reduce output set
- Count URL access frequency
- Map output each URL as key, with count 1
- Reduce sum the counts
- For each IP address, get the document with the
most in-links - Number of queries by IP address (requires
multiple steps)
5MapReduce Dataflow Diagram(Default MapReduce
Uses Filesystem)
Coordinator
Datapartitions by key
Map compu-tation partitions
Reduce compu-tation partitions
Redistributionby outputs key
6MapReduce Is Too Low-Level
- It represents a single, two-level aggregation
computation - It requires all of the logic to be encoded into
two external functions, map and reduce, even if
the operations are generic like selection
operations - Can we do something compositional?
7A First Take Sawzall
- Single Map-Reduce operation
- Based on aggregators that take tables, produce
tables - count table sum of int
- total table sum of float
- sum_of_squares table sum of float
- x float input
- emit count lt- 1
- emit total lt- x
- emit sum_of_squares lt- x x
8Could We Map SQL to MapReduce?
- Select
- Project
- Join
- Group-by
- Having
- Pros and cons of this?
9Pig Latin and Pig
- Pig Latin a compositional, collections-oriented
dataflow language - Oriented towards parallel data processing
analysis - Think of it as a series of query operators,
without the declarative language aspects - Emphasizes user-defined functions, esp. those
that have nice algebraic properties - Supports non-first-normal (nested) data
- Supports external data from files
- Pig the runtime system
10A Simple Example Face Detection
- Each expression creates a named collection
- load collections from files
- process them (e.g., per tuple) using a UDF
- store the results into files
- I load /mydata/images using ImageParser() as
(id, image) - F foreach I generate id, detectFaces(image)
- store F into /mydata/faces
11Another Example Sessions Ending in Best Page
According to PageRank
- Suppose we have two tables wed like to join,
then compare the final rank in the sequence vs.
other ranks
Pages
Visits
URL PageRank
www.cnn.com 0.9
www.flickr.com 0.9
www.social.com 0.7
www.digg.com 0.2
User URL Time
Alice www.cnn.com 700
Alice www.digg.com 720
Alice www.social.com 1000
Alice www.flickr.com 1005
Joe www.cnn.com/index.htm 1200
. . .
. . .
12Parallel Evaluation
?
?
?
?
?
Parallel group-by session / choose best
?
?
?
?
?
Parallel joins
Visit lists (filesystem)
Rank lists (filesystem)
13The Computation in Pig Latin
- Visits load /data/visits as (user, url,
time) - Visits foreach Visits generate user,
Canonicalize(url), time - Pages load /data/pages as (url, pagerank)
- VP join Visits by url, Pages by url
- UserVisits group VP by user
- Sessions foreach UserVisits generate
flatten(FindSessions()) - HappyEndings filter Sessions by BestIsLast()
- store HappyEndings into '/data/happy_ending
s'
14Pig Latin Features
- Record-oriented transformations
- Can work over nested collections
- Basic operators expose parallelism user-defined
operators may not - Operations are explicit, not declarative
- operators
- FILTER
- FOREACH GENERATE
- GROUP
- binary operators
- JOIN
- COGROUP
- UNION
15Pig Latin vs. Map-Reduce
- Map-reduce combines 3 primitives
- process records ? create groups ? process groups
- In Pig, these primitives are
- explicit
- independent
- fully composable
- Pig adds primitives for
- filtering tables
- projecting tables
- combining 2 or more tables
optimization opportunities
16Pig System
user
Pig Latin program
cross-job optimizer
17Key Issue Redundant Work
- Popular tables
- web crawl
- search log
- Popular transformations
- eliminate spam pages
- group pages by host
- join web crawl with search log
- Goal Minimize redundant work
18Work-Sharing Techniques
gtgt
Join A B
A?
19Executing Similar Jobs Together
execution engine
jobs
queue (job groups)
- Optimal queue ordering policy?
- New sharable jobs arrive with frequency ?1, ?2
- Which schedule is best
- If ?1 gtgt ?2
- If ?1 ?2
20Caching Data Transformations
- Options
- Cache Op2 output
- Cache Op3 output
- Cache both
- Considerations
- Space
- Utility
- Cost to generate
- ? Difficult to estimate a priori
- ? Can materialize fragments, and learn
21Caching Data Moves
Join A B
22Pig Summarized
- Somewhere between a programming language and a
DBMS - Allows distributed programming with explicit
parallel dataflow operators - Runtime system does caching and batching
23Another Option EC2
- Amazon Elastic Computing Cloud part of their
range of services - User gets many Linux VMs
- VMs get temporary IP addresses for communication
- Optionally have access to disk storage, or to
Simple Storage System (key-value pairs) - How does this compare to Pig?