Chris Olston Benjamin Reed - PowerPoint PPT Presentation

About This Presentation

Title:

Chris Olston Benjamin Reed

Description:

Chris Olston Benjamin Reed. Utkarsh Srivastava. Ravi Kumar Andrew ... 2. Common operations must be coded by hand ... Map1. Reduce1. Map2. Reduce2. Map3. Reduce3 ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 28

Provided by: utkar

Category:

more less

Transcript and Presenter's Notes

Title: Chris Olston Benjamin Reed

1
Pig Latin A Not-So-Foreign Language For Data
Processing

Chris Olston Benjamin Reed
Utkarsh Srivastava
Ravi Kumar Andrew Tomkins

Research
2
Data Processing Renaissance

Internet companies swimming in data
E.g. TBs/day at Yahoo!
Data analysis is inner loop of product
innovation
Data analysts are skilled programmers

3
Data Warehousing ?
Often not scalable enough
Scale

Prohibitively expensive at web scale
Up to 200K/TB

Little control over execution method
Query optimization is hard
Parallel environment
Little or no statistics
Lots of UDFs

SQL
4
New Systems For Data Analysis

Map-Reduce
Apache Hadoop
Dryad

. . .
5
Map-Reduce
k1 v1
k2 v2
k1 v3
Input records
k1 v1
k1 v3
k1 v5
Output records
map
reduce
map
k2 v2
k2 v4
k2 v4
k1 v5
reduce
Just a group-by-aggregate?
6
The Map-Reduce Appeal

Scalable due to simpler design
Only parallelizable operations
No transactions

Scale

Runs on cheap commodity hardware
Procedural Control- a processing pipe
7
Disadvantages
M
R
1. Extremely rigid data flow
Other flows constantly hacked in
M
M
M
R
Join, Union
Chains
Split

2. Common operations must be coded by hand
Join, filter, projection, aggregates, sorting,
distinct

3. Semantics hidden inside map-reduce functions
Difficult to maintain, extend, and optimize

8
Pros And Cons
Need a high-level, general data flow language
9
Enter Pig Latin
Need a high-level, general data flow language
Pig Latin
10
Outline

Map-Reduce and the need for Pig Latin
Pig Latin example
Salient features
Implementation

11
Example Data Analysis Task
Find the top 10 most visited pages in each
category
Visits
Url Info
User Url Time
Amy cnn.com 800
Amy bbc.com 1000
Amy flickr.com 1005
Fred cnn.com 1200
Url Category PageRank
cnn.com News 0.9
bbc.com News 0.8
flickr.com Photos 0.7
espn.com Sports 0.9
12
Data Flow
Load Visits
Group by url
Foreach url generate count
Load Url Info
Join on url
Group by category
Foreach category generate top10 urls
13
In Pig Latin

visits load /data/visits as
(user, url, time)
gVisits group visits by url
visitCounts foreach gVisits generate url,
count(visits)
urlInfo load /data/urlInfo as (url,
category, pRank)
visitCounts join visitCounts by url, urlInfo
by url
gCategories group visitCounts by category
topUrls foreach gCategories generate
top(visitCounts,10)
store topUrls into /data/topUrls

14
Outline

Map-Reduce and the need for Pig Latin
Pig Latin example
Salient features
Implementation

15
Step-by-step Procedural Control

Target users are entrenched procedural
programmers

The step-by-step method of creating a program in
Pig is much cleaner and simpler to use than the
single block method of SQL. It is easier to keep
track of what your variables are, and where you
are in the process of analyzing your data.
Jasmine Novak Engineer, Yahoo!
With the various interleaved clauses in SQL, it
is difficult to know what is actually happening
sequentially. With Pig, the data nesting and the
temporary tables get abstracted away. Pig has
fewer primitives than SQL does, but its more
powerful.
David Ciemiewicz Search Excellence, Yahoo!

Automatic query optimization is hard
Pig Latin does not preclude optimization

16
Quick Start and Interoperability

visits load /data/visits as
(user, url, time)
gVisits group visits by url
visitCounts foreach gVisits generate url,
count(urlVisits)
urlInfo load /data/urlInfo as (url,
category, pRank)
visitCounts join visitCounts by url, urlInfo
by url
gCategories group visitCounts by category
topUrls foreach gCategories generate
top(visitCounts,10)
store topUrls into /data/topUrls

Operates directly over files
17
Quick Start and Interoperability

visits load /data/visits as
(user, url, time)
gVisits group visits by url
visitCounts foreach gVisits generate url,
count(urlVisits)
urlInfo load /data/urlInfo as (url,
category, pRank)
visitCounts join visitCounts by url, urlInfo
by url
gCategories group visitCounts by category
topUrls foreach gCategories generate
top(visitCounts,10)
store topUrls into /data/topUrls

Schemas optional Can be assigned dynamically
18
User-Code as a First-Class Citizen

visits load /data/visits as
(user, url, time)
gVisits group visits by url
visitCounts foreach gVisits generate url,
count(urlVisits)
urlInfo load /data/urlInfo as (url,
category, pRank)
visitCounts join visitCounts by url, urlInfo
by url
gCategories group visitCounts by category
topUrls foreach gCategories generate
top(visitCounts,10)
store topUrls into /data/topUrls

User-defined functions (UDFs) can be used in
every construct
Load, Store
Group, Filter, Foreach

19
Nested Data Model

Pig Latin has a fully-nestable data model with
Atomic values, tuples, bags (lists), and maps
More natural to programmers than flat tuples
Avoids expensive joins
See paper

20
Outline

Map-Reduce and the need for Pig Latin
Pig Latin example
Novel features
Implementation

21
Implementation
SQL
user
automatic rewrite optimize
Pig
Pig is open-source. http//incubator.apache.org/p
ig
or
or
Hadoop Map-Reduce
22
Compilation into Map-Reduce
Map1
Every group or join operation forms a map-reduce
boundary
Load Visits
Group by url
Reduce1
Map2
Foreach url generate count
Load Url Info
Join on url
Reduce2
Map3
Group by category
Other operations pipelined into map and reduce
phases
Reduce3
Foreach category generate top10(urls)
23
Usage