Title: Chris Olston Benjamin Reed
1Pig Latin A Not-So-Foreign Language For Data
Processing
- Chris Olston Benjamin Reed
- Utkarsh Srivastava
- Ravi Kumar Andrew Tomkins
Research
2Data Processing Renaissance
- Internet companies swimming in data
- E.g. TBs/day at Yahoo!
- Data analysis is inner loop of product
innovation - Data analysts are skilled programmers
3Data Warehousing ?
Often not scalable enough
Scale
- Prohibitively expensive at web scale
- Up to 200K/TB
- Little control over execution method
- Query optimization is hard
- Parallel environment
- Little or no statistics
- Lots of UDFs
SQL
4New Systems For Data Analysis
- Map-Reduce
- Apache Hadoop
- Dryad
. . .
5Map-Reduce
k1 v1
k2 v2
k1 v3
Input records
k1 v1
k1 v3
k1 v5
Output records
map
reduce
map
k2 v2
k2 v4
k2 v4
k1 v5
reduce
Just a group-by-aggregate?
6The Map-Reduce Appeal
- Scalable due to simpler design
- Only parallelizable operations
- No transactions
Scale
Runs on cheap commodity hardware
Procedural Control- a processing pipe
7Disadvantages
M
R
1. Extremely rigid data flow
Other flows constantly hacked in
M
M
M
R
Join, Union
Chains
Split
- 2. Common operations must be coded by hand
- Join, filter, projection, aggregates, sorting,
distinct
- 3. Semantics hidden inside map-reduce functions
- Difficult to maintain, extend, and optimize
8Pros And Cons
Need a high-level, general data flow language
9Enter Pig Latin
Need a high-level, general data flow language
Pig Latin
10Outline
- Map-Reduce and the need for Pig Latin
- Pig Latin example
- Salient features
- Implementation
11Example Data Analysis Task
Find the top 10 most visited pages in each
category
Visits
Url Info
User Url Time
Amy cnn.com 800
Amy bbc.com 1000
Amy flickr.com 1005
Fred cnn.com 1200
Url Category PageRank
cnn.com News 0.9
bbc.com News 0.8
flickr.com Photos 0.7
espn.com Sports 0.9
12Data Flow
Load Visits
Group by url
Foreach url generate count
Load Url Info
Join on url
Group by category
Foreach category generate top10 urls
13In Pig Latin
- visits load /data/visits as
(user, url, time) - gVisits group visits by url
- visitCounts foreach gVisits generate url,
count(visits) - urlInfo load /data/urlInfo as (url,
category, pRank) - visitCounts join visitCounts by url, urlInfo
by url - gCategories group visitCounts by category
- topUrls foreach gCategories generate
top(visitCounts,10) - store topUrls into /data/topUrls
14Outline
- Map-Reduce and the need for Pig Latin
- Pig Latin example
- Salient features
- Implementation
15Step-by-step Procedural Control
- Target users are entrenched procedural
programmers
The step-by-step method of creating a program in
Pig is much cleaner and simpler to use than the
single block method of SQL. It is easier to keep
track of what your variables are, and where you
are in the process of analyzing your data.
Jasmine Novak Engineer, Yahoo!
With the various interleaved clauses in SQL, it
is difficult to know what is actually happening
sequentially. With Pig, the data nesting and the
temporary tables get abstracted away. Pig has
fewer primitives than SQL does, but its more
powerful.
David Ciemiewicz Search Excellence, Yahoo!
- Automatic query optimization is hard
- Pig Latin does not preclude optimization
16Quick Start and Interoperability
- visits load /data/visits as
(user, url, time) - gVisits group visits by url
- visitCounts foreach gVisits generate url,
count(urlVisits) - urlInfo load /data/urlInfo as (url,
category, pRank) - visitCounts join visitCounts by url, urlInfo
by url - gCategories group visitCounts by category
- topUrls foreach gCategories generate
top(visitCounts,10) - store topUrls into /data/topUrls
Operates directly over files
17Quick Start and Interoperability
- visits load /data/visits as
(user, url, time) - gVisits group visits by url
- visitCounts foreach gVisits generate url,
count(urlVisits) - urlInfo load /data/urlInfo as (url,
category, pRank) - visitCounts join visitCounts by url, urlInfo
by url - gCategories group visitCounts by category
- topUrls foreach gCategories generate
top(visitCounts,10) - store topUrls into /data/topUrls
Schemas optional Can be assigned dynamically
18User-Code as a First-Class Citizen
- visits load /data/visits as
(user, url, time) - gVisits group visits by url
- visitCounts foreach gVisits generate url,
count(urlVisits) - urlInfo load /data/urlInfo as (url,
category, pRank) - visitCounts join visitCounts by url, urlInfo
by url - gCategories group visitCounts by category
- topUrls foreach gCategories generate
top(visitCounts,10) - store topUrls into /data/topUrls
- User-defined functions (UDFs) can be used in
every construct - Load, Store
- Group, Filter, Foreach
19Nested Data Model
- Pig Latin has a fully-nestable data model with
- Atomic values, tuples, bags (lists), and maps
-
- More natural to programmers than flat tuples
- Avoids expensive joins
- See paper
20Outline
- Map-Reduce and the need for Pig Latin
- Pig Latin example
- Novel features
- Implementation
21Implementation
SQL
user
automatic rewrite optimize
Pig
Pig is open-source. http//incubator.apache.org/p
ig
or
or
Hadoop Map-Reduce
22Compilation into Map-Reduce
Map1
Every group or join operation forms a map-reduce
boundary
Load Visits
Group by url
Reduce1
Map2
Foreach url generate count
Load Url Info
Join on url
Reduce2
Map3
Group by category
Other operations pipelined into map and reduce
phases
Reduce3
Foreach category generate top10(urls)
23Usage
- First production release about a year ago
- 150 early adopters within Yahoo!
- Over 25 of the Yahoo! map-reduce user base
24Related Work
- Sawzall
- Data processing language on top of map-reduce
- Rigid structure of filtering followed by
aggregation - DryadLINQ
- SQL-like language on top of Dryad
- Nested data models
- Object-oriented databases
25Future Work
- Optional safe query optimizer
- Performs only high-confidence rewrites
- User interface
- Boxes and arrows UI
- Promote collaboration, sharing code fragments and
UDFs - Tight integration with a scripting language
- Use loops, conditionals of host language
26Credits
Arun Murthy Pi Song Santhosh Srinivasan Amir
Youssefi
Shubham Chopra Alan Gates Shravan
Narayanamurthy Olga Natkovich
27Summary
- Big demand for parallel data processing
- Emerging tools that do not look like SQL DBMS
- Programmers like dataflow pipes over static files
- Hence the excitement about Map-Reduce
- But, Map-Reduce is too low-level and rigid
Pig Latin Sweet spot between map-reduce and SQL