Chris Olston Benjamin Reed - PowerPoint PPT Presentation

About This Presentation
Title:

Chris Olston Benjamin Reed

Description:

Chris Olston Benjamin Reed. Utkarsh Srivastava. Ravi Kumar Andrew ... 2. Common operations must be coded by hand ... Map1. Reduce1. Map2. Reduce2. Map3. Reduce3 ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 28
Provided by: utkar
Category:
Tags: benjamin | chris | map1 | olston | reed

less

Transcript and Presenter's Notes

Title: Chris Olston Benjamin Reed


1
Pig Latin A Not-So-Foreign Language For Data
Processing
  • Chris Olston Benjamin Reed
  • Utkarsh Srivastava
  • Ravi Kumar Andrew Tomkins

Research
2
Data Processing Renaissance
  • Internet companies swimming in data
  • E.g. TBs/day at Yahoo!
  • Data analysis is inner loop of product
    innovation
  • Data analysts are skilled programmers

3
Data Warehousing ?
Often not scalable enough
Scale
  • Prohibitively expensive at web scale
  • Up to 200K/TB

  • Little control over execution method
  • Query optimization is hard
  • Parallel environment
  • Little or no statistics
  • Lots of UDFs

SQL
4
New Systems For Data Analysis
  • Map-Reduce
  • Apache Hadoop
  • Dryad

. . .
5
Map-Reduce
k1 v1
k2 v2
k1 v3
Input records
k1 v1
k1 v3
k1 v5
Output records
map
reduce
map
k2 v2
k2 v4
k2 v4
k1 v5
reduce
Just a group-by-aggregate?
6
The Map-Reduce Appeal
  • Scalable due to simpler design
  • Only parallelizable operations
  • No transactions

Scale

Runs on cheap commodity hardware
Procedural Control- a processing pipe
7
Disadvantages
M
R
1. Extremely rigid data flow
Other flows constantly hacked in
M
M
M
R
Join, Union
Chains
Split
  • 2. Common operations must be coded by hand
  • Join, filter, projection, aggregates, sorting,
    distinct
  • 3. Semantics hidden inside map-reduce functions
  • Difficult to maintain, extend, and optimize

8
Pros And Cons
Need a high-level, general data flow language
9
Enter Pig Latin
Need a high-level, general data flow language
Pig Latin
10
Outline
  • Map-Reduce and the need for Pig Latin
  • Pig Latin example
  • Salient features
  • Implementation

11
Example Data Analysis Task
Find the top 10 most visited pages in each
category
Visits
Url Info
User Url Time
Amy cnn.com 800
Amy bbc.com 1000
Amy flickr.com 1005
Fred cnn.com 1200
Url Category PageRank
cnn.com News 0.9
bbc.com News 0.8
flickr.com Photos 0.7
espn.com Sports 0.9
12
Data Flow
Load Visits
Group by url
Foreach url generate count
Load Url Info
Join on url
Group by category
Foreach category generate top10 urls
13
In Pig Latin
  • visits load /data/visits as
    (user, url, time)
  • gVisits group visits by url
  • visitCounts foreach gVisits generate url,
    count(visits)
  • urlInfo load /data/urlInfo as (url,
    category, pRank)
  • visitCounts join visitCounts by url, urlInfo
    by url
  • gCategories group visitCounts by category
  • topUrls foreach gCategories generate
    top(visitCounts,10)
  • store topUrls into /data/topUrls

14
Outline
  • Map-Reduce and the need for Pig Latin
  • Pig Latin example
  • Salient features
  • Implementation

15
Step-by-step Procedural Control
  • Target users are entrenched procedural
    programmers

The step-by-step method of creating a program in
Pig is much cleaner and simpler to use than the
single block method of SQL. It is easier to keep
track of what your variables are, and where you
are in the process of analyzing your data.
Jasmine Novak Engineer, Yahoo!
With the various interleaved clauses in SQL, it
is difficult to know what is actually happening
sequentially. With Pig, the data nesting and the
temporary tables get abstracted away. Pig has
fewer primitives than SQL does, but its more
powerful.
David Ciemiewicz Search Excellence, Yahoo!
  • Automatic query optimization is hard
  • Pig Latin does not preclude optimization

16
Quick Start and Interoperability
  • visits load /data/visits as
    (user, url, time)
  • gVisits group visits by url
  • visitCounts foreach gVisits generate url,
    count(urlVisits)
  • urlInfo load /data/urlInfo as (url,
    category, pRank)
  • visitCounts join visitCounts by url, urlInfo
    by url
  • gCategories group visitCounts by category
  • topUrls foreach gCategories generate
    top(visitCounts,10)
  • store topUrls into /data/topUrls

Operates directly over files
17
Quick Start and Interoperability
  • visits load /data/visits as
    (user, url, time)
  • gVisits group visits by url
  • visitCounts foreach gVisits generate url,
    count(urlVisits)
  • urlInfo load /data/urlInfo as (url,
    category, pRank)
  • visitCounts join visitCounts by url, urlInfo
    by url
  • gCategories group visitCounts by category
  • topUrls foreach gCategories generate
    top(visitCounts,10)
  • store topUrls into /data/topUrls

Schemas optional Can be assigned dynamically
18
User-Code as a First-Class Citizen
  • visits load /data/visits as
    (user, url, time)
  • gVisits group visits by url
  • visitCounts foreach gVisits generate url,
    count(urlVisits)
  • urlInfo load /data/urlInfo as (url,
    category, pRank)
  • visitCounts join visitCounts by url, urlInfo
    by url
  • gCategories group visitCounts by category
  • topUrls foreach gCategories generate
    top(visitCounts,10)
  • store topUrls into /data/topUrls
  • User-defined functions (UDFs) can be used in
    every construct
  • Load, Store
  • Group, Filter, Foreach

19
Nested Data Model
  • Pig Latin has a fully-nestable data model with
  • Atomic values, tuples, bags (lists), and maps
  • More natural to programmers than flat tuples
  • Avoids expensive joins
  • See paper

20
Outline
  • Map-Reduce and the need for Pig Latin
  • Pig Latin example
  • Novel features
  • Implementation

21
Implementation
SQL
user
automatic rewrite optimize
Pig
Pig is open-source. http//incubator.apache.org/p
ig
or
or
Hadoop Map-Reduce
22
Compilation into Map-Reduce
Map1
Every group or join operation forms a map-reduce
boundary
Load Visits
Group by url
Reduce1
Map2
Foreach url generate count
Load Url Info
Join on url
Reduce2
Map3
Group by category
Other operations pipelined into map and reduce
phases
Reduce3
Foreach category generate top10(urls)
23
Usage
  • First production release about a year ago
  • 150 early adopters within Yahoo!
  • Over 25 of the Yahoo! map-reduce user base

24
Related Work
  • Sawzall
  • Data processing language on top of map-reduce
  • Rigid structure of filtering followed by
    aggregation
  • DryadLINQ
  • SQL-like language on top of Dryad
  • Nested data models
  • Object-oriented databases

25
Future Work
  • Optional safe query optimizer
  • Performs only high-confidence rewrites
  • User interface
  • Boxes and arrows UI
  • Promote collaboration, sharing code fragments and
    UDFs
  • Tight integration with a scripting language
  • Use loops, conditionals of host language

26
Credits
Arun Murthy Pi Song Santhosh Srinivasan Amir
Youssefi
Shubham Chopra Alan Gates Shravan
Narayanamurthy Olga Natkovich
27
Summary
  • Big demand for parallel data processing
  • Emerging tools that do not look like SQL DBMS
  • Programmers like dataflow pipes over static files
  • Hence the excitement about Map-Reduce
  • But, Map-Reduce is too low-level and rigid

Pig Latin Sweet spot between map-reduce and SQL
Write a Comment
User Comments (0)
About PowerShow.com