Team Picard - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Team Picard

Description:

Ironman. Frank,5. Jeff,5. Matt,4. John,4. Fargo. Create Groups ... Ironman,4. Jeff,John, Jesse. Fargo,4. Matt,John. Fargo,5. Frank,Jeff. Jaws,1. Matt. Jaws,5 ... – PowerPoint PPT presentation

Number of Views:127
Avg rating:3.0/5.0
Slides: 27
Provided by: csU45
Category:
Tags: iron | man | picard | team

less

Transcript and Presenter's Notes

Title: Team Picard


1
Cloud Computing Your Movie Ratings
Team Picard
Jonathan Kupferman Jeff Silverman
Frank Jones John Morse
Jesse Wang
2
What Am I?
  • I Am Responsible for 20 Of Amazons Revenue1
  • 4 Billion In 2007
  • I Doubled The Revenue Of Overstock.com2
  • I Increased User Queues On Blockbuster.com by
    503
  • I Make Customers 45 More Likely To Shop At A
    Site
  • Hint I Am Not YouTube Or David Hasselhoff

Personalized Recommendations
ReadWriteWeb. The Art, Science and Business of
Recommendation Engines January 16, 2007
Leavitt Communications. Industry Trends May
2006. Internet Retailer. Getting Personal Helps
Boost Web Sales January 2008
3
Recommendations
  • How Its Done
  • Get An Item You Don't Have
  • Find People Who Have It
  • Who Is Similar To You?
  • Did They Like It?
  • Your Opinion Likely Similar

4
So What's The Problem?
  • Why It's Hard
  • Extracting Only The Relevant Data
  • Incomplete Information
  • Data Growth Exponentially Increases Computation
  • Reliance On Recommendations
  • Amazon, Pandora
  • Public Challenge
  • Little Improvement Since 2000
  • Netflix Prize 1,000,000

5
Two Different Approaches
  • Improve Algorithm
  • Fix Problem Cases
  • Combining Algorithms
  • Data-Specific Improvements (Overfitting)
  • Not Designed For Parallel Computation
  • Add More Data
  • More Opinions 10 vs 10,000
  • Fewer Outliers
  • Need To Deal With Scale

6
Our Plan
  • Simple General Algorithm
  • Add More Data
  • Relevant Social Networking Sites
  • Cloud Computing
  • Distributed Computation
  • Scalability
  • Applied To The Netflix Prize
  • Beat Their Algorithm By 10
  • Subset Of Their Data Provided

7
Algorithm
  • Categorical Approach
  • Put User Into Groups
  • GroupMovie Rating
  • e.g. (Crash,5) Frank,John
  • Determine Group Likelihood Combinations
  • In (Crash,5) Then 66 In (Jaws,5)
  • Pick A Movie The User Hasnt Seen
  • Which Group Are They Most Likely To Be In?
  • Max Of (Jaws,1),(Jaws,2),(Jaws,3),(Jaws,4),(Jaws,5
    )

8
  • Movie Oriented Social Networking
  • High Quality Data
  • More Diverse User Base
  • More Profiles
  • Added 1.2 Million Profiles
  • Dataset 3x Larger

9
Cloud Computing
  • Space/Computation Requirements
  • 88,000 Groups
  • 88,0002 8 Billion Group Comparisons
  • Some Groups Have 100,000 Users
  • Renting Computation Power
  • Pay For Use - 0.10/hour
  • No Admin Headaches
  • Adjust Size On The Fly
  • Amazon Elastic Compute Cloud (EC2)
  • Managed Using Rightscale

10

11
Parallel ComputationMapReduce
  • Break Data Into Small Pieces
  • Compute On Separate Machines
  • Easy To Use
  • Designed For Data
  • Google Created It And Uses It
  • 20,000,000 Gigabytes Per Day
  • 11,000 Machines
  • Yahoo! In Comparable Numbers
  • Hadoop Open-Source Implementation

12
Project Overview
Amazons Compute Cloud
13
How The Algorithm Works
14
Create Groups
Remove Empty Groups
Jeff,?
15
Keep The Unrated Movie
Remove Groups Jeff Is Not In
Determine Group Probability
1-(1-2/3)(1-1/2)(1-1/2)

9/10
1-(1-1/3)(1-1/3)

5/9
0
1/3
1/3
2/3
1/2
1/2
16
Netflix Prize Structure
  • That Was Just A Single User And Movie
  • 480,000 Users
  • 18,000 Movies
  • 2,800,000 Predictions To Make
  • How Its Scored
  • Root Mean Squared Error (RMSE)
  • Similarity Between Predicted Ratings Actual
    Ratings
  • 10 Better
  • Netflixs Cinematch 0.95 RMSE On Provided Data
  • RMSE of 0.87 1,000,000

17
Time For A Demo
18
Results
  • Computed Ratings For A Some Users
  • 400,000 To Go
  • RMSE With Netflix 1.5
  • RMSE With Netflix Flixster 0.96
  • RMSE For Netflixs Cinematch 0.95
  • What We Accomplished
  • Created Simple, Scalable Algorithm
  • Successfully Utilized MapReduce
  • Proved Adding Data Is Effective

19
What We Learned
  • MapReduce
  • Simple Yet Constricting
  • Quantity Quality For Computers
  • If Its Broken, Don't Fix It
  • Future
  • Complete Submission
  • Fine Tuning The Algorithm
  • Determine Worst Cases

20
What Worked/What Didn't
  • Worked
  • MapReduce
  • Rightscale
  • Ruby/Python Scripts
  • Didn't
  • Waterfall Development
  • Trying Lots Of Algorithms
  • Long Algorithm Runtimes
  • HBase Distributed Database

21
Thank You
  • Rightscale
  • Martin Rhoads
  • Google
  • Mohamed Hafetz
  • Tevik Bultan
  • Rich Wolski
  • Chris Coakley
  • Ben Zhao CURRENT Lab

22
Questions?
23
  • Automatic Load Balancing
  • Provides Fault Tolerance
  • Uses Java, Some Python
  • Highly Configurable
  • Performance Tuning Can Be Tricky
  • Current Version 0.17.0

24
Flixster Data
  • Scraping
  • Multi-threaded Python
  • Max Throughput 6Mb/s
  • 1000 Profiles Per Hour
  • Data Analysis
  • 1.25 Million Profiles
  • Average 35 Ratings Per User
  • 5 Users With 10k Ratings

25
MapReduce Example
26
Route For Improvement
  • Making Better Algorithms vs. Increasing Data
  • Add More High Quality Data
  • Why Limit Your Data?
  • Internet Provides Untapped Resource
  • Blogs, Social Networking,Reviews, etc.
  • Extract And Apply Relevant Data
  • Broader Range Of Opinion
  • Fewer Outliers
Write a Comment
User Comments (0)
About PowerShow.com