CS378 Final Project - PowerPoint PPT Presentation

About This Presentation
Title:

CS378 Final Project

Description:

Sim(x, y) = # of Kevin Bacon movies viewed by both x and y ... Slice on day, aggregate review scores across all reviewers and movies ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 15
Provided by: csUt8
Category:
Tags: cs378 | final | movies | project

less

Transcript and Presenter's Notes

Title: CS378 Final Project


1
CS378 Final Project
  • The Netflix Data Set
  • Class Project Ideas and Guidelines

2
The Data Set
  • 17,770 Movies
  • 480,189 Reviewers
  • More than 100 Million reviews
  • Rating of 1 through 5
  • Review Date
  • Uncompressed full dataset is 2 Gigabytes

3
Netflix Data Properties
  • Distribution of Number of Reviews per Reviewer
  • X-axis
  • of reviews
  • Y-axis
  • P ( of reviews)

4
Netflix Data Subsets
  • You will be given two subsets of the data
  • Format
  • Subset
  • Contains 9,000 reviewers
  • Restricted to only those movies with at least 5
    ratings
  • 12,000 movies
  • 2 Million reviews
  • 50 MB

5
Project Requirements
  • Compute each of the following
  • Average review score
  • Top 10 most highly rated movies
  • Distribution of all review scores
  • p(rating1), ..., p(rating5)
  • Number of reviews as a function of time
  • The reviewer whose review score distribution has
    the largest entropy
  • Compute five other properties of the data
  • These properties should be relevant to your
    project
  • You should explain this relevancy

6
Project Options
  • Classification
  • Clustering
  • Recommendation
  • Data Cubes

7
Project 1 Classification
  • Goal Predict classification scores
  • 5-class classification problem
  • K-Nearest Neighbor
  • Represent each reviewer by a (sparse) vector of
    his review scores
  • How can scores be predicted given a reviewer's
    nearest neighbors?
  • Represent each movie by a vector of each
    reviewer's scores
  • How can scores be predicted given a movie's
    nearest neighbors?
  • Experiment with different distance measures
  • Experiment with various normalization schemes

8
Project 1 Classification
  • Decision Trees and other Parametric Classifiers
  • Create dense features for each instance
  • Reviewer's average rating
  • Movie's average rating
  • Movie related features
  • Actors in each movie (collected from IMDB)
  • Time related features
  • Number of reviewer's previous scores
  • Use the WEKA machine learning package
  • Evaluate performance of various algorithms in the
    package
  • Decision Tree, SVM, ...

9
Project 1 Classification
  • Evaluation of Classification Performance
  • Accuracy, Confusion Matrices
  • Analysis Are 1's harder to predict than 5's?
  • Cross-validation
  • Does this make sense when these is a time-series
    component?
  • Extensions
  • Learning curves
  • How does accuracy change as the training set size
    increases
  • Distribution of accuracy per reviewer
  • Are some reviewers harder to predict than
    others?
  • Are some movies harder to predict?
  • ...

10
Project 2 Clustering
  • Goal Cluster reviewers and movies
  • K-means based methods
  • Download G-Means
  • Supports k-means and also other variants
  • Cluster using both sparse and dense
    representations
  • Sparse representation same as used for KNN
    classification
  • Dense representation same as used for parametric
    classification

11
Project 2 Clustering
  • Graph-based methods
  • Compute pairwise similarities between reviewers
  • Correlation
  • Your own ad-hoc method
  • i.e. The Kevin Bacon method
  • Sim(x, y) of Kevin Bacon movies viewed by
    both x and y
  • Similarity computation may be too expensive to
    perform on the full dataset
  • Software Graclus
  • Results analysis
  • Quantitative as well as Qualitative

12
Project 3 Recommendations
  • Goal Create movie recommendations for each
    reviewer
  • K-Nearest Neighbor
  • Instance representation
  • Sparse representation
  • Find the reviewer's nearest neighbors
  • Recommend movies scored highly by these
    neighbors
  • Try out various distance measures

13
Project 3 Recommendation
  • Evaluation
  • Propose a way of quantifying the quality of your
    recommendations
  • i.e. A recommendation is good if a reviewer ended
    up rating the recommendation with score of 4 or
    higher
  • Is it harder to recommend movies to reviewers who
    do not watch many movies?
  • Does your evaluation metric reflect this?

14
Project 3 Data Cubes
  • Load the data into a data cube
  • Find interesting trends in the data
  • i.e. Relation between average review score and
    day of week?
  • Slice on day, aggregate review scores across all
    reviewers and movies
  • Find other interesting trends
  • Use an open source data cube package (OLAP)
  • Mondrian Java based
  • Must be a proficient coder
Write a Comment
User Comments (0)
About PowerShow.com