Data Mining on the Web - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Data Mining on the Web

Description:

Entity resolution : merging records that refer to the same entity (e.g. ... Postings (Craig's list, B2B Web sites, del.icio.us, social networks, etc. etc.) 10 ... – PowerPoint PPT presentation

Number of Views:141
Avg rating:3.0/5.0
Slides: 18
Provided by: jeff452
Category:
Tags: craig | data | list | mining | web

less

Transcript and Presenter's Notes

Title: Data Mining on the Web


1
Data Mining on the Web
  • Outline of Course
  • Goals

2
What is Data Mining?
  • Discovery of useful and unexpected patterns in
    data.
  • Other issues
  • Data cleansing detection of bogus data, e.g.,
    age 150.
  • Entity resolution merging records that refer to
    the same entity (e.g., a person).
  • Visualization something better than megabyte
    files of output.

3
Web Mining
  • Take advantage of the size of the Web to extract
    information.
  • Use information from Web transactions (e.g.,
    on-line purchases, social networks) to extract
    information.

4
Outline of Course
  • Frequent Itemsets and association rules.
  • PageRank and related measures of importance on
    the Web (link analysis ).
  • Spam detection.
  • Topic-specific search.
  • Map-Reduce and Hadoop.

5
Outline (2)
  • Finding similar items (e.g., Web pages).
  • Shingling (documents turned into sets).
  • Minhashing (summarizing sets)
  • Locality-Sensitive Hashing (finding only similar
    sets).
  • Mining data streams.

6
Goals
  • Learn some interesting algorithms and techniques
    that have important applications.
  • Learn how to deal efficiently with data that is
    so large it doesnt fit in main memory.

7
Application Selling Stuff
  • There are two marketing environments
  • Brick-and-mortar conventional stores.
  • Strategies need large volume to succeed.
  • Run ads only for the most common items.
  • On-line, e.g., Amazon.
  • Takes advantage of the long-tail selling
    unusual things to unusual people.
  • Your Amazon landing page suggests things you are
    interested in.

8
The Long Tail
Source Chris Anderson (2004)
9
Application Understanding Documents
  • Many large databases of text information
  • The Web.
  • News articles.
  • Blogs.
  • Postings (Craigs list, B2B Web sites,
    del.icio.us, social networks, etc. etc.)

10
Questions About Web Pages
  • Which Web pages are the best answers to a query?
  • PageRank, topic-specific PageRank,
    hubs-and-authorities.
  • Which Web pages are spam?
  • TrustRank.

11
Questions About Documents
  • Which documents are about similar topics?
  • Based on similar sets of words in their text.
  • Based on usage, e.g., documents accessed by many
    of the same people.
  • What is the opinion in a document?
  • E.g., a car that uses less fuel, versus a car
    that has less room.

12
Application Collaborative Filtering
  • Advise one person based on what similar people
    have done.
  • Example data sets
  • NetFlix Challenge matrix of user-movie ratings.
  • Predict how a user would rate a movie.
  • Amazon purchases matrix of buyer-item pairs.

13
Example NetFlix Data
M o v i e s
0 0 5 0 0 0 4 0 0 0 5 3 0 0 0 0 0 4 0 0 0 3 0 2 0
0 5 0 0 1 0 4 0 0 0 0 4 0 0 1 0 5 3 0 2 0 0 0 0 0
4 3 3 0 5 0 0 0 0 0 0 0 0 3 4 0 0 5 etc., etc.
U s e r s
Note ratings are 1-5 0 not rated.
14
Collaborative Filtering (2)
  • Data is always sparse (most entries in the
    matrix are 0).
  • Look for similar rows or columns.
  • E.g., items bought by many of the same customers,
    or users who gave many of the same movies similar
    ratings.
  • Cluster rows and/or columns.

15
Gradiance Automated Homework
  • Go to www.gradiance.com/services
  • Create an account for yourself.
  • Enroll in class FA335CA1

16
Gradiance How It Works
  • Dont think of it as multiple-choice.
  • Really solve the problems.
  • Demonstrate you solved them by answering randomly
    chosen multiple-choice questions.
  • If you miss something, you get an explanation,
    and should try again.

17
Discussion Gradiance for Course Credit?
  • It has been suggested that we give credit to
    those who get at least 80 of the marks on the
    assignments.
  • No exam.
  • But deadlines for assignments first two pieces
    have May 21 deadlines.
Write a Comment
User Comments (0)
About PowerShow.com