Data Mining on the Web

About This Presentation

Title:

Data Mining on the Web

Description:

Entity resolution : merging records that refer to the same entity (e.g. ... Postings (Craig's list, B2B Web sites, del.icio.us, social networks, etc. etc.) 10 ... – PowerPoint PPT presentation

Number of Views:141

Avg rating:3.0/5.0

Slides: 18

Provided by: jeff452

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining on the Web

1
Data Mining on the Web

Outline of Course
Goals

2
What is Data Mining?

Discovery of useful and unexpected patterns in
data.
Other issues
Data cleansing detection of bogus data, e.g.,
age 150.
Entity resolution merging records that refer to
the same entity (e.g., a person).
Visualization something better than megabyte
files of output.

3
Web Mining

Take advantage of the size of the Web to extract
information.
Use information from Web transactions (e.g.,
on-line purchases, social networks) to extract
information.

4
Outline of Course

Frequent Itemsets and association rules.
PageRank and related measures of importance on
the Web (link analysis ).
Spam detection.
Topic-specific search.
Map-Reduce and Hadoop.

5
Outline (2)

Finding similar items (e.g., Web pages).
Shingling (documents turned into sets).
Minhashing (summarizing sets)
Locality-Sensitive Hashing (finding only similar
sets).
Mining data streams.

6
Goals

Learn some interesting algorithms and techniques
that have important applications.
Learn how to deal efficiently with data that is
so large it doesnt fit in main memory.

7
Application Selling Stuff

There are two marketing environments
Brick-and-mortar conventional stores.
Strategies need large volume to succeed.
Run ads only for the most common items.
On-line, e.g., Amazon.
Takes advantage of the long-tail selling
unusual things to unusual people.
Your Amazon landing page suggests things you are
interested in.

8
The Long Tail
Source Chris Anderson (2004)
9
Application Understanding Documents

Many large databases of text information
The Web.
News articles.
Blogs.
Postings (Craigs list, B2B Web sites,
del.icio.us, social networks, etc. etc.)

10
Questions About Web Pages

Which Web pages are the best answers to a query?
PageRank, topic-specific PageRank,
hubs-and-authorities.
Which Web pages are spam?
TrustRank.

11
Questions About Documents

Which documents are about similar topics?
Based on similar sets of words in their text.
Based on usage, e.g., documents accessed by many
of the same people.
What is the opinion in a document?
E.g., a car that uses less fuel, versus a car
that has less room.

12
Application Collaborative Filtering

Advise one person based on what similar people
have done.
Example data sets
NetFlix Challenge matrix of user-movie ratings.
Predict how a user would rate a movie.
Amazon purchases matrix of buyer-item pairs.

13
Example NetFlix Data
M o v i e s
0 0 5 0 0 0 4 0 0 0 5 3 0 0 0 0 0 4 0 0 0 3 0 2 0
0 5 0 0 1 0 4 0 0 0 0 4 0 0 1 0 5 3 0 2 0 0 0 0 0
4 3 3 0 5 0 0 0 0 0 0 0 0 3 4 0 0 5 etc., etc.
U s e r s
Note ratings are 1-5 0 not rated.
14
Collaborative Filtering (2)

Data is always sparse (most entries in the
matrix are 0).
Look for similar rows or columns.
E.g., items bought by many of the same customers,
or users who gave many of the same movies similar
ratings.
Cluster rows and/or columns.

15
Gradiance Automated Homework

Go to www.gradiance.com/services
Create an account for yourself.
Enroll in class FA335CA1

16
Gradiance How It Works

Dont think of it as multiple-choice.
Really solve the problems.
Demonstrate you solved them by answering randomly
chosen multiple-choice questions.
If you miss something, you get an explanation,
and should try again.

17
Discussion Gradiance for Course Credit?

It has been suggested that we give credit to
those who get at least 80 of the marks on the
assignments.
No exam.
But deadlines for assignments first two pieces
have May 21 deadlines.

Write a Comment

User Comments (0)