Title: Data Mining on the Web
1Data Mining on the Web
2What is Data Mining?
- Discovery of useful and unexpected patterns in
data. - Other issues
- Data cleansing detection of bogus data, e.g.,
age 150. - Entity resolution merging records that refer to
the same entity (e.g., a person). - Visualization something better than megabyte
files of output.
3Web Mining
- Take advantage of the size of the Web to extract
information. - Use information from Web transactions (e.g.,
on-line purchases, social networks) to extract
information.
4Outline of Course
- Frequent Itemsets and association rules.
- PageRank and related measures of importance on
the Web (link analysis ). - Spam detection.
- Topic-specific search.
- Map-Reduce and Hadoop.
5Outline (2)
- Finding similar items (e.g., Web pages).
- Shingling (documents turned into sets).
- Minhashing (summarizing sets)
- Locality-Sensitive Hashing (finding only similar
sets). - Mining data streams.
6Goals
- Learn some interesting algorithms and techniques
that have important applications. - Learn how to deal efficiently with data that is
so large it doesnt fit in main memory.
7Application Selling Stuff
- There are two marketing environments
- Brick-and-mortar conventional stores.
- Strategies need large volume to succeed.
- Run ads only for the most common items.
- On-line, e.g., Amazon.
- Takes advantage of the long-tail selling
unusual things to unusual people. - Your Amazon landing page suggests things you are
interested in.
8The Long Tail
Source Chris Anderson (2004)
9Application Understanding Documents
- Many large databases of text information
- The Web.
- News articles.
- Blogs.
- Postings (Craigs list, B2B Web sites,
del.icio.us, social networks, etc. etc.)
10Questions About Web Pages
- Which Web pages are the best answers to a query?
- PageRank, topic-specific PageRank,
hubs-and-authorities. - Which Web pages are spam?
- TrustRank.
11Questions About Documents
- Which documents are about similar topics?
- Based on similar sets of words in their text.
- Based on usage, e.g., documents accessed by many
of the same people. - What is the opinion in a document?
- E.g., a car that uses less fuel, versus a car
that has less room.
12Application Collaborative Filtering
- Advise one person based on what similar people
have done. - Example data sets
- NetFlix Challenge matrix of user-movie ratings.
- Predict how a user would rate a movie.
- Amazon purchases matrix of buyer-item pairs.
13Example NetFlix Data
M o v i e s
0 0 5 0 0 0 4 0 0 0 5 3 0 0 0 0 0 4 0 0 0 3 0 2 0
0 5 0 0 1 0 4 0 0 0 0 4 0 0 1 0 5 3 0 2 0 0 0 0 0
4 3 3 0 5 0 0 0 0 0 0 0 0 3 4 0 0 5 etc., etc.
U s e r s
Note ratings are 1-5 0 not rated.
14Collaborative Filtering (2)
- Data is always sparse (most entries in the
matrix are 0). - Look for similar rows or columns.
- E.g., items bought by many of the same customers,
or users who gave many of the same movies similar
ratings. - Cluster rows and/or columns.
15Gradiance Automated Homework
- Go to www.gradiance.com/services
- Create an account for yourself.
- Enroll in class FA335CA1
16Gradiance How It Works
- Dont think of it as multiple-choice.
- Really solve the problems.
- Demonstrate you solved them by answering randomly
chosen multiple-choice questions. - If you miss something, you get an explanation,
and should try again.
17Discussion Gradiance for Course Credit?
- It has been suggested that we give credit to
those who get at least 80 of the marks on the
assignments. - No exam.
- But deadlines for assignments first two pieces
have May 21 deadlines.