Title: 34: The Zombie Day
13/4 The Zombie Day
- Feedback thingie..
- Last bits of clustering
- Review of any questions etc
- Assuming I actually remember..
- Filtering
2Filtering and Recommender SystemsContent-based
and Collaborative
Some of the slides based On Mooneys Slides
3Personalization
- Recommenders are instances of personalization
software. - Personalization concerns adapting to the
individual needs, interests, and preferences of
each user. - Includes
- Recommending
- Filtering
- Predicting (e.g. form or calendar appt.
completion) - From a business perspective, it is viewed as part
of Customer Relationship Management (CRM).
4Feedback Prediction/Recommendation
- Traditional IR has a single userprobably working
in single-shot modes - Relevance feedback
- WEB search engines have
- Working continually
- User profiling
- Profile is a model of the user
- (and also Relevance feedback)
- Many users
- Collaborative filtering
- Propagate user preferences to other users
You know this one
5Recommender Systems in Use
- Systems for recommending items (e.g. books,
movies, CDs, web pages, newsgroup messages) to
users based on examples of their preferences. - Many on-line stores provide recommendations (e.g.
Amazon, CDNow). - Recommenders have been shown to substantially
increase sales at on-line stores.
6Feedback Detection
Non-Intrusive
Intrusive
- Click certain pages in certain order while ignore
most pages. - Read some clicked pages longer than some other
clicked pages. - Save/print certain clicked pages.
- Follow some links in clicked pages to reach more
pages. - Buy items/Put them in wish-lists/Shopping Carts
- Explicitly ask users to rate items/pages
73/11
- -Midterm returned
- --Two talks of interest tomorrow
- --Louiqa Raschid_at_10am
- --Zaiqing Nie_at_2pm
8Midterm (In-class)
- 494 section
- Max 59.5Min 12Mean 32.17Stdev 11.98
- 598section
- Max 61.5Min 30Mean 47.4Stdev 10
- Overall class
- Max 61.5Min 12Mean 38.15Stdev 13.4
Pickup your exam at the end of the class
9Midterm discussion
10Content-based vs. Collaborative Recommendation
11Collaborative Filtering
Correlation analysis Here is similar to
the Association clusters Analysis!
12Collaborative Filtering Method
- Weight all users with respect to similarity with
the active user. - Select a subset of the users (neighbors) to use
as predictors. - Normalize ratings and compute a prediction from a
weighted combination of the selected neighbors
ratings. - Present items with highest predicted ratings as
recommendations.
13Neighbor Selection
- For a given active user, a, select correlated
users to serve as source of predictions. - Standard approach is to use the most similar n
users, u, based on similarity weights, wa,u - Alternate approach is to include all users whose
similarity weight is above a given threshold.
14Rating Prediction
- Predict a rating, pa,i, for each item i, for
active user, a, by using the n selected neighbor
users, - u ? 1,2,n.
- To account for users different ratings levels,
base predictions on differences from a users
average rating. - Weight users ratings contribution by their
similarity to the active user.
ri,j is user is rating for item j
15Similarity Weighting
- Typically use Pearson correlation coefficient
between ratings for active user, a, and another
user, u.
ra and ru are the ratings vectors for the m
items rated by both a and u ri,j is
user is rating for item j
16Covariance and Standard Deviation
- Covariance
- Standard Deviation
17Significance Weighting
- Important not to trust correlations based on very
few co-rated items. - Include significance weights, sa,u, based on
number of co-rated items, m.
18Problems with Collaborative Filtering
- Cold Start There needs to be enough other users
already in the system to find a match. - Sparsity If there are many items to be
recommended, even if there are many users, the
user/ratings matrix is sparse, and it is hard to
find users that have rated the same items. - First Rater Cannot recommend an item that has
not been previously rated. - New items
- Esoteric items
- Popularity Bias Cannot recommend items to
someone with unique tastes. - Tends to recommend popular items.
- WHAT DO YOU MEAN YOU DONT CARE FOR BRITNEY
SPEARS YOU DUNDERHEAD?
19Content-Based Recommending
- Recommendations are based on information on the
content of items rather than on other users
opinions. - Uses machine learning algorithms to induce a
profile of the users preferences from examples
based on a featural description of content. - Lots of systems
20Advantages of Content-Based Approach
- No need for data on other users.
- No cold-start or sparsity problems.
- Able to recommend to users with unique tastes.
- Able to recommend new and unpopular items
- No first-rater problem.
- Can provide explanations of recommended items by
listing content-features that caused an item to
be recommended. - Well-known technology The entire field of
Classification Learning is at (y)our disposal!
21Disadvantages of Content-Based Method
- Requires content that can be encoded as
meaningful features. - Users tastes must be represented as a learnable
function of these content features. - Unable to exploit quality judgments of other
users. - Unless these are somehow included in the content
features.
22Primer on Classification Learning
FAST
- (you can learn more about this in
- CSE 471 Intro to AI
- CSE 575 Datamining
- EEE 511 Neural Networks)
23Many uses of Classification Learning in IR/Web
Search
- Learn user profiles
- Classify documents into categories based on their
contents - Useful in
- focused crawling
- Spam mail filtering
- Relevance reasoning..
24A classification learning example Predicting when
Rusell will wait for a table
--similar to book preferences, predicting credit
card fraud, predicting when people are likely
to respond to junk mail
25Uses different biases in predicting Russels
waiting habbits
Decision Trees --Examples are used to --Learn
topology --Order of questions
K-nearest neighbors
If patronsfull and dayFriday then wait
(0.3/0.7) If waitgt60 and Reservationno then
wait (0.4/0.9)
Association rules --Examples are used to
--Learn support and confidence of
association rules
SVMs
Neural Nets --Examples are used to --Learn
topology --Learn edge weights
Naïve bayes (bayesnet learning) --Examples are
used to --Learn topology --Learn CPTs
26Mirror, Mirror, on the wall Which learning
bias is the best of all?
Well, there is no such thing, silly! --Each
bias makes it easier to learn some patterns and
harder (or impossible) to learn others -A
line-fitter can fit the best line to the data
very fast but wont know what to do if the data
doesnt fall on a line --A curve fitter can
fit lines as well as curves but takes longer
time to fit lines than a line fitter. --
Different types of bias classes (Decision trees,
NNs etc) provide different ways of naturally
carving up the space of all possible
hypotheses So a more reasonable question is --
What is the bias class that has a specialization
corresponding to the type of patterns that
underlie my data? -- In this bias class, what is
the most restrictive bias that still can capture
the true pattern in the data?
--Decision trees can capture all boolean
functions --but are faster at capturing
conjunctive boolean functions --Neural nets can
capture all boolean or real-valued functions
--but are faster at capturing linearly seperable
functions --Bayesian learning can capture all
probabilistic dependencies But are faster at
capturing single level dependencies (naïve bayes
classifier)
27Fitting test cases vs. predicting future
cases The BIG TENSION.
2
1
3
Why not the 3rd?
28Naïve Bayesian Classification
- Problem Classify a given example E into one of
the classes among C1, C2 ,, Cn - E has k attributes A1, A2 ,, Ak and each Ai can
take d different values - Bayes Classification Assign E to class Ci that
maximizes P(Ci E) - P(Ci E) P(E Ci) P(Ci) / P(E)
- P(Ci) and P(E) are a priori knowledge (or can be
easily extracted from the set of data) - Estimating P(ECi) is harder
- Requires P(A1v1 A2v2.AkvkCi)
- Assuming d values per attribute, we will need ndk
probabilities - Naïve Bayes Assumption Assume all attributes are
independent P(E Ci) P P(Aivj Ci ) - The assumption is BOGUS, but it seems to WORK
(and needs only ndk probabilities
29NBC in terms of BAYES networks..
NBC assumption
More realistic assumption
30Estimating the probabilities for NBC
- Given an example E described as A1v1
A2v2.Akvk we want to compute the class of E - Calculate P(Ci A1v1 A2v2.Akvk) for all
classes Ci and say that the class of E is the
one for which P(.) is maximum - P(Ci A1v1 A2v2.Akvk)
- P P(vj Ci ) P(Ci) / P(A1v1
A2v2.Akvk) - Given a set of training N examples that have
already been classified into n classes Ci - Let (Ci) be the number of
examples that are labeled as Ci - Let (Ci, Aivi) be the number of
examples labeled as Ci - that have attribute Ai
set to value vj - P(Ci) (Ci)/N
- P(Aivj Ci) (Ci, Aivi) /
(Ci) -
31Example
P(willwaityes) 6/12 .5 P(Patronsfullwillw
aityes) 2/60.333 P(Patronssomewillwaityes
) 4/60.666
Similarly we can show that P(Patronsfullwillw
aitno) 0.6666
P(willwaityesPatronsfull) P(patronsfullwill
waityes) P(willwaityes)
--------------------------------------------------
---------
P(Patronsfull)
k
.333.5 P(willwaitnoPatronsfull) k 0.666.5
32Using M-estimates to improve probablity estimates
- The simple frequency based estimation of
P(AivjCk) can be inaccurate, especially when
the true value is close to zero, and the number
of training examples is small (so the probability
that your examples dont contain rare cases is
quite high) - Solution Use M-estimate
- P(Aivj Ci) (Ci, Aivi)
mp / (Ci) m - p is the prior probability of Ai taking the value
vi - If we dont have any background information,
assume uniform probability (that is 1/d if Ai can
take d values) - m is a constantcalled equivalent sample size
- If we believe that our sample set is large
enough, we can keep m small. Otherwise, keep it
large. - Essentially we are augmenting the (Ci) normal
samples with m more virtual samples drawn
according to the prior probability on how Ai
takes values
Also, to avoid overflow errors do addition of
logarithms of probabilities (instead of
multiplication of probabilities)
33Applying NBC for Text Classification
- Text classification is the task of classifying
text documents to multiple classes - Is this mail spam?
- Is this article from comp.ai or misc.piano?
- Is this article likely to be relevant to user X?
- Is this page likely to lead me to pages relevant
to my topic? (as in topic-specific crawling) - NBC has been applied a lot to text classification
tasks. - The big question How to represent text documents
as feature vectors? - Vector space variants (e.g. a binary version of
the vector space rep) - Used by Sahami et.al. in SPAM filtering
- A problem is that the vectors are likely to be as
large as the size of the vocabulary - Use feature selection techniques to select only
a subset of words as features (see Sahami et al
paper) - Unigram model Mitchell paper
- Used by Joachims for newspaper article
categorization - Document as a vector of positions with values
being the words
3425th March
- Text Classification
- Spam mail filtering
35Extensions to Naïve Bayes idea
- Vector of Bags model
- E.g. Books have several different fields that are
all text - Authors, description,
- A word appearing in one field is different from
the same word appearing in another - Want to keep each bag differentvector of m Bags
- Additional useful terms
- Odds Ratio
- P(relexample)/P(relexample)
- An example is positive if the odds ratio is gt 1
- Strengh of a keyword
- LogP(wrel)/P(wrel)
- We can summarize a users profile in terms of the
words that have strength above some threshold.
36Sahami et als Solution for SPAM detection
- use standard Term Vector Space model developed
by Information Retrieval field (similar to
AdEater) - 1 e-mail message ? single fixed-width feature
vector - have 1 bit in this vector for each term that
occurs in some message in E (plus a bunch of
domain-specific featureseg, when message was
sent) - learning algorithm
- use standard Naive Bayes algorithm
37Sahami et. Al. spam filtering
- The above framework is completely general. We
just need to encode each e-mail as a fixed-width
vector X ?X1, X2, X3, ..., XN? of features. - So... What features are used in Sahamis system
- words
- suggestive phrases (free money, must be over
21, ...) - senders domain (.com, .edu, .gov, ...)
- peculiar punctuation (!!!Get Rich Quick!!!)
- did email contain an attachment?
- was message sent during evening or daytime?
- ?
- ?
- (Well see a similar list for AdEater and other
learning systems)
generatedautomatically
handcrafted!
38Feature Selection
- A problem -- too many features -- each vector x
contains several thousand features. - Most come from word features -- include a word
if any e-mail contains it (eg, every x contains
an opossum feature even though this word occurs
in only one message). - Slows down learning and predictoins
- May cause lower performance
- The Naïve Bayes Classifier makes a huge
assumption -- the independence assumption. - A good strategy is to have few features, to
minimize the chance that the assumption is
violated. - Ideally, discard all features that violate the
assumption. (But if we knew these features, we
wouldnt need to make the naive independence
assumption!) - Feature selection a few thousand ? 500
features
39Feature-Selection approach
- Lots of ways to perform feature selection
- FEATURE SELECTION DIMENSIONALITY REDUCTION
- One simple strategy mutual information
- Suppose we have two random variables A and B.
- Mutual information MI(A,B) is a numeric measure
of what we can conclude about A if we know B, and
vice-versa. - MI(A,B) Pr(AB) log(Pr(AB)/(Pr(A)Pr(B)))
- Example If A and B are independent, then we
cant conclude anything MI(A, B) 0 - Note that MI can be calculated without needing
conditional probabilities.
40Mutual Information, continued
- Check our intuition independence -gt MI(A,B)0
MI(A,B) Pr(AB) log(Pr(AB)/(Pr(A)Pr(B)))
Pr(AB) log(Pr(A)Pr(B)/(Pr(A)Pr(B
))) Pr(AB) log 1
0 - Fully correlated, it becomes the information
content - MI(A,A) - Pr(A)log(Pr(A))
- it depends on how uncertain the event is
notice that the expression becomes maximum (1)
when Pr(A).5 this makes sense since the most
uncertain event is one whose probability is .5
(if it is .3 then we know it is likely not to
happen if it is .7 we know it is likely to
happen).
41MI and Feature Selection
- Back to feature selection Pick features Xi that
have high mutual information with the junk/legit
classification C. - These are exactly the features that are good for
prediction - Pick 500 features Xi with highest value MI(Xi, C)
- NOTE NBCs estimate of probabilities is
actually quite a bit wrong but they still got by
with those.. - Also, note that this analysis looks at each
feature in isolation and may thus miss highly
predictive word groups whose individual words are
quite non-predictive - e.g. free and money may have low MI, but
Free money may have higher MI. - A way to handle this is to look at MI of not just
words but subsets of words - (in the worst case, you will need to compute 2n
MIs ?) - So instead, Sahami et. Al. add domain specific
phrases separately.. - Note Theres no reason that the highest-MI
features are the ones that least violate the
independence assumption -- this is just a
heuristic!
42MI based feature selection vs. LSI
- Both MI and LSI are dimensionality reduction
techniques - MI is looking to reduce dimensions by looking at
a subset of the original dimensions - LSI looks instead at a linear combination of the
subset of the original dimensions (Good Can
automatically capture sets of dimensions that are
more predictive. Bad the new features may not
have any significance to the user) - MI does feature selection w.r.t. a classification
task (MI is being computed between a feature and
a class) - LSI does dimensionality reduction independent of
the classes (just looks at data variance)
43Experiments
- 1789 hand-tagged e-mail messages
- 1578 junk
- 211 legit
- Split into
- 1528 training messages (86)
- 251 testing messages (14)
- Similar to experiment described in AdEater
lecture, except messages are not randomly split.
This is unfortunate -- maybe performance is just
a fluke. - Training phase Compute PrXxCjunk, PrXx,
and PCjunk from training messages - Testing phase Compute PrCjunkXx for each
training message x. Predict junk if
PrCjunkXxgt0.999. Record mistake/correct
answer in confusion matrix.
44Precision/Recall Curves
better performance
Points from Table on Slide 14
45Results
same configuration, just different training/test
messages
46Real scenario
- Data in previous experiments was collected in a
strange way. Real scenario tries to fix it. - Three kinds of messages
- Read and keep
- Read and discard (eg, joke from a friend)
- Junk
- Real scenario models setting where messages
arrive, and some are deleted because they are
junk, others are deleted because they arent
worth saving, and others are read and then saved.
Both of these shouldcount as legit -- but
read discard messageswere not collected
47Summary
- Bayesian Classification
- Naïve Bayesian Classification
- Email features automatically generated lists of
words hand-picked phrases domain-specific
features - Feature selection by Mutual Information heuristic
- Semi-controlled experiments
- Collect data in various ways compare 2/3
categories - Confusion Matrix
- Precision recall vs Accuracy
- Can trade precision for recall by varying
classification threshold.
48Current State of the Art in Spam Filtering
- SpamAssassin (http//www.spamassassin.org ) is
pretty much the best spam filter out there (it is
FREE!) - Based on a variety of tests. Each test gives a
numerical score (spam points) to the message
(the more positive it is, the more spammy it is).
When the cumulative scores is above a threshold,
it puts the message in spam box. Tests used are
at http//www.spamassassin.org/tests.html. - Tests are 1 of three types
- Domain Specific Has a set of hand-written rules
(sort of like the Sahami et. Al. domain specific
features). If the rule matches then the message
is given a score (ve or ve). If the cumulative
score is more than a threshold, then the message
is classified as SPAM.. - Bayesian Filter Uses NBC to train on messages
that the user classified (requires that SA be
integrated with a mail client ASU IMAP version
does it) - An interesting point is that it is hard to
explain to the user why the bayesian filter
found a message to be spam (while domain specific
filter can say that specific phrases were found). - Collaborative Filter E.g. Vipuls razor, etc. If
this type of message has been reported as SPAM by
other users (to a central spam server), then the
message is given additional spam points. - Messages are reported in terms of their
signatures - Simple checksum signatures dont quite work
(since the Spammers put minor variations in the
body) - So, these techniques use fuzzy signatures, and
similarity rather than equality of
signatures. (see the connection with Crawling
and Duplicate Detection).
49A message caught by Spamassassin
- Message 346
- From aetbones_at_ccinet.ab.ca Thu Mar 25 165123
2004 - From Geraldine Montgomery ltaetbones_at_ccinet.ab.cagt
- To randy.mullen_at_asu.edu
- Cc ranga_at_asu.edu, rangers_at_asu.edu, rao_at_asu.edu,
raphael_at_asu.edu, - rapture_at_asu.edu, rashmi_at_asu.edu
- Subject V1AGKRA 80 DISCOUNT !! sg g pz kf
- Date Fri, 26 Mar 2004 024921 0000 (GMT)
- X-Spam-Flag YES
- X-Spam-Checker-Version SpamAssassin 2.63
(2004-01-11) on - parichaalak.eas.asu.edu
- X-Spam-Level
- X-Spam-Status Yes, hits42.2 required5.0
testsBIZ_TLD,DCC_CHECK, - FORGED_MUA_OUTLOOK,FORGED_OUTLOOK_TAGS,HTM
L_30_40,HTML_FONT_BIG, - HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,MIME_HT
ML_NO_CHARSET, - MIME_HTML_ONLY,MIME_HTML_ONLY_MULTI,MISSIN
G_MIMEOLE, - OBFUSCATING_COMMENT,RCVD_IN_BL_SPAMCOP_NET
,RCVD_IN_DSBL,RCVD_IN_NJABL, - RCVD_IN_NJABL_PROXY,RCVD_IN_OPM,RCVD_IN_OP
M_HTTP, - RCVD_IN_OPM_HTTP_POST,RCVD_IN_SORBS,RCVD_I
N_SORBS_HTTP,SORTED_RECIPS,
50Example of SpamAssassin explanation
Domain specific
- X-Spam-Status Yes, hits42.2 required5.0
testsBIZ_TLD,DCC_CHECK, - FORGED_MUA_OUTLOOK,FORGED_OUTLOOK_TAGS,HTM
L_30_40,HTML_FONT_BIG, - HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,MIME_HT
ML_NO_CHARSET, - MIME_HTML_ONLY,MIME_HTML_ONLY_MULTI,MISSIN
G_MIMEOLE, - OBFUSCATING_COMMENT,RCVD_IN_BL_SPAMCOP_NET
,RCVD_IN_DSBL,RCVD_IN_NJABL, - RCVD_IN_NJABL_PROXY,RCVD_IN_OPM,RCVD_IN_OP
M_HTTP, - RCVD_IN_OPM_HTTP_POST,RCVD_IN_SORBS,RCVD_I
N_SORBS_HTTP,SORTED_RECIPS, - SUSPICIOUS_RECIPS,X_MSMAIL_PRIORITY_HIGH,X
_PRIORITY_HIGH autolearnno - version2.63
collaborative
In this case, autolearn is set to no so bayesian
filter is not active.
51General comments on Spam
- Spam is a technical problem (we created it)
- It has the arms-race character to it
- We cant quite legislate against SPAM
- Most spam comes from outside national boundaries
- Need technical solutions
- To detect Spam (we mostly have a handle on it)
- To STOP spam generation (detecting spam after its
gets sent still is taxing mail serversby some
estimates more than 66 of the mail relayed by
AOL/Yahoo mailservers is SPAM - Brother Gates suggest monetary cost
- Make every mailer pay for the mail they send
- Not necessarily in stamps but perhaps by
agreeing to give some CPU cycles to work on some
problem (e.g. finding primes computing PI etc) - The cost will be minuscule for normal users, but
will multiply for spam mailers who send millions
of mails. - Other innovative ideas neededwe now have a
conferences on Spam mail - http//www.ceas.cc/
52NBC with Unigram Model
- Assume that words from a fixed vocabulary V
appear in the document D at different positions
(assume D has L words) - P(DC) is P(p1w1,p2w2pLwl C)
- Assume that words appearance probabilities are
independent of each other - P(DC) is P(p1w1C)P(p2w2C) P(pLwl C)
- Assume that word occurrence probability is
INDEPENDENT of its position in the document - P(p1w1C)P(p2w1C)P(pLw1C)
- Use m-estimates set p to 1/V and m to V (where V
is the size of the vocabulary) - P(wkCi) (wk,Ci) 1/w(Ci) V
- (wk,Ci) is the number of times wk appears in the
documents classified into class Ci - w(Ci) is the total number of words in all
documents
Used to classify usenet articles from 20
different groups --achieved an accuracy of
89!! (random guessing will get you 5)
53How Well (and WHY) DOES NBC WORK?
- Naïve bayes classifier is darned easy to
implement - Good learning speed, classification speed
- Modest space storage
- Supports incrementality
- Recommendations re-done as more attribute values
of the new item become known. - It seems to work very well in many scenarios
- Peter Norvig, the director of Machine Learning at
GOOGLE said, when asked about what sort of
technology they use Naïve bayes - But WHY?
- Domingos/Pazzani 1996 showed that NBC has much
wider ranges of applicability than previously
thought (despite using the independence
assumption) - classification accuracy is different from
probability estimate accuracy - Notice that normal classification application
application dont quite care about the actual
probability only which probability is the
highest - Exception is Cost-based learningsuppose false
positives and false negatives have different
costs - E.g. Sahami et al consider a message to be spam
only if Spam class probability is gt.9 (so they
are using incorrect NBC estimates here)
54Combining Content and Collaboration
- Content-based and collaborative methods have
complementary strengths and weaknesses. - Combine methods to obtain the best of both.
- Various hybrid approaches
- Apply both methods and combine recommendations.
- Use collaborative data as content.
- Use content-based predictor as another
collaborator. - Use content-based predictor to complete
collaborative data.
55Movie Domain
- EachMovie Dataset Compaq Research Labs
- Contains user ratings for movies on a 05 scale.
- 72,916 users (avg. 39 ratings each).
- 1,628 movies.
- Sparse user-ratings matrix (2.6 full).
- Crawled Internet Movie Database (IMDb)
- Extracted content for titles in EachMovie.
- Basic movie information
- Title, Director, Cast, Genre, etc.
- Popular opinions
- User comments, Newspaper and Newsgroup reviews,
etc.
56Content-Boosted Collaborative Filtering
EachMovie
IMDb
57Content-Boosted CF - I
58Content-Boosted CF - II
User Ratings Matrix
Pseudo User Ratings Matrix
Content-Based Predictor
- Compute pseudo user ratings matrix
- Full matrix approximates actual full user
ratings matrix - Perform CF
- Using Pearson corr. between pseudo user-rating
vectors
59Conclusions
- Recommending and personalization are important
approaches to combating information over-load. - Machine Learning is an important part of systems
for these tasks. - Collaborative filtering has problems.
- Content-based methods address these problems (but
have problems of their own). - Integrating both is best.