Title: Information Retrieval
1Information Retrieval
March 25, 2005
2Course Information
- Instructor Dragomir R. Radev (radev_at_si.umich.edu)
- Office 3080, West Hall Connector
- Phone (734) 615-5225
- Office hours M 11-12 Th 12-1 or via email
- Course page http//tangra.si.umich.edu/radev/650
/ - Class meets on Fridays, 210-455 PM in 409 West
Hall
3Text classification
4Introduction
- Text classification assigning documents to
predefined categories - Hierarchical vs. flat
- Many techniques generative (maxent, knn, Naïve
Bayes) vs. discriminative (SVM, regression) - Generative model joint prob. p(x,y) and use
Bayesian prediction to compute p(yx) - Discriminative model p(yx) directly.
5Generative models knn
- K-nearest neighbors
- Very easy to program
- Issues choosing k, b?
6Feature selection The ?2 test
- For a term t
- Testing for independenceP(C0,It0) should be
equal to P(C0) P(It0) - P(C0) (k00k01)/n
- P(C1) 1-P(C0) (k10k11)/n
- P(It0) (k00K10)/n
- P(It1) 1-P(It0) (k01k11)/n
7Feature selection The ?2 test
- High values of ?2 indicate lower belief in
independence. - In practice, compute ?2 for all words and pick
the top k among them.
8Feature selection mutual information
- No document length scaling is needed
- Documents are assumed to be generated according
to the multinomial model
9Naïve Bayesian classifiers
- Naïve Bayesian classifier
- Assuming statistical independence
10Spam recognition
Return-Path ltig_esq_at_rediffmail.comgt X-Sieve CMU
Sieve 2.2 From "Ibrahim Galadima"
ltig_esq_at_rediffmail.comgt Reply-To
galadima_esq_at_netpiper.com To webmaster_at_aclweb.org
Date Tue, 14 Jan 2003 210626 -0800 Subject
Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS
LETTER MAY COME TO YOU AS A SURPRISE SINCE I
HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE
CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL
ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN
THE COURSE OF MY SEARCH FOR A RELIABLE PERSON
WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTIO
N INVOLVING THE ! TRANSFER OF FUND VALUED
AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED
STATES DOLLARS US20M TO A SAFE FOREIGN
ACCOUNT THE ABOVE FUND IN QUESTION IS NOT
CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT
IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN
1999 BY INEC TO A
11Well-known datasets
- 20 newsgroups (/data0/projects/graph/20ng)
- http//people.csail.mit.edu/u/j/jrennie/public_htm
l/20Newsgroups/ - Reuters-21578 (/data2/corpora/reuters21578)
- Cats grain, acquisitions, corn, crude, wheat,
trade - WebKB (/data2/corpora/webkb)
- http//www-2.cs.cmu.edu/webkb/
- course, student, faculty, staff, project, dept,
other - NB performance (2000)
- P26,43,18,6,13,2,94
- R83,75,77,9,73,100,35
12Support vector machines
- Introduced by Vapnik in the early 90s.
13Semi-supervised learning
- EM
- Co-training
- Graph-based
14Exploiting Hyperlinks Co-training
- Each document instance has two sets of alternate
view (Blum and Mitchell 1998) - terms in the document, x1
- terms in the hyperlinks that point to the
document, x2 - Each view is sufficient to determine the class of
the instance - Labeling function that classifies examples is
the same applied to x1 or x2 - x1 and x2 are conditionally independent, given
the class
Slide from Pierre Baldi
15Co-training Algorithm
- Labeled data are used to infer two Naïve Bayes
classifiers, one for each view - Each classifier will
- examine unlabeled data
- pick the most confidently predicted positive and
negative examples - add these to the labeled examples
- Classifiers are now retrained on the augmented
set of labeled examples
Slide from Pierre Baldi
16Additional topics
- Soft margins
- VC dimension
- Kernel methods
17Conclusion
- SVMs are widely considered to be the best method
for text classification (look at papers by
Sebastiani, Christianini, Joachims), e.g. 86
accuracy on Reuters. - NB also good in many circumstances
18Information extraction
19Information Extraction
- Automatically extract unstructured text data from
Web pages - Represent extracted information in some
well-defined schema - E.g.
- crawl the Web searching for information about
certain technologies or products of interest - extract information on authors and books from
various online bookstore and publisher pages
Slide from Pierre Baldi
20Info Extraction as Classification
- Represent each document as a sequence of words
- Use a sliding window of width k as input to a
classifier - each of the k inputs is a word in a specific
position - The system trained on positive and negative
examples (typically manually labeled) - Limitation no account of sequential constraints
- e.g. the author field usually precedes the
address field in the header of a research paper - can be fixed by using stochastic finite-state
models
Slide from Pierre Baldi
21Hidden Markov Models
Example Classify short segments of text in terms
whether they correspond to the title, author
names, addresses, affiliations, etc.
Slide from Pierre Baldi
22Hidden Markov Model
- Each state corresponds to one of the fields that
we wish to extract - e.g. paper title, author name, etc.
- True Markov state diagram is unknown at
parse-time - can see noisy observations from each state
- the sequence of words from the document
- Each state has a characteristic probability
distribution over the set of all possible words - e.g. specific distribution of words from the
state title
Slide from Pierre Baldi
23Training HMM
- Given a sequence of words and HMM
- parse the observed sequence into a corresponding
set of inferred states - Viterbi algorithm
- Can be trained
- in supervised manner with manually labeled data
- bootstrapped using a combination of labeled and
unlabeled data
Slide from Pierre Baldi
24Human behavior on the Web
The slides in this section are from Pierre Baldi
25Web data and measurement issues
- Background
- Important to understand how data is collected
- Web data is collected automatically via software
logging tools - Advantage
- No manual supervision required
- Disadvantage
- Data can be skewed (e.g. due to the presence of
robot traffic) - Important to identify robots (also known as
crawlers, spiders)
26A time-series plot of Web requests
Number of page requests per hour as a function of
time from page requests in the www.ics.uci.edu
Web server logs during the first week of April
2002.
27Robot / human identification
- Robot requests identified by classifying page
requests using a variety of heuristics - e.g. some robots self-identify themselves in the
server logs (robots.txt) - Robots explore the entire website in breadth
first fashion - Humans access web-pages in depth first fashion
- Tan and Kumar (2002) discuss more techniques
28Robot / human identification
- Robot traffic consists of two components
- Periodic Spikes (can overload a server)
- Requests by bad robots
- Lower-level constant stream of requests
- Requests by good robots
- Human traffic has
- Daily pattern Monday to Friday
- Hourly pattern peak around midday low traffic
from midnight to early morning
29Server-side data
- Data logging at Web servers
- Web server sends requested pages to the requester
browser - It can be configured to archive these requests in
a log file recording - URL of the page requested
- Time and date of the request
- IP address of the requester
- Requester browser information (agent)
30Data logging at Web servers
- Status of the request
- Referrer page URL if applicable
- Server-side log files
- provide a wealth of information
- require considerable care in interpretation
- More information in Cooley et al. (1999), Mena
(1999) and Shahabi et al. (2001)
31Page requests, caching, and proxy servers
- In theory, requester browser requests a page from
a Web server and the request is processed - In practice, there are
- Other users
- Browser caching
- Dynamic addressing in local network
- Proxy Server caching
32Page requests, caching, and proxy servers
A graphical summary of how page requests from an
individual user can be masked at various stages
between the users local computer and the Web
server.
33Identifying individual users from Web server logs
- Useful to associate specific page requests to
specific individual users - IP address most frequently used
- Disadvantages
- One IP address can belong to several users
- Dynamic allocation of IP address
- Better to use cookies
- Information in the cookie can be accessed by the
Web server to identify an individual user over
time - Actions by the same user during different
sessions can be linked together
34Identifying individual users from Web server logs
- Commercial websites use cookies extensively
- 90 of users have cookies enabled permanently on
their browsers - However
- There are privacy issues need implicit user
cooperation - Cookies can be deleted / disabled
- Another option is to enforce user registration
- High reliability
- Can discourage potential visitors
35Client-side data
- Advantages of collecting data at the client side
- Direct recording of page requests (eliminates
masking due to caching) - Recording of all browser-related actions by a
user (including visits to multiple websites) - More-reliable identification of individual users
(e.g. by login ID for multiple users on a single
computer) - Preferred mode of data collection for studies of
navigation behavior on the Web - Companies like comScore and Nielsen use
client-side software to track home computer users - Zhu, Greiner and Häubl (2003) used client-side
data
36Client-side data
- Statistics like Time per session and Page-view
duration are more reliable in client-side data - Some limitations
- Still some statistics like Page-view duration
cannot be totally reliable e.g. user might go to
fetch coffee - Need explicit user cooperation
- Typically recorded on home computers may not
reflect a complete picture of Web browsing
behavior - Web surfing data can be collected at intermediate
points like ISPs, proxy servers - Can be used to create user profile and target
advertise
37Handling massive Web server logs
- Web server logs can be very large
- Small university department website gets a
million requests per month - Amazon, Google can get tens of millions of
requests each day - Exceed main memory capacities, stored on disks
- Time-costs to data access place significant
constraints on types of analysis - In practice
- Analysis of subset of data
- Filtering out events and fields of no direct
interest
38Empirical client-side studies of browsing behavior
- Data for client-side studies are collected at the
client-side over a period of time - Reliable page revisitation patterns can be
gathered - Explicit user permission is required
- Typically conducted at universities
- Number of individuals is small
- Can introduce bias because of the nature of the
population being studied - Caution must be exercised when generalizing
observations - Nevertheless, provide good data for studying
human behavior
39Early studies from 1995 to 1997
- Earliest studies on client-side data are Catledge
and Pitkow (1995) and Tauscher and Greenberg
(1997) - In both studies, data was collected by logging
Web browser commands - Population consisted of faculty, staff and
students - Both studies found
- clicking on the hypertext anchors as the most
common action - using back button was the second common action
40Early studies from 1995 to 1997
- high probability of page revisitation
(0.58-0.61) - Lower bound because the page requests prior to
the start of the studies are not accounted for - Humans are creatures of habit?
- Content of the pages changed over time?
- strong recency (page that is revisited is usually
the page that was visited in the recent past)
effect - Correlates with the back button usage
- Similar repetitive actions are found in telephone
number dialing etc
41The Cockburn and McKenzie study from 2002
- Previous studies are relatively old
- Web has changed dramatically in the past few
years - Cockburn and McKenzie (2002) provides a more
up-to-date analysis - Analyzed the daily history.dat files produced by
the Netscape browser for 17 users for about 4
months - Population studied consisted of faculty, staff
and graduate students - Study found revisitation rates higher than past
94 and 95 studies (0.81) - Time-window is three times that of past studies
42The Cockburn and McKenzie study from 2002
- Revisitation rate less biased than the previous
studies? - Human behavior changed from an exploratory mode
to a utilitarian mode? - The more pages user visits, the more are the
requests for new pages - The most frequently requested page for each user
can account for a relatively large fraction of
his/her page requests - Useful to see the scatter plot of the distinct
number of pages requested per user versus the
total pages requested - Log-log plot also informative
43The Cockburn and McKenzie study from 2002
The number of distinct pages visited versus page
vocabulary size of each of the 17 users in the
Cockburn and McKenzie (2002) study
44The Cockburn and McKenzie study from 2002
The number of distinct pages visited versus page
vocabulary size of each of the 17 users in the
Cockburn and McKenzie (2002) study (log-log plot)
45The Cockburn and McKenzie study from 2002
Bar chart of the ratio of the number of page
requests for the most frequent page divided by
the total number of page requests, for 17 users
in the Cockburn McKenzie (2002) study
46Video-based analysis of Web usage
- Byrne et al. (1999) analyzed video-taped
recordings of eight different users over a period
of 15 min to 1 hour - Audio descriptions of the users was combined with
the video recordings of their screen for analysis - Study found
- users spent a considerable amount of time
scrolling Web pages - users spent a considerable amount of time waiting
for pages to load (15 of time)
47Probabilistic models of browsing behavior
- Useful to build models that describe the browsing
behavior of users - Can generate insight into how we use Web
- Provide mechanism for making predictions
- Can help in pre-fetching and personalization
48Markov models for page prediction
- General approach is to use a finite-state Markov
chain - Each state can be a specific Web page or a
category of Web pages - If only interested in the order of visits (and
not in time), each new request can be modeled as
a transition of states - Issues
- Self-transition
- Time-independence
49Markov models for page prediction
- For simplicity, consider order-dependent,
time-independent finite-state Markov chain with M
states - Let s be a sequence of observed states of length
L. e.g. s ABBCAABBCCBBAA with three states A, B
and C. st is state at position t (1lttltL). In
general, -
- Under a first-order Markov assumption, we have
- This provides a simple generative model to
produce sequential data
50Markov models for page prediction
- If we denote Tij P(st jst-1 i), we can
define a M x M transition matrix - Properties
- Strong first-order assumption
- Simple way to capture sequential dependence
- If each page is a state and if W pages, O(W2), W
can be of the order 105 to 106 for a CS dept. of
a university - To alleviate, we can cluster W pages into M
clusters, each assigned a state in the Markov
model - Clustering can be done manually, based on
directory structure on the Web server, or
automatic clustering using clustering techniques
51Markov models for page prediction
- Tij P(st jst-1 i) now represent the
probability that an individual users next
request will be from category j, given they were
in category i - We can add E, an end-state to the model
- E.g. for three categories with end state -
- E denotes the end of a sequence, and start of a
new sequence
52Markov models for page prediction
- First-order Markov model assumes that the next
state is based only on the current state - Limitations
- Doesnt consider long-term memory
- We can try to capture more memory with kth-order
Markov chain - Limitations
- Inordinate amount of training data O(Mk1)
53Fitting Markov models to observed page-request
data
- Assume that we collected data in the form of N
sessions from server-side logs, where ith session
si, 1lt i lt N, consists of a sequence of Li page
requests, categorized into M 1 states and
terminating in E. Therefore, data D s1, , sN - Let denote the set of parameters of the Markov
model, consists of M2 -1 entries in T - Let denote the estimated probability of
transitioning from state i to j.
54Fitting Markov models to observed page-request
data
- The likelihood function would be
- This assumes conditional independence of
sessions. - Under Markov assumptions, likelihood is
- where nij is the number of times we see a
transition from state i to state j in the
observed data D.
55Fitting Markov models to observed page-request
data
- For convenience, we use log-likelihood
- We can maximize the expression by taking partial
derivatives wrt each parameter and incorporating
the constraint (via Lagrange multipliers) that
the sum of transition probabilities out of any
state must sum to one - The maximum likelihood (ML) solution is
56Bayesian parameter estimation for Markov models
- In practice, M is large (102-103), we end up
estimating M2 probabilities - D may contain potentially millions of sequences,
so some nij 0 - Better way would be to incorporate prior
knowledge prior probability distribution
and then maximize , the posterior
distribution on given the data (rather
than ) - Prior distribution reflects our prior belief
about the parameter set - The posterior reflects our posterior belief in
the parameter set now informed by the data D
57Bayesian parameter estimation for Markov models
- For Markov transition matrices, it is common to
put a distribution on each row of T and assume
that each of these priors are independent - where
- Consider the set of parameters for the ith row in
T, a useful prior distribution on these
parameters is the Dirichlet distribution defined
as - where , and C is a
normalizing constant
58Bayesian parameter estimation for Markov models
- The MP posterior parameter estimates are
- If nij 0 for some transition (i, j) then
instead of having a parameter estimate of 0 (ML),
we will have allowing prior knowledge
to be incorporated - If nij gt 0, we get a smooth combination of the
data-driven information (nij) and the prior
59Bayesian parameter estimation for Markov models
- One simple way to set prior parameter is
- Consider alpha as the effective sample size
- Partition the states into two sets, set 1
containing all states directly linked to state i
and the remaining in set 2 - Assign uniform probability e/K to all states in
set 2 (all set 2 states are equally likely) - The remaining (1-e) can be either uniformly
assigned among set 1 elements or weighted by some
measure - Prior probabilities in and out of E can be set
based on our prior knowledge of how likely we
think a user is to exit the site from a
particular state
60Predicting page requests with Markov models
- Many flavors of Markov models proposed for next
page and future page prediction - Useful in pre-fetching, caching and
personalization of Web page - For a typical website, the number of pages is
large Clustering is useful in this case - First-order Markov models are found to be
inferior to other types of Markov models - kth-order is an obvious extension
- Limitation O(Mk1) parameters (combinatorial
explosion)
61Predicting page requests with Markov models
- Deshpande and Karypis (2001) propose schemes to
prune kth-order Markov state space - Provide systematic but modest improvements
- Another way is to use empirical smoothing
techniques that combine different models from
order 1 to order k (Chen and Goodman 1996) - Cadez et al. (2003) and Sen and Hansen (2003)
propose mixtures of Markov chains, where we
replace the first-order Markov chain
62Predicting page requests with Markov models
- with a mixture of first-order Markov chains
- where c is a discrete-value hidden variable
taking K values Sumk P(c k) 1and - P(st st-1, c k) is the transition matrix
for the kth mixture component - One interpretation of this is user behavior
consists of K different navigation behaviors
described by the K Markov chains - Cadez et al. use this model to cluster sequences
of page requests into K groups, parameters are
learned using the EM algorithm
63Predicting page requests with Markov models
- Consider the problem of predicting the next
state, given some number of states t - Let s1,t s1,, st denote the sequence of t
states - The predictive distribution for a mixture of K
Markov models is - The last line is obtained if we assume
conditioned on component c k, the next state
st1 depends only on st
64Predicting page requests with Markov models
- Weight based on observed history is
-
- where
- Intuitively, these membership weights evolve as
we see more data from the user - In practice,
- Sequences are short
- Not realistic to assume that observed data is
generated by a mixture of K first-order Markov
chains - Still, mixture model is a useful approximation
65Predicting page requests with Markov models
- K can be chosen by evaluating the out-of-sample
predictive performance based on - Accuracy of prediction
- Log probability score
- Entropy
- Other variations of Markov models
- Sen and Hansen 2003
- Position-dependent Markov models (Anderson et al.
2001, 2002) - Zukerman et al. 1999
66Search Engine Querying
- How users issue queries to search engines
- Tracking search query logs
- timestamp, text string, user ID etc.
- Collecting query datasets from different
distribution - Jansen et al (1998), Silverstein et al (1998)
- Lau and Horvitz (1999), Spink et al (2002)
- Xie and OHallaron (2002)
- e.g.
- Xie and OHallaron (2002)
- Checked how many queries were coming
- Checked users IP address
- Reported 111,000 queries (2.7) originating from
AOL
67Analysis of Search Engine Query Logs
68Main Results
- Average number of terms in a query is ranging
from a low of 2.2 to a high of 2.6 - The most common number of terms in a query is 2
- The majority of users dont refine their query
- The number of users who viewed only a single page
increase 29 (1997) to 51 (2001) (Excite) - 85 of users viewed only first page of search
results (AltaVista) - 45 (2001) of queries is about Commerce, Travel,
Economy, People (was 201997) - The queries about adult or entertainment
decreased from 20 (1997) to around 7 (2001)
69Main Results
- Query Length Distributions (bar) - Poisson
Model(dots lines)
- All four studies produced a generally consistent
set of findings about user behavior in a search
engine context - most users view relatively few pages per query
- most users dont use advanced search features
70Advanced Search Tips
- Useful operators for searching (Google)
- Include stop word (common
words) where is Irvine - - Exclude operating system -Microsoft
- Synonyms computer
- Phrase search modeling the internet
- or Either A Or B vacation London or Paris
- site Domain search admission
sitewww.uci.edu
71Power-law Characteristics
Power-Law in log-log space
- Frequency f(r) of Queries with Rank r
- 110000 queries from Vivisimo
- 1.9 Million queries from Excite
- There are strong regularities in terms of
patterns of behavior in how we search the Web
72Models for Search Strategies
- It is significant to know the process by which a
typical user navigates through search space when
looking for information using a search engine - The inference of users search actions could be
used for marketing purposes such as real-time
targeted advertising
73Graphical Representation
- Lar Horvitz(1999)
- Model of users search query actions over time
- Simple Bayesian network
- Current search action
- Time interval
- Next search action
- Informational goals
- Track Search Trajectory of individual users
- Provide more relevant feedback to users