Information Retrieval

About This Presentation

Title:

Information Retrieval

Description:

Robots explore the entire website in breadth first fashion. Humans access web-pages in depth first fashion. Tan and Kumar (2002) discuss more techniques ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 74

Provided by: dragomi3

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval

1
Information Retrieval
March 25, 2005

Handout 11

2
Course Information

Instructor Dragomir R. Radev (radev_at_si.umich.edu)
Office 3080, West Hall Connector
Phone (734) 615-5225
Office hours M 11-12 Th 12-1 or via email
Course page http//tangra.si.umich.edu/radev/650
/
Class meets on Fridays, 210-455 PM in 409 West
Hall

3
Text classification
4
Introduction

Text classification assigning documents to
predefined categories
Hierarchical vs. flat
Many techniques generative (maxent, knn, Naïve
Bayes) vs. discriminative (SVM, regression)
Generative model joint prob. p(x,y) and use
Bayesian prediction to compute p(yx)
Discriminative model p(yx) directly.

5
Generative models knn

K-nearest neighbors
Very easy to program
Issues choosing k, b?

6
Feature selection The ?2 test

For a term t
Testing for independenceP(C0,It0) should be
equal to P(C0) P(It0)
P(C0) (k00k01)/n
P(C1) 1-P(C0) (k10k11)/n
P(It0) (k00K10)/n
P(It1) 1-P(It0) (k01k11)/n

7
Feature selection The ?2 test

High values of ?2 indicate lower belief in
independence.
In practice, compute ?2 for all words and pick
the top k among them.

8
Feature selection mutual information

No document length scaling is needed
Documents are assumed to be generated according
to the multinomial model

9
Naïve Bayesian classifiers

Naïve Bayesian classifier
Assuming statistical independence

10
Spam recognition
Return-Path ltig_esq_at_rediffmail.comgt X-Sieve CMU
Sieve 2.2 From "Ibrahim Galadima"
ltig_esq_at_rediffmail.comgt Reply-To
galadima_esq_at_netpiper.com To webmaster_at_aclweb.org
Date Tue, 14 Jan 2003 210626 -0800 Subject
Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS
LETTER MAY COME TO YOU AS A SURPRISE SINCE I
HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE
CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL
ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN
THE COURSE OF MY SEARCH FOR A RELIABLE PERSON
WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTIO
N INVOLVING THE ! TRANSFER OF FUND VALUED
AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED
STATES DOLLARS US20M TO A SAFE FOREIGN
ACCOUNT THE ABOVE FUND IN QUESTION IS NOT
CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT
IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN
1999 BY INEC TO A
11
Well-known datasets

20 newsgroups (/data0/projects/graph/20ng)
http//people.csail.mit.edu/u/j/jrennie/public_htm
l/20Newsgroups/
Reuters-21578 (/data2/corpora/reuters21578)
Cats grain, acquisitions, corn, crude, wheat,
trade
WebKB (/data2/corpora/webkb)
http//www-2.cs.cmu.edu/webkb/
course, student, faculty, staff, project, dept,
other
NB performance (2000)
P26,43,18,6,13,2,94
R83,75,77,9,73,100,35

12
Support vector machines

Introduced by Vapnik in the early 90s.

13
Semi-supervised learning

EM
Co-training
Graph-based

14
Exploiting Hyperlinks Co-training

Each document instance has two sets of alternate
view (Blum and Mitchell 1998)
terms in the document, x1
terms in the hyperlinks that point to the
document, x2
Each view is sufficient to determine the class of
the instance
Labeling function that classifies examples is
the same applied to x1 or x2
x1 and x2 are conditionally independent, given
the class

Slide from Pierre Baldi
15
Co-training Algorithm

Labeled data are used to infer two Naïve Bayes
classifiers, one for each view
Each classifier will
examine unlabeled data
pick the most confidently predicted positive and
negative examples
add these to the labeled examples
Classifiers are now retrained on the augmented
set of labeled examples

Slide from Pierre Baldi
16
Additional topics

Soft margins
VC dimension
Kernel methods

17
Conclusion

SVMs are widely considered to be the best method
for text classification (look at papers by
Sebastiani, Christianini, Joachims), e.g. 86
accuracy on Reuters.
NB also good in many circumstances

18
Information extraction
19
Information Extraction

Automatically extract unstructured text data from
Web pages
Represent extracted information in some
well-defined schema
E.g.
crawl the Web searching for information about
certain technologies or products of interest
extract information on authors and books from
various online bookstore and publisher pages

Slide from Pierre Baldi
20
Info Extraction as Classification

Represent each document as a sequence of words
Use a sliding window of width k as input to a
classifier
each of the k inputs is a word in a specific
position
The system trained on positive and negative
examples (typically manually labeled)
Limitation no account of sequential constraints
e.g. the author field usually precedes the
address field in the header of a research paper
can be fixed by using stochastic finite-state
models

Slide from Pierre Baldi
21
Hidden Markov Models
Example Classify short segments of text in terms
whether they correspond to the title, author
names, addresses, affiliations, etc.
Slide from Pierre Baldi
22
Hidden Markov Model

Each state corresponds to one of the fields that
we wish to extract
e.g. paper title, author name, etc.
True Markov state diagram is unknown at
parse-time
can see noisy observations from each state
the sequence of words from the document
Each state has a characteristic probability
distribution over the set of all possible words
e.g. specific distribution of words from the
state title

Slide from Pierre Baldi
23
Training HMM

Given a sequence of words and HMM
parse the observed sequence into a corresponding
set of inferred states
Viterbi algorithm
Can be trained
in supervised manner with manually labeled data
bootstrapped using a combination of labeled and
unlabeled data

Slide from Pierre Baldi
24
Human behavior on the Web
The slides in this section are from Pierre Baldi
25
Web data and measurement issues

Background
Important to understand how data is collected
Web data is collected automatically via software
logging tools
Advantage
No manual supervision required
Disadvantage
Data can be skewed (e.g. due to the presence of
robot traffic)
Important to identify robots (also known as
crawlers, spiders)

26
A time-series plot of Web requests
Number of page requests per hour as a function of
time from page requests in the www.ics.uci.edu
Web server logs during the first week of April
2002.
27
Robot / human identification

Robot requests identified by classifying page
requests using a variety of heuristics
e.g. some robots self-identify themselves in the
server logs (robots.txt)
Robots explore the entire website in breadth
first fashion
Humans access web-pages in depth first fashion
Tan and Kumar (2002) discuss more techniques

28
Robot / human identification

Robot traffic consists of two components
Periodic Spikes (can overload a server)
Requests by bad robots
Lower-level constant stream of requests
Requests by good robots
Human traffic has
Daily pattern Monday to Friday
Hourly pattern peak around midday low traffic
from midnight to early morning

29
Server-side data

Data logging at Web servers
Web server sends requested pages to the requester
browser
It can be configured to archive these requests in
a log file recording
URL of the page requested
Time and date of the request
IP address of the requester
Requester browser information (agent)

30
Data logging at Web servers

Status of the request
Referrer page URL if applicable
Server-side log files
provide a wealth of information
require considerable care in interpretation
More information in Cooley et al. (1999), Mena
(1999) and Shahabi et al. (2001)

31
Page requests, caching, and proxy servers

In theory, requester browser requests a page from
a Web server and the request is processed
In practice, there are
Other users
Browser caching
Dynamic addressing in local network
Proxy Server caching

32
Page requests, caching, and proxy servers
A graphical summary of how page requests from an
individual user can be masked at various stages
between the users local computer and the Web
server.
33
Identifying individual users from Web server logs

Useful to associate specific page requests to
specific individual users
IP address most frequently used
Disadvantages
One IP address can belong to several users
Dynamic allocation of IP address
Better to use cookies
Information in the cookie can be accessed by the
Web server to identify an individual user over
time
Actions by the same user during different
sessions can be linked together

34
Identifying individual users from Web server logs

Commercial websites use cookies extensively
90 of users have cookies enabled permanently on
their browsers
However
There are privacy issues need implicit user
cooperation
Cookies can be deleted / disabled
Another option is to enforce user registration
High reliability
Can discourage potential visitors

35
Client-side data

Advantages of collecting data at the client side
Direct recording of page requests (eliminates
masking due to caching)
Recording of all browser-related actions by a
user (including visits to multiple websites)
More-reliable identification of individual users
(e.g. by login ID for multiple users on a single
computer)
Preferred mode of data collection for studies of
navigation behavior on the Web
Companies like comScore and Nielsen use
client-side software to track home computer users
Zhu, Greiner and Häubl (2003) used client-side
data

36
Client-side data

Statistics like Time per session and Page-view
duration are more reliable in client-side data
Some limitations
Still some statistics like Page-view duration
cannot be totally reliable e.g. user might go to
fetch coffee
Need explicit user cooperation
Typically recorded on home computers may not
reflect a complete picture of Web browsing
behavior
Web surfing data can be collected at intermediate
points like ISPs, proxy servers
Can be used to create user profile and target
advertise

37
Handling massive Web server logs

Web server logs can be very large
Small university department website gets a
million requests per month
Amazon, Google can get tens of millions of
requests each day
Exceed main memory capacities, stored on disks
Time-costs to data access place significant
constraints on types of analysis
In practice
Analysis of subset of data
Filtering out events and fields of no direct
interest

38
Empirical client-side studies of browsing behavior

Data for client-side studies are collected at the
client-side over a period of time
Reliable page revisitation patterns can be
gathered
Explicit user permission is required
Typically conducted at universities
Number of individuals is small
Can introduce bias because of the nature of the
population being studied
Caution must be exercised when generalizing
observations
Nevertheless, provide good data for studying
human behavior

39
Early studies from 1995 to 1997

Earliest studies on client-side data are Catledge
and Pitkow (1995) and Tauscher and Greenberg
(1997)
In both studies, data was collected by logging
Web browser commands
Population consisted of faculty, staff and
students
Both studies found
clicking on the hypertext anchors as the most
common action
using back button was the second common action

40
Early studies from 1995 to 1997

high probability of page revisitation
(0.58-0.61)
Lower bound because the page requests prior to
the start of the studies are not accounted for
Humans are creatures of habit?
Content of the pages changed over time?
strong recency (page that is revisited is usually
the page that was visited in the recent past)
effect
Correlates with the back button usage
Similar repetitive actions are found in telephone
number dialing etc

41
The Cockburn and McKenzie study from 2002

Previous studies are relatively old
Web has changed dramatically in the past few
years
Cockburn and McKenzie (2002) provides a more
up-to-date analysis
Analyzed the daily history.dat files produced by
the Netscape browser for 17 users for about 4
months
Population studied consisted of faculty, staff
and graduate students
Study found revisitation rates higher than past
94 and 95 studies (0.81)
Time-window is three times that of past studies

42
The Cockburn and McKenzie study from 2002

Revisitation rate less biased than the previous
studies?
Human behavior changed from an exploratory mode
to a utilitarian mode?
The more pages user visits, the more are the
requests for new pages
The most frequently requested page for each user
can account for a relatively large fraction of
his/her page requests
Useful to see the scatter plot of the distinct
number of pages requested per user versus the
total pages requested
Log-log plot also informative

43
The Cockburn and McKenzie study from 2002
The number of distinct pages visited versus page
vocabulary size of each of the 17 users in the
Cockburn and McKenzie (2002) study
44
The Cockburn and McKenzie study from 2002
The number of distinct pages visited versus page
vocabulary size of each of the 17 users in the
Cockburn and McKenzie (2002) study (log-log plot)
45
The Cockburn and McKenzie study from 2002
Bar chart of the ratio of the number of page
requests for the most frequent page divided by
the total number of page requests, for 17 users
in the Cockburn McKenzie (2002) study
46
Video-based analysis of Web usage

Byrne et al. (1999) analyzed video-taped
recordings of eight different users over a period
of 15 min to 1 hour
Audio descriptions of the users was combined with
the video recordings of their screen for analysis
Study found
users spent a considerable amount of time
scrolling Web pages
users spent a considerable amount of time waiting
for pages to load (15 of time)

47
Probabilistic models of browsing behavior

Useful to build models that describe the browsing
behavior of users
Can generate insight into how we use Web
Provide mechanism for making predictions
Can help in pre-fetching and personalization

48
Markov models for page prediction

General approach is to use a finite-state Markov
chain
Each state can be a specific Web page or a
category of Web pages
If only interested in the order of visits (and
not in time), each new request can be modeled as
a transition of states
Issues
Self-transition
Time-independence

49
Markov models for page prediction

For simplicity, consider order-dependent,
time-independent finite-state Markov chain with M
states
Let s be a sequence of observed states of length
L. e.g. s ABBCAABBCCBBAA with three states A, B
and C. st is state at position t (1lttltL). In
general,
Under a first-order Markov assumption, we have
This provides a simple generative model to
produce sequential data

50
Markov models for page prediction

If we denote Tij P(st jst-1 i), we can
define a M x M transition matrix
Properties
Strong first-order assumption
Simple way to capture sequential dependence
If each page is a state and if W pages, O(W2), W
can be of the order 105 to 106 for a CS dept. of
a university
To alleviate, we can cluster W pages into M
clusters, each assigned a state in the Markov
model
Clustering can be done manually, based on
directory structure on the Web server, or
automatic clustering using clustering techniques

51
Markov models for page prediction

Tij P(st jst-1 i) now represent the
probability that an individual users next
request will be from category j, given they were
in category i
We can add E, an end-state to the model
E.g. for three categories with end state -
E denotes the end of a sequence, and start of a
new sequence

52
Markov models for page prediction

First-order Markov model assumes that the next
state is based only on the current state
Limitations
Doesnt consider long-term memory
We can try to capture more memory with kth-order
Markov chain
Limitations
Inordinate amount of training data O(Mk1)

53
Fitting Markov models to observed page-request
data

Assume that we collected data in the form of N
sessions from server-side logs, where ith session
si, 1lt i lt N, consists of a sequence of Li page
requests, categorized into M 1 states and
terminating in E. Therefore, data D s1, , sN
Let denote the set of parameters of the Markov
model, consists of M2 -1 entries in T
Let denote the estimated probability of
transitioning from state i to j.

54
Fitting Markov models to observed page-request
data

The likelihood function would be
This assumes conditional independence of
sessions.
Under Markov assumptions, likelihood is
where nij is the number of times we see a
transition from state i to state j in the
observed data D.

55
Fitting Markov models to observed page-request
data

For convenience, we use log-likelihood
We can maximize the expression by taking partial
derivatives wrt each parameter and incorporating
the constraint (via Lagrange multipliers) that
the sum of transition probabilities out of any
state must sum to one
The maximum likelihood (ML) solution is

56
Bayesian parameter estimation for Markov models

In practice, M is large (102-103), we end up
estimating M2 probabilities
D may contain potentially millions of sequences,
so some nij 0
Better way would be to incorporate prior
knowledge prior probability distribution
and then maximize , the posterior
distribution on given the data (rather
than )
Prior distribution reflects our prior belief
about the parameter set
The posterior reflects our posterior belief in
the parameter set now informed by the data D

57
Bayesian parameter estimation for Markov models

For Markov transition matrices, it is common to
put a distribution on each row of T and assume
that each of these priors are independent
where
Consider the set of parameters for the ith row in
T, a useful prior distribution on these
parameters is the Dirichlet distribution defined
as
where , and C is a
normalizing constant

58
Bayesian parameter estimation for Markov models

The MP posterior parameter estimates are
If nij 0 for some transition (i, j) then
instead of having a parameter estimate of 0 (ML),
we will have allowing prior knowledge
to be incorporated
If nij gt 0, we get a smooth combination of the
data-driven information (nij) and the prior

59
Bayesian parameter estimation for Markov models

One simple way to set prior parameter is
Consider alpha as the effective sample size
Partition the states into two sets, set 1
containing all states directly linked to state i
and the remaining in set 2
Assign uniform probability e/K to all states in
set 2 (all set 2 states are equally likely)
The remaining (1-e) can be either uniformly
assigned among set 1 elements or weighted by some
measure
Prior probabilities in and out of E can be set
based on our prior knowledge of how likely we
think a user is to exit the site from a
particular state

60
Predicting page requests with Markov models

Many flavors of Markov models proposed for next
page and future page prediction
Useful in pre-fetching, caching and
personalization of Web page
For a typical website, the number of pages is
large Clustering is useful in this case
First-order Markov models are found to be
inferior to other types of Markov models
kth-order is an obvious extension
Limitation O(Mk1) parameters (combinatorial
explosion)

61
Predicting page requests with Markov models

Deshpande and Karypis (2001) propose schemes to
prune kth-order Markov state space
Provide systematic but modest improvements
Another way is to use empirical smoothing
techniques that combine different models from
order 1 to order k (Chen and Goodman 1996)
Cadez et al. (2003) and Sen and Hansen (2003)
propose mixtures of Markov chains, where we
replace the first-order Markov chain

62
Predicting page requests with Markov models

with a mixture of first-order Markov chains
where c is a discrete-value hidden variable
taking K values Sumk P(c k) 1and
P(st st-1, c k) is the transition matrix
for the kth mixture component
One interpretation of this is user behavior
consists of K different navigation behaviors
described by the K Markov chains
Cadez et al. use this model to cluster sequences
of page requests into K groups, parameters are
learned using the EM algorithm

63
Predicting page requests with Markov models

Consider the problem of predicting the next
state, given some number of states t
Let s1,t s1,, st denote the sequence of t
states
The predictive distribution for a mixture of K
Markov models is
The last line is obtained if we assume
conditioned on component c k, the next state
st1 depends only on st

64
Predicting page requests with Markov models

Weight based on observed history is
where
Intuitively, these membership weights evolve as
we see more data from the user
In practice,
Sequences are short
Not realistic to assume that observed data is
generated by a mixture of K first-order Markov
chains
Still, mixture model is a useful approximation

65
Predicting page requests with Markov models

K can be chosen by evaluating the out-of-sample
predictive performance based on
Accuracy of prediction
Log probability score
Entropy
Other variations of Markov models
Sen and Hansen 2003
Position-dependent Markov models (Anderson et al.
2001, 2002)
Zukerman et al. 1999

66
Search Engine Querying

How users issue queries to search engines
Tracking search query logs
timestamp, text string, user ID etc.
Collecting query datasets from different
distribution
Jansen et al (1998), Silverstein et al (1998)
Lau and Horvitz (1999), Spink et al (2002)
Xie and OHallaron (2002)
e.g.
Xie and OHallaron (2002)
Checked how many queries were coming
Checked users IP address
Reported 111,000 queries (2.7) originating from
AOL

67
Analysis of Search Engine Query Logs
68
Main Results

Average number of terms in a query is ranging
from a low of 2.2 to a high of 2.6
The most common number of terms in a query is 2
The majority of users dont refine their query
The number of users who viewed only a single page
increase 29 (1997) to 51 (2001) (Excite)
85 of users viewed only first page of search
results (AltaVista)
45 (2001) of queries is about Commerce, Travel,
Economy, People (was 201997)
The queries about adult or entertainment
decreased from 20 (1997) to around 7 (2001)

69
Main Results
- Query Length Distributions (bar) - Poisson
Model(dots lines)

All four studies produced a generally consistent
set of findings about user behavior in a search
engine context
most users view relatively few pages per query
most users dont use advanced search features

70
Advanced Search Tips

Useful operators for searching (Google)
Include stop word (common
words) where is Irvine
- Exclude operating system -Microsoft
Synonyms computer
Phrase search modeling the internet
or Either A Or B vacation London or Paris
site Domain search admission
sitewww.uci.edu

71
Power-law Characteristics
Power-Law in log-log space

Frequency f(r) of Queries with Rank r
110000 queries from Vivisimo
1.9 Million queries from Excite
There are strong regularities in terms of
patterns of behavior in how we search the Web

72
Models for Search Strategies

It is significant to know the process by which a
typical user navigates through search space when
looking for information using a search engine
The inference of users search actions could be
used for marketing purposes such as real-time
targeted advertising

73
Graphical Representation