Information Retrieval - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

Information Retrieval

Description:

Robots explore the entire website in breadth first fashion. Humans access web-pages in depth first fashion. Tan and Kumar (2002) discuss more techniques ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 74
Provided by: dragomi3
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
March 25, 2005
  • Handout 11

2
Course Information
  • Instructor Dragomir R. Radev (radev_at_si.umich.edu)
  • Office 3080, West Hall Connector
  • Phone (734) 615-5225
  • Office hours M 11-12 Th 12-1 or via email
  • Course page http//tangra.si.umich.edu/radev/650
    /
  • Class meets on Fridays, 210-455 PM in 409 West
    Hall

3
Text classification
4
Introduction
  • Text classification assigning documents to
    predefined categories
  • Hierarchical vs. flat
  • Many techniques generative (maxent, knn, Naïve
    Bayes) vs. discriminative (SVM, regression)
  • Generative model joint prob. p(x,y) and use
    Bayesian prediction to compute p(yx)
  • Discriminative model p(yx) directly.

5
Generative models knn
  • K-nearest neighbors
  • Very easy to program
  • Issues choosing k, b?

6
Feature selection The ?2 test
  • For a term t
  • Testing for independenceP(C0,It0) should be
    equal to P(C0) P(It0)
  • P(C0) (k00k01)/n
  • P(C1) 1-P(C0) (k10k11)/n
  • P(It0) (k00K10)/n
  • P(It1) 1-P(It0) (k01k11)/n

7
Feature selection The ?2 test
  • High values of ?2 indicate lower belief in
    independence.
  • In practice, compute ?2 for all words and pick
    the top k among them.

8
Feature selection mutual information
  • No document length scaling is needed
  • Documents are assumed to be generated according
    to the multinomial model

9
Naïve Bayesian classifiers
  • Naïve Bayesian classifier
  • Assuming statistical independence

10
Spam recognition
Return-Path ltig_esq_at_rediffmail.comgt X-Sieve CMU
Sieve 2.2 From "Ibrahim Galadima"
ltig_esq_at_rediffmail.comgt Reply-To
galadima_esq_at_netpiper.com To webmaster_at_aclweb.org
Date Tue, 14 Jan 2003 210626 -0800 Subject
Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS
LETTER MAY COME TO YOU AS A SURPRISE SINCE I
HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE
CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL
ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN
THE COURSE OF MY SEARCH FOR A RELIABLE PERSON
WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTIO
N INVOLVING THE ! TRANSFER OF FUND VALUED
AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED
STATES DOLLARS US20M TO A SAFE FOREIGN
ACCOUNT THE ABOVE FUND IN QUESTION IS NOT
CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT
IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN
1999 BY INEC TO A
11
Well-known datasets
  • 20 newsgroups (/data0/projects/graph/20ng)
  • http//people.csail.mit.edu/u/j/jrennie/public_htm
    l/20Newsgroups/
  • Reuters-21578 (/data2/corpora/reuters21578)
  • Cats grain, acquisitions, corn, crude, wheat,
    trade
  • WebKB (/data2/corpora/webkb)
  • http//www-2.cs.cmu.edu/webkb/
  • course, student, faculty, staff, project, dept,
    other
  • NB performance (2000)
  • P26,43,18,6,13,2,94
  • R83,75,77,9,73,100,35

12
Support vector machines
  • Introduced by Vapnik in the early 90s.

13
Semi-supervised learning
  • EM
  • Co-training
  • Graph-based

14
Exploiting Hyperlinks Co-training
  • Each document instance has two sets of alternate
    view (Blum and Mitchell 1998)
  • terms in the document, x1
  • terms in the hyperlinks that point to the
    document, x2
  • Each view is sufficient to determine the class of
    the instance
  • Labeling function that classifies examples is
    the same applied to x1 or x2
  • x1 and x2 are conditionally independent, given
    the class

Slide from Pierre Baldi
15
Co-training Algorithm
  • Labeled data are used to infer two Naïve Bayes
    classifiers, one for each view
  • Each classifier will
  • examine unlabeled data
  • pick the most confidently predicted positive and
    negative examples
  • add these to the labeled examples
  • Classifiers are now retrained on the augmented
    set of labeled examples

Slide from Pierre Baldi
16
Additional topics
  • Soft margins
  • VC dimension
  • Kernel methods

17
Conclusion
  • SVMs are widely considered to be the best method
    for text classification (look at papers by
    Sebastiani, Christianini, Joachims), e.g. 86
    accuracy on Reuters.
  • NB also good in many circumstances

18
Information extraction
19
Information Extraction
  • Automatically extract unstructured text data from
    Web pages
  • Represent extracted information in some
    well-defined schema
  • E.g.
  • crawl the Web searching for information about
    certain technologies or products of interest
  • extract information on authors and books from
    various online bookstore and publisher pages

Slide from Pierre Baldi
20
Info Extraction as Classification
  • Represent each document as a sequence of words
  • Use a sliding window of width k as input to a
    classifier
  • each of the k inputs is a word in a specific
    position
  • The system trained on positive and negative
    examples (typically manually labeled)
  • Limitation no account of sequential constraints
  • e.g. the author field usually precedes the
    address field in the header of a research paper
  • can be fixed by using stochastic finite-state
    models

Slide from Pierre Baldi
21
Hidden Markov Models
Example Classify short segments of text in terms
whether they correspond to the title, author
names, addresses, affiliations, etc.
Slide from Pierre Baldi
22
Hidden Markov Model
  • Each state corresponds to one of the fields that
    we wish to extract
  • e.g. paper title, author name, etc.
  • True Markov state diagram is unknown at
    parse-time
  • can see noisy observations from each state
  • the sequence of words from the document
  • Each state has a characteristic probability
    distribution over the set of all possible words
  • e.g. specific distribution of words from the
    state title

Slide from Pierre Baldi
23
Training HMM
  • Given a sequence of words and HMM
  • parse the observed sequence into a corresponding
    set of inferred states
  • Viterbi algorithm
  • Can be trained
  • in supervised manner with manually labeled data
  • bootstrapped using a combination of labeled and
    unlabeled data

Slide from Pierre Baldi
24
Human behavior on the Web
The slides in this section are from Pierre Baldi
25
Web data and measurement issues
  • Background
  • Important to understand how data is collected
  • Web data is collected automatically via software
    logging tools
  • Advantage
  • No manual supervision required
  • Disadvantage
  • Data can be skewed (e.g. due to the presence of
    robot traffic)
  • Important to identify robots (also known as
    crawlers, spiders)

26
A time-series plot of Web requests
Number of page requests per hour as a function of
time from page requests in the www.ics.uci.edu
Web server logs during the first week of April
2002.
27
Robot / human identification
  • Robot requests identified by classifying page
    requests using a variety of heuristics
  • e.g. some robots self-identify themselves in the
    server logs (robots.txt)
  • Robots explore the entire website in breadth
    first fashion
  • Humans access web-pages in depth first fashion
  • Tan and Kumar (2002) discuss more techniques

28
Robot / human identification
  • Robot traffic consists of two components
  • Periodic Spikes (can overload a server)
  • Requests by bad robots
  • Lower-level constant stream of requests
  • Requests by good robots
  • Human traffic has
  • Daily pattern Monday to Friday
  • Hourly pattern peak around midday low traffic
    from midnight to early morning

29
Server-side data
  • Data logging at Web servers
  • Web server sends requested pages to the requester
    browser
  • It can be configured to archive these requests in
    a log file recording
  • URL of the page requested
  • Time and date of the request
  • IP address of the requester
  • Requester browser information (agent)

30
Data logging at Web servers
  • Status of the request
  • Referrer page URL if applicable
  • Server-side log files
  • provide a wealth of information
  • require considerable care in interpretation
  • More information in Cooley et al. (1999), Mena
    (1999) and Shahabi et al. (2001)

31
Page requests, caching, and proxy servers
  • In theory, requester browser requests a page from
    a Web server and the request is processed
  • In practice, there are
  • Other users
  • Browser caching
  • Dynamic addressing in local network
  • Proxy Server caching

32
Page requests, caching, and proxy servers
A graphical summary of how page requests from an
individual user can be masked at various stages
between the users local computer and the Web
server.
33
Identifying individual users from Web server logs
  • Useful to associate specific page requests to
    specific individual users
  • IP address most frequently used
  • Disadvantages
  • One IP address can belong to several users
  • Dynamic allocation of IP address
  • Better to use cookies
  • Information in the cookie can be accessed by the
    Web server to identify an individual user over
    time
  • Actions by the same user during different
    sessions can be linked together

34
Identifying individual users from Web server logs
  • Commercial websites use cookies extensively
  • 90 of users have cookies enabled permanently on
    their browsers
  • However
  • There are privacy issues need implicit user
    cooperation
  • Cookies can be deleted / disabled
  • Another option is to enforce user registration
  • High reliability
  • Can discourage potential visitors

35
Client-side data
  • Advantages of collecting data at the client side
  • Direct recording of page requests (eliminates
    masking due to caching)
  • Recording of all browser-related actions by a
    user (including visits to multiple websites)
  • More-reliable identification of individual users
    (e.g. by login ID for multiple users on a single
    computer)
  • Preferred mode of data collection for studies of
    navigation behavior on the Web
  • Companies like comScore and Nielsen use
    client-side software to track home computer users
  • Zhu, Greiner and Häubl (2003) used client-side
    data

36
Client-side data
  • Statistics like Time per session and Page-view
    duration are more reliable in client-side data
  • Some limitations
  • Still some statistics like Page-view duration
    cannot be totally reliable e.g. user might go to
    fetch coffee
  • Need explicit user cooperation
  • Typically recorded on home computers may not
    reflect a complete picture of Web browsing
    behavior
  • Web surfing data can be collected at intermediate
    points like ISPs, proxy servers
  • Can be used to create user profile and target
    advertise

37
Handling massive Web server logs
  • Web server logs can be very large
  • Small university department website gets a
    million requests per month
  • Amazon, Google can get tens of millions of
    requests each day
  • Exceed main memory capacities, stored on disks
  • Time-costs to data access place significant
    constraints on types of analysis
  • In practice
  • Analysis of subset of data
  • Filtering out events and fields of no direct
    interest

38
Empirical client-side studies of browsing behavior
  • Data for client-side studies are collected at the
    client-side over a period of time
  • Reliable page revisitation patterns can be
    gathered
  • Explicit user permission is required
  • Typically conducted at universities
  • Number of individuals is small
  • Can introduce bias because of the nature of the
    population being studied
  • Caution must be exercised when generalizing
    observations
  • Nevertheless, provide good data for studying
    human behavior

39
Early studies from 1995 to 1997
  • Earliest studies on client-side data are Catledge
    and Pitkow (1995) and Tauscher and Greenberg
    (1997)
  • In both studies, data was collected by logging
    Web browser commands
  • Population consisted of faculty, staff and
    students
  • Both studies found
  • clicking on the hypertext anchors as the most
    common action
  • using back button was the second common action

40
Early studies from 1995 to 1997
  • high probability of page revisitation
    (0.58-0.61)
  • Lower bound because the page requests prior to
    the start of the studies are not accounted for
  • Humans are creatures of habit?
  • Content of the pages changed over time?
  • strong recency (page that is revisited is usually
    the page that was visited in the recent past)
    effect
  • Correlates with the back button usage
  • Similar repetitive actions are found in telephone
    number dialing etc

41
The Cockburn and McKenzie study from 2002
  • Previous studies are relatively old
  • Web has changed dramatically in the past few
    years
  • Cockburn and McKenzie (2002) provides a more
    up-to-date analysis
  • Analyzed the daily history.dat files produced by
    the Netscape browser for 17 users for about 4
    months
  • Population studied consisted of faculty, staff
    and graduate students
  • Study found revisitation rates higher than past
    94 and 95 studies (0.81)
  • Time-window is three times that of past studies

42
The Cockburn and McKenzie study from 2002
  • Revisitation rate less biased than the previous
    studies?
  • Human behavior changed from an exploratory mode
    to a utilitarian mode?
  • The more pages user visits, the more are the
    requests for new pages
  • The most frequently requested page for each user
    can account for a relatively large fraction of
    his/her page requests
  • Useful to see the scatter plot of the distinct
    number of pages requested per user versus the
    total pages requested
  • Log-log plot also informative

43
The Cockburn and McKenzie study from 2002
The number of distinct pages visited versus page
vocabulary size of each of the 17 users in the
Cockburn and McKenzie (2002) study
44
The Cockburn and McKenzie study from 2002
The number of distinct pages visited versus page
vocabulary size of each of the 17 users in the
Cockburn and McKenzie (2002) study (log-log plot)
45
The Cockburn and McKenzie study from 2002
Bar chart of the ratio of the number of page
requests for the most frequent page divided by
the total number of page requests, for 17 users
in the Cockburn McKenzie (2002) study
46
Video-based analysis of Web usage
  • Byrne et al. (1999) analyzed video-taped
    recordings of eight different users over a period
    of 15 min to 1 hour
  • Audio descriptions of the users was combined with
    the video recordings of their screen for analysis
  • Study found
  • users spent a considerable amount of time
    scrolling Web pages
  • users spent a considerable amount of time waiting
    for pages to load (15 of time)

47
Probabilistic models of browsing behavior
  • Useful to build models that describe the browsing
    behavior of users
  • Can generate insight into how we use Web
  • Provide mechanism for making predictions
  • Can help in pre-fetching and personalization

48
Markov models for page prediction
  • General approach is to use a finite-state Markov
    chain
  • Each state can be a specific Web page or a
    category of Web pages
  • If only interested in the order of visits (and
    not in time), each new request can be modeled as
    a transition of states
  • Issues
  • Self-transition
  • Time-independence

49
Markov models for page prediction
  • For simplicity, consider order-dependent,
    time-independent finite-state Markov chain with M
    states
  • Let s be a sequence of observed states of length
    L. e.g. s ABBCAABBCCBBAA with three states A, B
    and C. st is state at position t (1lttltL). In
    general,
  • Under a first-order Markov assumption, we have
  • This provides a simple generative model to
    produce sequential data

50
Markov models for page prediction
  • If we denote Tij P(st jst-1 i), we can
    define a M x M transition matrix
  • Properties
  • Strong first-order assumption
  • Simple way to capture sequential dependence
  • If each page is a state and if W pages, O(W2), W
    can be of the order 105 to 106 for a CS dept. of
    a university
  • To alleviate, we can cluster W pages into M
    clusters, each assigned a state in the Markov
    model
  • Clustering can be done manually, based on
    directory structure on the Web server, or
    automatic clustering using clustering techniques

51
Markov models for page prediction
  • Tij P(st jst-1 i) now represent the
    probability that an individual users next
    request will be from category j, given they were
    in category i
  • We can add E, an end-state to the model
  • E.g. for three categories with end state -
  • E denotes the end of a sequence, and start of a
    new sequence

52
Markov models for page prediction
  • First-order Markov model assumes that the next
    state is based only on the current state
  • Limitations
  • Doesnt consider long-term memory
  • We can try to capture more memory with kth-order
    Markov chain
  • Limitations
  • Inordinate amount of training data O(Mk1)

53
Fitting Markov models to observed page-request
data
  • Assume that we collected data in the form of N
    sessions from server-side logs, where ith session
    si, 1lt i lt N, consists of a sequence of Li page
    requests, categorized into M 1 states and
    terminating in E. Therefore, data D s1, , sN
  • Let denote the set of parameters of the Markov
    model, consists of M2 -1 entries in T
  • Let denote the estimated probability of
    transitioning from state i to j.

54
Fitting Markov models to observed page-request
data
  • The likelihood function would be
  • This assumes conditional independence of
    sessions.
  • Under Markov assumptions, likelihood is
  • where nij is the number of times we see a
    transition from state i to state j in the
    observed data D.

55
Fitting Markov models to observed page-request
data
  • For convenience, we use log-likelihood
  • We can maximize the expression by taking partial
    derivatives wrt each parameter and incorporating
    the constraint (via Lagrange multipliers) that
    the sum of transition probabilities out of any
    state must sum to one
  • The maximum likelihood (ML) solution is

56
Bayesian parameter estimation for Markov models
  • In practice, M is large (102-103), we end up
    estimating M2 probabilities
  • D may contain potentially millions of sequences,
    so some nij 0
  • Better way would be to incorporate prior
    knowledge prior probability distribution
    and then maximize , the posterior
    distribution on given the data (rather
    than )
  • Prior distribution reflects our prior belief
    about the parameter set
  • The posterior reflects our posterior belief in
    the parameter set now informed by the data D

57
Bayesian parameter estimation for Markov models
  • For Markov transition matrices, it is common to
    put a distribution on each row of T and assume
    that each of these priors are independent
  • where
  • Consider the set of parameters for the ith row in
    T, a useful prior distribution on these
    parameters is the Dirichlet distribution defined
    as
  • where , and C is a
    normalizing constant

58
Bayesian parameter estimation for Markov models
  • The MP posterior parameter estimates are
  • If nij 0 for some transition (i, j) then
    instead of having a parameter estimate of 0 (ML),
    we will have allowing prior knowledge
    to be incorporated
  • If nij gt 0, we get a smooth combination of the
    data-driven information (nij) and the prior

59
Bayesian parameter estimation for Markov models
  • One simple way to set prior parameter is
  • Consider alpha as the effective sample size
  • Partition the states into two sets, set 1
    containing all states directly linked to state i
    and the remaining in set 2
  • Assign uniform probability e/K to all states in
    set 2 (all set 2 states are equally likely)
  • The remaining (1-e) can be either uniformly
    assigned among set 1 elements or weighted by some
    measure
  • Prior probabilities in and out of E can be set
    based on our prior knowledge of how likely we
    think a user is to exit the site from a
    particular state

60
Predicting page requests with Markov models
  • Many flavors of Markov models proposed for next
    page and future page prediction
  • Useful in pre-fetching, caching and
    personalization of Web page
  • For a typical website, the number of pages is
    large Clustering is useful in this case
  • First-order Markov models are found to be
    inferior to other types of Markov models
  • kth-order is an obvious extension
  • Limitation O(Mk1) parameters (combinatorial
    explosion)

61
Predicting page requests with Markov models
  • Deshpande and Karypis (2001) propose schemes to
    prune kth-order Markov state space
  • Provide systematic but modest improvements
  • Another way is to use empirical smoothing
    techniques that combine different models from
    order 1 to order k (Chen and Goodman 1996)
  • Cadez et al. (2003) and Sen and Hansen (2003)
    propose mixtures of Markov chains, where we
    replace the first-order Markov chain

62
Predicting page requests with Markov models
  • with a mixture of first-order Markov chains
  • where c is a discrete-value hidden variable
    taking K values Sumk P(c k) 1and
  • P(st st-1, c k) is the transition matrix
    for the kth mixture component
  • One interpretation of this is user behavior
    consists of K different navigation behaviors
    described by the K Markov chains
  • Cadez et al. use this model to cluster sequences
    of page requests into K groups, parameters are
    learned using the EM algorithm

63
Predicting page requests with Markov models
  • Consider the problem of predicting the next
    state, given some number of states t
  • Let s1,t s1,, st denote the sequence of t
    states
  • The predictive distribution for a mixture of K
    Markov models is
  • The last line is obtained if we assume
    conditioned on component c k, the next state
    st1 depends only on st

64
Predicting page requests with Markov models
  • Weight based on observed history is
  • where
  • Intuitively, these membership weights evolve as
    we see more data from the user
  • In practice,
  • Sequences are short
  • Not realistic to assume that observed data is
    generated by a mixture of K first-order Markov
    chains
  • Still, mixture model is a useful approximation

65
Predicting page requests with Markov models
  • K can be chosen by evaluating the out-of-sample
    predictive performance based on
  • Accuracy of prediction
  • Log probability score
  • Entropy
  • Other variations of Markov models
  • Sen and Hansen 2003
  • Position-dependent Markov models (Anderson et al.
    2001, 2002)
  • Zukerman et al. 1999

66
Search Engine Querying
  • How users issue queries to search engines
  • Tracking search query logs
  • timestamp, text string, user ID etc.
  • Collecting query datasets from different
    distribution
  • Jansen et al (1998), Silverstein et al (1998)
  • Lau and Horvitz (1999), Spink et al (2002)
  • Xie and OHallaron (2002)
  • e.g.
  • Xie and OHallaron (2002)
  • Checked how many queries were coming
  • Checked users IP address
  • Reported 111,000 queries (2.7) originating from
    AOL

67
Analysis of Search Engine Query Logs
68
Main Results
  • Average number of terms in a query is ranging
    from a low of 2.2 to a high of 2.6
  • The most common number of terms in a query is 2
  • The majority of users dont refine their query
  • The number of users who viewed only a single page
    increase 29 (1997) to 51 (2001) (Excite)
  • 85 of users viewed only first page of search
    results (AltaVista)
  • 45 (2001) of queries is about Commerce, Travel,
    Economy, People (was 201997)
  • The queries about adult or entertainment
    decreased from 20 (1997) to around 7 (2001)

69
Main Results
- Query Length Distributions (bar) - Poisson
Model(dots lines)
  • All four studies produced a generally consistent
    set of findings about user behavior in a search
    engine context
  • most users view relatively few pages per query
  • most users dont use advanced search features

70
Advanced Search Tips
  • Useful operators for searching (Google)
  • Include stop word (common
    words) where is Irvine
  • - Exclude operating system -Microsoft
  • Synonyms computer
  • Phrase search modeling the internet
  • or Either A Or B vacation London or Paris
  • site Domain search admission
    sitewww.uci.edu

71
Power-law Characteristics
Power-Law in log-log space
  • Frequency f(r) of Queries with Rank r
  • 110000 queries from Vivisimo
  • 1.9 Million queries from Excite
  • There are strong regularities in terms of
    patterns of behavior in how we search the Web

72
Models for Search Strategies
  • It is significant to know the process by which a
    typical user navigates through search space when
    looking for information using a search engine
  • The inference of users search actions could be
    used for marketing purposes such as real-time
    targeted advertising

73
Graphical Representation
  • Lar Horvitz(1999)
  • Model of users search query actions over time
  • Simple Bayesian network
  • Current search action
  • Time interval
  • Next search action
  • Informational goals
  • Track Search Trajectory of individual users
  • Provide more relevant feedback to users
Write a Comment
User Comments (0)
About PowerShow.com