Title: Topic-Sensitive PageRank
1Topic-Sensitive PageRank
2Abstract
- Targetimproving the ranking of search-query
results - Beforeusing the link structure of the Web, to
capture the relative importance of Web pages,
independent of any particular search query - Nowa set of PageRank vectors, biased using a set
of topics, to capture more accurately the notion
of importance with respect to a particular topic
3Abstract contribution
- more accurate rankings than generic PageRank
- Compute topic-sensitive PageRank scores for pages
satisfying the query using the topic of the query
keywords - Considering searches done in context
- Compute the topic-sensitive PageRank scores using
the topic of the context in which the query
appeared
41. Introduction
- HITS 14
- a link analysis algorithm
- Hubs
- Authorities
- Include content analyst 4
- Automatically compiling resource lists for
general topics 8
51. Introduction - PageRank algorithm7,16
- rank vector - apriori importance -gt estimate
pages on the Web - Computed once
- Offline
- independent of the search query (con)
- importance scores are used in conjunction with
query-specific IR scores to rank the query
results
61. Introduction - Advantage of PageRank
- query-time cost of incorporating the precomputed
PageRank importance score for a page is low - PageRank is generated using the entire Web graph
rather than a small subset
71. Introduction - Method in this paper
- allows the query to influence the link-based
score(HITS) - requires minimal query-time processing (PageRank)
- biased with a different topic
81. Introduction -making PageRank topic-sensitive
- avoid the problem of heavily linked pages getting
highly ranked for queries(no particular
authority) - Hilltop 5 that is designed to improve results
for popular queries - Generates a query-specific authority score by
detecting and indexing pages that appear to be
good experts for certain keywords - experts were not found will not be handled by the
Hilltop algorithm.
91. Introduction -making PageRank topic-sensitive
- 17Propose using a set of Web Pages terms for
influencing the computation. - An approach for enhancing search rankings by
generating a PageRank vector for each possible
query term was recently proposed in 18 with
favorable results - requires considerable processing time and storage
- not easily extended to make use of user and query
context
101. Introduction - two query scenarios
- Scenarios1assume a user with a specific
information need issues a query - Determine the topics most closely associated with
the query, and use the appropriate
topic-sensitive PageRank vectors for ranking the
documents satisfying the query.
111. Introduction - two query scenarios
- Scenario2user is viewing a document (for
instance, browsing the Web or reading email), and
selects a term from the document for which he
would like more information.
12Summary of approach
- generate 16 topic-sensitive PageRank vectors
using URLs from a top-level category from the
Open Directory Project (ODP) - At query time, calculate the similarity of the
query to each topics - take the linear combination of the
topic-sensitive vectors, weighted using the
similarities of the query to the topics - link-based computations are performed offline,
the query-time costs are not much
132. Review of PageRank
- page u link to page v
- Example
- Yahoo -gt important page(many pages point to it)
- pointed to from Yahoo! are probably important
- Nu -gt out degree of page u
- Rank(p) importance of page p
- link (u ,v) confers units of rank to v
14 - N is the number of pages , assign all pages the
initial value 1/N - Bv represent the set of pages pointing to v
- The final vector
- contains the PageRank vector over the Web
- computed only once after each crawl of the Web
15- expressed as the following eigenvector
calculation - M -gt square stochastic matrix corresponding to
the directed graph G of the Web - Page j to Page I , mij 1/Nj
- Repeatedly multiplying Rank by M yields the
dominant eigenvector Rank of the matrix M
16An example
v2
v1
v3
v5
v4
17- PageRank can be viewed as the stationary
probability distribution over pages induced by a
random walk on the Web - To convergence - M is irreducible and aperiodic
- Dumping factor1 a to restrict rank sink
- add transition edges of probability a /N between
every pair of nodes in G
18Rank Sink
- Page A points to Page B and Page B points to Page
A, and the PageRank value for these pages
increases
RefAnalysis of Rank Sink Problem in PageRank
Algorithm
19The key to creating topic-sensitive PageRank is
to bias the computation to increase the effect of
certain categories of pages by using a nonuniform
Nx1 personalization vector for
203. Topic-Sensitive PageRank - Method in this
article
- precompute the multiple importance scores for
each page - a set of scores of the importance of a page with
respect to various topics - combined to form a composite PageRank score
- to produce the final rank of the query
213. Topic-Sensitive Pagerank - 3.2 ODP biasing
- To generate a set of biased PageRank vectors
using a set of a basis topics. - Performed once
- Offline
- Personalization vector
- 16 different biased PageRank vectors (using 16
top-level of ODP) - Tj set of URLs in the ODP category cj
DMOZ Open Directory Project 16 top-level topics
223.2 Personalization vector
233.3 Query Time Importance Score
- The second is performed at query time
- Given a query q, let q be the context of q.
- Let qi be the ith term in the query context
- Then given the query q, compute for each cj the
following - reflects the interests of user k
24- the query-sensitive importance score of each of
these retrieved URLs - rankjd is the rank of document d given by the
rank vector - The results are ranked according to this
composite score sqd
25- random surfer modelvisits a web page with a
certain probability which derives from the page's
PageRank - wj is the coefficient used to weight the jth rank
vector - With probability 1- a a random surfer on page u
follows an outlink of u
264. Experimental Results
- A series of experiments
- 4-1 describe the similarity measure use to
compare two rankings - 4-2 investigate how the induced rankings vary,
based on both the topic used to bias the rank
vectors - 4-3 the retrieval performance of ordinary
PageRank versus topic-sensitive PageRank. - 4-4 how the use of query context can be used in
conjunction with topic-sensitive PageRank
274. Experiment Data
- crawl contained roughly 280,000 of the 3 million
URLs in the ODP. - 35 queries in paper 9 show at Table1
284-1 Similarity Measure - First measure
- First measure
- degree of overlap
between the top n URLs of two rankings - n 20, use to compare
- it does not indicate the degree to which the
relative orderings of the top n URLs of two
rankings
294-1 Similarity Measure - second measure
Kendalls distance measure9
- KSim(T1,T2) is the probability that and
agree on the relative ordering of a randomly
selected pair of distinct nodes - U union of the URLs in and
30- Ref https//en.wikipedia.org/wiki/Kendall_tau_dis
tance
314.2 Effect of ODP Biasing
- bias factor a
- affects the degree to which the resultant vector
is biased towards the topic vector used for - For a 1, the URLs in the bias set Tj will be
assigned the score - as a 0, the content of Tj becomes irrelevant to
the final score
32- Use a 0.25(heuristically)
- the induced rankings of query results are not
very sensitive to the choice of a - The average overlap
- between the top 20
- results for the two
- values of is very high
33- differences across different topically-biased
PageRank vectors is much higher
34- investigate which of these rankings is best for
specific queries - Table 5 shows the top 5 ranked URLs
354.3 query-sensitive scoring
- how effectively we can utilize the ranking
precision - intuitively the most relevant
categories for the query - Use only the top three highest values categories
to compute sqd score
364.3 query-sensitive scoring
374.3 query-sensitive scoring -experiment
- To compare the query-sensitive approach to
PageRank - 10 queries
- 5 volunteers
- Each query, the volunteer
- was shown 2 result rankings
- Top 10 results with the unbiased PageRank vector
- Top 10 results with the composite sqd score
- Select all URLs relevant to the query
- Choose the better ranking results
384.3 query-sensitive scoring -result
394.4 context-sensitive scoring
- Using the context can help disambiguate the query
term and yield results that more closely reflect
40(No Transcript)
415.Sources of search context
- the history of queries issued leading up to the
current query is another form of query context - Jordan and basketball
- sort of hierarchical directory
- User context
- Browsings patterns
- Bookmarks
- Email archive
426. Ongoing Work
- discovering sources of search context
- development of the best set of the basis
topics(second of third level of Open Directory
hierarchy) -gt efficiency problem - Creating the dumping vector to create the topic
sensitive rank vectors -gt being more resistant