ECE 7995 CACHING AND PREFETCHING TECHNIQUES - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

ECE 7995 CACHING AND PREFETCHING TECHNIQUES

Description:

An Anonymous ID Identifying the user IP address. ... For proxy caching, queries are shared by different users repeats over longer ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 60

Provided by: eceEng

Learn more at: https://ece.eng.wayne.edu

Category:

more less

Transcript and Presenter's Notes

Title: ECE 7995 CACHING AND PREFETCHING TECHNIQUES

1
ECE 7995CACHING AND PREFETCHING
TECHNIQUES
2
Locality In Search Engine Queries And Its
Implications For Caching

By
LAKSHMI JANARDHAN ba8671
JUNAID AHMED ax9974

3
Outline

Introduction
Related previous work
Analysis of search engines and its query traces
Query locality and its implications
User lexicon analysis and its implication
Results
Conclusion and Scope of improvement

4
Introduction

Serving a search request requires a significant
amount of computation as well as I/O and network
bandwidth, caching these results could improve
performance in 3 ways.
Repeated query results.
Because of the reduction in server workload,
scarce computing cycles in the server are saved,
allowing these cycles to be applied to more
advanced algorithms.
Distribute part of the computational tasks and
customize search results based on user contextual
information.

5
Questions Still Open??

Where should we cache these results?
How long should we keep a query in cache before
it becomes stale?
What might be the other benefits gained from
caching?
Do we have to study the real time search engines
in order to understand caching search engine
results?

6
Related work Motivation

Due to the exponential growth of the Web, there
has been much research on the impact of Web
caching and how to maximize its performance
benefits.
Deploying proxies between clients and servers
yields a number of performance benefits. It
reduces server load, network bandwidth usage as
well as user access latency.

There are previous studies on search engine
traces like the Excite search engine trace to
determine how users search the Web and what they
search for and the AltaVista search engine trace,
studying the interaction of terms within queries
and presenting results of a correlation analysis
of the log entries.
Although these studies have not focused on
caching search engine results, all of them
suggest queries have significant locality, which
hence motivates the authors in writing the
paper.

8
Analysis of search engines its query traces

Analysis of VIVISIMO EXCITE Search Engines.
Query Trace description of both VIVISIMO EXCITE
Search Engines.
Statistical summary of the above traces.

9
Vivisimo search engine

Vivisimo is a clustering meta-search engine that
organizes the combined outputs of multiple search
engines.
Upon reception of each user query, Vivisimo
combines the results from other search engines
and organizes these documents into meaningful
groups.
The groupings are generated dynamically based on
extracts from the documents, such as titles,
URLs, and short descriptions.

10
Excite search engine

Excite is a basic search engine that
automatically produces search results by listing
relevant web sites and information upon reception
of each user query.
Capitalization of the query is disregarded.
The default logic operation to be performed is
ALL'. It also supports other logic operations
like AND, OR, AND NOT.
More advanced searching features of Excite
include wild card matching, PHRASE searching
and relevance feedbacks.

11
Query traces of vivisimo excite search engines

Vivisimo trace captures the behavior of early
adopters who may not be representative of a
steady state user group.
Both Vivisimo Excite traces were collected at
different times
Even though they were captured at different time
periods and user populations, their results seems
to be the same.

In both traces, there are entries which contain
the following fields of interest-
An Anonymous ID Identifying the user IP
address.
A Timestamp Specifying when the user request is
received.
A Query String Submitted by the user. Any
advanced query operations are selected, they will
also be specified in this string.
A Number Indicating whether the request is for
next page results or a new user query.

13
Statistical summaries

Users do not issue many next-page requests. Fewer
than two pages on average are examined for each
query.
Users do repeat queries a lot.
In the Vivisimo trace, over 32 of the queries
are repeated ones that have been submitted before
by either the same user or a different user.
In the Excite trace, more than 42of the queries
are repeated queries.

The majority of users do not use advanced query
options
97 of the queries from the Vivisimo trace and 3
of the queries from the Excite trace use the
default logic operations offered by the
corresponding search engines.
Users on average do not submit many queries. The
average numbers of queries submitted by a user
are 5.48 and 3.69, respectively.
About 70of the queries consist of more than one
word, although the average query length is fewer
than three terms, which is short.

15
The number of HTTP requests cannot be inferred
from the Excite trace since the trace did not
contain Information about HTTP requests from
users.
16

Vivisimo Trace

Excite Trace

17
Query locality its Implications

Query Repetition and Distribution
Query Locality based on Individual User
Temporal Query Locality
Multiple Word Query Locality
Users with shared IP Address

18
Query repetition distribution

Among 35,538 queries that are repeated, there are
16,612 distinct queries in Vivisimo trace. Thus
each query was repeated 3.20 times.
In Excite trace the query was repeated 4.49
times. There were 235,607 distinct queries among
821,315 repeated ones.
The figure shows the plots of Zipf distribution
of repeated query frequencies for both traces

19
Query repetition distribution (contd..)

We are interested only in repeated queries
whether they are from same users or shared by
different users
Out of 32.05 of repeated queries, only 70.58
are from same users in Vivisimo trace.
Out of 42.75 of repeated queries, only 37.35
are repeated from same users in Excite trace.
Thus, these results suggest that queries having
high degree of shareness should be cached at the
server side

20
Query repetition distribution (contd..)
Shareness is based on their repetition
frequencies
21
Query locality based on individual user

The query shareness distribution indicates that
query locality exists with respect to the same
users as well as among different users.
In Vivisimo trace, out of 20,220 users, 6,628
queries are repeated at least once. Each user on
avg repeated 5.36 queries.
In Excite trace, out of 520,883 users who
submitted queries, 136,626 repeated queries. Each
user on avg repeated 6.01 queries.
Figure shows the percentage of the repeated
queries over the total number of queries
submitted by each user.

22
Query locality based on individual user (contd..)

Thus from these results, we can see that not only
lot of users repeated queries but each user
repeated queries a lot.
Since16 to 22 of all queries were repeated by
the same users and that search engine servers can
cache only limited data, caching these queries
and query results based on individual user
requirements in a more distributed way is
important.
It reduces user query submission overheads and
access latencies as well as server load.
By caching queries at the user side, we also have
the opportunity to improve query results based on
individual user context, which cannot be achieved
by caching queries at a centralized server.

23
Temporal query locality

Tendency of users to repeat queries within a
short time interval
In Vivisimo about 65 of the queries were
repeated within an hour, and 83 in Excite.
45.5 of the queries were repeated by the same
users within 5 minutes.
Out of 21.98 of the queries that were repeated
over a day, only 5.09 came from same users.

24
Temporal query locality (contd..)
25
Temporal query locality (contd..)
26
Multi word query locality

This comes into picture if a user uses more than
one word for his/her query
Multiword locality has less degree of shareness.
Caching multiword queries are more promising
because it takes more time for computing.
Since they have less degree of shareness, they
are mostly cached at the user side.

27
Multi word query locality (contd..)
Multiple word query summary and comparison
between single and multi word queries.
28
Users with shared ip addresses

For those users, their IP addresses are
dynamically allocated by DHCP servers.
Unfortunately, there is no common way to identify
these kinds of users. This impacts our analysis
in two ways.
First, because different users can share the same
IP address at different times, their queries seem
like they come from the same user, leading to an
overestimate of the query locality from the same
users.
Second, because the same users can use different
IP addresses at different times, it is also
possible for us to underestimate the query
locality from the same users.

29
User Lexicon Analysis and Implication

User Lexicon lexicon is a synonym for
dictionary. It is user vocabulary.
By analyzing the user query lexicons, it is
possible to pre-fetch query results for each user
based on frequently used terms

30
User Lexicon Analysis and Implication

All the words used by each user individually were
grouped and the user lexicon size distribution
was observed
Vivisimo Trace
249,541 words in queries ? 51,895 are distinct
words (20.8)
Excite Trace
5,095,189 words in queries ? 350,879 distinct
words (6.9)

31
User Lexicon Analysis and Implication

User lexicon sizes are much smaller than overall
lexicon size
Largest user lexicons in Vivisimo trace have only
885 words and 202 words for Excite trace

32
Distribution of User Lexicon Size

User Lexicon size does not follow Zipf
distribution
Majority of user have small lexicon indicated by
the heavy tail

33
Distribution of User Lexicon Size

Users submitted more than 100 queries over 8 hour
period are removed as they are likely to be
meta-search engine
User lexicon is larger as more queries are
submitted by the user

34
Distribution of User Lexicon Size

Shows relationship between the no. of queries
submitted by a user and corresponding user
lexicon size
Excite trace also shows same pattern

35
Analysis of Frequent Users and Their Lexicons

Users have large lexicons but they do not use all
words uniformly
We are interested in how frequently words are
used by frequent users in the queries
Users submitted only few queries over long period
(35 days) or whose trace lasts for too short a
period are ignored.

36
Analysis of Frequent Users and Their Lexicons

Fre-user if user has submitted at least 70
queries over 35 days.
Fre-lexicon Consist of words that were used at
least 5 times by the corresponding users

37
Analysis of Frequent Users and Their Lexicons

Most of the fre-users had small fre-lexicons
A small no. of users with relatively large
fre-lexicons

38
Analysis of Frequent Users and Their Lexicons

Prefetching based on fre-lexicons for these users
will reduce no. of queries
But would required large cache size to store all
possible word combination

39
Analysis of Frequent Users and Their Lexicons

Shows both the percentage of the queries from
fre-lexicons and the percentage of the queries
from fre-lexicons with fewer than five terms

40
Analysis of Frequent Users and Their Lexicons

Constraints imposed on the no. of terms does not
affect the results for most of the fre-users.
Hence, we could just enumerate the word
combinations using no more than 4 terms, which
would greatly reduce the number of queries to be
pre-fetched

41
Research Implications

Review the statistical results derived from both
traces and discuss their implications
Focus on 3 aspects
1. Caching search engine results
2. Prefetching search engine results
3. Improving query result rankings

42
Caching search engine results

Query results can be cached on the servers, the
proxies, and the clients. For optimal
performance, we should make decisions based on
the following aspects
1. Scalability
2. Performance improvement
3. Hit rate and shareness
4. Overhead.
5. Opportunity for other benefits

43
Caching search engine results
44
Caching search engine results

Server Caching
Server has limited resources so it doesnt scale
well with increasing size of Internet
We cant reduce no. of request received by the
server
Server Caching has small overhead
It allows max query shareness and hit rate would
be high by caching popular queries

45
Caching search engine results

Proxy Caching
It is effective to reduce both server workload
and network traffic.
In the case of caching query results, this
assumes that the users nearby a proxy would share
queries.
However, one of the main disadvantages of proxy
caching is the significant overhead of placing
dedicated proxies among the Internet.

46
Caching search engine results

User Side Caching
User side caching achieves the best scalability.
Because the overhead of caching can be amortized
to a large number of users, the overhead at each
user side is small.
User side caching make possible to prefetch or
improve query results based on individual user
requirements.
No shareness can be exploited with user side
caching.

47
Caching search engine results

Degree of shareness tells where should query
results be cached
Two extreme case
1. User never repeat query gives max degree of
shareness. In this case cache query at
servers/proxies
2. User never share queries results caching
query at user side

48
Caching search engine results

32 to 42 are repeated queries
16 to 22 are repeated queries by same user
Queries repeated by same user can be cached at
user side
Rest of repeated queries can be cached at
server/proxy side
Multiple-word queries should be cached at user
side

49
Caching search engine results

How long cache query result?
Temporal query locality indicates that most of
the queries are repeated within short time
intervals
User side caching query results should be cached
for hours. This also helps to remove or update
stale query results in time
Long-term caching such as couple of days should
be done at servers/proxies

50
Caching search engine results

Time to live (TTL) tells whether or not the query
has been in the cache for too long and should be
discarded. Set to relative short time interval
The If-Modified-Since request-header field is
used in each user request. If the requested
resource has not been modified since the time
specified in this field, a copy of the resource
will not be returned from the server. This
guarantees query freshness but has an overhead

51
Caching search engine results

For server side and proxy caching, we should use
different mechanisms to maintain cache
consistency
With server side caching, we can easily ensure
cache consistency by removing or updating stale
query results whenever new query results are
computed

52
Caching search engine results

For proxy caching, queries are shared by
different users repeats over longer time
intervals, TTL based approach would incur more
overhead. TTL is usually set to a relatively
short interval to prevent caches from serving
stale data
Advanced protocols should be used to update stale
query results while keeping the overhead low
Proxy polling technique can be used where proxies
periodically check back with server to determine
if cached objects are still valid

53
Prefetching Search Engine Results

Lexicon Analysis suggest that pre-fetching query
result based on user fre-lexicon is promising
Prefetching has always been an important method
to reduce user access latency

54
Prefetching Search Engine Results

User lexicon analysis shows that majority of user
have small lexicon sizes
Prefetching can be done by enumerating all the
word combinations from fre-lexicons and prefetch
the corresponding query results into a user level
cache
Analysis shows that same performance improvement
can be achieved by skipping queries longer than 4
words

55
Prefetching Search Engine Results

When user interests stay relatively stable, the
majority of the words from a fre-lexicon will
remain the same. Hence fewer new query results to
fetch during prefetching. Thus overhead at the
user side is small.
With user interests changing gradually, the
fre-lexicons should also be updated to match user
interests. New queries will be formed and results
will need to be prefetched to achieve the best
cache hit rates

56
Prefetching Search Engine Results

Same algorithm can be used at server side.
Redundant query can be sent to the server
If prefetching algorithm is performed regularly
when user machines are idle and servers are not
busy, most of these requests will be processed at
non-peak times by servers
Hence, server peak time workload can even be
reduced since many of the queries have already
been satisfied by user prefetching

57
Prefetching Search Engine Results

Prefetching can also be performed at proxies. In
such cases, servers global knowledge about user
query patterns can be utilized to decide what and
when to prefetch
Since proxies allow query shareness among
different users, exploring how to achieve the
maximum hit rate is left as future work

58
Improving Query Result Rankings

Search engines return thousand of results but
actually few are used by users i.e less than 2
pages of result
Improving query result rankings based on
individual user requirements is more important
than ever
With user side/proxy caching, it is now possible
to re-rank the returned search engine results
based on the unique interest of individual user
For example, a naive algorithm would be to
increase the ranks of the Web pages visited by
the user among the next query results

59
Conclusion