ECE 7995 CACHING AND PREFETCHING TECHNIQUES - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

ECE 7995 CACHING AND PREFETCHING TECHNIQUES

Description:

An Anonymous ID Identifying the user IP address. ... For proxy caching, queries are shared by different users repeats over longer ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 60
Provided by: eceEng
Category:

less

Transcript and Presenter's Notes

Title: ECE 7995 CACHING AND PREFETCHING TECHNIQUES


1
ECE 7995CACHING AND PREFETCHING
TECHNIQUES
2
Locality In Search Engine Queries And Its
Implications For Caching
  • By
  • LAKSHMI JANARDHAN ba8671
  • JUNAID AHMED ax9974

3
Outline
  • Introduction
  • Related previous work
  • Analysis of search engines and its query traces
  • Query locality and its implications
  • User lexicon analysis and its implication
  • Results
  • Conclusion and Scope of improvement

4
Introduction
  • Serving a search request requires a significant
    amount of computation as well as I/O and network
    bandwidth, caching these results could improve
    performance in 3 ways.
  • Repeated query results.
  • Because of the reduction in server workload,
    scarce computing cycles in the server are saved,
    allowing these cycles to be applied to more
    advanced algorithms.
  • Distribute part of the computational tasks and
    customize search results based on user contextual
    information.

5
Questions Still Open??
  • Where should we cache these results?
  • How long should we keep a query in cache before
    it becomes stale?
  • What might be the other benefits gained from
    caching?
  • Do we have to study the real time search engines
    in order to understand caching search engine
    results?

6
Related work Motivation
  • Due to the exponential growth of the Web, there
    has been much research on the impact of Web
    caching and how to maximize its performance
    benefits.
  • Deploying proxies between clients and servers
    yields a number of performance benefits. It
    reduces server load, network bandwidth usage as
    well as user access latency.

7
  • There are previous studies on search engine
    traces like the Excite search engine trace to
    determine how users search the Web and what they
    search for and the AltaVista search engine trace,
    studying the interaction of terms within queries
    and presenting results of a correlation analysis
    of the log entries.
  • Although these studies have not focused on
    caching search engine results, all of them
    suggest queries have significant locality, which
    hence motivates the authors in writing the
    paper.

8
Analysis of search engines its query traces
  • Analysis of VIVISIMO EXCITE Search Engines.
  • Query Trace description of both VIVISIMO EXCITE
    Search Engines.
  • Statistical summary of the above traces.

9
Vivisimo search engine
  • Vivisimo is a clustering meta-search engine that
    organizes the combined outputs of multiple search
    engines.
  • Upon reception of each user query, Vivisimo
    combines the results from other search engines
    and organizes these documents into meaningful
    groups.
  • The groupings are generated dynamically based on
    extracts from the documents, such as titles,
    URLs, and short descriptions.

10
Excite search engine
  • Excite is a basic search engine that
    automatically produces search results by listing
    relevant web sites and information upon reception
    of each user query.
  • Capitalization of the query is disregarded.
  • The default logic operation to be performed is
    ALL'. It also supports other logic operations
    like AND, OR, AND NOT.
  • More advanced searching features of Excite
    include wild card matching, PHRASE searching
    and relevance feedbacks.

11
Query traces of vivisimo excite search engines
  • Vivisimo trace captures the behavior of early
    adopters who may not be representative of a
    steady state user group.
  • Both Vivisimo Excite traces were collected at
    different times
  • Even though they were captured at different time
    periods and user populations, their results seems
    to be the same.

12
  • In both traces, there are entries which contain
    the following fields of interest-
  • An Anonymous ID Identifying the user IP
    address.
  • A Timestamp Specifying when the user request is
    received.
  • A Query String Submitted by the user. Any
    advanced query operations are selected, they will
    also be specified in this string.
  • A Number Indicating whether the request is for
    next page results or a new user query.

13
Statistical summaries
  • Users do not issue many next-page requests. Fewer
    than two pages on average are examined for each
    query.
  • Users do repeat queries a lot.
  • In the Vivisimo trace, over 32 of the queries
    are repeated ones that have been submitted before
    by either the same user or a different user.
  • In the Excite trace, more than 42of the queries
    are repeated queries.

14
  • The majority of users do not use advanced query
    options
  • 97 of the queries from the Vivisimo trace and 3
    of the queries from the Excite trace use the
    default logic operations offered by the
    corresponding search engines.
  • Users on average do not submit many queries. The
    average numbers of queries submitted by a user
    are 5.48 and 3.69, respectively.
  • About 70of the queries consist of more than one
    word, although the average query length is fewer
    than three terms, which is short.

15
The number of HTTP requests cannot be inferred
from the Excite trace since the trace did not
contain Information about HTTP requests from
users.
16
  • Vivisimo Trace
  • Excite Trace

17
Query locality its Implications
  • Query Repetition and Distribution
  • Query Locality based on Individual User
  • Temporal Query Locality
  • Multiple Word Query Locality
  • Users with shared IP Address

18
Query repetition distribution
  • Among 35,538 queries that are repeated, there are
    16,612 distinct queries in Vivisimo trace. Thus
    each query was repeated 3.20 times.
  • In Excite trace the query was repeated 4.49
    times. There were 235,607 distinct queries among
    821,315 repeated ones.
  • The figure shows the plots of Zipf distribution
    of repeated query frequencies for both traces

19
Query repetition distribution (contd..)
  • We are interested only in repeated queries
    whether they are from same users or shared by
    different users
  • Out of 32.05 of repeated queries, only 70.58
    are from same users in Vivisimo trace.
  • Out of 42.75 of repeated queries, only 37.35
    are repeated from same users in Excite trace.
  • Thus, these results suggest that queries having
    high degree of shareness should be cached at the
    server side

20
Query repetition distribution (contd..)
Shareness is based on their repetition
frequencies
21
Query locality based on individual user
  • The query shareness distribution indicates that
    query locality exists with respect to the same
    users as well as among different users.
  • In Vivisimo trace, out of 20,220 users, 6,628
    queries are repeated at least once. Each user on
    avg repeated 5.36 queries.
  • In Excite trace, out of 520,883 users who
    submitted queries, 136,626 repeated queries. Each
    user on avg repeated 6.01 queries.
  • Figure shows the percentage of the repeated
    queries over the total number of queries
    submitted by each user.

22
Query locality based on individual user (contd..)
  • Thus from these results, we can see that not only
    lot of users repeated queries but each user
    repeated queries a lot.
  • Since16 to 22 of all queries were repeated by
    the same users and that search engine servers can
    cache only limited data, caching these queries
    and query results based on individual user
    requirements in a more distributed way is
    important.
  • It reduces user query submission overheads and
    access latencies as well as server load.
  • By caching queries at the user side, we also have
    the opportunity to improve query results based on
    individual user context, which cannot be achieved
    by caching queries at a centralized server.

23
Temporal query locality
  • Tendency of users to repeat queries within a
    short time interval
  • In Vivisimo about 65 of the queries were
    repeated within an hour, and 83 in Excite.
  • 45.5 of the queries were repeated by the same
    users within 5 minutes.
  • Out of 21.98 of the queries that were repeated
    over a day, only 5.09 came from same users.

24
Temporal query locality (contd..)
25
Temporal query locality (contd..)
26
Multi word query locality
  • This comes into picture if a user uses more than
    one word for his/her query
  • Multiword locality has less degree of shareness.
  • Caching multiword queries are more promising
    because it takes more time for computing.
  • Since they have less degree of shareness, they
    are mostly cached at the user side.

27
Multi word query locality (contd..)
Multiple word query summary and comparison
between single and multi word queries.
28
Users with shared ip addresses
  • For those users, their IP addresses are
    dynamically allocated by DHCP servers.
    Unfortunately, there is no common way to identify
    these kinds of users. This impacts our analysis
    in two ways.
  • First, because different users can share the same
    IP address at different times, their queries seem
    like they come from the same user, leading to an
    overestimate of the query locality from the same
    users.
  • Second, because the same users can use different
    IP addresses at different times, it is also
    possible for us to underestimate the query
    locality from the same users.

29
User Lexicon Analysis and Implication
  • User Lexicon lexicon is a synonym for
    dictionary. It is user vocabulary.
  • By analyzing the user query lexicons, it is
    possible to pre-fetch query results for each user
    based on frequently used terms

30
User Lexicon Analysis and Implication
  • All the words used by each user individually were
    grouped and the user lexicon size distribution
    was observed
  • Vivisimo Trace
  • 249,541 words in queries ? 51,895 are distinct
    words (20.8)
  • Excite Trace
  • 5,095,189 words in queries ? 350,879 distinct
    words (6.9)

31
User Lexicon Analysis and Implication
  • User lexicon sizes are much smaller than overall
    lexicon size
  • Largest user lexicons in Vivisimo trace have only
    885 words and 202 words for Excite trace

32
Distribution of User Lexicon Size
  • User Lexicon size does not follow Zipf
    distribution
  • Majority of user have small lexicon indicated by
    the heavy tail

33
Distribution of User Lexicon Size
  • Users submitted more than 100 queries over 8 hour
    period are removed as they are likely to be
    meta-search engine
  • User lexicon is larger as more queries are
    submitted by the user

34
Distribution of User Lexicon Size
  • Shows relationship between the no. of queries
    submitted by a user and corresponding user
    lexicon size
  • Excite trace also shows same pattern

35
Analysis of Frequent Users and Their Lexicons
  • Users have large lexicons but they do not use all
    words uniformly
  • We are interested in how frequently words are
    used by frequent users in the queries
  • Users submitted only few queries over long period
    (35 days) or whose trace lasts for too short a
    period are ignored.

36
Analysis of Frequent Users and Their Lexicons
  • Fre-user if user has submitted at least 70
    queries over 35 days.
  • Fre-lexicon Consist of words that were used at
    least 5 times by the corresponding users

37
Analysis of Frequent Users and Their Lexicons
  • Most of the fre-users had small fre-lexicons
  • A small no. of users with relatively large
    fre-lexicons

38
Analysis of Frequent Users and Their Lexicons
  • Prefetching based on fre-lexicons for these users
    will reduce no. of queries
  • But would required large cache size to store all
    possible word combination

39
Analysis of Frequent Users and Their Lexicons
  • Shows both the percentage of the queries from
    fre-lexicons and the percentage of the queries
    from fre-lexicons with fewer than five terms

40
Analysis of Frequent Users and Their Lexicons
  • Constraints imposed on the no. of terms does not
    affect the results for most of the fre-users.
    Hence, we could just enumerate the word
    combinations using no more than 4 terms, which
    would greatly reduce the number of queries to be
    pre-fetched

41
Research Implications
  • Review the statistical results derived from both
    traces and discuss their implications
  • Focus on 3 aspects
  • 1. Caching search engine results
  • 2. Prefetching search engine results
  • 3. Improving query result rankings

42
Caching search engine results
  • Query results can be cached on the servers, the
    proxies, and the clients. For optimal
    performance, we should make decisions based on
    the following aspects
  • 1. Scalability
  • 2. Performance improvement
  • 3. Hit rate and shareness
  • 4. Overhead.
  • 5. Opportunity for other benefits

43
Caching search engine results
44
Caching search engine results
  • Server Caching
  • Server has limited resources so it doesnt scale
    well with increasing size of Internet
  • We cant reduce no. of request received by the
    server
  • Server Caching has small overhead
  • It allows max query shareness and hit rate would
    be high by caching popular queries

45
Caching search engine results
  • Proxy Caching
  • It is effective to reduce both server workload
    and network traffic.
  • In the case of caching query results, this
    assumes that the users nearby a proxy would share
    queries.
  • However, one of the main disadvantages of proxy
    caching is the significant overhead of placing
    dedicated proxies among the Internet.

46
Caching search engine results
  • User Side Caching
  • User side caching achieves the best scalability.
  • Because the overhead of caching can be amortized
    to a large number of users, the overhead at each
    user side is small.
  • User side caching make possible to prefetch or
    improve query results based on individual user
    requirements.
  • No shareness can be exploited with user side
    caching.

47
Caching search engine results
  • Degree of shareness tells where should query
    results be cached
  • Two extreme case
  • 1. User never repeat query gives max degree of
    shareness. In this case cache query at
    servers/proxies
  • 2. User never share queries results caching
    query at user side

48
Caching search engine results
  • 32 to 42 are repeated queries
  • 16 to 22 are repeated queries by same user
  • Queries repeated by same user can be cached at
    user side
  • Rest of repeated queries can be cached at
    server/proxy side
  • Multiple-word queries should be cached at user
    side

49
Caching search engine results
  • How long cache query result?
  • Temporal query locality indicates that most of
    the queries are repeated within short time
    intervals
  • User side caching query results should be cached
    for hours. This also helps to remove or update
    stale query results in time
  • Long-term caching such as couple of days should
    be done at servers/proxies

50
Caching search engine results
  • Time to live (TTL) tells whether or not the query
    has been in the cache for too long and should be
    discarded. Set to relative short time interval
  • The If-Modified-Since request-header field is
    used in each user request. If the requested
    resource has not been modified since the time
    specified in this field, a copy of the resource
    will not be returned from the server. This
    guarantees query freshness but has an overhead

51
Caching search engine results
  • For server side and proxy caching, we should use
    different mechanisms to maintain cache
    consistency
  • With server side caching, we can easily ensure
    cache consistency by removing or updating stale
    query results whenever new query results are
    computed

52
Caching search engine results
  • For proxy caching, queries are shared by
    different users repeats over longer time
    intervals, TTL based approach would incur more
    overhead. TTL is usually set to a relatively
    short interval to prevent caches from serving
    stale data
  • Advanced protocols should be used to update stale
    query results while keeping the overhead low
  • Proxy polling technique can be used where proxies
    periodically check back with server to determine
    if cached objects are still valid

53
Prefetching Search Engine Results
  • Lexicon Analysis suggest that pre-fetching query
    result based on user fre-lexicon is promising
  • Prefetching has always been an important method
    to reduce user access latency

54
Prefetching Search Engine Results
  • User lexicon analysis shows that majority of user
    have small lexicon sizes
  • Prefetching can be done by enumerating all the
    word combinations from fre-lexicons and prefetch
    the corresponding query results into a user level
    cache
  • Analysis shows that same performance improvement
    can be achieved by skipping queries longer than 4
    words

55
Prefetching Search Engine Results
  • When user interests stay relatively stable, the
    majority of the words from a fre-lexicon will
    remain the same. Hence fewer new query results to
    fetch during prefetching. Thus overhead at the
    user side is small.
  • With user interests changing gradually, the
    fre-lexicons should also be updated to match user
    interests. New queries will be formed and results
    will need to be prefetched to achieve the best
    cache hit rates

56
Prefetching Search Engine Results
  • Same algorithm can be used at server side.
    Redundant query can be sent to the server
  • If prefetching algorithm is performed regularly
    when user machines are idle and servers are not
    busy, most of these requests will be processed at
    non-peak times by servers
  • Hence, server peak time workload can even be
    reduced since many of the queries have already
    been satisfied by user prefetching

57
Prefetching Search Engine Results
  • Prefetching can also be performed at proxies. In
    such cases, servers global knowledge about user
    query patterns can be utilized to decide what and
    when to prefetch
  • Since proxies allow query shareness among
    different users, exploring how to achieve the
    maximum hit rate is left as future work

58
Improving Query Result Rankings
  • Search engines return thousand of results but
    actually few are used by users i.e less than 2
    pages of result
  • Improving query result rankings based on
    individual user requirements is more important
    than ever
  • With user side/proxy caching, it is now possible
    to re-rank the returned search engine results
    based on the unique interest of individual user
  • For example, a naive algorithm would be to
    increase the ranks of the Web pages visited by
    the user among the next query results

59
Conclusion
  • In Conclusion, we answer the questions again hat
    we had in the beginning
  • Where should we cache search engine results?
  • How long should we cache search results?
  • What are the other benefits of caching search
    engine results?
Write a Comment
User Comments (0)
About PowerShow.com