Title: Web%20Caching
1Web Caching
2What is Web Caching
- Introducing proxy servers at certain points in
the network that serve in caching Web documents
for faster client access. - Comparable to the cache memory in a computer
system
3Proxy Cache
clients
servers
Reply
Req.
proxy
Req.
Reply
4How?
- Client send requests to the proxy.
- If the requested document is in its cache, the
proxy serves the request from its cache. - Otherwise, the proxy forward the request to the
server. - Server replies the request through the proxy
(proxy keep a copy of the requested document).
5Why Web Caching?
- Rapid growth in HTTP traffic to form the largest
part of the Internet traffic which causes more
network congestion and server unavailability. - The number of Web static pages almost doubles
every year - Some old data
- Number of unique pages 800M lt X lt 2.2B
- Number of unique web sites 8,500,000
- static pages 30 - 40
- pages revisited 80
- expected hit-rate 24 - 32
6Why Web Caching?
- Bandwidth
- Latency
- Performance Response Time
- Server Load
- Failure Redundancy
7Expected Gains
- Bandwidth saving
- Improving content availability.
- Improving web server availability.
- Server load balancing.
- Reducing user-perceived latency
8What Content and Protocols
- HTTP 1.0 Basic protocol
- Send Request based on fix number of verbs
- GET
- HEAD
- POST
- Receive response, meta-data, content
9What Content and Protocols
- HTTP Request
- Request Simple-Request Full-Request
- Simple-Request "GET" SP Request-URI CRLF
- Full-Request Request-Line
- ( General-Header
- Request-Header
- Entity-Header )
- CRLF
- Entity-Body
10What Content and Protocols
- Example
- GET /pub/www/index.html HTTP/1.0
- Response
- HTTP/1.1 200 OK
- Server Microsoft-IIS/5.0
- Date Sat, 19 Oct 2002 054653 GMT
- Expires Sun, 20 Oct 2002 160000 GMT
- Content-Length 2291
- Content-Type text/html
- Cache-control private
11What Content and Protocols
- Example if-modified-since
- GET /pub/www/index.html HTTP/1.0
- If-Modified-Since Sat, 19 Oct 2002 194331 GMT
- Response
- HTTP/1.1 200 OK
- Server Microsoft-IIS/5.0
- Date Thu, 13 Jul 2000 054653 GMT
- Expires Sun, 20 Oct 2002 160000 GMT
- Content-Length 2291
- Content-Type text/html
- Cache-control private
12What Content and Protocols
- Example if-modified-since
- GET /pub/www/index.html HTTP/1.0
- If-Modified-Since Sat, 19 Oct 2002 194331 GMT
- Response
- HTTP/1.1 304 Not Modified
13HTTP support for caching
- Conditional requests (IMS)
- Servers can set expires and max-age
- Request indirection application level routing
- Range requests, entity tag
- Cache-control header
- Requests min-fresh, max-stale, no-transform
- Responses must-revalidate, public, private,
no-cache
14Where
Local ISP
Content Server
Reverse Proxy
cache
cdn
L4 Switch
Data Center ISP
Intranet
cache
Browser
cache
Browser
cache
Browser
cdn
cache
15Cache Types
- Proxy Caching
- Reverse Proxy Caching
- Transparent Caching
- Adaptive Caching
- Push Caching
- Active Caching
16Proxy Caching
- Harvest/Squid
- Provide web content for a fixed user base
- Deployed at the network edges (company or
institutional gateway or firewall hosts) - Standalone operation
- Manual configuration in web browsers
- Commodity product/technology
- Single point of failures
17Reverse Proxy Caching
- Designed to offload duties from one or more
specific servers - Data size is limited to size of static content on
the server - Challenge is fast, disk-less operation
- Cache consistency is easy
18Transparent Caching
- Intercept HTTP requests and redirect them to web
cache servers or cache clusters - No client configuration
- Violates end-to-end paradigm
- Client thinks it is talking directly to server
- Server thinks it is talking to cache
- Implemented as L4-switch
- Layer 4 switch makes switching decisions based on
TCP or UDP port number, i.e., 80
19Transparent Caching
20Adaptive Caching
- ISP Level caching, global data placement
optimization - Cooperating multiple distributed caches
- Operate as a cache-mesh based on content demand
- Cache Group Management Protocol
- How meshes are formed
- How individual caches join/leave the meshes
- Content Routing Protocol sends request to the
appropriate cache within the meshes - Uses distributed cache meshes to solve the hot
spot problem - Caches dynamically join and leave the groups
based on content demand - Administrative boundaries must be relaxed
21Push Caching
- Keep data close to those clients requesting this
information - Send the data out proactively
- Assumption we are able launch caches that may
cross administrative boundaries - Incurs cost (storage and transmission)
22Active Caching
- Applies caching to dynamic documents
- 30 of client HTTP requests contains cookies
- The servers provides the cache with the objects
and any associated cache applets - Use an applet inside of the cache to customize
dynamic pages on the fly
23Cache Placement/Deployment
- Close to clients/content consumers
- Proxy caching
- Transparent proxy caching
- Close to servers/content providers
- Improve access to logical sets of data
- Delay-sensitive data video, audio
- Reverse proxy caching
- Push caching
- Network choke points strategic deployment
- Adaptive caching
- Problem with administrative control
24Zipf Law vs. Web Access
- Zipf Law
- Web Access
- Caching?
25Zipfs Law
- Zipfs law The frequency of an event P as a
function of rank i is a power law function - Pi ? / ia where a 1
26Zipfs Law
- Observed to be true for
- Frequency of written words in English texts
- Population of cities
- Income of a company as a function of rank
27Zipfs Law vs. Web Access
- For a given server, page access by rank follows
Zipfs law - Web requests from a fixed population of users
follows Zipfs law 0.64 lt a lt 0.83
28Observations
- Top 1 of all documents account for 20 - 35 of
proxy requests - Top 10 account for 45 - 55 of requests
- It takes 25 to 40 of all documents to account
for 70 of requests - It takes 70 to 80 of all documents to account
for 90 of requests
29Zipfs Law and Caching
- Discussion
- How does this help in cache design?
30Basic caching algorithm
- Pages may be
- Fresh up-to-date
- Expired current date gt expiration date
- Stale old
31Basic caching algorithm - 2
- If (page is in the cache)
- if ( page is expired or stale )
- Get from server - if-modified-since
- If not modified, Get from cache
- Get from Server
- Else
- Get from Server
Soft Miss
32Basic caching algorithm - 3
- If cache has space
- Store the file
- Else
- Delete expired from cache
- Delete stale from cache
- Delete LRU from cache
- Delete largest/smallest from cache?
33Cache Replacement
- Cache size is limited, need replacement policy
- LRU
- LFU
- Greedy-dual size
- Many others
34Cache Consistency
- Multiple copies of objects created
- How and when renewing the copies?
- Goals
- Avoid stale copies
- Keep non useful traffic as low as possible
35Cache Consistency Polling
- Solution 1 polling every time
implemented in HTTP using the optional
if-modified-since" request header field Benefit
strong consistency Drawback very slow cache hit
36Cache Consistency Polling
- Solution 2 polling if TTL expires, widely used
- Associate a TTL (12 hours or 2 days) with each
cached object
implemented in HTTP using the optional "expires"
header field Benefit fast cache hit Drawback
weak cache consistency (5 stale) due to TTL is
an a priori estimate of an object's life time
37Cache Consistency
- Solution 3 Invalidation Protocols
- The server helps the proxy in maintaining
consistency - Invalidation protocols
- When the proxy makes a request,
- Piggyback cache validation (PCV) the proxy
provides some other potentially stale copies for
server validating - Piggyback cache invalidation (PCI) the server
provides some copies which have been updated
since last access - Use of volumes
- Volume lease
- The client receive a lease from the server
- During the lease validity the client can retreive
copies from proxy - When the lease expire the client has to renew it
- Problems scalability, servers needs keep cache
states
38Cache Cooperation
- Hierarchical caching
- Cache servers form a hierarchy, tree-like
structures - Parent servers top of the hierarchy, receive
requests from child servers. If they do not have
the requested objects, either ask their parents
or original web servers - Sibling servers if the local cache does not have
the requested object, then ask its sibling
caches. If the sibling caches do not have the
object, then the local cache asks the parent cache
39(No Transcript)
40Cache Hierarchies
- Use hierarchy to scale a proxy
- Why?
- Larger population higher hit rate (less
compulsory misses) - Larger effective cache size
- Why is population for single proxy limited?
- Performance, administration, policy, etc.
- NLANR cache hierarchy
- Most popular
- 9 top level caches
- Internet Cache Protocol based (ICP)
- Squid/Harvest proxy
- How to locate content?
41ICP (Internet cache protocol)
- Simple protocol to query another cache for
content - Uses UDP why?
- ICP message contents
- Type query, hit, hit_obj, miss
- Other identifier, URL, version, sender address
- Special message types used with UDP echo port
- Used to probe server or dumb cache
- Query and then wait till time-out (2 sec)
- Transfers between caches still done using HTTP
42Squid
Parent
ICP Query
ICP Query
Child
Child
Child
Web page request
43Squid
Parent
ICP MISS
ICP MISS
Child
Child
Child
44Squid
Parent
Web page request
Child
Child
Child
45Squid
Parent
ICP Query
ICP Query
ICP Query
Child
Child
Child
Web page request
46Squid
Parent
ICP HIT
ICP MISS
ICP HIT
Child
Child
Child
Web page request
47Squid
Parent
Web page request
Child
Child
Child
48Hierarchical caching
- Ideally, want the cache mesh to behave as a
single cache with equivalent capacity and
processing capability - ICP many copies of popular objects created
capacity wasted - High Latency More than one hop needed for
searching object - How to improve? ? Discuss!
49Problems with caching
- Over 50 of all HTTP objects are uncacheable.
- Sources
- Dynamic data ? stock prices, frequently updated
content - CGI scripts ? results based on passed parameters
- SSL ? encrypted data is not cacheable
- Most web clients dont handle mixed pages well
?many generic objects transferred with SSL - Cookies ? results may be based on passed data
- Hit metering ? owner wants to measure of hits
for revenue, etc, so, cache busting
50Risks of Using Proxy
- Benefits reduce latency, bandwidth saving, etc.
- Risks
- Obsolete data
- Violate client privacy the proxy can keep a log
file telling which objects the client has
requested - Data integrity
51Real Proxy Servers
- Squid The most widely used. The better working
and the free one. - http//www.squid-cache.org/
- Microsoft ISA Server 2004 Microsoft developed
ISA to replace Microsoft proxy server. Its fully
functional with Active Directory - http//www.microsoft.com/isaserver/
- Apache Apache web server has a module to do
reverse caching (experimental) - http//httpd.apache.org/docs-2.0/mod/mod_cach
e.html - Cisco Cache Engine sits next to (mostly) Cisco
routers and receives transparently redirected
HTTP requests http//www.cisco.com/warp/public/cc/
pd/cxsr/500/index.shtml - CERN/W3C HTTPd It was the original proxy server.
http//www.w3.org/hypertext/WWW/Daemon/Status.html