Title: Web Workload Characterization
110
Web Workload Characterization
Web Protocols and Practice
2Topics
WEB WORKLOAD CHARACTERIZATION
Web Workload Definition Workload Characterization Statistics and Probability Distributions HTTP Message Characteristics Web Resource Characteristics User Behavior Characteristics Applying Workload Models
Web Protocols and Practice
3Web Workload Definition
WEB WORKLOAD CHARACTERIZATION
Important performance metrics, such as user-perceived latency and server throughput, depend on the interaction of numerous protocols and software components. A workload consists of the set of all inputs a system receives over a period of time. Web workload models are used to generate request traffic for comparing the performance of different proxy and server implementation.
Web Protocols and Practice
4Web Workload Definition
WEB WORKLOAD CHARACTERIZATION
Developing a workload model involves three main steps Identifying the important workload parameters Analyzing measurement data to qualify these parameters Validating the model against reality Constructing a workload model requires an understanding of statistical techniques for analyzing measurement data and representing the key properties of Web traffic.
Web Protocols and Practice
5Web Workload Definition
WEB WORKLOAD CHARACTERIZATION
Key properties of Web workloads are HTTP message characteristics Resource characteristics User behavior
Web Protocols and Practice
6Workload Characterization
WEB WORKLOAD CHARACTERIZATION
A workload model consists of a collection of parameters that represent the key features of the workload that affect the resource allocation and system performance. Workload model can be applied to a variety of performance evaluation tasks, such as the following Identifying performance problems Benchmarking Web components Capacity planning
Web Protocols and Practice
7Workload Characterization
WEB WORKLOAD CHARACTERIZATION
Workload models have several approaches Trace-driven workload Constructs requests directly from an existing log or trace Reproduces a known workload Avoids the intermediate step of analyzing the traffic Not provide flexibility for experimenting with changes to the workload No clear separation between the load and performance
Web Protocols and Practice
8Workload Characterization
WEB WORKLOAD CHARACTERIZATION
Stress testing Sends requests as fast as possible to evaluate a proxy or a server under heavy load May not present the realistic traffic patterns
Web Protocols and Practice
9Workload Characterization
WEB WORKLOAD CHARACTERIZATION
Synthetic Workload derives from an explicit mathematical model that can be inspected, analyzed, and criticized Represents the key properties of real Web traffic Explores system performance in a controlled manner by changing the parameters associated with each probability distribution
Web Protocols and Practice
10Workload Characterization
WEB WORKLOAD CHARACTERIZATION
To ensure that a workload model is representative of real workloads, the parameters of the model should have certain properties Decoupling from underlying system Proper level of detail Independence from other parameters (Table 10.1)
Web Protocols and Practice
11Table 10.1. Examples of Web workload parameters
WEB WORKLOAD CHARACTERIZATION
Parameter Category
Request method Response code Protocol
Content type Resource size Response size Popularity Modification frequency Temporal locality Number of embedded resources Resource
Session interarrival times Number of clicks per session Request interarrival times Users
Web Protocols and Practice
12Statistics and Probability Distributions
WEB WORKLOAD CHARACTERIZATION
Statistics such as the mean, median, and variance capture the basic properties of many workload parameters. Mean shows the average value of the parameters. Median shows the middle value of parameters. Half of the values are smaller than the median and the other half are larger than the median. Variance or standard deviation attempt to quantify how much the parameters varies from the average value.
Web Protocols and Practice
13Statistics and Probability Distributions
WEB WORKLOAD CHARACTERIZATION
For a sequence of 4100, 4700, 4200, 20,000, 4000 bytes mean size 7400 bytes median size 4200 bytes For a sequence of 4100, 4700, 4200, 4800, 4000 bytes mean size 4360 bytes median size 4200 bytes
Web Protocols and Practice
14Statistics and Probability Distributions
WEB WORKLOAD CHARACTERIZATION
Probability distributions capture how a parameter varies over a wide range of values.
Web Protocols and Practice
15Statistics and Probability Distributions
WEB WORKLOAD CHARACTERIZATION
For a sequence of 4100, 4700, 4200, 20,000, 4000 bytes F(x) P(X lt x)
Example of cumulative distribution Function (CDF)
Web Protocols and Practice
16Statistics and Probability Distributions
WEB WORKLOAD CHARACTERIZATION
For a sequence of 4100, 4700, 4200, 20,000, 4000 bytes Fc (x) P(X gt x) 1 - F(x)
Figure 10.1. Example of complementary cumulative
distribution Function (CCDF)
Web Protocols and Practice
17Statistics and Probability Distributions
WEB WORKLOAD CHARACTERIZATION
Several probability distributions have been widely applied to workload characterization. One of the most popular probability distributions is the exponential distribution with the form mean
Web Protocols and Practice
18Statistics and Probability Distributions
WEB WORKLOAD CHARACTERIZATION
Relating a measured distribution to an equation requires justifying the hypothesis that the equation is capable of accurately representing the measured data. Justifying this hypothesis consists of two key steps The measured data is fitted with the equation to determine the value of each variable. Statistical tests are performed to compare the resulting equation with the measured equation.
Web Protocols and Practice
19Statistics and Probability Distributions
WEB WORKLOAD CHARACTERIZATION
In some cases, no single well-known distribution matches the measured data. It may be necessary to represent different parts of the measured distribution with different equations.
Web Protocols and Practice
20HTTP Message Characteristics
WEB WORKLOAD CHARACTERIZATION
HTTP Request Methods HTTP Response Codes
Web Protocols and Practice
21HTTP Request Methods
WEB WORKLOAD CHARACTERIZATION
Knowing which request methods arise in practice is useful for optimizing server implementation and developing realistic benchmarks for evaluating Web proxies and servers. Traffic characteristics The overwhelming majority or Web requests use the GET method to fetch resources and invoke scripts. Small fraction of HTTP requests use the POST method to submit data in forms.
Web Protocols and Practice
22HTTP Request Methods
WEB WORKLOAD CHARACTERIZATION
Measurements show a small number of HEAD requests to test an operational Web server. Web Distributed Authoring and Versioning (WEBDAV) use PUT and DELETE methods frequently. The emergence of tools for testing and debugging Web components may increase the use of the TRACE method. The exact distribution of request methods varies from site to site.
Web Protocols and Practice
23HTTP Response Codes
WEB WORKLOAD CHARACTERIZATION
Knowing how servers respond to client requests is an important part of constructing a realistic model of Web workloads. Traffic characteristics 200 OK for 75 to 90 of responses 304 Not Modified for 10 to 30 of responses The other redirection(3xx) codes and the client error(4xx) codes are the most common 206 Partial Content may become more common when the server returns a range of bytes from the requested resource
Web Protocols and Practice
24HTTP Response Codes
WEB WORKLOAD CHARACTERIZATION
302 Found is used for redirection responses and varies from site to site
Web Protocols and Practice
25Web Resource Characteristics
WEB WORKLOAD CHARACTERIZATION
Content Type Resource Size Response Size Resource Popularity Modification Frequency (Resource Changes) Temporal Locality Number of Embedded Resources
Web Protocols and Practice
26Web Resource Characteristics
WEB WORKLOAD CHARACTERIZATION
Understanding the characteristics of Web resources is an important part of modeling Web workload. Resources are vary in terms of How big they are How popular they are How often they change Characteristics of Web resources are Content type Resource size
Web Protocols and Practice
27Web Resource Characteristics
WEB WORKLOAD CHARACTERIZATION
Response size Resource popularity Modification frequency (Resource changes) Temporal locality Number of embedded resources
Web Protocols and Practice
28Content type
WEB WORKLOAD CHARACTERIZATION
Content type has a direct relationship to other key workload parameters, such as resource size and modification frequency. Traffic characteristics Overwhelming majority of resources are text content (plain and HTML) and images (jpeg and gif) The remaining content types include documents such as postscript and PDF, software such as JavaScript of Java applets, and audio and video data.
Web Protocols and Practice
29Content type
WEB WORKLOAD CHARACTERIZATION
The emergence of new application can have an influence on the distribution of content types.
Web Protocols and Practice
30Resource Sizes
WEB WORKLOAD CHARACTERIZATION
The sizes of Web resources affect The storage requirements at the origin server The overhead of caching resources at browsers and proxies The load on the network The latency in delivering the response message Traffic characteristics The average resource size is relatively small Average size of an HTML 4 to 8 KB Median size of an HTML 2 KB Average size of an image 14 KB
Web Protocols and Practice
31Resource Sizes
WEB WORKLOAD CHARACTERIZATION
Knowing the distribution of resource sizes at Web sites is useful for deciding how to allocate memory or disk space at a server or proxy. The high variability in resource size is captured by the Pareto distribution mean a is a shape parameter k is a scale parameter
Web Protocols and Practice
32Statistics and Probability Distributions
WEB WORKLOAD CHARACTERIZATION
Figure 10.2. Exponential and Pareto distributions
(with mean of 1)
Web Protocols and Practice
33Statistics and Probability Distributions
WEB WORKLOAD CHARACTERIZATION
Figure 10.3. Exponential and Pareto distributions
on a logarithmic scale
Web Protocols and Practice
34Statistics and Probability Distributions
WEB WORKLOAD CHARACTERIZATION
Figure 10.4. Lognormal distribution
Web Protocols and Practice
35Response Sizes
WEB WORKLOAD CHARACTERIZATION
In analyzing the server and network performance, the size of response messages is a more important factor. Traffic characteristics Response sizes may differ from resource sizes for a variety of reasons Some HTTP response messages do not have a message body. Some Web resources are never requested and do not contribute to the set of response messages. Some responses are aborted before they complete, resulting in shorter transfers.
Web Protocols and Practice
36Response Sizes
WEB WORKLOAD CHARACTERIZATION
The median of the response size distribution is several hundred bytes smaller than the median resource size. Response sizes can be represented by a combination of the lognormal and Pareto distributions. Response size distribution has a heavy tail. Some factors suggest that the distribution of response sizes is not the same as the distribution of resource sizes.
Web Protocols and Practice
37Resource Popularity
WEB WORKLOAD CHARACTERIZATION
The popularity of the various resources at a Web site has important performance implications. The most popular resources are likely to reside in main memory at the origin server, obviating the need to fetch the data from disk. Traffic characteristics Popularity is measured in terms of the proportion of requests that access a particular resource The probability mass function (pmf) P(r) captures the proportion of requests directed to each resources.
Web Protocols and Practice
38Resource Popularity
WEB WORKLOAD CHARACTERIZATION
The proportion of requests for a resource follows Zipfs Law r is the rank of an object k is a constant that ensures that P(r) sums to 1. (Figure 10.5)
Web Protocols and Practice
39Resource Popularity
WEB WORKLOAD CHARACTERIZATION
Figure 10.5. Zipfs law
Web Protocols and Practice
40Resource Popularity
WEB WORKLOAD CHARACTERIZATION
more generally, a Zipf-like distribution has the form for some constant c. The extreme case of c 0 corresponds to all resources having equal popularity. Early studies of requests to Web servers found c values close to 1. More recent studies show values for c in the range of 0.75 to 0.95
Web Protocols and Practice
41Resource Changes
WEB WORKLOAD CHARACTERIZATION
Web resources change over time as a result of modifications at the origin server. Modifications to resources affect the performance of Web caching. Resources that change less often may be given preference in caching or revalidated with the origin server less frequently. Traffic characteristics Images do not change very often Text and HTML files change more often than images
Web Protocols and Practice
42Resource Changes
WEB WORKLOAD CHARACTERIZATION
Some resources change in a periodic fashion News stories The Expires header could indicate the next time that a cached resource would change. Accurate timing information in the HTTP response message can reduce the load on the origin server as well as the user-perceived latency for accessing the resource. An accurate model of Web workloads need to consider the frequency of resource changes.
Web Protocols and Practice
43Temporal Locality
WEB WORKLOAD CHARACTERIZATION
The time between successive requests for the same resource has a significant affect on Web traffic. Resource popularity indicates the frequency of requests without indicating the spacing between the requests. Temporal locality captures the likelihood that a requested resource will be requested again in the near future.
Web Protocols and Practice
44Temporal Locality
WEB WORKLOAD CHARACTERIZATION
Testing a server with a benchmark that has low temporal locality would underestimate the potential throughput. High temporal locality also increases the likelihood that a request is satisfied by a browser or proxy cache and reduces the likelihood that a resource has changed since the previous access.
Web Protocols and Practice
45Temporal Locality
WEB WORKLOAD CHARACTERIZATION
Traffic characteristics Temporal locality can be measured by sequencing through the stream of requests, putting each request at the top of a stack, and noting the position in the stack- the stack distance - of the previous access to each resource. The small stack distance suggests high temporal locality. The stack distances for requests for a resource follow a lognormal distribution.
Web Protocols and Practice
46Number of Embedded Resources
WEB WORKLOAD CHARACTERIZATION
Embedded resources include images, JavaScript programs, and other HTML files that appear as frames in the containing Web page. The number of embedded references in a Web page has significant impact on the server and network load. Traffic characteristics Web pages have a median of 8 to 20 embedded resources. The distribution has high variability, following the Pareto distribution.
Web Protocols and Practice
47Number of Embedded Resources
WEB WORKLOAD CHARACTERIZATION
The number of embedded images has tended to increase over time as more users have high-bandwidth connection to the Internet. A large number of embedded resources does not necessarily translate into a large number of requests to the Web server A cached copy of embedded resource may be available. Some embedded images do not reside at the same Web server as the containing Web page.
Web Protocols and Practice
48User Behavior Characteristics
WEB WORKLOAD CHARACTERIZATION
Web workload characteristics depend on the behavior of users as they download Web pages from various sites. The workload introduced by a single user can be modeled at three levels Session The series of requests by a single user to a single Web site could be viewed as a logical session. Click A user performs one or more clicks to request Web pages.
Web Protocols and Practice
49User Behavior Characteristics
WEB WORKLOAD CHARACTERIZATION
Request Each click triggers the browser to issue an HTTP request for a resource. Each session arrival brings a new user to the site. The client may establish a new TCP connection for a request or send a request on an existing TCP connection. Session arrivals can be studied by considering the time between the start of one user session and the start of the next user session.
Web Protocols and Practice
50User Behavior Characteristics
WEB WORKLOAD CHARACTERIZATION
The session arrival times follow an exponential distribution. Exponential interarrival times correspond to a Poisson process, when users arrive independently of one another. The exponential distribution is not an accurate model of interarrival times of TCP connections and HTTP requests.
Web Protocols and Practice
51User Behavior Characteristics
WEB WORKLOAD CHARACTERIZATION
A workload model that assumes that HTTP requests arrive as a Poisson process would underestimate the possibility of the heavy-load periods and would overestimate the potential performance of the Web server. The number of clicks associated with user sessions has considerable influence on the load on a server.
Web Protocols and Practice
52User Behavior Characteristics
WEB WORKLOAD CHARACTERIZATION
The number of clicks follows a Pareto distribution, suggesting that some sessions involve a much larger number of clicks than others. The time between successive requests (request interarrival time) by each user has important implications on the server and network load. The time between the downloading of one page and its embedded images and the users next click is referred to as think time or quiet time.
Web Protocols and Practice
53User Behavior Characteristics
WEB WORKLOAD CHARACTERIZATION
The characteristics of user think times influence the effectiveness of policies for closing persistent connections. Most interrequest times are less than 60 seconds. The think times follow a Pareto distribution with a heavy tail, with a around 1.5. Heavy-tailed distributions apply to numerous properties of Web traffic Resource sizes
Web Protocols and Practice
54User Behavior Characteristics
WEB WORKLOAD CHARACTERIZATION
Response sizes The number of embedded references in a Web page The number of click per session The time between successive clicks A Web session can be modeled as a sequence of on/off periods, in which each on period corresponds to downloading a Web page and its embedded images and each off period corresponds to the users think time.
Web Protocols and Practice
55User Behavior Characteristics
WEB WORKLOAD CHARACTERIZATION
The duration of on/off periods both follow a heavy-tailed distribution. The load on Web servers and the network exhibits a phenomenon known as self similarity, in which the traffic varies dramatically on a variety of time scales from microseconds to several minutes.
Web Protocols and Practice
56Applying Workload Models
WEB WORKLOAD CHARACTERIZATION
A deeper understanding of Web workload characteristics can drive the creation of a workload model for evaluating Web protocols and software components. Generating synthetic traffic involves sampling the probability distribution associated with each workload parameter. (Table 10.2)
Web Protocols and Practice
57Table 10.2. Probability distributions in Web
workload models
WEB WORKLOAD CHARACTERIZATION
Workload parameter Distribution
Session interarrival times Exponential
Response sizes (tail of distribution) Resource sizes (tail of distribution) Number of embedded images Request interarrival times Pareto
Response sizes (body of distribution) Resource sizes (body of distribution) Temporal locality Lognormal
Resource popularity Zipf-like
Web Protocols and Practice
58Applying Workload Models
WEB WORKLOAD CHARACTERIZATION
Generating synthetic traffic that accurately represents a real workload is very challenging. Validation of the synthetic workload model is an important step in constructing and using a workload model. Validation is different from verification Verification involves testing that the synthetic traffic has the statistical properties embodied in the workload model.
Web Protocols and Practice
59Applying Workload Models
WEB WORKLOAD CHARACTERIZATION
Validation requires demonstrating that the performance of a system subjected to the synthetic workload matches the performance of the same system under a real workload, according to some predefined performance metrics. Synthetic workload models are also used to test servers over a range of scenarios that might not have happened in practice. Generating synthetic traffic provides an opportunity to evaluate a proxy or server in a controlled manner.
Web Protocols and Practice
60Applying Workload Models
WEB WORKLOAD CHARACTERIZATION
Web performance depends on the interaction between user behavior, resource characteristics, server load, and network dynamics. Synthetic workloads help address the need to evaluate and compare Web software components in a controlled manner.
Web Protocols and Practice