Title: Analyzing Web Server Log Files
1Analyzing Web Server Log Files
- Eric Landrieu
- e-mail eland_at_perfman.com
- Lead Developer, PerfMan for Web Servers
- The Information Systems Manager, Inc.
2Growth of Web Server
- Has become a vital part of the business model
- Internet web servers must be reliable, as they
are truly an international 24x7x365 sales
mechanism - Content of site(s) can be just as damaging in
users eyes as poor performance we have a
2-edged sword
3So how do we monitor the web server?
- OS-level tools
- Performance Monitor (Windows NT)
- SMF, RMF (OS/390)
- Third-party offerings
- Active web site monitors (give a client-side
view of the site) - Database/Application monitoring
- Web server log files
4So how do we monitor the web server?
- No one method can give you the whole picture on
your web servers health and performance
5Whats in the Log Files?
- View of client-server transactions client
request, with the server response - Multiple transactions can be required for a web
page
GET /parking/space.asp
404 File Not Found
6Whats in the Log Files?
- Each transaction is totally separate in the log
file - Any user-level data must be manually grouped
using criteria available in the particular log
file
7So what is in these log files?
8Information in the log files
- Client IP - Usually the IP address, but can be
resolved to DNS by the web server (not
recommended) - File requested by client (including directory)
- Method used in request (GET, POST, etc.)
9Information in the log files
- Return Code - was it successful, and if not, why?
- Bytes Sent back to the client in the response
- Referring URL where did the user find the link
to this request? - Browser String telling what browser is being used
10Information in the log files
- Username - anonymous or authenticated access
- Cookie The cookie relating to this
transaction, if any - Bytes Received by the server in the request
- Time Taken by the server to process the request
11Standardized Log Formats
- Common Log Format (CLF)
- Extended Common Log Format
- W3C Standard
- Other formats may be product-specific, and many
are extensions of the CLF or Extended CLF formats.
12Common Log Format
- Advantages
- Supported by just about every web server ever
written - Disadvantages
- Inflexible
- Contains very limited data no Bytes Received,
Time Taken, User agent (Browser), or Referer
fields available.
13Common Log Format
64.12.105.154 - - 16/Feb/2001065935 -0800
"GET /cgi-bin/Count.cgi?dfgecbhomeddB
HTTP/1.0" 404 211 64.12.97.10 - -
16/Feb/2001065937 -0800 "GET
/java/FixFontHeadline.class HTTP/1.0" 200
2898 64.12.97.9 - - 16/Feb/2001065943 -0800
"GET /graphics/trombone.gif HTTP/1.0" 200
1050 64.12.96.206 - - 16/Feb/2001065958
-0800 "GET /images/joinband.jpg HTTP/1.0" 200
13457 64.12.97.9 - - 16/Feb/2001070030 -0800
"GET /images/parade.jpg HTTP/1.0" 200
22754 128.93.11.53 - - 16/Feb/2001102053
-0800 "GET /schedule.shtml HTTP/1.0" 200
7103 128.93.11.53 - - 16/Feb/2001102648
-0800 "GET /index.shtml HTTP/1.0" 200
8650 128.93.11.53 - - 16/Feb/2001102118
-0800 "GET /about.shtml HTTP/1.0" 200
9151 128.93.11.53 - - 16/Feb/2001102625
-0800 "GET /communty.shtml HTTP/1.0" 200
5731 128.93.11.53 - - 16/Feb/2001101825
-0800 "GET /join.shtml HTTP/1.0" 200
5056 128.93.11.53 - - 16/Feb/2001102453
-0800 "GET /write.shtml HTTP/1.0" 200
9633 128.93.11.53 - - 16/Feb/2001105405
-0800 "GET /robots.txt HTTP/1.0" 404 204
14Extended Common Log Format
- Adds User Agent (Browser) and Referrer to Common
Log Format - Advantages
- Most web servers support it
- More information available than CLF
- Disadvantages
- Still no Time Taken or Bytes Received
- Still inflexible
15Extended Common Log Format
64.12.105.154 - - 16/Feb/2001065935 -0800
"GET /cgi-bin/Count.cgi?dfgecbhomeddB
HTTP/1.0" 404 211 "http//www.mycommunityb
and.org/" "Mozilla/4.0 (compatible MSIE 5.5 CS
2000 Windows 98)" 64.12.97.10 - -
16/Feb/2001065937 -0800 "GET
/java/FixFontHeadline.class HTTP/1.0" 200 2898
"-" "Java 1.1" 64.12.97.9 - - 16/Feb/200106594
3 -0800 "GET /graphics/trombone.gif HTTP/1.0"
200 1050 "http//www.mycommunityband.org/"
"Mozilla/4.0 (compatible MSIE 5.5 CS 2000
Windows 98)" 64.12.96.206 - - 16/Feb/200106595
8 -0800 "GET /images/joinband.jpg HTTP/1.0" 200
13457 "http//www.mycommunityband.org/join
.shtml" "Mozilla/4.0 (compatible MSIE 5.5 CS
2000 Windows 98)" 64.12.97.9 - -
16/Feb/2001070030 -0800 "GET
/images/parade.jpg HTTP/1.0" 200 22754
"http//www.mycommunityband.org/about.shtml"
"Mozilla/4.0 (compatible MSIE 5.5 CS 2000
Windows 98)" 128.93.11.53 - - 16/Feb/200110205
3 -0800 "GET /schedule.shtml HTTP/1.0" 200 7103
"- "xyro_(xcrawler_at_cosmos.inria.fr)" 128.
93.11.53 - - 16/Feb/2001102648 -0800 "GET
/index.shtml HTTP/1.0" 200 8650 "-
"xyro_(xcrawler_at_cosmos.inria.fr)" 128.93.11.53 -
- 16/Feb/2001102118 -0800 "GET /about.shtml
HTTP/1.0" 200 9151 "- "xyro_(xcrawler_at_cos
mos.inria.fr)" 128.93.11.53 - -
16/Feb/2001102625 -0800 "GET /communty.shtml
HTTP/1.0" 200 5731 "- "xyro_(xcrawler_at_cos
mos.inria.fr)" 128.93.11.53 - -
16/Feb/2001101825 -0800 "GET /join.shtml
HTTP/1.0" 200 5056 "- "xyro_(xcrawler_at_cos
mos.inria.fr)" 128.93.11.53 - -
16/Feb/2001102453 -0800 "GET /write.shtml
HTTP/1.0" 200 9633 "- "xyro_(xcrawler_at_cos
mos.inria.fr)" 128.93.11.53 - -
16/Feb/2001105405 -0800 "GET /robots.txt
HTTP/1.0" 404 204 - -
16W3C Extended Log Format
- http//www.w3.org/TR/WD-logfile
- Advantages
- Very Flexible
- Extensible
- Disadvantages
- Not as universally supported by web servers
17W3C Extended Log Format
Software Microsoft Internet Information
Services 5.0 Version 1.0 Date 2001-03-18
050120 Fields date time c-ip cs-username s-ip
cs-method cs-uri-stem cs-uri-query sc-status
sc-bytes cs-bytes time-taken cs-version cs-host
cs(User-Agent) cs(Cookie) cs(Referer)
2001-03-18 050120 144.249.14.154 -
144.249.252.75 GET /Default.asp - 200 40606 253
16 HTTP/1.1 entry.corp.com Mozilla/4.0(compat
ibleMSIE5.01Windows95) SITESERVERID547754c
dab354b60fcd92cd09351121e - 2001-03-18 050121
144.249.14.154 - 144.249.252.75 GET
/corporate.css - 304 160 436 0 HTTP/1.1
entry.corp.com Mozilla/4.0(compatibleMSIE5
.01Windows95) SITESERVERID547754cdab354b6
0fcd92cd09351121eASPSESSIONIDGGQQGZECKEJNEBECDJ
LKONONHOOBBINF http//entry.corp.com/ 2001-03-18
050121 144.249.14.154 - 144.249.252.75 GET
/images/vDivider2.gif - 304 209 444 0 HTTP/1.1
entry.corp.com Mozilla/4.0(compatibleMSIE5
.01Windows95) SITESERVERID547754cdab354b6
0fcd92cd09351121eASPSESSIONIDGGQQGZECKEJNEBECDJ
LKONONHOOBBINF http//entry.corp.com/ 2001-03-18
050121 144.249.14.154 - 144.249.252.75 GET
/images/toc_quicklink.gif - 304 209 448 0
HTTP/1.1 entry.corp.com Mozilla/4.0(compatibl
eMSIE5.01Windows95) SITESERVERID547754
cdab354b60fcd92cd09351121eASPSESSIONIDGGQQGZECK
EJNEBECDJLKONONHOOBBINF http//entry.corp.com/ 200
1-03-18 050121 144.249.14.154 - 144.249.252.75
GET /images/region_am.jpg - 304 209 444 0
HTTP/1.1 entry.corp.com Mozilla/4.0(compatibl
eMSIE5.01Windows95) SITESERVERID547754
cdab354b60fcd92cd09351121eASPSESSIONIDGGQQGZECK
EJNEBECDJLKONONHOOBBINF http//entry.corp.com/ 200
1-03-18 050121 144.249.14.154 - 144.249.252.75
GET /images/orange_square_bullet.gif - 304 209
455 0 HTTP/1.1 entry.corp.com
Mozilla/4.0(compatibleMSIE5.01Windows95)
SITESERVERID547754cdab354b60fcd92cd09351121e
ASPSESSIONIDGGQQGZECKEJNEBECDJLKONONHOOBBINF
http//entry.corp.com/ 2001-03-18 050122
144.249.14.154 - 144.249.252.75 GET
/corpnews/images/org_pointer_2.gif - 304 209 456
0 HTTP/1.1 entry.corp.com Mozilla/4.0(compati
bleMSIE5.01Windows95)
SITESERVERID547754cdab354b60fcd92cd09351121eAS
PSESSIONIDGGQQGZECKEJNEBECDJLKONONHOOBBINF
http//entry.corp.com/
18Which Format(s) Does My Web Server Support
Server Common Log Format Extended CLF W3C Extended Log Format
Apache Default Available No
Microsoft IIS Available No Default
IBM HTTP Server (Websphere) (Based on Apache) Default Available No
iPlanet Web Server Default Available No
Website Pro (Orielly) Available No Available
Lotus Domino Default Available No
19Which Format(s) Does My Web Server Support
Server Common Log Format Extended CLF W3C Extended Log Format
AOLServer Default Available No
Zeus Web Server Default Available No
Xitami Available Default No
I/Net Commerce Server/400 Default No No
WebStar (Mac) Available No Available
Servertec Internet Server Available Available Available
20Limitations
- -or-
- Why we cant ignore other sources of information
21Log File Limitations
- Not enough information to get the whole picture
on the sites performance and health - We need to correlate the log data with other
sources. - OS-level statistics (Performance Monitor, SMF,
3rd party) - Active web analysis (e.g. Keynote)
- Data on databases or other components of the site
Client
Internet
Web Server
Back End DB
22Log File Limitations
- Not enough information to get the whole picture
on the sites performance and health - We need to correlate the log data with other
sources. - OS-level statistics (Performance Monitor, SMF,
3rd party) - Active web analysis (e.g. Keynote)
- Data on databases or other components of the site
Client
Internet
Web Server
Back End DB
23Log File Limitations
- Not enough information to get the whole picture
on the sites performance and health - We need to correlate the log data with other
sources. - OS-level statistics (Performance Monitor, SMF,
3rd party) - Active web analysis (e.g. Keynote)
- Data on databases or other components of the site
Client
Internet
Web Server
Back End DB
24Log File Limitations
- Not enough information to get the whole picture
on the sites performance and health - We need to correlate the log data with other
sources. - OS-level statistics (Performance Monitor, SMF,
3rd party) - Active web analysis (e.g. Keynote)
- Data on databases or other components of the site
Client
Internet
Web Server
Back End DB
25Log File Limitations
- Only when fit together with the other pieces do
we get the complete picture of your total web
site health.
Client
Internet
Web Server
Back End DB
26Log File Limitations
- You may also have to deal with log file formats
which dont include all of the information that
you would like.
Bytes Received
Time Taken
Common Log Format
Referrer
User Agent
27Issues With Log Files
- User or Session level statistics
- Caching
- Clustering
- What constitutes a site?
28User or Session Level Statistics
- The server doesnt give you statistics for the
user (e.g. how long were they on the site?) - You have to mine these yourself from the data
available - You will only be able to get approximations with
this data, not exact figures
29How do we group records for user-level statistics?
- Clients IP address
- Proxy Servers and firewalls with Network Address
Translation (NAT) will make all users from behind
the firewall look like one user - If the proxy or firewall has multiple IP
addresses (or it is an array), multiple accesses
of site from one user may look like multiple users
30How do we group records for user-level statistics?
- Cookies
- If the site assigns cookies to track users
through the site, you can group the records based
upon the cookie - Users who disable cookies on their browser mess
this up - Not all log file formats include the cookie
31How do we group records for user-level statistics?
- User name
- Useful for intranet, but you must have the server
disallow anonymous access - Impractical for most internet sites (except
restricted access)
32Caching
- Content from the web site may be cached outside
of the web server - The web server may not get notification of
requests for content that are serviced by these
caches - The caches may be in Proxy Servers, Browsers, or
elsewhere
33Clustering
- Each server in a web cluster may maintain its own
log file - You have to combine the log files to get
information relevant to the entire site - One user accessing your site may get data from
multiple servers - You may still want information on each individual
server, to verify that they are load-balancing
properly
34What constitutes a web site?
- You have to decide exactly what you want to call
a site - A load-balanced cluster
- A single site running on a dedicated server
- A single site on a server running multiple sites
- A directory within a site on a server
- Multiple servers which act as your web presence
(home, support, e-commerce)
35What good is analyzing log files?
- OS-level analysis cant
- Provide user (session)-level info
- Break down by return code, file type or name,
directory, etc.
36What good is analyzing log files?
- Active monitoring
- Gives the client-side perspective
- May not distinguish between a slow link/router
and a slow response from server - Some are concerned only with response to the
testing system, not server load - If a browser-based product, it may have troubles
with browser incompatabilities
37So whats the key to analyzing log files?
- Grouping your log file records into useful
statistics that will help you understand what is
going on with your site
38Example 404 Errors
- When a user gets a 404 Error (File Not Found),
they may perceive a lack of professionalism or
quality with your site. - You want to know not only what non-existent files
are being requested, but why they are being
requested (outdated link?)
39Example 404 Errors
40Example 404 Errors
41Example User Session Time
- You want to get as useful an approximation as is
possible for how long users are staying at your
site (at least, marketing will) - Obviously, the longer they are browsing your
site, the more interested they may be in what you
have to offer - You can use their first and last requests for
files to get a rough approximation
42Example User Session Time
- Most sessions were very short (1-2 pages)
- This was an Entry server cluster, which passed
off to other sites - A few (lt20 of total sessions) were very long
43Example Cluster Load-Balancing
- Ideally, your clustered servers for the site
would be sharing the load equally - If one server is carrying a larger load, it can
lead to overall perceived slowdown of your site
(most people going to a heavily loaded server
while an idle server sits and does nothing)
44Example Cluster Load-Balancing
45Example Cluster Load-Balancing
46So What Should I Take Out Of This?
47Summary
- Web server log file analysis is an important part
of your monitoring of your web servers - Log file analysis alone will not give you the
complete picture of your web server, but you
cant get the complete picture without it - Know what is useful in the log files, what
limitations are inherent in them, and how to
analyze them