Title: Web log analysis
1Web log analysis
- Presented by Zhan Wu
- Guided by Dr.Bettina Berendt
- Seminar Web Mining
2Whats log file
- Log file A file that lists all the
actions - that have occurred
- Every time you visit a site, the web server
will generate a record of the HTTP transaction
into a log file.
3Why web log analysis
- Is anyone looking at your Web site?
- Do they like what they see?
- Do all your links work well?
- Whats the traffic of your web?
4Why web log analysis(cont.)
Web designers The incentives of visitors,what make them stay and what make them leave
Web administrators all clicks lead to documents ,images, multimedia files, scripts and applets are loaded and displayed properly
Companies that place adv. Make their investment effectively,refuse to waste money
5Log file type
- Access log
- Referrer log
- Agent log
- Error log
6 Log file from www.eduserver.de
- pd9e0e981.dip.t-dialin.net - -
01/Dec/2001001742 0100 - "GET /db/stellenliste.html HTTP/1.1" 200 8038
- Mozilla/4.0 (compatible MSIE 5.5 Windows NT
5.0) http//www.jobs.zeit.de/akad.html
- Access log ,agent log and referrer log
are always together that is called extended log
file.however some server turn off the agent log
and referrer log ,only leave the access log which
is called common log file
7 Access log
Address / DNS pd9e0e981.dip.t-dialin.net
identification
authuser
timestamp 01/Dec/2001001742 0100
Request page "GET /db/stellenliste.html HTTP/1.1"
Status code 200
Transfer volume 8038
8 four series status codes
- Success (200 series)
- Redirect (300 series)
- Failure (400 series)
- Server Error (500 series)
9Agent log
- The agent log has information about the browser
version, and operating system of the visitor. - Mozilla is the original code name of
Netscape.Now almost all browsers compatible with
Netscape use Mozilla as code name.
10Referrer log
- The referrer log indicates the page where the
visitor was located when making the next request.
- how is your site categorized in search engine ?
- http//de.dir.yahoo.com/Bildung_und_Ausbildung/Po
rtale_und_Linksammlungen/Bildungsserver
11Referrer log (cont.)
- How is the path of the visitor navigate in your
site? - pd9e0e981.dip.t-dialin.net "GET
/db/set.html?Id221KATEGORIEstellenangebotsaea
df3e55f209b8c73ba53df99dc574a HTTP/1.1"
http//www.bildungsserver.de/db/stellenliste.html - www.job.zeit.de joblist
a certain job information
12Error log
- Another standard and important log that separates
from the other three logs - example from www.schulweb.de
- Wed Jan 16 134045 2002 error client
194.51.47.214 File does not exist
/home/schulweb/html/images/dot_so.gif
13Overview of log analysis software
- Writing own program
- Free software (top 3 by Google pagerank)
- eETReMe Tracking, The Webalizer, Analog
- Commercial software and solution package
- (top 3 by Google pagerank)
- Wusage, WebTrends, AccessWatch
14Three step of web log analysis
-
- Decide what we need
- Choose a log analysis software
- Analyze the output of program
15Step 1 what we need
- The traffic of the site
- The distribution of the domains
- The referrer site
-
16Step 1 (cont.)what we dont need
- We dont care the error log.this problem will be
left to the web administrator. - We donn care the browser ,operation system of
the visitors - User sessions are not important either.
17Step 2 which way I should choose
- Limited budget and poor background on computer
science determine that I have to choose the free
software! - I choose the Analog
-
- there are different versions for Macintosh,
Unix, DOS, Windows. - Also, while the default configuration gives a
great report, Analog is easy customizable to
produce exactly the report you want.
18Step 3 get the output-traffic
- All the data come from the results of Analog
- The average request per month is 912,615 and
30,420 per day ,the traffic increased month by
month last year.
19Step 3(cont.) Domain distribution
20Data Clean
- Why domain eduserver doesn't appear?
- Separating in-house from external
- Thank Dr.Berendt for filtering all the entries
from the eduserver itself
21Limitations of log Analysis
- User Sessions
- Not all information are captured
- Confusion of domains
22User Sessions
- The popular methods to measure the user sessions
as following - 1. Authenticated user
- 2. Cookies
- 3. IP address of the visitor
- All these above have problems!!!
23Not all entries are captured
- ISPs cache the specific pages
- Web browsers also have their own local caches
24Confusion of domains
- there is nothing to stop a commercial entity from
registering a site in the .org domain . - sites in the .com domain and other domains can
also be located in foreign countries, so you
cannot tell exactly which requests are coming
from users in other countries. - .edu domains only exist in USA.We can not tell a
German educational site from the last term of
domains.
25Conclusion
- there is a great deal of useful information you
can get from web logs. - There is still a lot of things to do in this
field in the future.
26