Title: Logfile-Preprocessing using WUMprep
1Logfile-Preprocessing using WUMprep
A Perl-Script-Suite that does more than just
filtering raw data
2about WUMprep 1
- WUMprep is part of Open Source-Project HypKnowSys
and written by Carsten Pohle - it comprises Logfile-Preprocessing in two ways
- filtering
- adding meaning to WebSites (Taxonomies)
- it can be used both as stand-alone and in
conjunction with other Mining-Tools (e.g. WUM)
3Configuring WUMprep (1)
- wumprep.conf is used for defining basic needs of
each script - Just give your domain and your input-Log that
will do for the moment - before running removeRobots.pl you can define the
sec. in Timestamp ? Question which value is
appropriate?
wumprep .conf
4Next step logfileTemplate (config 2)
- The 4 basic Web-Server Log-Formats are defined in
WUMpreps logFormat.txt - According to a given Format you arrange
logfileTemplate. - Basically anything goes but if the Log is queried
from a MySQL-database remember that Host,
Timestamp and Agent are mandatory (and Referrer
at least helpfull)1
1See Nicolas Michaels Presentation for Details
conc. problems with the basic algorithm
format.txt
logfile Template
5Usage of logfileTemplate (config 3)
- You have this format
- koerting.hannover.kkf.net - - 01/Mar/20030034
41 -0700 "GET /css/styles.css HTTP/1.1" 200 7867
"-" "Mozilla/4.0 (compatible MSIE 6.0 Windows
98)" - You have this format
- 200.11.240.17 Mozilla/4.0 (compatible MSIE 5.5
Windows 98) 2001-10-20 000000 2399027
- you take this template
- _at_host_dns_at_ _at_auth_user_at_ _at_ident_at_
_at_ts_day_at_/_at_ts_month_at_/_at_ts_year_at__at_ts_hour_at__at_ts_minut
es_at__at_ts_seconds_at_ _at_tz_at_ "_at_method_at_ _at_path_at_
_at_protocol_at_" _at_status_at_ _at_sc_bytes_at_ _at_referrer_at_
_at_agent_at_ - you take this template
- _at_host_ip_at_ _at_agent_at_ _at_ts_year_at_-_at_ts_month_at_-_at_ts_day_at_
_at_ts_hour_at__at_ts_minutes_at__at_ts_seconds_at_ _at_dummy_at_
NB Have a close look at your logfile and arrange
logfileTemplate by following exactly the given
Format
6Dealing with Unstandardized Format (config 4)
- This last slides example is taken from a
MySQL-Database - 200.11.240.17 Mozilla/4.0 (compatible MSIE 5.5
Windows 98) 2001-10-20 000000 2399027 - You have an unusual Timestamp-Format (see
adapting the Template in last slide), and a
missing Referrer ? sessionize.pl may look e.g.
for foreign referrers to start a new session -
7Configuring wumprep.conf (5)
- Go to sessionizeSettings in wumprep.conf.
- Out-comment everything that deals with Referrers.
Itll look like this
Set to true if the sessionizer should insert
dummy hits to the referring document at the
beginning of each session. sessionizeInsertReferr
erHits true Name of the GET query parameter
denoting the referrer (leave blank if not
applicable) sessionizeQueryReferrerName
referrer Should a foreign referrer start a new
session? sessionizeForeignReferrerStartsSession
0
8 Were ready to go sessionize the Log
- If no Cookie ID is given, sessionize.pl will look
for the Host and the Timestamp. There is a
Threshold q 1800 sec. in wumprep.conf. Let t0
be the first entry. Then a Session is computed by
taking any URL, whose Timestamp lies in between t
- t0 q as a sequent request of t0
9detectRobots
- There are two Types of Robots ethical and
non-ethical (lets say three the good, the bad
and the very ugly -). - The first type acts according the 'Robots
Exclusion Standards' and looks first in a file
called Robots.txt where to go and where not. - Removing them is done over the Robot-Database
indexers.lst. Additionally, detectRobots.pl flags
IPs as robots when they accessed robots.txt
10detectRobots (2)
- The 2nd type with its IP and Agent looking like
coming from a Human is difficult to detect and
requires a sessionized Log. - There is (besides two others) a time-based
heuristic to remove them Too many html-requests
in a given time is very likely to come from a
Robot. Default value in wumprep.conf is 2 sec.
11detectRobots (3)
- you can add entries to indexers.lst by taking a
larger Log and type at command-line - grep "bot" logfile awk 'print "robot-host ",
1' sort uniq gtgtindexers.lst - detectRobots.pl removes 6 before and 17 of
Robot-Entries after running the script in my Logs
(26682821Kb vs. 23602821Kb of
xyz.nobotsxyz.sess) - There will allways remain some uncertainty about
Robot-Detection. Further research is necessary.
12logFilter
- Further Data Cleaning thankfully is much easier
- logFilter.pl uses your Filter Rules in
wumprep.conf. - You can define your own Filter Rules or add them
to wumprep.conf
- \.ico
- \.gif
- \.jpg
- \.jpeg
- \.css
- \.js
- \.GIF
- \.JPG
- _at_mydomainFilter.txt
13Taxonomies
- Taxonomies are built using Regular Expressions
map your Site according a Taxonomy and
MapreTaxonomies.pl uses your predefined regexes
to overwrite the requests in the log to your Site
Concept. - Itll look something like this
- HOME www\.c-o-k\.de\/ METHODS
\/cp_\.htm\?fall3\/ TOOLS \/cp_\.htm\?fall1\/
FIELDSTUDIES \/cp_.htm?fall2\/
14Taxonomies II
- This is what MapreTaxonomies.pl does with it
(Aggregation). - 117858180.136.155.126 - - 29/Mar/2003000200
0100 "GET AUTHOR-FORMAT/LITERATURDB HTTP/1.1"
200 1406 "-" "Mozilla/5.0 (Windows U Win 9x
4.90 de-DE rv1.0.1) Gecko/20020823
Netscape/7.0" - 117858180.136.155.126 - - 29/Mar/2003000200
0100 "GET AUTHOR-FORMAT/LITERATURDB HTTP/1.1"
200 10301 "http//edoc.hu-berlin.de/conferences/co
nf2/Kuehne-Hartmut-2002-09-08/HTML/kuehne-ch1.html
" "Mozilla/5.0 (Windows U Win 9x 4.90 de-DE
rv1.0.1) Gecko/20020823 Netscape/7.0" - This Data Aggregation is a neccessary step before
working with WUM
15Taxonomies III
- Above that, Carsten Pohle wants to use them as a
Filter for uninteresting Patterns one usually
gets out of Association Rules ? any Pattern that
matches the Taxonomy (via mapReTaxonomies.pl) is
most likely to be uninteresting
16Further Reading
- Berendt, Mobasher, Spiliopoulou, Wiltshire,
Measuring the Accuracy of Sessionizers for Web
Usage Analysis - Pang-Ning Tan and Vipin Kumar, Discovery of Web
Robot Sessions based on their Navigational
Patterns, in Data Mining and Knowledge
Discovery, 6 (1) (2002), S. 9-35. - Nicolas Michael, Erkennen von Web-Robotern anhand
ihres Navigationsmusters (on Berendt, HS Web
Mining SS03) - Gebhard Dettmar, Knowledge Discovery in Databases
- Methodik und Anwendungsbereiche, Knowledge
Discovery in Databases, Teil II - Web Mining
17Logfile-Preprocessing via WUMprep
Thanks for Listening!