Logfile-Preprocessing using WUMprep - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Logfile-Preprocessing using WUMprep

Description:

200.11.240.17 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98) 2001-10-20 00:00:00 ... 5.0 (Windows; U; Win 9x 4.90; de-DE; rv:1.0.1) Gecko/20020823 Netscape/7.0' ... – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 18
Provided by: gebhard9
Category:

less

Transcript and Presenter's Notes

Title: Logfile-Preprocessing using WUMprep


1
Logfile-Preprocessing using WUMprep
A Perl-Script-Suite that does more than just
filtering raw data
2
about WUMprep 1
  • WUMprep is part of Open Source-Project HypKnowSys
    and written by Carsten Pohle
  • it comprises Logfile-Preprocessing in two ways
  • filtering
  • adding meaning to WebSites (Taxonomies)
  • it can be used both as stand-alone and in
    conjunction with other Mining-Tools (e.g. WUM)

3
Configuring WUMprep (1)
  • wumprep.conf is used for defining basic needs of
    each script
  • Just give your domain and your input-Log that
    will do for the moment
  • before running removeRobots.pl you can define the
    sec. in Timestamp ? Question which value is
    appropriate?

wumprep .conf
4
Next step logfileTemplate (config 2)
  • The 4 basic Web-Server Log-Formats are defined in
    WUMpreps logFormat.txt
  • According to a given Format you arrange
    logfileTemplate.
  • Basically anything goes but if the Log is queried
    from a MySQL-database remember that Host,
    Timestamp and Agent are mandatory (and Referrer
    at least helpfull)1

1See Nicolas Michaels Presentation for Details
conc. problems with the basic algorithm
format.txt
logfile Template
5
Usage of logfileTemplate (config 3)
  • You have this format
  • koerting.hannover.kkf.net - - 01/Mar/20030034
    41 -0700 "GET /css/styles.css HTTP/1.1" 200 7867
    "-" "Mozilla/4.0 (compatible MSIE 6.0 Windows
    98)"
  • You have this format
  • 200.11.240.17 Mozilla/4.0 (compatible MSIE 5.5
    Windows 98) 2001-10-20 000000 2399027
  • you take this template
  • _at_host_dns_at_ _at_auth_user_at_ _at_ident_at_
    _at_ts_day_at_/_at_ts_month_at_/_at_ts_year_at__at_ts_hour_at__at_ts_minut
    es_at__at_ts_seconds_at_ _at_tz_at_ "_at_method_at_ _at_path_at_
    _at_protocol_at_" _at_status_at_ _at_sc_bytes_at_ _at_referrer_at_
    _at_agent_at_
  • you take this template
  • _at_host_ip_at_ _at_agent_at_ _at_ts_year_at_-_at_ts_month_at_-_at_ts_day_at_
    _at_ts_hour_at__at_ts_minutes_at__at_ts_seconds_at_ _at_dummy_at_

NB Have a close look at your logfile and arrange
logfileTemplate by following exactly the given
Format
6
Dealing with Unstandardized Format (config 4)
  • This last slides example is taken from a
    MySQL-Database
  • 200.11.240.17 Mozilla/4.0 (compatible MSIE 5.5
    Windows 98) 2001-10-20 000000 2399027
  • You have an unusual Timestamp-Format (see
    adapting the Template in last slide), and a
    missing Referrer ? sessionize.pl may look e.g.
    for foreign referrers to start a new session

7
Configuring wumprep.conf (5)
  • Go to sessionizeSettings in wumprep.conf.
  • Out-comment everything that deals with Referrers.
    Itll look like this

Set to true if the sessionizer should insert
dummy hits to the referring document at the
beginning of each session. sessionizeInsertReferr
erHits true Name of the GET query parameter
denoting the referrer (leave blank if not
applicable) sessionizeQueryReferrerName
referrer Should a foreign referrer start a new
session? sessionizeForeignReferrerStartsSession
0
8
Were ready to go sessionize the Log
  • If no Cookie ID is given, sessionize.pl will look
    for the Host and the Timestamp. There is a
    Threshold q 1800 sec. in wumprep.conf. Let t0
    be the first entry. Then a Session is computed by
    taking any URL, whose Timestamp lies in between t
    - t0 q as a sequent request of t0

9
detectRobots
  • There are two Types of Robots ethical and
    non-ethical (lets say three the good, the bad
    and the very ugly -).
  • The first type acts according the 'Robots
    Exclusion Standards' and looks first in a file
    called Robots.txt where to go and where not.
  • Removing them is done over the Robot-Database
    indexers.lst. Additionally, detectRobots.pl flags
    IPs as robots when they accessed robots.txt

10
detectRobots (2)
  • The 2nd type with its IP and Agent looking like
    coming from a Human is difficult to detect and
    requires a sessionized Log.
  • There is (besides two others) a time-based
    heuristic to remove them Too many html-requests
    in a given time is very likely to come from a
    Robot. Default value in wumprep.conf is 2 sec.

11
detectRobots (3)
  • you can add entries to indexers.lst by taking a
    larger Log and type at command-line
  • grep "bot" logfile awk 'print "robot-host ",
    1' sort uniq gtgtindexers.lst
  • detectRobots.pl removes 6 before and 17 of
    Robot-Entries after running the script in my Logs
    (26682821Kb vs. 23602821Kb of
    xyz.nobotsxyz.sess)
  • There will allways remain some uncertainty about
    Robot-Detection. Further research is necessary.

12
logFilter
  • Further Data Cleaning thankfully is much easier
  • logFilter.pl uses your Filter Rules in
    wumprep.conf.
  • You can define your own Filter Rules or add them
    to wumprep.conf
  • \.ico
  • \.gif
  • \.jpg
  • \.jpeg
  • \.css
  • \.js
  • \.GIF
  • \.JPG
  • _at_mydomainFilter.txt

13
Taxonomies
  • Taxonomies are built using Regular Expressions
    map your Site according a Taxonomy and
    MapreTaxonomies.pl uses your predefined regexes
    to overwrite the requests in the log to your Site
    Concept.
  • Itll look something like this
  • HOME www\.c-o-k\.de\/ METHODS
    \/cp_\.htm\?fall3\/ TOOLS \/cp_\.htm\?fall1\/
    FIELDSTUDIES \/cp_.htm?fall2\/

14
Taxonomies II
  • This is what MapreTaxonomies.pl does with it
    (Aggregation).
  • 117858180.136.155.126 - - 29/Mar/2003000200
    0100 "GET AUTHOR-FORMAT/LITERATURDB HTTP/1.1"
    200 1406 "-" "Mozilla/5.0 (Windows U Win 9x
    4.90 de-DE rv1.0.1) Gecko/20020823
    Netscape/7.0"
  • 117858180.136.155.126 - - 29/Mar/2003000200
    0100 "GET AUTHOR-FORMAT/LITERATURDB HTTP/1.1"
    200 10301 "http//edoc.hu-berlin.de/conferences/co
    nf2/Kuehne-Hartmut-2002-09-08/HTML/kuehne-ch1.html
    " "Mozilla/5.0 (Windows U Win 9x 4.90 de-DE
    rv1.0.1) Gecko/20020823 Netscape/7.0"
  • This Data Aggregation is a neccessary step before
    working with WUM

15
Taxonomies III
  • Above that, Carsten Pohle wants to use them as a
    Filter for uninteresting Patterns one usually
    gets out of Association Rules ? any Pattern that
    matches the Taxonomy (via mapReTaxonomies.pl) is
    most likely to be uninteresting

16
Further Reading
  • Berendt, Mobasher, Spiliopoulou, Wiltshire,
    Measuring the Accuracy of Sessionizers for Web
    Usage Analysis
  • Pang-Ning Tan and Vipin Kumar, Discovery of Web
    Robot Sessions based on their Navigational
    Patterns, in Data Mining and Knowledge
    Discovery, 6 (1) (2002), S. 9-35.
  • Nicolas Michael, Erkennen von Web-Robotern anhand
    ihres Navigationsmusters (on Berendt, HS Web
    Mining SS03)
  • Gebhard Dettmar, Knowledge Discovery in Databases
    - Methodik und Anwendungsbereiche, Knowledge
    Discovery in Databases, Teil II - Web Mining

17
Logfile-Preprocessing via WUMprep
Thanks for Listening!
Write a Comment
User Comments (0)
About PowerShow.com