WIRED - Web Analytics Week - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

WIRED - Web Analytics Week

Description:

WIRED System Evaluations due now Web Logs overview Web Analytics Understanding Queries Tracking Users Web Log Reliability Web Log Data Mining & KDD – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 32
Provided by: DonTur5
Category:

less

Transcript and Presenter's Notes

Title: WIRED - Web Analytics Week


1
WIRED - Web Analytics Week
  • WIRED System Evaluations due now
  • Web Logs overview
  • Web Analytics
  • Understanding Queries
  • Tracking Users
  • Web Log Reliability
  • Web Log Data Mining KDD

2
Web Analytics
  • Evaluation of Web Information Retrieval ( Web
    Information Seeking)
  • What can we learn?
  • IR systems use
  • Web server administration
  • Who are the users?
  • Types of users
  • User situations
  • How does it affect or help IR?

3
Web Server Overview
  • Any application that can serve files using the
    HTTP protocol
  • Text, HTML, XHTML, XML
  • Graphics
  • CGI, applets, serlets
  • other media MIME types
  • Apache or MS IIS that serve primarily Web pages
  • Servers create ASCII text log files showing
  • Date, time, bytes transferred, (cache status)
  • Status/error codes, user IP address, (domain
    name)
  • Server method, URI, misc comments

4
Web Log Overview
  • Access Log
  • Logs information such as page served or time
    served
  • Referer Log
  • Logs name of the server and page that links to
    current served page
  • Not always
  • Can be from any Web site
  • Agent Log
  • Logs browser type and operating system
  • Mozilla
  • Windows

5
What can we learn from Web logs?
  • Every time a Web browser requests a file, it gets
    logged
  • Where the user came from
  • What kind of browser used to access the server
  • Referring URL
  • Every time a page gets served, it gets logged
  • Request time, serve time, bytes transferred, URI,
    status code

6
Web Log Analysis in Action
  • UT Web log reports
  • (Figures in parentheses refer to the 7 days to
    28-Mar-2004 0300).
  • Successful requests 39,826,634 (39,596,364)
  • Average successful requests per day 5,690,083
    (5,656,623)
  • Successful requests for pages 4,189,081
    (4,154,717)
  • Average successful requests for pages per day
    598,499 (593,530)
  • Failed requests 442,129 (439,467)
  • Redirected requests 1,101,849 (1,093,606)
  • Distinct files requested 479,022 (473,341)
  • Corrupt logfile lines 427
  • Data transferred 278.504 Gbytes (276.650 Gbytes)
  • Average data transferred per day 39.790 Gbytes
    (39.521 Gbytes)

7
Problems with Web Servers
  • Actual user or intent not known
  • Paths difficult to determine
  • Infrequent access challenging to uncover
  • No State Information
  • Server Hits not Representative
  • Counters inaccurate
  • DOS, Floods, Bandwidth can Stop intended usage
  • Robots, etc.
  • ISP Proxy servers
  • 5.3 Unsound inferences from data that is logged
    Haigh Megarity, 1998.

8
Web Server Configuration
  • Unique file directory names at a glance
    analysis
  • Hierarchical directory structure
  • Redirect CGI to find referrer
  • Use a database
  • store web content
  • record usage data with context of content logged
  • Create state information with programming
  • Servlets, ActiveX, Javascript
  • Custom server or log format
  • Log rollover, report frequency, special case
    testing

9
Log File Format
  • Extended Log File Format - W3C Working Draft
    WD-logfile-960323
  • 192.117.240.3 - - 24/Jul/1998000004 -0400
  • "GET /10/3/a3-160-e.html HTTP/1.0" 200 2308
    "http//www.amicus.nlc-bnc.ca/wbin/resanet/itemdis
    p/l0/d1/r1/e0/h10/i11683503"
  • "Mozilla/2.0 (compatible MSIE 3.01 Windows
    95)"
  • Every server generates slightly different logs
  • Versions operating system issues
  • Admin tweaks to log formats
  • Extended Log Format most common
  • WWW Consortium Standards ( apache)

10
Lets Look at some logs
  • http//www.ischool.utexas.edu/analog-monthly.html
  • http//www.ischool.utexas.edu/analog-weekly.html

11
Log Analysis Tools
  • Analog
  • Webalizer
  • Sawmill
  • WebTrends
  • AWStats
  • WWWStat
  • GetStats
  • Perl Scripts
  • Data Mining Business Intelligence tools

12
WebTrends
  • A whole industry of analytics
  • Most popular commercial application

13
Measuring Web Site Usage
  • Now that the Web is a primary source,
    understanding its use is critical
  • Little external cues that the Web site is being
    used
  • What - pages and their content/subject
  • How - browsers
  • Who - userid or IP
  • When - trends, daily, weekly, yearly
  • Where - the user is and what page they came from

14
What you cant measure?
  • Who the user is
  • Always
  • If the users needs have changed
  • If theyre using the information
  • Browsing vs. Reading vs. Acting on the
    information
  • Changes to site and how they affect each user
  • Pages not used at all - and why

15
Analysis of a Very Large Search Log
  • What kinds of patterns can we find?
  • Request query and results page
  • 280 GB Six Weeks of Web Queries
  • Almost 1 Billion Search Requests, 850K valid,
    575K queries
  • 285 Million User Sessions (cookie issues)
  • Large volume, less trendy
  • Why are unique queries important?
  • Web Users
  • Use Short Queries in short sessions - 63.7 one
    request
  • Mostly Look at the First Ten Results only
  • Seldom Modify Queries
  • Traditional IR Isnt Accurately Describing Web
    Search
  • Phrase Searching Could Be Augmented
  • Silverstein, Henzinger, Marais, Moricz (1998)

16
Analysis of a Very Large Search Log
  • 2.35 Average Terms Per Query
  • 0 20.6 (?)
  • 1 25.8
  • 2 26.0 72.4
  • Operators Per Query
  • 0 79.6
  • Terms Predictable
  • First Set of Results Viewed Only 85
  • Some (Single Term Phrase) Query Correlation
  • Augmentation
  • Taxonomy Input
  • Robots vs. Humans

17
Web Analytics and IR?
  • Knowing access patterns of users
  • Lists of search terms
  • Numbers of words
  • Words, concepts to add (synonyms)
  • Types of queries
  • Success of searching a site
  • Was a result link clicked on?
  • How many pp/user after a search?
  • Is a new or better search interface needed?

18
Real Life Information Retrieval
  • 51K Queries from Excite (1997)
  • Search Terms 2.21
  • Number of Terms
  • 1 31 2 31 3 18 (80 Combined)
  • Logic Modifiers (by User)
  • Infrequent
  • AND, , -
  • Logic Modifiers (by Query)
  • 6 of Users
  • Less Than 10 of Queries
  • Lots of Mistakes
  • Uniqueness of Queries
  • 35 successive
  • 22 modified
  • 43 identical

19
Real Life Information Retrieval
  • Queries per user 2.8
  • Sessions
  • Flawed Analysis (User ID)
  • Some Revisits to Query (Result Page Revisits)
  • Page Views
  • Accurate, but not by User
  • Use of Relevance Feedback (more like this)
  • Not Used Much (11)
  • Terms Used Typical frequent
  • Mistakes
  • Typos
  • Misspellings
  • Bad (Advanced) Query Formulation
  • Jansen, B. J., Spink, A., Bateman, J.,
    Saracevic, T. (1998)

20
KDD for Extracting Knowledge
  • Knowledge extraction, information discovery,
    information extraction, data archeology, data
    pattern processing, OLAP, HV statistical analysis
  • Sounds as if knowledge is there to be found.
  • User and usage context help find the knowledge
  • Hypothesis before analysis
  • Why KDD, why now?
  • Data storage, analysis costs
  • Visualization

21
KDD Process
  • Database for structured data and queries
  • How structured, alorithms for queries
  • How results can be understood and visualized
  • Iterative Interactive, hypothesis driven
    hypothesis generating

22
KDD Efforts
  • Data Cleaning
  • Formulating the Questions
  • Finding useful features to represent the data
    p30
  • Models
  • Classification to fit data into pre-defined
    classes
  • Regressions to fit predictions values
  • Clustering to class sets found in data
  • Summarization to briefly describe data
  • Dependency discovery of variable relationships
  • Sequence analysis for time or interaction patterns

23
Data Prep for Mining the WWW
  • Processing the data before mining
  • WEBMINER system - site toplogy
  • Cleaning
  • User identification
  • Session identification (episodes)
  • Path completion

24
(No Transcript)
25
Web Usage Mining
  • VL Verification
  • Data Mining to Discover Patterns of Use
  • Pre-Processing
  • Pattern Discovery
  • Pattern Analysis
  • Site Analysis, Not User Analysis
  • Srivastava, J., Cooley, R., Deshpande, M., Tan,
    P.N. - 2000

26
Web Usage Discovery
  • Content
  • Text
  • Graphics
  • Features
  • Structure
  • Content Organization
  • Templates and Tags
  • Usage
  • Patterns
  • Page References
  • Dates and Times
  • User Profile
  • Demographics
  • Customer Information

27
Web Usage Collection
  • Types of Data
  • Web Servers
  • Proxies
  • Web Clients
  • Data Abstractions
  • Sessions
  • Episodes
  • Clickstreams
  • Page Views
  • The Tools for Web Use Verification

28
Web Usage Preprocessing
  • Usage Preprocessing
  • Understanding the Web Use Activities of the Site
  • Extract from Logs
  • Content Preprocessing
  • Converting Content Into Formats for Processing
  • Understanding Content (Working with Dev Team)
  • Structure Preprocessing
  • Mining Links and Navigation from Site
  • Understanding Page Content and Link Structures

29
Web Usage Pattern Discovery
  • Clustering for Similarities
  • Pages
  • Users
  • Links
  • Classification
  • Mapping Data to Pre-defined Classes
  • Rule Discovery
  • Rule Rules
  • Computation Intensive
  • Many Paths to the Similar Answers
  • Pattern Detection
  • Ordering By Time
  • Predicting Use With Time

30
Web Usage Mining as Evaluation?
  • Mining Goals
  • Improved Design
  • Improved Delivery
  • Improved Content
  • Personalization (XMod Data)
  • System Improvement (Tech Data)
  • Site Modification (IA Data)
  • Business Intelligence (Market Data)
  • Usage Characterization (User Behavior Data)

31
Web Analytics Wrap-up
  • What can we learn about users?
  • What can we learn about services?
  • How can we help users improve their use?
  • How can IR models benefit from this analysis?
  • What kind of improvements in Web IR systems and
    their interfaces can be take from this?
Write a Comment
User Comments (0)
About PowerShow.com