Title: Web Mining
1Web Mining
- What we would like to teach you
- Web Mining, what can it do for you?
- Why do we want to mine the web?
- What should we mine?
- What type of problems do we have when Mining?
2Web Mining cont.
- How do we mine the web?
- What problems will we have along the way?
- Our data requires users, how can we tell who is
who? - How do we track users?
- Knowing how we can track the users, what type of
data do we need?
3Background Information
- What you need to know before mining the web.
- There are three types of web mining
- Web usage mining
- Web content mining
- Web structure mining
4Web Usage Mining
Web usage mining is a type of data mining that
looks a how users use and navigate a web
site. Early web usage mining only reported user
activity. Now we look to find patterns in the
user activity.
5Web Content Mining
- Web content mining tries to discover useful
information regarding the content of the page. - Many times it is text mining with little regard
to the structure of the page itself. - Goal of finding useful information about text
video or images.
6Web Structure Mining
- Web structure mining associates the connection
and layout of the web site. - Types of connections
- Hyperlinks
- HTML XML tags
7Knowing all of this, what can we do?
- We can help eCommerce better market their items
and services - We can personalize our web sites.
- We can better present data with smarter
entry/exit points.
8Web Mining Steps
- Pre Process your data
- Discover patterns
- Pattern Analysis
9Pre-Processing
- Web logs contain varied information, what do we
want? - i.e.
- Single i.p. / multiple server sessions.
- Multiple i.p. / single server session
- Multiple i.p. / single user
- Multiple agent / single user
10Pattern Discovery
- We can use our Data Mining Techniques
- Association rule Mining
- Classification
- Clustering
- Outlier detection
11What to do with our rules?
- eCommerce web sites can use rules to find out who
else will purchase similar items. - We can use rules to make advertisements generated
for our personal tastes. - News sites can layout their pages to suit quicker
paths to the most used data, or customized to the
specific user.
12Data and Errors/Noise
- Sample Data and explanation
- Noise problems
- Main problem with HTTP
- IP problems
- NAT, Proxies, VPN, and remote access problems
- Small viewing problem
- Bots
13Sample Data
- fcrawler.looksmart.com - - 26/Apr/2000000012
-0400 "GET /contacts.html HTTP/1.0" 200 4595 "-"
"FAST-WebCrawler/2.1-pre2 (ashen_at_looksmart.net)" - fcrawler.looksmart.com - - 26/Apr/2000001719
-0400 "GET /news/news.html HTTP/1.0" 200 16716
"-" "FAST-WebCrawler/2.1-pre2 (ashen_at_looksmart.net
)" - 123.123.123.123 - - 26/Apr/2000002348 -0400
"GET /pics/wpaper.gif HTTP/1.0" 200 6248
"http//www.jafsoft.com/asctortf/" "Mozilla/4.05
(Macintosh I PPC)" - 123.123.123.123 - - 26/Apr/2000002347 -0400
"GET /asctortf/ HTTP/1.0" 200 8130
"http//search.netscape.com/Computers/Data_Formats
/Document/Text/RTF" "Mozilla/4.05 (Macintosh I
PPC)" - 123.123.123.123 - - 26/Apr/2000002348 -0400
"GET /pics/5star2000.gif HTTP/1.0" 200 4005
"http//www.jafsoft.com/asctortf/" "Mozilla/4.05
(Macintosh I PPC)" - 123.123.123.123 - - 26/Apr/2000002350 -0400
"GET /pics/5star.gif HTTP/1.0" 200 1031
"http//www.jafsoft.com/asctortf/" "Mozilla/4.05
(Macintosh I PPC)" - 123.123.123.123 - - 26/Apr/2000002351 -0400
"GET /pics/a2hlogo.jpg HTTP/1.0" 200 4282
"http//www.jafsoft.com/asctortf/" "Mozilla/4.05
(Macintosh I PPC)" - 123.123.123.123 - - 26/Apr/2000002351 -0400
"GET /cgi-bin/newcount?jafsof3width4fontdigita
lnoshow HTTP/1.0" 200 36 "http//www.jafsoft.com/
asctortf/" "Mozilla/4.05 (Macintosh I PPC)"
14Explanation of Data
- ppp931.on.bellglobal.com
- - -
- 26/Apr/2000001612 -0400
- "GET /download/windows/asctab31.zip HTTP/1.0"
- 200
- 1540096
- "http//www.htmlgoodies.com/downloads/freeware/web
development/15.html" - "Mozilla/4.7 enC-SYMPA (Win95 U)"
15Noise Problem HTTP
- Stateless Protocol
- Server forget the users ever came to the site
- Unable to keep track of any interactions between
users and server
16Noise IP problem NAT
- Network Address Translation
- Aka IP masquerading
- Each Connection has their own IP
- However, NAT converts it into one general IP used
by everyone - 192.168.24.5, 192.168.24.8, 192.168.24.38
- 56.23.92.1
17Noise IP problem Proxies
- Like NAT in that they change your IP
- Difference exists in that they are at a higher
level - Happens when you access Library resources
- More Control than NAT
18Noise IP problem VPN
- Virtual Private Network
- Use of tunnels
- Used to connect to other computers on the VPN
- IP as in VPN
19Noise IP problem Remote Access
- Citrix, VPC, SSH, Putty, and others
- Using programs to access computers from a remote
area - Actually able to use programs on remote computer
rather than just view files.
20Noise problem Viewing
- What counts as viewing a page
- One page will most likely have multiple get
commands - Variety of types
- .html, .js, .doc, .zip, .jpg, various web apps
such as .cgi and .php - Concerning between important and irrelevant types
21Noise problem Bots
- Web robot
- Why Bots are bad?
- Bots sometimes show up in User Agent
- Problems occur when they do not show up
22Tracking User Sessions
- A Session series of URLs visited in order in a
given time frame - Techniques to add state to HTTP
- HTTP Authentication
- Client-side cookies
- URL cookies
- Hidden form fields
23HTTP Authentication
- How it Works
- Visitor accesses a URL that requires HTTP
authentication - The server sends a response to the browser asking
for credentials - The browser asks the visitor for credentials and
sends them to the server - The server validates the credentials and logs the
username along with the URL accessed - The browser caches the authentication information
for future HTTP requests until the browser window
is closed
24HTTP Authentication
An example of a site that requires HTTP
authentication
25HTTP Authentication
- Advantages
- Part of the HTTP protocol
- Username logged to standard server logs
- Easy to keep track of unique users
- Disadvantages
- Anonymous access lumped together
- Usernames need to be unique per person
- Users dislike accounts for viewing web pages
26Client-side cookies
- How it works
- Visitor accesses a URL
- The server appends a cookie to the HTTP response
- The browser saves the cookie
- The browser sends the cookie back to the server
with future HTTP requests
27Client-side cookies
An example cookie from cnet.com
28Client-side cookies
- Advantages
- Transport is transparent to the user
- Works even if user closes the browser window
- Can be logged by the web server
- Cookies are required by many useful web sites
- Users will be used to accepting them
- Wellsfargo.com, discovercard, citicards.com, and
chase.com all required cookies when tested. - facebook.com, gmail.com, hotmail.com, and
myspace.com all required cookies when tested. - Disadvantages
- The browser can decide what to do with cookies
(reject/delete them) - Multiple people can share a browser
- The browsers date/time must match the servers
- Computer cleaning applications often delete
cookies for privacy reasons
29Client-side cookies
Firefox 2.02 Privacy Options for Cookies
30Client-side cookies
A site that requires cookies
31URL cookies
- How it works
- Visitor accesses a URL
- The server generates all links on the returned
page with a unique session identifier in the
URL. Example csbsju.edu/SESSIONID/home.html - When the visitor clicks on a link the URL
including the session ID is logged
32URL cookies
- Advantages
- Will work when client-side cookies wont
- Disadvantages
- All the links on a page have to be dynamically
generated - Visitors can bookmark URLs and send them to
others so return visits based on old sessions IDs
may be noisy data - The URLs are confusing to read
- Wont automatically track return visitors without
a boorkmark
33Hidden form fields
- How it works
- Visitor accesses a URL
- The server generates a unique session identifier
and places it within the HTML of the page - The user clicks a form submit button or
JavaScript controlled link which submits the form
34Hidden form fields
- Advantages
- Will work with all browsers all the time if form
buttons are used without JavaScript - Transparent to the user
- Disadvantages
- Using GET instead of POST will have the same
problems as URL Cookies - May require JavaScript for the best user
transparency - All pages must be dynamically generated by the
web server - Web logs dont contain POST information
35Data Intelligence Processes
- Association Rule Mining
- Used to discover web structure
- Used to discover user access patterns
- Useful because
- Advertising
- Can help to determine web structure
- Notion of hubs and authorities making up a web
community - E.g., Googles PageRank algorithm
36Data Intelligence Processes
- Classification
- Predict similar web pages that a user would like
to visit - Customization of a web page
- Clustering
- Useful in creating a hierarchy of web pages
(e.g., Yahoo! hierarchy) - Outlier Detection
- Very limited application in web mining
37Our Research Objective
- Mining the CSB/SJU website
- Weve noticed
- Search function is bad/horrible
- Use of A to Z Index
- Take only student network traffic
- Take different time sections
- See if any relationships can be discovered
- Using Enter/Exit vectors
- See what common sessions are like
- Are there similar web pages which should be
linked to one another?
38References
- http//delivery.acm.org/10.1145/320000/319781/p43-
garofalakis.pdf?key1319781key21356793711collP
ortaldlACMCFID16009993CFTOKEN87185929 Data
Mining and the Web Past, Present and Future,
Garofalakis M, Rastogi R, Seshadri S, Shim K - http//delivery.acm.org/10.1145/850000/846188/p12-
srivastava.pdf?key1846188key26296793711collPo
rtaldlACMCFID16009993CFTOKEN87185929 Web
Usage Mining Discovery and Applications of Usage
Patterns from Web Data, Srivastava J, Cooley R,
Deshpande M, Tan P - http//delivery.acm.org/10.1145/320000/319792/p63-
joshi.pdf?key1319792key22396793711collPortal
dlACMCFID16009993CFTOKEN87185929 Warehousing
and Mining Web Logs, Joshi K, Joshi A, Yesha Y,
Drishnapuram R
39References
- Dataset and explanations, http//www.jafsoft.com/s
earchengines/log_sample.html Jafsoft 2005 - Nat Picture, http//www.skullbox.net/firewalls/nat
.gif Skullbox.net 2004 - VPN pic, http//www.internetaccessmonitor.com/eng
/support/docs/winroute/img/vpn-scheme.png Red
Line Software, 2006 - Various research, www.wikipedia.org, Wikipedia
2007
40References
- http//www.galeas.de/webmining.html