WEB MINING Prof. Navneet Goyal BITS, Pilani - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

WEB MINING Prof. Navneet Goyal BITS, Pilani

Description:

WEB MINING Prof. Navneet Goyal BITS, Pilani * 18 * * * * * * * * * * * Dr. Navneet Goyal, BITS,Pilani * * Dr. Navneet Goyal, BITS,Pilani ... – PowerPoint PPT presentation

Number of Views:326
Avg rating:3.0/5.0
Slides: 49
Provided by: csisBits6
Category:
Tags: bits | mining | web | goyal | mining | navneet | pilani | prof | usage

less

Transcript and Presenter's Notes

Title: WEB MINING Prof. Navneet Goyal BITS, Pilani


1
WEB MININGProf. Navneet GoyalBITS, Pilani
2
Web Mining
  • Web Mining is the use of the data mining
    techniques to automatically discover and extract
    information from web documents/services
  • Discovering useful information from the
    World-Wide Web and its usage patterns
  • My Definition Using data mining techniques to
    make the web more useful and more profitable (for
    some) and to increase the efficiency of our
    interaction with the web

3
Web Mining
  • Data Mining Techniques
  • Association rules
  • Sequential patterns
  • Classification
  • Clustering
  • Outlier discovery
  • Applications to the Web
  • E-commerce
  • Information retrieval (search)
  • Network management

4
Examples of Discovered Patterns
  • Association rules
  • 98 of AOL users also have E-trade accounts
  • Classification
  • People with age less than 40 and salary gt 40k
    trade on-line
  • Clustering
  • Users A and B access similar URLs
  • Outlier Detection
  • User A spends more than twice the average amount
    of time surfing on the Web

5
Web Mining
  • The WWW is huge, widely distributed, global
    information service centre for
  • Information services news, advertisements,
    consumer information, financial management,
    education, government, e-commerce, etc.
  • Hyper-link information
  • Access and usage information
  • WWW provides rich sources of data for data mining

6
Why Mine the Web?
  • Enormous wealth of information on Web
  • Financial information (e.g. stock quotes)
  • Book/CD/Video stores (e.g. Amazon)
  • Restaurant information (e.g. Zagats)
  • Car prices (e.g. Carpoint)
  • Lots of data on user access patterns
  • Web logs contain sequence of URLs accessed by
    users
  • Possible to mine interesting nuggets of
    information
  • People who ski also travel frequently to Europe
  • Tech stocks have corrections in the summer and
    rally from November until February

7
Why is Web Mining Different?
  • The Web is a huge collection of documents except
    for
  • Hyper-link information
  • Access and usage information
  • The Web is very dynamic
  • New pages are constantly being generated
  • Challenge Develop new Web mining algorithms and
    adapt traditional data mining algorithms to
  • Exploit hyper-links and access patterns
  • Be incremental

8
Web Mining Applications
  • E-commerce (Infrastructure)
  • Generate user profiles
  • Targetted advertizing
  • Fraud
  • Similar image retrieval
  • Information retrieval (Search) on the Web
  • Automated generation of topic hierarchies
  • Web knowledge bases
  • Extraction of schema for XML documents
  • Network Management
  • Performance management
  • Fault management

9
User Profiling
  • Important for improving customization
  • Provide users with pages, advertisements of
    interest
  • Example profiles on-line trader, on-line
    shopper
  • Generate user profiles based on their access
    patterns
  • Cluster users based on frequently accessed URLs
  • Use classifier to generate a profile for each
    cluster
  • Engage technologies
  • Tracks web traffic to create anonymous user
    profiles of Web surfers
  • Has profiles for more than 35 million anonymous
    users

10
Internet Advertizing
  • Ads are a major source of revenue for Web portals
    (e.g., Yahoo, Lycos) and E-commerce sites
  • Plenty of startups doing internet advertizing
  • Doubleclick, AdForce, Flycast, AdKnowledge
  • Internet advertizing is probably the hottest
    web mining application today

11
Internet Advertizing
  • Scheme 1
  • Manually associate a set of ads with each user
    profile
  • For each user, display an ad from the set based
    on profile
  • Scheme 2
  • Automate association between ads and users
  • Use ad click information to cluster users (each
    user is associated with a set of ads that he/she
    clicked on)
  • For each cluster, find ads that occur most
    frequently in the cluster and these become the
    ads for the set of users in the cluster

12
Internet Advertizing
  • Use collaborative filtering (e.g. Likeminds,
    Firefly)
  • Each user Ui has a rating for a subset of ads
    (based on click information, time spent, items
    bought etc.)
  • Rij - rating of user Ui for ad Aj
  • Problem Compute user Uis rating for an unrated
    ad Aj

13
Internet Advertizing
  • Key Idea User Uis rating for ad Aj is set to
    Rkj, where Uk is the user whose rating of ads is
    most similar to Uis
  • User Uis rating for an ad Aj that has not been
    previously displayed to Ui is computed as
    follows
  • Consider a user Uk who has rated ad Aj
  • Compute Dik, the distance between Ui and Uks
    ratings on common ads
  • Uis rating for ad Aj Rkj (Uk is user with
    smallest Dik)
  • Display to Ui ad Aj with highest computed rating

14
Fraud
  • With the growing popularity of E-commerce,
    systems to detect and prevent fraud on the Web
    become important
  • Maintain a signature for each user based on
    buying patterns on the Web (e.g., amount spent,
    categories of items bought)
  • If buying pattern changes significantly, then
    signal fraud
  • HNC software uses domain knowledge and neural
    networks for credit card fraud detection

15
Retrieval of Similar Images
  • Given
  • A set of images
  • Find
  • All images similar to a given image
  • All pairs of similar images
  • Sample applications
  • Medical diagnosis
  • Weather predication
  • Web search engine for images
  • E-commerce

16
Retrieval of Similar Images
  • QBIC, Virage, Photobook
  • Compute feature signature for each image
  • QBIC uses color histograms
  • WBIIS, WALRUS use wavelets
  • Use spatial index to retrieve database image
    whose signature is closest to the querys
    signature
  • WALRUS decomposes an image into regions
  • A single signature is stored for each region
  • Two images are considered to be similar if they
    have enough similar region pairs

17
Images retrieved by WALRUS
Query image
18
Problems with Web Search Today
  • Todays search engines are plagued by problems
  • the abundance problem (99 of info of no interest
    to 99 of people)
  • limited coverage of the Web (internet sources
    hidden behind search interfaces)
  • Largest crawlers cover lt 18 of all web pages
  • limited query interface based on keyword-oriented
    search
  • limited customization to individual users

19
Problems with Web Search Today
  • Todays search engines are plagued by problems
  • Web is highly dynamic
  • Lot of pages added, removed, and updated every
    day
  • Very high dimensionality

20
Improve Search By Adding Structure to the Web
  • Use Web directories (or topic hierarchies)
  • Provide a hierarchical classification of
    documents (e.g., Yahoo!)
  • Searches performed in the context of a topic
    restricts the search to only a subset of web
    pages related to the topic

Yahoo home page
Recreation
Science
Business
News
Sports
Travel
Companies
Finance
Jobs
21
Automatic Creation of Web Directories
  • In the Clever project, hyper-links between Web
    pages are taken into account when categorizing
    them
  • Use a bayesian classifier
  • Exploit knowledge of the classes of immediate
    neighbors of document to be classified
  • Show that simply taking text from neighbors and
    using standard document classifiers to classify
    page does not work
  • Inktomis Directory Engine uses Concept
    Induction to automatically categorize millions
    of documents

22
Network Management
  • Objective To deliver content to users quickly
    and reliably
  • Traffic management
  • Fault management

23
Why is Traffic Management Important?
  • While annual bandwidth demand is increasing
    ten-fold on average, annual bandwidth supply is
    rising only by a factor of three
  • Result is frequent congestion at servers and on
    network links
  • during a major event (e.g., princess dianas
    death), an overwhelming number of user requests
    can result in millions of redundant copies of
    data flowing back and forth across the world
  • Olympic sites during the games
  • NASA sites close to launch and landing of
    shuttles

24
Traffic Management
  • Key Ideas
  • Dynamically replicate/cache content at multiple
    sites within the network and closer to the user
  • Multiple paths between any pair of sites
  • Route user requests to server closest to the user
    or least loaded server
  • Use path with least congested network links
  • Akamai, Inktomi

25
Traffic Management
Congested link
Congested server
Request
Service Provider Network
26
Traffic Management
  • Need to mine network and Web traffic to determine
  • What content to replicate?
  • Which servers should store replicas?
  • Which server to route a user request?
  • What path to use to route packets?
  • Network Design issues
  • Where to place servers?
  • Where to place routers?
  • Which routers should be connected by links?
  • One can use association rules, sequential pattern
    mining algorithms to cache/prefetch replicas at
    server

27
Fault Management
  • Fault management involves
  • Quickly identifying failed/congested servers and
    links in network
  • Re-routing user requests and packets to avoid
    congested/down servers and links
  • Need to analyze alarm and traffic data to carry
    out root cause analysis of faults
  • Bayesian classifiers can be used to predict the
    root cause given a set of alarms

28
Web Mining Issues
  • Size
  • Grows at about 1 million pages a day
  • Google indexes 9 billion documents
  • Number of web sites
  • Netcraft survey says 72 million sites
  • (http//news.netcraft.com/archives/web_server_su
    rvey.html)
  • Diverse types of data
  • Images
  • Text
  • Audio/video
  • XML
  • HTML

29
Number of Active Sites
Total Sites Across All Domains August 1995 -
October 2007
30
Systems Issues
  • Web data sets can be very large
  • Tens to hundreds of terabytes
  • Cannot mine on a single server!
  • Need large farms of servers
  • How to organize hardware/software to mine
    multi-terabye data sets
  • Without breaking the bank!

31
Different Data Formats
  • Structured Data
  • Unstructured Data
  • OLE DB offers some solutions!

32
Web Data
  • Web pages
  • Intra-page structures
  • Inter-page structures
  • Usage data
  • Supplemental data
  • Profiles
  • Registration information
  • Cookies

33
Web Usage Mining
  • Pages contain information
  • Links are roads
  • How do people navigate the Internet
  • ? Web Usage Mining (clickstream analysis)
  • Information on navigation paths available in log
    files
  • Logs can be mined from a client or a server
    perspective

34
Website Usage Analysis
  • Why analyze Website usage?
  • Knowledge about how visitors use Website could
  • Provide guidelines to web site reorganization
    Help prevent disorientation
  • Help designers place important information where
    the visitors look for it
  • Pre-fetching and caching web pages
  • Provide adaptive Website (Personalization)
  • Questions which could be answered
  • What are the differences in usage and access
    patterns among users?
  • What user behaviors change over time?
  • How usage patterns change with quality of service
    (slow/fast)?
  • What is the distribution of network traffic over
    time?

35
Website Usage Analysis
36
Website Usage Analysis
37
Website Usage Analysis
  • Analog Web Log File Analyser
  • Gives basic statistics such as
  • number of hits
  • average hits per time period
  • what are the popular pages in your site
  • who is visiting your site
  • what keywords are users searching for to get to
    you
  • what is being downloaded
  • http//www.analog.cx/

38
Web Usage Mining Process
39
Web Usage Mining Process
40
Web Usage Mining Process
41
Web Mining Outline
  • Goal Examine the use of data mining on the World
    Wide Web
  • Web Content Mining
  • Web Structure Mining
  • Web Usage Mining

42
Web Mining Taxonomy
Modified from zai01
43
Web Content Mining
  • Examine the contents of web pages as well as
    result of web searching
  • Can be thought of as extending the work performed
    by basic search engines
  • Search engines have crawlers to search the web
    and gather information, indexing techniques to
    store the information, and query processing
    support to provide information to the users
  • Web Content Mining is the process of extracting
    knowledge from web contents

44
Semi-structured Data
  • Content is, in general, semi-structured
  • Example
  • Title
  • Author
  • Publication_Date
  • Length
  • Category
  • Abstract
  • Content

45
Structuring Textual Data
  • Many methods designed to analyze structured data
  • If we can represent documents by a set of
    attributes we will be able to use existing data
    mining methods
  • How to represent a document?
  • Vector based representation
  • (referred to as bag of words as it is
    invariant to permutations)
  • Use statistics to add a numerical dimension to
    unstructured text

46
Document Representation
  • A document representation aims to capture what
    the document is about
  • One possible approach
  • Each entry describes a document
  • Attribute describe whether or not a term appears
    in the document

47
Document Representation
  • Another approach
  • Each entry describes a document
  • Attributes represent the frequency in which a
    term appears in the document

48
Document Representation
  • Stop Word removal Many words are not
    informative and thus
  • irrelevant for document representation
  • the, and, a, an, is, of, that,
  • Stemming reducing words to their root form
    (Reduce dimensionality)
  • A document may contain several occurrences of
    words like fish, fishes, fisher, and fishers. But
    would not be retrieved by a query with the
    keyword fishing
  • Different words share the same word stem and
    should be represented with its stem, instead of
    the actual word Fish
Write a Comment
User Comments (0)
About PowerShow.com