Introduction to Web Mining and Web Usage Mining - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Introduction to Web Mining and Web Usage Mining

Description:

a collection of user clicks to a single Web server during a user ... Ck = candidates of size k: those itemsets of size k that could be frequent, given Fk-1 ... – PowerPoint PPT presentation

Number of Views:1133
Avg rating:3.0/5.0
Slides: 65
Provided by: federic91
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Web Mining and Web Usage Mining


1
Introduction to Web Mining and Web Usage Mining
  • Course Usability of Interactive Applications
  • Year 2007
  • Lecturer Federico M. Facca (facca_at_elet.polimi.ti)
  • Main Lecturer Francesca Rizzo (rizzo_at_elet.polimi.
    it)

2
Agenda
  • Web Mining
  • Introduction
  • Web Content Mining
  • Web Structure Mining
  • Web Usage Mining
  • Introduction
  • Algorithms
  • Applications
  • Examples
  • References

3
Web MiningIntroduction
  • Web Mining
  • is the application of data mining techniques to
    discover patterns from the Web.
  • Data Mining
  • also called Knowledge-Discovery in Databases
    (KDD) is the process of automatically searching
    large volumes of data for patterns (extracting
    Knowledge from data)

4
Web MiningIntroduction
  • Web Content Mining
  • discover useful information from the content of a
    web page. The type of the web content may consist
    of text, image, audio or video data in the web
  • Web Structure Mining
  • using the graph theory to analyse the node and
    connection structure of a web site
  • Web Usage Mining
  • analyse and discover interesting patterns of
    users usage data on the web. The usage data
    records the users behaviour when the user
    browses or makes transactions on the web site.

5
Web MiningIntroduction
WCM
Wrapper
WUM
Characterizing
?
Web data
InformationRetrieval
Information Extraction
Generalizzation
Analysis
Knowledge
WSM
Indexer
  • According to the Web Mining category and of the
    objective, the different phases acquire a
    different role and importance

Categorization
Crawler/Spider
Clustering
Ranker
6
Web MiningWeb Content Mining
  • Discovery of useful information from web contents
    / data / documents
  • Web data contents text, image, audio, video,
    metadata and hyperlinks.
  • Information Retrieval View ( Structured
    Semi-Structured)
  • Assist / Improve information finding
  • Filtering Information to users on user profiles
  • Database View
  • Model Data on the web
  • Integrate them for more sophisticated queries

7
Web MiningWeb Content Mining
  • Developing intelligent tools for IR
  • Finding keywords and key phrases
  • Discovering grammatical rules and collocations
  • Hypertext classification/categorization
  • Extracting key phrases from text documents
  • Learning extraction models/rules
  • Hierarchical clustering
  • Predicting (words) relationship

8
Web MiningWeb Structure Mining
  • To discover the link structure of the hyperlinks
    at the inter-document level to generate
    structural summary about the Website and Web
    page.
  • based on the hyperlinks, categorizing the Web
    pages and generated information.
  • discovering the structure of Web document
    itself.
  • discovering the nature of the hierarchy or
    network of hyperlinks in the Website of a
    particular domain.

9
Web MiningWeb Structure Mining
  • Finding authoritative Web pages
  • Retrieving pages that are not only relevant, but
    also of high quality, or authoritative on the
    topic
  • Hyperlinks can infer the notion of authority
  • The Web consists not only of pages, but also of
    hyperlinks pointing from one page to another
  • These hyperlinks contain an enormous amount of
    latent human annotation
  • A hyperlink pointing to another Web page, this
    can be considered as the author's endorsement of
    the other page

10
Web Usage MiningIntroduction
  • Known also as web log mining
  • Not only statistical measures
  • Not only server logs
  • Can be organized according 3 orthogonal dimensions
  • Techniques
  • Statistical Analysis
  • Association Rules
  • Clustering
  • Sequential Patterns
  • Rough Sets
  • Fuzzy Logic
  • Visualizzation
  • Graphs
  • Relational Tables
  • OLAP
  • Query languages
  • Applications
  • Personalization
  • Usability Testing
  • User modeling
  • Marketing
  • Adaptive Web sites

11
Web Usage MiningTerms
  • User
  • The principal using a client to interactively
    retrieve and render resources or resource
    manifestations.
  • Page view
  • Visual rendering of a Web page in a specific
    client environment at a specific point of time
  • Click stream
  • a sequential series of page view request
  • User session
  • a delimited set of user clicks (click stream)
    across one or more Web servers.
  • Server session (visit)
  • a collection of user clicks to a single Web
    server during a user session.
  • Episode
  • a subset of related user clicks that occur within
    a user session.

12
Web Usage MiningApplications
  • Target potential customers for electronic
    commerce
  • Enhance the quality and delivery of Internet
    information services to the end user
  • Improve Web server system performance
  • Identify potential prime advertisement locations
  • Facilitates personalization/adaptive sites
  • Improve site design
  • Fraud/intrusion detection
  • Predict users actions (allows prefetching)

13
Web Usage MiningInformation Retrieval
  • The information is usually easy to obtain (web
    log, cookies, proxy log, data base log...).
  • Information can be obtained from server, client
    e proxy.

14
Web Usage MiningInformation Extraction
  • Completing missing information using some
    heuristics
  • Identification of sessions/episodes
  • Mining and conversion of contents to the
    elaboration format (WCM)
  • Mining of the web site structure (WSM)
  • Finding and removing data distortion (e.g.
    crawlers sessions).
  • Representing the information in the correct
    format for the pattern discovery task

15
Web Usage MiningGeneralization
  • Usage Patterns
  • Navigation patterns
  • Behaviour pattern
  • Access patterns
  • Techniques
  • Association rules (e.g. 45 users that visited
    products/product1.html also visited
    products/productX.html ).
  • Clustering (identifying group of users that show
    similar sessions)
  • Classification (e.g. 30 of users that bought
    products from the category Music are between
    18-25 years old and live in north Europe)
  • Sequential patterns (e.g. 15 of users that
    bought a product in the category Music after a
    week made a new order in the category Book)

16
Web Usage MiningAnalysis
  • Removing patterns that do not provide new
    knowledge
  • Visualization of acquired knowledge
  • Usage of discovered pattern to
  • Categorizing users
  • Personalizing contents/advertisements
  • Modifying dynamically web site structure
  • Marketing
  • Improving application usability

17
Web Usage MiningProblems with Web Logs
  • Identifying users
  • Clients may have multiple streams
  • Clients may access web from multiple hosts
  • Proxy servers many clients/one address
  • Proxy servers one client/many addresses
  • Data not in log
  • POST data (i.e., CGI request) not recorded
  • Cookie data stored elsewhere

18
Web Usage MiningProblems with Web Logs
  • Missing data
  • Pages may be cached
  • Referring page requires client cooperation
  • When does a session end?
  • Use of forward and backward pointers
  • Typically a 30 minute timeout is used
  • Web content may be dynamic
  • May not be able to reconstruct what the user saw
  • Use of spiders and automated agents automatic
    request we pages

19
Web Usage MiningProblems with Web Logs
  • Like most data mining tasks, web log mining
    requires preprocessing
  • To identify users
  • To match sessions to other data
  • To fill in missing data
  • Essentially, to reconstruct the click stream

20
Web Usage MiningWeb Server Logs
  • Web servers have the ability to log all
  • requests
  • Web server log formats
  • Common Log Format (CLF)
  • Extended Log Format allows configuration of log
    file
  • Generate vast amounts of data

21
Web Usage MiningWeb Server Logs
  • Common Log Format
  • Remotehost browser hostname or IP
  • Remote log name of user (almost always "-"
    meaning "unknown")
  • Authuser authenticated username
  • Date Date and time of the request
  • "request exact request lines from client
  • Status The HTTP status code returned
  • Bytes The content-length of response

22
Web Usage MiningWeb Server Logs
23
Web Usage MiningPre-Processing
  • Data Cleaning
  • Removes log entries that are not needed for the
    mining process
  • Data Integration
  • Synchronize data from multiple server logs,
    metadata
  • User Identification
  • Associates page references with different users
  • Session/Episode Identification
  • Groups users page references into user sessions
  • Page View Identification
  • Path Completion
  • Fills in page references missing due to browser
    and proxy caching

24
Web Usage MiningPre-Processing
  • A single IP address is used by many users
  • Different IP addresses in a single session
  • Missing cache hits in the server logs

Proxy server
Different users
Web server
Single user
ISP server
Web server
25
Web Usage MiningPre-Processing
  • Remote Agent
  • A remote agent is implemented in Java Applet
  • It is loaded into the client only once when the
    first page is accessed
  • The subsequent requests are captured and send
    back to the server
  • Modified Browser
  • The source code of the existing browser can be
    modified to gain user specific data at the client
    side
  • Dynamic page rewriting
  • When the user first submit the request, the
    server returns the requested page rewritten to
    include a session specific ID
  • Each subsequent request will supply this ID to
    the server
  • Heuristics
  • Use a set of assumptions to identify user
    sessions and find the missing cache hits in the
    server log

26
Web Usage MiningSession identification heuristics
  • Timeout
  • if the time between pages requests exceeds a
    certain limit, it is assumed that the user is
    starting a new session
  • IP/Agent
  • Each different agent type for an IP address
    represents a different sessions
  • Referring page
  • If the referring page file for a request is not
    part of an open session, it is assumed that the
    request is coming from a different session
  • Same IP-Agent/different sessions (Closest)
  • Assigns the request to the session that is
    closest to the referring page at the time of the
    request
  • Same IP-Agent/different sessions (Recent)
  • In the case where multiple sessions are same
    distance from a page request, assigns the request
    to the session with the most recent referrer
    access in terms of time

27
Web Usage MiningSessionization Example
28
Web Usage MiningSessionization Example
29
Web Usage MiningSessionization Example
30
Web Usage MiningSessionization Example
31
Web Usage MiningSessionization Example
32
Web Usage MiningAssociation rule mining
  • Proposed by Agrawal et al in 1993.
  • It is an important data mining model studied
    extensively by the database and data mining
    community.
  • Assume all data are categorical.
  • No good algorithm for numeric data.
  • Initially used for Market Basket Analysis to find
    how items purchased by customers are related.
  • Url1? Url4 sup 5, conf 100

33
Web Usage MiningAssociation rule mining
  • A set of items
  • I i1, i2, , im
  • Transaction t
  • t a set of items, and t ? I
  • Transaction Database T
  • a set of transactions T t1, t2, , tn

34
Web Usage MiningAssociation rule mining
  • A transaction t contains X, a set of items
    (itemset) in I, if X ? t.
  • An association rule is an implication of the
    form
  • X ? Y, where X, Y ? I, and X ?Y ?
  • An itemset is a set of items.
  • E.g., X url1, url2, url3 is an itemset.
  • A k-itemset is an itemset with k items.
  • E.g., url1, url2, url3 is a 3-itemset

35
Web Usage MiningAssociation rule mining
  • Support
  • The rule holds with support sup in T (the
    transaction data set) if sup of transactions
    contain X ? Y.
  • sup Pr(X ? Y).
  • Confidence
  • The rule holds in T with confidence conf if conf
    of tranactions that contain X also contain Y.
  • conf Pr(Y X)
  • An association rule is a pattern that states when
    X occurs, Y occurs with certain probability.

36
Web Usage MiningAssociation rule mining
t1 Url1, Url2, Url4 t2 Url1, Url3 t3 Url3,
Url5 t4 Url1, Url2, Url3 t5 Url1, Url2, Url6,
Url3, Url4 t6 Url2, Url6, Url4 t7 Url2, Url4,
Url6
  • Transaction data
  • Assume
  • minsup 30
  • minconf 80
  • An example frequent itemset
  • Url2, Url6, Url4 sup 3/7
  • Association rules from the itemset
  • Url6 ? Url4, Url2 sup 3/7, conf 3/3
  • Url6, Url2 ? Url4, sup 3/7, conf 3/3

37
Web Usage MiningAssociation rule mining
Apriori Algorithm
  • Probably the best known algorithm
  • Two steps
  • Find all itemsets that have minimum support
    (frequent itemsets, also called large itemsets).
  • Use frequent itemsets to generate rules.
  • E.g., a frequent itemset
  • Url2, Url6, Url4 sup 3/7
  • and one rule from the frequent itemset
  • Url6 ? Url4, Url2 sup 3/7, conf 3/3

38
Web Usage MiningAssociation rule mining
Apriori Algorithm
  • Iterative algo
  • Find all 1-item frequent itemsets then all
    2-item frequent itemsets, and so on.
  • In each iteration k, only consider itemsets that
    contain some k-1 frequent itemset.
  • Find frequent itemsets of size 1 F1
  • From k 2
  • Ck candidates of size k those itemsets of size
    k that could be frequent, given Fk-1
  • Fk those itemsets that are actually frequent,
    Fk ? Ck (need to scan the database once).

39
Web Usage MiningAssociation rule mining
Apriori Algorithm
Dataset T minsup0.5
itemsetcount 1. scan T ? C1 12,
23, 33, 41, 53 ? F1 12,
23, 33, 53 ? C2
1,2, 1,3, 1,5, 2,3, 2,5, 3,5 2. scan
T ? C2 1,21, 1,32, 1,51, 2,32,
2,53, 3,52 ? F2
1,32, 2,32, 2,53,
3,52 ? C3 2, 3,5 3. scan T ?
C3 2, 3, 52 ? F3 2, 3, 5
40
Web Usage MiningAssociation rule mining
Apriori Algorithm
  • Frequent itemsets ? association rules
  • One more step is needed to generate association
    rules
  • For each frequent itemset X,
  • For each proper nonempty subset A of X,
  • Let B X - A
  • A ? B is an association rule if
  • Confidence(A ? B) minconf,
  • support(A ? B) support(A?B) support(X)
  • confidence(A ? B) support(A ? B) / support(A)

41
Web Usage MiningSequential pattern mining
  • Association Rule concerns about what items are
    appears together (at the same time)
  • Intra-transaction patterns
  • Sequential Pattern concerns about what items
    appears at different times
  • Inter-transaction patterns

42
Web Usage MiningSequential pattern mining
  • Itemset
  • non-empty set of items. Each itemset is mapped to
    an integer.
  • Sequence
  • Ordered list of itemsets.
  • Customer Sequence
  • List of customer transactions ordered by
    increasing transaction time.
  • A customer supports a sequence if the sequence is
    contained in the customer-sequence.
  • Support for a Sequence
  • Fraction of total customers that support a
    sequence.
  • Maximal Sequence
  • A sequence that is not contained in any other
    sequence.
  • Large Sequence
  • Sequence that meets minisup.

43
Web Usage MiningSequential pattern mining
44
Web Usage MiningSequential pattern mining
PrefixSpan algorithm
  • ltagt, ltaagt, lta(ab)gt and lta(abc)gt are prefixes of
    sequence lta(abc)(ac)d(cf)gt
  • Given sequence lta(abc)(ac)d(cf)gt

45
Web Usage MiningSequential pattern mining
PrefixSpan algorithm
  • Step 1 find length-1 sequential patterns
  • ltagt, ltbgt, ltcgt, ltdgt, ltegt, ltfgt
  • Step 2 divide search space. The complete set of
    seq. pat. can be partitioned into 6 subsets
  • The ones having prefix ltagt
  • The ones having prefix ltbgt
  • The ones having prefix ltfgt

46
Web Usage MiningSequential pattern mining
PrefixSpan algorithm
  • Only need to consider projections w.r.t. ltagt
  • ltagt-projected database lt(abc)(ac)d(cf)gt,
    lt(_d)c(bc)(ae)gt, lt(_b)(df)cbgt, lt(_f)cbcgt
  • Find all the length-2 seq. pat. Having prefix
    ltagt ltaagt, ltabgt, lt(ab)gt, ltacgt, ltadgt, ltafgt
  • Further partition into 6 subsets
  • Having prefix ltaagt
  • Having prefix ltafgt

47
Web Usage MiningSequential pattern mining
PrefixSpan algorithm
SDB
Length-1 sequential patterns ltagt, ltbgt, ltcgt, ltdgt,
ltegt, ltfgt
Having prefix ltcgt, , ltfgt
Having prefix ltagt
Having prefix ltbgt
ltagt-projected database lt(abc)(ac)d(cf)gt lt(_d)c(bc)
(ae)gt lt(_b)(df)cbgt lt(_f)cbcgt
ltbgt-projected database

Length-2 sequential patterns ltaagt, ltabgt,
lt(ab)gt, ltacgt, ltadgt, ltafgt

Having prefix ltaagt
Having prefix ltafgt

ltaagt-proj. db
ltafgt-proj. db
48
Web Usage MiningClustering
  • Clustering is a technique for finding similarity
    groups in data, called clusters. I.e.,
  • it groups data instances that are similar to
    (near) each other in one cluster and data
    instances that are very different (far away) from
    each other into different clusters.
  • Clustering is often called an unsupervised
    learning task as no class values denoting an a
    priori grouping of the data instances are given,
    which is the case in supervised learning.
  • Due to historical reasons, clustering is often
    considered synonymous with unsupervised learning.
  • In fact, association rule mining is also
    unsupervised

49
Web Usage MiningClustering
  • The data set has three natural groups of data
    points, i.e., 3 natural clusters.

50
Web Usage MiningClustering
  • Let us see some real-life examples
  • Example 1 groups people of similar sizes
    together to make small, medium and large
    T-Shirts.
  • Tailor-made for each person too expensive
  • One-size-fits-all does not fit all.
  • Example 2 In e-commerce, segment customers
    according to their similarities
  • To do targeted marketing.

51
Web Usage MiningClustering
  • A clustering algorithm
  • Partitional clustering
  • Hierarchical clustering
  • A distance (similarity, or dissimilarity)
    function
  • Clustering quality
  • Inter-clusters distance ? maximized
  • Intra-clusters distance ? minimized
  • The quality of a clustering result depends on the
    algorithm, the distance function, and the
    application.

52
Web Usage MiningClustering - K-means algorithm
  • K-means is a partitional clustering algorithm
  • Let the set of data points (or instances) D be
  • x1, x2, , xn,
  • where xi (xi1, xi2, , xir) is a vector in a
    real-valued space X ? Rr, and r is the number of
    attributes (dimensions) in the data.
  • The k-means algorithm partitions the given data
    into k clusters.
  • Each cluster has a cluster center, called
    centroid.
  • k is specified by the user

53
Web Usage MiningClustering - K-means algorithm
  • Given k, the k-means algorithm works as follows
  • Randomly choose k data points (seeds) to be the
    initial centroids, cluster centers
  • Assign each data point to the closest centroid
  • Re-compute the centroids using the current
    cluster memberships.
  • If a convergence criterion is not met, go to 2)

54
Web Usage MiningClustering - K-means algorithm
  • no (or minimum) re-assignments of data points to
    different clusters,
  • no (or minimum) change of centroids, or
  • minimum decrease in the sum of squared error
    (SSE),
  • Ci is the jth cluster, mj is the centroid of
    cluster Cj (the mean vector of all the data
    points in Cj), and dist(x, mj) is the distance
    between data point x and centroid mj.

55
Web Usage MiningClustering - K-means algorithm
Select K and according, K centers in the space
56
Web Usage MiningClustering - K-means algorithm
Assign points to the nearest center
57
Web Usage MiningClustering - K-means algorithm
Recompute the new center for each cluster
58
Web Usage MiningClustering - K-means algorithm
Assign points to the nearest center
59
Web Usage MiningClustering - K-means algorithm
Three points change cluster
60
Web Usage MiningClustering - K-means algorithm
Recompute the new center for each cluster
61
Web Usage MiningClustering - K-means algorithm
Assign points to the nearest center No change!
STOP
62
Web Usage MiningApplications
  • User characterizing
  • Creation of user classes according to navigations
    behaviours and visited contents.
  • Basic step for many of the other WUM applications
  • Personalization
  • Attracting users with advanced personalized
    features (content, presentation, navigation).
  • Recommender systems based on user profiles and
    mined behaviours
  • Ad Hoc advertising

63
Web Usage MiningApplications
  • Web Application Improving
  • Performances prefetching, load balance, web
    caching, based on user behaviours
  • Security finding intrusions and frauds .
  • Usability adapting the model of the web
    application to the model expected by users
  • Marketing
  • Information on users are very important for
    e-commerce web sites.
  • Its possible to obtain data on
  • Customer acquisition
  • Customer keeping
  • Cross sales
  • Customer loss

64
References
  • R. Kosala and H. Blockeel. Web mining research a
    survey. SIGKDD Explorations, ACM, 2(1) 1-15,
    2000.
  • Sankar Pal, Varun Talwar, and Pabitra Mitra. Web
    mining in soft computing framework Relevance,
    state of the art and future directions, 2002.
  • Jaideep Srivastava, Robert Cooley, Mukund
    Deshpande, and Pang-Ning Tan. Web usage mining
    Discovery and applications of usage patterns from
    web data. SIGKDD Explorations, ACM, 1(2) 12-23,
    2000.
  • Federico Michele Facca, Pier Luca Lanzi Mining
    interesting knowledge from weblogs a survey.
    Data Knowl. Eng. 53(3) 225-241 (2005)
Write a Comment
User Comments (0)
About PowerShow.com