Mining di dati web - PowerPoint PPT Presentation

About This Presentation
Title:

Mining di dati web

Description:

For instance the 4-shingling of (a,rose,is,a,rose,is, ... of all shingles of size ... Documents sharing shingles in the meta-sketch are very likely to have a ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 50
Provided by: fabrizios7
Category:
Tags: dati | mining | shingling | web

less

Transcript and Presenter's Notes

Title: Mining di dati web


1
Mining di dati web
  • Lezione n 6
  • Clustering di Documenti Web
  • Gli Algoritmi Basati sul Contenuto
  • A.A 2005/2006

2
Document Clustering
  • Classical clustering algorithms are not suitable
    for high dimensional data.
  • Dimensionality Reduction is a viable but
    expensive solution.
  • Different kind of clustering exists
  • Partitional (or Top-Down)
  • Hierarchical (or Bottom-Up)

3
Partitional Clustering
  • Directly decomposes the data set into a set of
    disjoint clusters.
  • The most famous is the K-Means algorithm.
  • Usually they are linear in the number of elements
    to cluster.

4
Hierarchical Partitioning
  • Proceeds successively by either merging smaller
    clusters into larger ones, or by splitting larger
    clusters.
  • The clustering methods differ in the rule by
    which it is decided which two small clusters are
    merged or which large cluster is split.
  • The end result of the algorithm is a tree of
    clusters called a dendrogram, which shows how the
    clusters are related.
  • By cutting the dendrogram at a desired level a
    clustering of the data items into disjoint groups
    is obtained.

5
Dendrogram Example
6
Clustering in Web Content Mining
  • Possible uses of clustering in Web Content
    Mining.
  • Automatic Document Classification.
  • Search Engine Results Presentation.
  • Search Engine Optimization
  • Collection Reorganization.
  • Index Reorganization.
  • Dimensionality Reduction!!!!

7
Advanced Document Clustering Techniques
  • Co-Clustering
  • Dhillon, I. S., Mallela, S., and Modha, D. S.
    2003. Information-theoretic co-clustering. In
    Proceedings of the Ninth ACM SIGKDD international
    Conference on Knowledge Discovery and Data Mining
    (Washington, D.C., August 24 - 27, 2003). KDD
    '03. ACM Press, New York, NY, 89-98.
  • Syntactic Clustering
  • Broder, A. Z., Glassman, S. C., Manasse, M. S.,
    and Zweig, G. 1997. Syntactic clustering of the
    Web. Comput. Netw. ISDN Syst. 29, 8-13 (Sep.
    1997), 1157-1166.

8
Co-Clustering
  • Idea represent a collection with its
    term-document matrix and then cluster both rows
    and columns.
  • It has a strong theoretical foundation.
  • It is based on the assumption that the best
    clustering is the one that leads to the largest
    mutual information between the clustered random
    variables.

9
Information Theory
  • Entropy of a random variable X with probability
    distribution p(x)
  • The Kullback-Leibler(KL) Divergence or Relative
    Entropy between two probability distributions p
    and q
  • Mutual Information between random variables X and
    Y

10
Contingency Table
  • Let X and Y be discrete random variables that
    take values in the sets x1, x2, , xm and y1,
    y2, , yn.
  • Let p(X,Y) denote the joint probability
    distribution between X and Y.

11
Problem Formulation
  • Co-clustering is concerned with simulteously
    clustering X into (at most) k disjoint clusters
    and Y into (at most) l disjoint clusters.
  • Let the k clusters of X be written asx1, x2,
    , xk, and let the l clusters of Y be written
    as y1, y2, , yl.
  • (CX,CY) is defined co-clustering, where
  • Cx x1, x2, , xm ? x1, x2, , xk
  • CY y1, y2, , yn ? y1, y2, , yl
  • An optimal co-clustering minimizes I(XY) -
    I(XCX(X)YCY(Y)) I(XY) - I(X-Y)

12
Lemma 2.1
  • For a fixed co-clustering (CX, CY), we can write
    the loss in mutual information as I(XY) -
    I(XY) D(p(X,Y)q(X,Y)),where D(--)
    denotes the Kullback-Leibler divergence and
    q(X,Y) is a distribution of the
    form q(x,y)p(x,y)p(xx)p(yy)where x ? x,
    y ? y.

13
The Approximation Matrix q(X,Y)
  • q(x,y)p(x,y)p(xx)p(yy).
  • p(x)?x ? x p(x)
  • p(y)?y ? y p(y)
  • p(xx)p(x)/p(x)
  • p(yy)p(y)/p(y)

14
Proof of Lemma 2.1
15
Some UsefulEqualities
16
Co-Clustering Algorithm
17
Co-Clustering Soundness
  • Theorem The co-clustering algorithm
    monotonically decreases loss in mutual
    information (objective function value)
  • Marginals p(x) and p(y) are preserved at every
    step (q(x)p(x) and q(y)p(y) )

18
Co-ClusteringComplexity
  • The algorithm is computationally efficient
  • Even for sparse data
  • If nz is the number of nonzeros in the imput
    joint distribution p(X,Y), t is the number of
    iterations O(nz t (k l))
  • Experimentally t 20.

19
A Toy Example
20
A Real ExampleBefore
21
A Real ExampleAfter
22
ApplicationDimensionality Reduction
  • Feature Selection
  • Feature Clustering

1
  • Select the best words
  • Throw away rest
  • Frequency based pruning
  • Information criterion based
  • pruning

Document Bag-of-words
Vector Of words
Word1
Wordk
m
1
Cluster1
Vector Of words
  • Do not throw away words
  • Cluster words instead
  • Use clusters as features

Document Bag-of-words
Clusterk
m
23
Syntactic Clustering
  • Finding syntactically similar documents.
  • Approach based on two different similarity
    measures
  • Resemblance
  • Containment
  • A sketch of few hundreds bytes is kept for each
    document.

24
Document Model
  • We view each document as a sequence of words.
  • Start by lexically analyzing the doc into a
    canonical sequence of tokens.
  • This canonical form ignores minor details such as
    formatting, html commands, and capitalization.
  • We then associate with every document D a set of
    subsequences of tokens S(D,w).

25
Shingling
  • A contiguous subsequence contained in D is called
    a shingle.
  • Given a document D we define its w-shingling
    S(D,w) as the set of all unique shingles of size
    w contained in D.
  • For instance the 4-shingling of
    (a,rose,is,a,rose,is,a,rose) is the set
  • (a,rose,is,a)(rose,is,a,rose)(is,a,rose,is).

26
Resemblace
  • For a given shingle size, the resemblance r of
    two documents A and B is defined aswhere A
    is the size of set A.

27
Containment
  • For a given shingle size, the containment c of
    two documents A and B is defined aswhere A
    is the size of set A.

28
Properties of r and c
  • The resemblance is a number between 0 and 1.
  • r(A,A) 1
  • The containment is a number between 0 and 1.
  • If A?B then c(A,B)1.
  • Experiments show that the definitions capture the
    informal notions of roughly the same and
    roughly contained.

29
Resemblance Distance
  • Resemblance is not transitive.
  • Version 100 of a document is probably quite
    different from version 1.
  • The Resemblance Distance d(A,B)1-r(A,B) is a not
    metric but obeys the triangle inequality.

30
Resemblance and Containment Estimates
  • Fix a shingle size w.
  • Let U be the set of all shingles of size w.
  • U is countable thus we can view its elements as
    numbers.
  • Fix a parameter s.
  • For a set W?U define MINs(W) aswhere
    smallest refers to numerical order on U, and
    define

31
Resemblance and Containment Estimates
  • Theorem. Let ?U?U a permutation of U chosen
    u.a.r. Let F(A)MINs(?(S(A))) and
    V(A)MODm(?(S(A))). Define F(B) and V(B)
    analogously. Then
  • is an unbiased estimate of the resemblance of
    A and B.
  • is an unbiased estimate of the resemblance of
    A and B.
  • is an unbiased estimate of the containment of
    A in B.

32
The Sketch
  • Choose a random permutation of U.
  • The Sketch of a document D consists of the set
    F(D) and/or V(D).
  • F(D) has fixed size. Allows only the estimation
    of resemblance.
  • V(D) has variable size. Grows as D grows.

33
Practical Sketches Representation
  • Canonicalize documents by removing HTML
    formatting and converting all words to lowercase.
  • The shingle size w is 10.
  • Use a 40 bit fingerprint function, based on Rabin
    Fingerprints, enhanced to behave as a random
    permutation. Now a shingle is this fingerprint
    value.
  • m in the modulus is set to 25.

34
Rabin Fingerprints
  • Is based on the use of irreducible polynomials
    with coefficients in Galois Field 2.
  • Let A(a1, , am) be a binary string. a11.
  • A(t)a1tm-1a2tm-2am
  • Let P(t) be an irriducible polynomial of degree
    k, over Z2.
  • f(A)A(t) mod P(t) is the Rabin Fingerprint of A.

35
Shingle Clustering
  • Retrieve every document on the Web.
  • Calculate the sketch for each document.
  • Compare the sketches for each pair of documents
    to see if they exceed a threshold of resemblance.
  • Combine the pairs of similar documents to make
    the clusters of similar documents.

36
Efficiency
INEfficiency
  • 30,000,000 HTML docs
  • A pairwaise comparison would involve O(1015)
    comparisons!!!!
  • Just one bit per document in a data structure
    requires 4 Mbytes. A sketch size of 800 bytes per
    documents requires 24 Gbytes!!!
  • One millisecond of computation per document
    translates into 8 hours of computation!!!
  • Any algorithm involving random disk accesses or
    that causes paging activity is completely
    infeasible.

37
Divide, Compute, Merge
  • Take the data, divide it into pieces of size m
    (in order to fit the data entirely in memory)
  • Compute on each piece separately
  • Merge the results.
  • The merging process is I/O bound
  • Each merge pass is linear
  • log(n/m) passes are required.
  • The overall performance is O(n log(n/m)).

38
The real Clustering Algorithm (I phase)
  • Calculate a sketch for every document. This step
    is linear in the total lengths of documents.

39
The real Clustering Algorithm (II phase)
  • Produce a list of all the shingles and the
    documents they appear in, sorted by shingle
    value. To do this, the sketch for each document
    is expanded into a list of ltshingle value,
    document IDgt pairs. Sort the list using the
    divide, sort merge approach.
  • Remember shingle value, means rabin fingerprint
    of the sketch.

40
The real Clustering Algorithm (III phase)
  • Generate a list of all the pairs of documents
    that share any shingles, along with the number of
    shingles they have in common. To do this, take
    the file of sorted couples and expand it into a
    list of ltID, ID, count of common shinglesgt
    triplets
  • take each shingle that appears in multiple
    documents and generate the complete set of ltID,
    ID, 1gt triplets.
  • Apply divide, sort, merge procedure (summing up
    the counts for matching ID-ID pairs) to produce a
    single file of all ltID, ID, countgt triplets
    sorted by the first document ID. This phase
    requires the greatest amount of disk space
    because the initial expansion of the document ID
    triplets is quadratic in the number of documents
    sharing a shingle, and initially produces many
    triplets with a count of 1.

41
The real Clustering Algorithm (IV phase)
  • Produce the complete clustering. Examine each
    ltID,ID,countgt triplet and decide if the document
    pair exceeds our threshold for resemblance. If it
    does, we add a link between the two documents in
    a union-find algorithm. The connected components
    output by the union-find algorithm form the final
    clusters. This phase has the greatest memory
    requirements because we need to hold the entire
    union-find data structure in memory.

42
Performance Issues
  • Common Shingles.
  • Shared by more than 1,000 documents.
  • The number of document ID pairs is quadratic in
    the number of documents sharing a shingle.
  • Remove shingles that are more frequent than a
    given threshold.
  • Identical Documents.
  • Identical documents do not need to be handled.
    Remove identical documents from collection.
    Remove documents having the same fingerpring.
  • Super shingles.
  • Compute a meta-sketch shingling the shingles
  • Documents sharing shingles in the meta-sketch are
    very likely to have a high resemblance value.
  • Need to carefully choose super-shingle size.

43
Super-shingles based Clustering
  • Compute the list of super shingles for each
    document
  • Expand the list of super shingles into a sorted
    list of ltsuper shingle, IDgt pairs.
  • Any documents that share a super shingle resemble
    each other and are added into the cluster.

44
Problems withSuper-shingles
  • Super shingles are not as flexible or as accurate
    as computing resemblance with regular sketches.
  • They do not work well for shor documents. Short
    documents do not contain many shingles, even
    regular shingles are not accurate in computing
    resemblance.
  • Super-shingles represent sequence of shingles,
    and so, shorter documents, with fewer super
    shingles, have a lower probability of producing a
    common super shingle.
  • Super-shingles cannot detect containment.

45
A Nice ApplicationPage Changing Characterization
  • We can use the technique of comparing sketches
    over time to characterize the behavior of pages
    on the web.
  • For instance, we can observe a page at different
    times and see how similar each version is to the
    preceding version.
  • We can thus answer some basic questions like
  • How often do pages change?
  • How much do they change per time interval?
  • How often do pages move? Within a server? Between
    servers?
  • How long do pages live? How many are created? How
    many die?

46
Experiments
  • 30,000,000 HTML Pages. 150Gbytes (5k per
    document)
  • The file containing just the URLs of the
    documents took up 1.8Gbytes (an average of 60
    bytes per URL).
  • 10 word long shingles, 5 byte fingerprint. 1 in
    25 of the shingles found were kept.
  • 600M shingles and the raw sketch files took up 3
    Gbytes.

47
Experiments
  • In the third phase - the creation of ltID, ID,
    countgt triples - the storage required was 20
    Gbytes. At the end the file took 6 Gbytes.
  • The final clustering phase is the most memory
    intensive. The final file took up less than
    100MBytes.

48
Experiments
  • Resemblance threshold set to 50.
  • 3.6 million clusters found containing a total of
    12.3 million documents.
  • 2.1 million clusters contained only identical
    documents (5.3 million documents).
  • The remainig 1.5 million clusters contained 7
    million documents (a mixture of exact duplicates
    and similar).

49
Experiments
Phase Time (CPU-days) Paralle-lizable
Sketching 4.6 YES
Duplicate elimination 0.3
Shingle merging 1.7 YES
ID-ID pair formation 0.7
ID-ID merging 2.6 YES
Cluster formation 0.5
Total ? 10.5
Write a Comment
User Comments (0)
About PowerShow.com