Web Document Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Web Document Clustering

Description:

Web Document Clustering Department of Computer Science and Engineering Southern Methodist University Wenyi Ni Why web document clustering is needed? – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 22
Provided by: Weny3
Learn more at: https://s2.smu.edu
Category:

less

Transcript and Presenter's Notes

Title: Web Document Clustering


1
Web Document Clustering
  • Department of Computer Science and Engineering
  • Southern Methodist University
  • Wenyi Ni

2
Why web document clustering is needed?
  • 3.3 billion web pages on the internet
  • Every time you post a query, the search engine
    returns thousands of records.
  • Did you efficiently find what you wanted?
  • Web document clustering is a good choice.
  • An example
  • www.metacrawler.com

3
How to present a web document in a general model?
  • TF-IDF
  • Each web document is consisted by words.
  • The more words they share, the more likely they
    are similar.
  • Each Web document D can be represented by the
    following form
  • D d1,d2, dn
  • Where n means that there are totally n
    different words in the document collection.
  • di represents the appearance of the ith word in
    the document.(1 means exist, 0 means non-exist)
  • The order of di is determined by the weight.

4
How to calculate the weight?
  • tfij is number of occurrences of the word tj in
    the Web document Di.
  • idfj is Inverse document frequency.
  • dfj is the number of Web documents in which word
    tj occurs in the document collection.
  • n is the total number of Web documents in the
    document collection.

5
How to calculate the similarity between two web
documents
  • Jaccard similarity measure
  • Other common measures Cosine, Dice, Overlap

6
Agglomerative Hierarchical clustering
  1. Start with regarding each document as an
    individual cluster
  2. Merge the most similar pair of documents or
    document clusters.(use the similarity measure)
  3. Step 2 is iteratively executed until all objects
    are contained within a single cluster, which
    become the root of the tree.

7
K-means clustering
  1. Arbitrary select K documents as seeds, they are
    the initial centroids of each cluster.
  2. Assign all other documents to the closest
    centroid
  3. Compute the centroid of each cluster again. Get
    new centroid of each cluster
  4. Repeat step2,3, until the centroid of each
    cluster doesnt change.

8
Some other refinement algorithm using TF-IDF model
  • Biselting K-means
  • Scatter/Gather

9
Bisecting K-means
  • 1.Select a cluster to split (There are several
    ways to select which cluster to split. No
    significance difference exists in terms of
    clustering accuracy). We normally choose the
    largest cluster or the one with the least overall
    similarity
  • 2.Employ the basic k-means algorithm to subdivide
    the chosen cluster.
  • 3.Repeat step 2 for a constant number of times.
    Then perform the split that produces clusters
    with the highest overall similarity
  • 4.Repeat the above step1,2,3, until the desired
    number of clusters is reached

10
How to present a web document in STC model
  • What is STC?
  • Suffix Tree clustering
  • The whole web document is treated as a string
  • The identification of base clusters is the
    creation of an inverted index of strings for the
    web document collection

11
A suffix tree example(courtesy form zemair)
  • Three strings. Each string is a document.
  • Cat ate cheese
  • Mouse ate cheese too
  • Cat ate mouse too.

12
STC algorithm(cont)
  • 1.Document cleaning
  • Delete the word prefix and suffix, reduce plural
    to singular. Sentence boundaries are marked and
    non-word tokens (such as numbers, HTML tags and
    most punctuation) are stripped.
  • 2.Identify Base Cluster.
  • Create an inverted index of strings from the web
    document collection with using a suffix tree.
    Each node of the suffix tree represents a group
    of documents and a string that is common to all
    of them. The label of the node represents the
    common string. Each node represents a base
    cluster.

13
STC algorithm(cont)
  • 3.Score base clusters.
  • ?Each base cluster is assigned a score
  • The score formula S(B)Bf(P)
  • B is the number of documents in base cluster B
  • P is the number of words in string P that has a
    non-zero score
  • The function f penalizes single word, linear for
    string that is two to six words long. And become
    constant for longer string.

14
STC algorithm
  • 4.Combine base clusters.
  • The similarity measure used to combine base
    clusters is based on the overlap of their
    document sets
  • ?Bx and By with size Bx and By
  • ?Bx?By represents the number of documents
    common to both base clusters.
  • ?Define the similarity of Bx and By to be 1 if
  • Bx ? By/Bxgt0.5 and Bx ? By/Bygt0.5.
    Otherwise is 0.
  • ?Two base clusters are connected if they have
    similarity of 1. Using a single-link clustering
    algorithm, all the connected base clusters are
    clustering together. All the documents in these
    base clusters constitute a web document cluster.

15
Link Based Model
  • Idea Web pages that share common links each
    other are very likely to be tightly related
  • Each web document P is represented as 2 vectors
    Pout(N-dimension) and Pin(M-dimension)
  • Pout,i represents whether the web document P has
    a out-link in the ith item of vector Pout
  • Pin,j represents whether the web document P has a
    in-link in the jth item of vector Pin
  • For example
  • Pout( link1, link2,,linkn) represents all the
    out-link in web document collection.
  • Document Pout,2 1 means this document has link2
    as out-link.

16
Link based algorithm
  • 1.Filter irrelevant web documents
  • ?A document is regarded irrelevant if the sum of
    in-links and out-links less than 2
  • 2.Use near-common link of cluster to grantee
    intra-cluster cohesiveness
  • ?Every cluster should have at least one 30 near
    common link
  • 3.Assign each web document to cluster, generate
    base clusters.
  • ? Similarity between the document and the
    corresponding cluster is above the similarity
    threshold
  • ? The document has a link in common with near
    common links of the corresponding cluster
  • 4.Generate final clusters by merging base
    clusters

17
How to evaluate the quality of the result
clusters (cont)
  • Entropy
  • 1)For each cluster, the class distribution of the
    data(we usually use TREC5,TREC6 document
    collection) is calculated first.
  • 2)Using this class distribution, the entropy of
    each cluster j is calculated.
  • Ej -Spijlog(pij)
  • 3) The best quality is that all the documents in
    the cluster fall into the same class that is
    known before clustering

18
How to evaluate the quality of the result clusters
  • F-measure
  • 1)Calculate the recall and precision of that
    cluster for each given class.
  • 2)For cluster j and its corresponding class i
  • Recall(i, j) nij/ni
  • Percision(i, j) nij/nj
  • F(i, j) ( 2 Recall(i, j) Percision(i, j)) /
    ((Percision(i, j) Recall(i, j))

19
Algorithm evaluation and comparison
  • TF-IDF based AHC
  • Good cluster quality, time complexity O(n²)
  • TF-IDF based K-means
  • Linear time complexity O(Kmn) Sensitive to
    outliers
  • STC
  • Best for increment. Linear time complexity O(n),
    has memory problem.
  • Link based
  • Linear time complexity O(mn), low dimension, good
    cluster quality.

20
Future work
  • Each algorithm has its advantage and
    disadvantage. We need to refine these algorithms.
    Sometime we need trade off.
  • Still some room to make it better.
  • 1.increase the entropy or F-measure value of the
    result clusters(The evaluation value is under 0.6
    in almost all algorithm,while the best is 1)
  • 2.decrease the response time(we often need to
    process a large document collection. We need a
    fast algorithm)

21
End
Write a Comment
User Comments (0)
About PowerShow.com