Title: Clustering Categorical Data The Case of Quran Verses
1Clustering Categorical DataThe Case of Quran
Verses
- Presented By
- Muhammad Al-Watban
- IS 598
2Outline
- Introduction
- Preprocessing of Quran Verses
- Similarity Measures
- Assisting Clusters Similarities
- Shortcomings of Traditional clustering methods
with categorical data - ROCK - Major definitions
- ROCK clustering Algorithm
- ROCK example
- Conclusion and future work
3Introduction
- The holy Quran covers a wide range of topics.
- Quran does not cover each topic by a set of
sequenced verses or suras. - A single verse usually deals with many subjects
- Project goal to cluster the verses of The Holy
Quran based on the verses subjects.
4Preprocessing of Quran Verses
- it is necessary to perform manual preprocessing
for the Quran text to capture the subjects of the
verses into a tabular format - Verses in the Holy Quran can be viewed as records
and the related subjects as attributes of the
record. This is demonstrated by the following
table - The data in the above table is similar to what is
known as market-basket data. - Here, we will call it verses-treasues data
5Similarity Measures
- Two types of attributes
- Continuous attributes
- range of attribute value is continuous and
ordered - includes Attributes with numeric values (e.g.
salary) - also includes attributes whose allowed set of
values are thought to be part of an ordered set
of a meaningful sequence (e.g. professional
ranks, disease severity levels) - The similarity (or dissimilarity) between objects
is computed based on distance between them. - the most commonly used distance measure is
Euclidean distance, and Manhattan distance
6Similarity Measures
- Categorical attributes
- consists of attributes whose underlying domain is
not ordered - Examples colors, blood type.
- If the attribute has only two states (namely 0
and 1), then it is called binary if it has more
than two states, it is called nominal. - there is no easy way to measure a distance
between objects - We can define dissimilarity based on the simple
matching approach -
- Where m is the number of matched attribute, and p
is the total number of attributes.
7Similarity Measures
- Where does the verses treasures data fit?
- Each verse can be represented by a record with
Boolean attributes, each attribute corresponds to
a single subject - The attribute corresponding to a subject is T if
the verse contains that subjects otherwise, it
is F - As we said, Boolean attributes are a special case
of categorical attributes
8Assisting Clusters Similarities
- Many clustering algorithm(such as hirarchical
clustering) requires computing distance between
clusters (rather than elements) - There are several standard methods
- 1- Single linkage
- D(r,s) distance between clusters r and s is
defined as the distance between the closest pair
of objects -
9Assisting Clusters Similarities
- 2. Complete linkage
- distance is defined as the distance between the
farthest pair of objects - 3. Average linkage
- distance is defined as the average of distances
between all pairs of objects r and s, where r and
s belong to different clusters -
10Assisting Clusters Similarities
- 4. Centroid Linkage
- distance between clusters is defined as the
distance between the pair of cluster centroids.
11Shortcomings of Traditional clustering methods
with categorical data
- Example
- Consider the following 4 market basket
transactions - T1 1, 2, 3, 4
- T2 1, 2, 4
- T3 3
- T4 4
- converting these transactions to Boolean points,
we get - P1 (1, 1, 1, 1)
- P2 (1, 1, 0, 1)
- P3 (0, 0, 1, 0)
- P4 (0, 0, 0, 1)
- using Euclidean distance to measure the closeness
between all pairs of points, we find that
d(p1,p2) is the smallest distance -
12Shortcomings of Traditional clustering methods
with categorical data
- If we use the centroid-based hierarchical
algorithm then we merge P1 and P2 and get a new
cluster (P12) with (1, 1, 0.5, 1) as a centroid - Then, using Euclidean distance again, we find
- d(p12,p3) ?3.25
- d(p12,p4) ?2.25
- d(p3,p4) ?2
- So, we should merge P3 and P4 since the distance
between them is the shortest. - However, T3 and T4 don't have even a single
common item. - So, using distance metrics as similarity measure
for categorical data is not appropriate - The solution is ROCK
13ROCK - Major definitions
- Similarity function
- Neighbors
- Links
- Criterion function
- Goodness measure
14Similarity function
- Let Sim (Pi, Pj) be a similarity function that is
used to measure the closeness between points pi
and Pj. - ROCK assumes that Sim function is normalized to
return a value between 0 and 1 - For Quran treasures data, a possible definition
for the sim function is based on the Jaccard
coefficient
15Example similarity function
- Suppose two verses (P1 and P2) contain the
following subjects - P1 judgment, faith, prayer, fair
- P2 fasting, faith, prayer
- Sim(P1,P2) P1? P2 / P1?P2
- 2 / 5 0.40
16Major definitions
- Similarity for data objects
- Neighbors
- Links
- Criterion function
- Goodness measure
17Neighbors and Links
- one main problem of traditional clustering
islocal properties involving only the two points
are considered. - Neighbor
- If similarity between two points exceeds certain
similarity threshold (?), they are neighbors. - Link
- The Link for pair of points is the number of
their common neighbors. - Obviously, Link incorporates global information
about the other points in the neighborhood of the
two points. The larger the Link, the higher
probability that this pair of points are in the
same clusters.
18Example neighboring and linking
- Example
- Assume that we have three distinct points p1,p2
and p3 where - neighbor(p1)p1,p2
- neighbor(p2)p1,p2,3
- neighbor(p3)p3,p2
- Neighboring graph ?
- To define the number of links between two points,
say p1 and p3, we have to find the number of
their common neighbors hence, we can define the
linkage function between p1 and p3 to be - Link (p1,p3) neighbor(p1) ? neighbor(p3)
P2 - Or Link (p1,p3) 1
19Example minimum linkages
- If we have four pointsP1,P2,P3,P4
- suppose that similarity threshold (?) is equal to
1 - Then, Two Points are neighbors if sim(Pi,Pj)gt1
- hence, points are considered neighbors only to
identical points (i.e. only to themselves) - To find Link(P1,P2)
- neighbor(P1)P1
- neighbor(P2)P2
- link (P1,P2) neighbor(p1) ? neighbor(p2) 0
20- The following table shows the number of links
(common neighbors) between the four points - We can depict the neighboring graph
21Example maximum linkages
- If we have four pointsP1,P2,P3,P4
- suppose that similarity threshold (?) is equal to
0 - Then, Two Points are neighbors if sim(Pi,Pj)gt0
- hence, any pair of points are neighbors
- To find Link(P1,P2)
- neighbor(P1)P1,P2,P3,P4
- neighbor(P2)P1,P2,P3,P4
- link (P1,P2) neighbor(P1) ? neighbor(P2) 4
22- The following table shows the number of links
(common neighbors) between the four points - We can depict the neighboring graph
23Example illustrating links
- from the previous example, we have
- neighbor(P1)P1,P2,P3,P4
- neighbor(P3)P1,P2,P3,P4
- link (P1,P3) neighbor(P1) ? neighbor(P3) 4
links - we can depict these four different links (or
paths) through these four different neighbors as
follows
24Major definitions
- Similarity for data objects
- Neighbors
- Links
- Criterion function
- Goodness measure
25Criterion function
- to get the best clusters, we have to maximize
this Criterion Function - Where Ci denotes cluster i
- ni is the number of points in Ci
- k is the number of clusters
- ? is the similarity threshold
- Suppose in Ci, each point has roughly nf(?)
neighbors. - A suitable choice for basket data is
f(?)(1-?)/(1?)
26Criterion function
- By maximizing this criterion function, we are
maximizing the sum of links of intra cluster
point pairs and at the same time minimizing the
sum of links among pairs of points belonging to
different clusters (i.e. among inter cluster
point pairs)
27Major definitions
- Similarity for data objects
- Neighbors
- Links
- Criterion function
- Goodness measure
28Goodness measure
- Goodness Function
- During clustering, we use this goodness measure
in order to maximize the criterion function. - This goodness measure helps to identify the best
pair of clusters to be merged during each step of
ROCK.
29ROCK Clustering algorithm
- Input A set S of data points
- Number of k clusters to be found
- The similarity threshold
- Output Groups of clustered data
- The ROCK algorithm is divided into three major
parts - Draw a random sample from the data set
- Perform a hierarchical agglomerative clustering
algorithm - Label data on disk
- in our case, we do not deal with a very huge data
set. So, we will consider the whole data in the
process of forming clusters, i.e. we skip step1
and step3
30ROCK Clustering algorithm
- Draw a random sample from the data set
- sampling is used to ensure scalability to very
large data sets - The initial sample is used to form clusters, then
the remaining data on disk is assigned to these
clusters - in our case, we will consider the whole data in
the process of forming clusters.
31ROCK Clustering algorithm
- Perform a hierarchical agglomerative clustering
algorithm - ROCK performs the following steps which are
common to all hierarchical agglomerative
clustering algorithms, but with different
definition to the similarity measures - places each single data point into a separate
cluster - compute the similarity measure for all pairs of
clusters - merge the two clusters with the highest
similarity (goodness measure) - Verify a stop condition. If it is not met then go
to step b
32- Label data on disk
- Finally, the remaining data points in the disk
are assigned to the generated clusters. - This is done by selecting a random sample Li from
each cluster Ci, then we assign each point p to
the cluster for which it has the strongest
linkage with Li. - As we said, we will consider the whole data in
the process of forming clusters.
33ROCK Clustering algorithm
- Computation of links
- using the similarity threshold ?, we can convert
the similarity matrix into an adjacency matrix
(A) - Then we obtain a matrix indicating the number of
links by calculating (A x A ) , i.e., by
multiplying the adjacency matrix A with itself
34ROCK Example
- Suppose we have four verses contains some
subjects , as follows - P1 judgment, faith, prayer, fair
- P2 fasting, faith, prayer
- P3 fair, fasting, faith
- P4 fasting, prayer, pilgrimage
- the similarity threshold 0.3, and number of
required cluster is 2. - using Jaccard coefficient as a similarity
measure, we obtain the following similarity table
35ROCK Example
- Since we have a similarity threshold equal to
0.3, then we derive the adjacency table? - By multiplying the adjacency table with itself,
we derive the following table which shows the
number of links (or common neighbors) ?
36ROCK Example
- we compute the goodness measure for all adjacent
points ,assuming that f(?) 1-? / 1? - we obtain the following table?
- we have an equal goodness measure for merging
((P1,P2), (P2,P1), (P3,P1))
37ROCK Example
- Now, we start the hierarchical algorithm by
merging, say P1 and P2. - A new cluster (lets call it C(P1,P2)) is formed.
- It should be noted that for some other
hierarchical clustering techniques, we will not
start the clustering process by merging P1 and
P2, since Sim(P1,P2) 0.4,which is not the
highest. But, ROCK uses the number of links as
the similarity measure rather than distance.
38ROCK Example
- Now, after merging P1 and P2, we have only three
clusters. The following table shows the number of
common neighbors for these clusters? - Then we can obtain the following goodness
measures for all adjacent clusters?
39ROCK Example
- Since the number of required clusters is 2, then
we finish the clustering algorithm by merging
C(P1,P2) and P3, obtaining a new cluster
C(P1,P2,P3) which contains P1,P2,P3 leaving P4
alone in a separate cluster.
40Conclusion and future work (1/3)
- We aim to apply a clustering technique on the
verses of the Holy Quran - We should first perform manual preprocessing for
the Quran text to capture the subjects of the
verses into a tabular format. - Then we can apply a clustering algorithm which
clusters each set of similar verses into the same
group.
41Conclusion and future work (2/3)
- Most traditional clustering algorithm uses
distance based similarity measures which is not
appropriate for clustering our categorical-type
datasets. - we will apply the general framework of the ROCK
algorithm. - The ROCK (RObust Clustering using linKs)
algorithm is an agglomerative hierarchical
clustering algorithm for clustering categorical
data. It presents a new notion of link to measure
similarity between data objects.
42Conclusion and future work (3/3)
- We will adopt JAVA language to implement ROCK
clustering algorithm. - During testing, will try to form clusters of
verses belonging to a single sura, and verses
belonging to many different suras. - Insha Allah, we will achieve success in
performing this mission.
43- Thank You for your attention
- I will be glad to answer your questions