Clustering Categorical Data The Case of Quran Verses - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Clustering Categorical Data The Case of Quran Verses

Description:

The holy Quran covers a wide range of topics. Quran does not cover each topic by a set of sequenced verses or sura's. ... P4={ fasting, prayer, pilgrimage} ... – PowerPoint PPT presentation

Number of Views:347

Avg rating:3.0/5.0

Slides: 44

Provided by: wat60

Category:

more less

Transcript and Presenter's Notes

Title: Clustering Categorical Data The Case of Quran Verses

1
Clustering Categorical DataThe Case of Quran
Verses

Presented By
Muhammad Al-Watban
IS 598

2
Outline

Introduction
Preprocessing of Quran Verses
Similarity Measures
Assisting Clusters Similarities
Shortcomings of Traditional clustering methods
with categorical data
ROCK - Major definitions
ROCK clustering Algorithm
ROCK example
Conclusion and future work

3
Introduction

The holy Quran covers a wide range of topics.
Quran does not cover each topic by a set of
sequenced verses or suras.
A single verse usually deals with many subjects
Project goal to cluster the verses of The Holy
Quran based on the verses subjects.

4
Preprocessing of Quran Verses

it is necessary to perform manual preprocessing
for the Quran text to capture the subjects of the
verses into a tabular format
Verses in the Holy Quran can be viewed as records
and the related subjects as attributes of the
record. This is demonstrated by the following
table
The data in the above table is similar to what is
known as market-basket data.
Here, we will call it verses-treasues data

5
Similarity Measures

Two types of attributes
Continuous attributes
range of attribute value is continuous and
ordered
includes Attributes with numeric values (e.g.
salary)
also includes attributes whose allowed set of
values are thought to be part of an ordered set
of a meaningful sequence (e.g. professional
ranks, disease severity levels)
The similarity (or dissimilarity) between objects
is computed based on distance between them.
the most commonly used distance measure is
Euclidean distance, and Manhattan distance

6
Similarity Measures

Categorical attributes
consists of attributes whose underlying domain is
not ordered
Examples colors, blood type.
If the attribute has only two states (namely 0
and 1), then it is called binary if it has more
than two states, it is called nominal.
there is no easy way to measure a distance
between objects
We can define dissimilarity based on the simple
matching approach
Where m is the number of matched attribute, and p
is the total number of attributes.

7
Similarity Measures

Where does the verses treasures data fit?
Each verse can be represented by a record with
Boolean attributes, each attribute corresponds to
a single subject
The attribute corresponding to a subject is T if
the verse contains that subjects otherwise, it
is F
As we said, Boolean attributes are a special case
of categorical attributes

8
Assisting Clusters Similarities

Many clustering algorithm(such as hirarchical
clustering) requires computing distance between
clusters (rather than elements)
There are several standard methods
1- Single linkage
D(r,s) distance between clusters r and s is
defined as the distance between the closest pair
of objects

9
Assisting Clusters Similarities

2. Complete linkage
distance is defined as the distance between the
farthest pair of objects
3. Average linkage
distance is defined as the average of distances
between all pairs of objects r and s, where r and
s belong to different clusters

10
Assisting Clusters Similarities

4. Centroid Linkage
distance between clusters is defined as the
distance between the pair of cluster centroids.

11
Shortcomings of Traditional clustering methods
with categorical data

Example
Consider the following 4 market basket
transactions
T1 1, 2, 3, 4
T2 1, 2, 4
T3 3
T4 4
converting these transactions to Boolean points,
we get
P1 (1, 1, 1, 1)
P2 (1, 1, 0, 1)
P3 (0, 0, 1, 0)
P4 (0, 0, 0, 1)
using Euclidean distance to measure the closeness
between all pairs of points, we find that
d(p1,p2) is the smallest distance

12
Shortcomings of Traditional clustering methods
with categorical data

If we use the centroid-based hierarchical
algorithm then we merge P1 and P2 and get a new
cluster (P12) with (1, 1, 0.5, 1) as a centroid
Then, using Euclidean distance again, we find
d(p12,p3) ?3.25
d(p12,p4) ?2.25
d(p3,p4) ?2
So, we should merge P3 and P4 since the distance
between them is the shortest.
However, T3 and T4 don't have even a single
common item.
So, using distance metrics as similarity measure
for categorical data is not appropriate
The solution is ROCK

13
ROCK - Major definitions

Similarity function
Neighbors
Links
Criterion function
Goodness measure

14
Similarity function

Let Sim (Pi, Pj) be a similarity function that is
used to measure the closeness between points pi
and Pj.
ROCK assumes that Sim function is normalized to
return a value between 0 and 1
For Quran treasures data, a possible definition
for the sim function is based on the Jaccard
coefficient

15
Example similarity function

Suppose two verses (P1 and P2) contain the
following subjects
P1 judgment, faith, prayer, fair
P2 fasting, faith, prayer
Sim(P1,P2) P1? P2 / P1?P2
2 / 5 0.40

16
Major definitions

Similarity for data objects
Neighbors
Links
Criterion function
Goodness measure

17
Neighbors and Links

one main problem of traditional clustering
islocal properties involving only the two points
are considered.
Neighbor
If similarity between two points exceeds certain
similarity threshold (?), they are neighbors.
Link
The Link for pair of points is the number of
their common neighbors.
Obviously, Link incorporates global information
about the other points in the neighborhood of the
two points. The larger the Link, the higher
probability that this pair of points are in the
same clusters.

18
Example neighboring and linking

Example
Assume that we have three distinct points p1,p2
and p3 where
neighbor(p1)p1,p2
neighbor(p2)p1,p2,3
neighbor(p3)p3,p2
Neighboring graph ?
To define the number of links between two points,
say p1 and p3, we have to find the number of
their common neighbors hence, we can define the
linkage function between p1 and p3 to be
Link (p1,p3) neighbor(p1) ? neighbor(p3)
P2
Or Link (p1,p3) 1

19
Example minimum linkages

If we have four pointsP1,P2,P3,P4
suppose that similarity threshold (?) is equal to
1
Then, Two Points are neighbors if sim(Pi,Pj)gt1
hence, points are considered neighbors only to
identical points (i.e. only to themselves)
To find Link(P1,P2)
neighbor(P1)P1
neighbor(P2)P2
link (P1,P2) neighbor(p1) ? neighbor(p2) 0

The following table shows the number of links
(common neighbors) between the four points
We can depict the neighboring graph

21
Example maximum linkages

If we have four pointsP1,P2,P3,P4
suppose that similarity threshold (?) is equal to
0
Then, Two Points are neighbors if sim(Pi,Pj)gt0
hence, any pair of points are neighbors
To find Link(P1,P2)
neighbor(P1)P1,P2,P3,P4
neighbor(P2)P1,P2,P3,P4
link (P1,P2) neighbor(P1) ? neighbor(P2) 4

The following table shows the number of links
(common neighbors) between the four points
We can depict the neighboring graph

23
Example illustrating links

from the previous example, we have
neighbor(P1)P1,P2,P3,P4
neighbor(P3)P1,P2,P3,P4
link (P1,P3) neighbor(P1) ? neighbor(P3) 4
links
we can depict these four different links (or
paths) through these four different neighbors as
follows

24
Major definitions

Similarity for data objects
Neighbors
Links
Criterion function
Goodness measure

25
Criterion function

to get the best clusters, we have to maximize
this Criterion Function
Where Ci denotes cluster i
ni is the number of points in Ci
k is the number of clusters
? is the similarity threshold
Suppose in Ci, each point has roughly nf(?)
neighbors.
A suitable choice for basket data is
f(?)(1-?)/(1?)

26
Criterion function

By maximizing this criterion function, we are
maximizing the sum of links of intra cluster
point pairs and at the same time minimizing the
sum of links among pairs of points belonging to
different clusters (i.e. among inter cluster
point pairs)

27
Major definitions

Similarity for data objects
Neighbors
Links
Criterion function
Goodness measure

28
Goodness measure

Goodness Function
During clustering, we use this goodness measure
in order to maximize the criterion function.
This goodness measure helps to identify the best
pair of clusters to be merged during each step of
ROCK.

29
ROCK Clustering algorithm

Input A set S of data points
Number of k clusters to be found
The similarity threshold
Output Groups of clustered data
The ROCK algorithm is divided into three major
parts
Draw a random sample from the data set
Perform a hierarchical agglomerative clustering
algorithm
Label data on disk
in our case, we do not deal with a very huge data
set. So, we will consider the whole data in the
process of forming clusters, i.e. we skip step1
and step3

30
ROCK Clustering algorithm

Draw a random sample from the data set
sampling is used to ensure scalability to very
large data sets
The initial sample is used to form clusters, then
the remaining data on disk is assigned to these
clusters
in our case, we will consider the whole data in
the process of forming clusters.

31
ROCK Clustering algorithm

Perform a hierarchical agglomerative clustering
algorithm
ROCK performs the following steps which are
common to all hierarchical agglomerative
clustering algorithms, but with different
definition to the similarity measures
places each single data point into a separate
cluster
compute the similarity measure for all pairs of
clusters
merge the two clusters with the highest
similarity (goodness measure)
Verify a stop condition. If it is not met then go
to step b

Label data on disk
Finally, the remaining data points in the disk
are assigned to the generated clusters.
This is done by selecting a random sample Li from
each cluster Ci, then we assign each point p to
the cluster for which it has the strongest
linkage with Li.
As we said, we will consider the whole data in
the process of forming clusters.

33
ROCK Clustering algorithm

Computation of links
using the similarity threshold ?, we can convert
the similarity matrix into an adjacency matrix
(A)
Then we obtain a matrix indicating the number of
links by calculating (A x A ) , i.e., by
multiplying the adjacency matrix A with itself

34
ROCK Example

Suppose we have four verses contains some
subjects , as follows
P1 judgment, faith, prayer, fair
P2 fasting, faith, prayer
P3 fair, fasting, faith
P4 fasting, prayer, pilgrimage
the similarity threshold 0.3, and number of
required cluster is 2.
using Jaccard coefficient as a similarity
measure, we obtain the following similarity table

35
ROCK Example

Since we have a similarity threshold equal to
0.3, then we derive the adjacency table?
By multiplying the adjacency table with itself,
we derive the following table which shows the
number of links (or common neighbors) ?

36
ROCK Example

we compute the goodness measure for all adjacent
points ,assuming that f(?) 1-? / 1?
we obtain the following table?
we have an equal goodness measure for merging
((P1,P2), (P2,P1), (P3,P1))

37
ROCK Example

Now, we start the hierarchical algorithm by
merging, say P1 and P2.
A new cluster (lets call it C(P1,P2)) is formed.
It should be noted that for some other
hierarchical clustering techniques, we will not
start the clustering process by merging P1 and
P2, since Sim(P1,P2) 0.4,which is not the
highest. But, ROCK uses the number of links as
the similarity measure rather than distance.

38
ROCK Example

Now, after merging P1 and P2, we have only three
clusters. The following table shows the number of
common neighbors for these clusters?
Then we can obtain the following goodness
measures for all adjacent clusters?

39
ROCK Example

Since the number of required clusters is 2, then
we finish the clustering algorithm by merging
C(P1,P2) and P3, obtaining a new cluster
C(P1,P2,P3) which contains P1,P2,P3 leaving P4
alone in a separate cluster.

40
Conclusion and future work (1/3)

We aim to apply a clustering technique on the
verses of the Holy Quran
We should first perform manual preprocessing for
the Quran text to capture the subjects of the
verses into a tabular format.
Then we can apply a clustering algorithm which
clusters each set of similar verses into the same
group.

41
Conclusion and future work (2/3)

Most traditional clustering algorithm uses
distance based similarity measures which is not
appropriate for clustering our categorical-type
datasets.
we will apply the general framework of the ROCK
algorithm.
The ROCK (RObust Clustering using linKs)
algorithm is an agglomerative hierarchical
clustering algorithm for clustering categorical
data. It presents a new notion of link to measure
similarity between data objects.

42
Conclusion and future work (3/3)

We will adopt JAVA language to implement ROCK
clustering algorithm.
During testing, will try to form clusters of
verses belonging to a single sura, and verses
belonging to many different suras.
Insha Allah, we will achieve success in
performing this mission.