Title: An Efficient Method for Projected Clustering
1An Efficient Method for Projected Clustering
- Hongyin Cui Jiang Ye
- School of Computing Science
- Simon Fraser University
2Introduction
- Clustering is a widely used technique for data
mining, indexing and classification. - Most clustering algorithms do not work
efficiently or effectively in high dimensional
spaces because of the inherent sparsity of the
data.
3Clusters may exist in different subspaces
comprised of different combinations of attributes
4Related Work
- CLIQUE density-based and grid-based
- It partitions each dimension into the same number
of equal length intervals - It partitions a m-dimensional data space into
non-overlapping rectangular units - A unit is dense if the fraction of total data
points contained in the unit exceeds the input
model parameter - A cluster is a maximal set of connected dense
units within a subspace - A bottom-up greedy algorithm
- exponential dependency on the number of dimensions
5Related Work (Cont.)
- PROCLUS
- Find the best set of medoids by a hill climbing
process. - Search not just in the space of the possible
medoids but also in the space of possible
dimensions associated with each medoid. - So it uses a locality analysis and its result may
be a local optimum.
6Related Work (Cont.)
- DOC
- An dense projective cluster is a pair (C, D), and
- C is a subset of the data set S, D is a subset of
full-dimension d - C must be sufficiently large, i.e. C ?S
- ?i?D, maxp?C pi - minq?C qi w
- ?i?d - D, maxp?C pi - minq?C qi gt w
- It repeatedly
- choose p?S and X?S via radom sampling
- compute the corresponding cluster (C, D)
- Report the best found cluster
- An approximation of the optimal projective
cluster.
7Problem Definition
- key observations
- Often many records in a database share similar
values for several attributes. - Identifying and grouping together records that
share similar values for some attributes can both
gain useful insight into the data (projected
clusters), and obtain a more parsimonious
representation of the data.
8Problem Definition (cont.)
- The user can define discretization criteria by
specifying the interval wi for each attribute i,
or using a global interval w for all attributes. - We say a group of records share a similar value
on attribute i, if they have a same discretized
value on i. (in a same interval) - For example
Name Position Points Played Mins Penalty Mins
Blake Defense 43 395 34
Borque Defense 80 430 22
Gullimore Defense 3 30 18
Gretzky Centre 89 458 26
Konstantinov Defense 10 560 120
May Winger 35 290 180
Odjick Winger 9 115 245
Tkachuk Center 82 475 160
Wotton Defense 5 38 6
Figure 1 A fragment of the NHL Players
Statistic Table (1996)
9Problem Definition (cont.)
- In Figure 1, suppose the discretization intervals
imposed on attributes are - Position gt already discrete
- wPoints 10, wPlayedMins 60, wPenaltyMins20
- Find out
- Borque, Gretzky, Tkachuk Points,
Played Mins - Gullimore, Wotton
Position, Points, Played Mins, Penalty Mins
played and scored a lot
Same position Played, scored penalized
sparingly
10Problem Definition (cont.)
- Definition
- Let p (p1, , pd) be a point in Rd, d denotes
the set of the d dimensions, and wi 0 for 0 i
d. - ?i?d, dimension i is partitioned and pi is
discretized by wi. - Let S be a set of points in Rd. For any 0? 1,
a projected cluster in S is a pair (C, D), C ? S,
D ? d, such that - C ?S
- ?j?D, all points in C share an equivalent
discretized value on attribute j. (in the same
interval) - No D ? D also satisfies the above two conditions
11FIPCLUS Mining projected clusters via frequent
closed itemsets
- Basic Steps
- Step 1 discretize p on each attribute.
- Step 2 create a transaction database.
- Step 3 Mining frequent closed itemsets by
CLOSET algorithm, each identify one
subspace. - Step 4 find corresponding groups of points
for each subspace via scanning DB once.
12FIPCLUS Mining projected clusters via frequent
closed itemsets (cont.)
- Step 1
- ?i?d, partition dimension i and discretize pi
by wi. - Or Discretize pi using users specified criteria.
- Or ignore step 1, if users provide discretized
data.
13FIPCLUS Mining projected clusters via frequent
closed itemsets (cont.)
- Step 2
- ?i?d, enumerate and number each discretized
value with a different integer j, and all numbers
are continuous. - E.g. Positiondefense, center, winger, then
defense1, center2 and winger3 - Substitute each discretized value in d with an
unique integer, idj. - The original database is transformed into a
transaction database.
14FIPCLUS Mining projected clusters via frequent
closed itemsets (cont.)
- Step 3 based on CLOSET Jie Pei, Jiawei Han
- Definition (Frequent closed itemset) An itemset
X is a closed itemset if there exists no itemset
X such that (1) X is a proper superset of X,
and (2) every transaction containing X also
contains X. A closed itemset X is frequent if
its support passes the given support threshold. - CLOSET is based on FP-tree without candidate
generation.
15FIPCLUS Mining projected clusters via frequent
closed itemsets (cont.)
- CLOSET
- Input Transaction database TDB and support
threshold min_sup - Output The complete set of frequent closed
itemsets - Method
- Initialization. Let FCI be the set of frequent
closed itemset. Initialize FCI ? - Find frequent items. Scan transaction database
TDB, compute frequent item list f_list. - Mine frequent closed itemsets recursively. Call
CLOSET(?, TDB, f_list, FCI).
16FIPCLUS Mining projected clusters via frequent
closed itemsets (cont.)
- CLOSET(X, DB, f_list, FCI)
- Parameters
- X is the frequent itemset.
- DB X-conditional database, which is a subset of
transactions in TDB containing X. - f_list frequent item list of DB
- FCI The set of frequent closed itemsets already
found.
17FIPCLUS Mining projected clusters via frequent
closed itemsets (cont.)
- CLOSET(X, DB, f_list, FCI)
-
- Extract a set (Y) of items appearing in every
transaction of DB, insert X?Y to FCI, if it is
not a subset of some itemset in FCI with the same
support - Build FP-tree for DB, items in Y are excluded.
- Directly extract frequent closed itemsets from
FP-tree. - ?i? rest of f_list, form conditional database
DBi and compute local frequent item list f_listi - ?i? rest of f_list, call CLOSET(iX, DBI,
f_listi, FCI), if iX is not a subset of any
frequent closed itemset in FCI with the same
support.
18Evaluation Comparison
- Definition - more flexible and meaningful.
- No assumption on the distribution of C in D.
- different interval wi on each dimension or
flexible discretization criteria. - CLIQUE
- partition each dimension into ? intervals, not
flexible. - Hard to determine dense threshold for each unit.
- PROCLUS
- Distance-based has all distance-based flaws.
- DOC
- Very similar definition
- But global interval width w for each dimension,
not flexible.
19Evolution Comparison
- Algorithm
- Solve clustering problem via mining frequent
itemsets - more efficient, scalable and faster in large
database. - Runtime complexity is O(N), where NDB.
Typically 4, 5 scan of DB - CLIQUE
- Bottom-up construction generate huge candidates,
each of which need scan DB once. ---? not
efficient - PROCLUS
- Find the best set of medoids by a hill climbing
process. - A locality analysis and its result may be a local
optimum. - Runtime complexity O(N?k ?l N?k ?d), where k
the number of clusters, l the average
dimensionality of subspaces, d the full
dimensionality. --? less efficient
20Evaluation Comparison
- DOC
- Find the approximation of clusters via random
sampling. - Not complete and quality can not be guaranteed.
- Runtime complexity is O(N ? dc1), where c a
constant, d the full dimensionality, and N
DB. ---? less efficient
21Conclusion
- We proposed FIPCLUS, which
- Efficiently mining projected clusters via
frequent closed itemsets. - Applies a compressed FP-tree structure for mining
frequent closed itemset without candidate
generation. - Generates a much smaller set of frequent itemsets
and leads to less and more interesting projected
clusters.
22Weakness Future Work
- Weakness
- FIPCLUS may generate some overlapping clusters.
- E.g. For (C1, D1) and (C2, D2),
- C1a, b, c, D1d1, d2, d3, d4 C2a, b,
c, e, f, D2d1, d2 - Future work
- Modify FIPCLUS to mine the maximal frequent
itemsets to address above weakness. - E.g. In above example, it only outputs (C1, D1).
- It is actually a tradeoff, since maximal frequent
itemsets may lose some interesting clusters and
information. - Evaluate its effectiveness.
23References
- 1Â Â R. Agrawal, J. Gehrke, D. Gunopulos, P.
Raghavan, Automatic Subspace Clustering of High
Dimensional Data for Data Mining Application - 2Â Â C. Procopiuc, M. Jones, P. Agarwal,
T.M.Murali, A Monto Carlo Algorithm for Fast
Projective Clustering - 3Â Â C. Aggarwal, C. Procopius, J. Wolf, P. Yu,
J. Park, Fast Algorithm for Projected Clustering - 4Â Â J. Pei, J. Han, R. Mao, CLOSET An
Efficient Algorithm for Mining Frequent Closed
Itemsets - 5  K. Yip, D. Cheung, M. Ng, A Highly-usable
Projected Clustering Algorithm for Gene
Expression Profiles - 6Â Â H.V. Jagadish, J. Madar, R. Ng, Semantic
Compression and Pattern Extraction with Fascicles