Title: Consistent Bipartite Graph CoPartitioning for HighOrder Heterogeneous CoClustering
1Consistent Bipartite Graph Co-Partitioning for
High-Order Heterogeneous Co-Clustering
- Tie-Yan Liu
- WSM Group, Microsoft Research Asia
- 2005.11.11
- Joint work with Bin Gao, Peking University
2Outline
- Motivation
- What is high-order heterogeneous co-clustering
- Why previous methods can not work well on this
problem - Consistent Bipartite Graph Go-partitioning (CGBC)
- Experimental Evaluation
- Conclusions and Future Work
3Clustering
- Clustering is to group the data objects into
clusters, so that objects in the same cluster are
similar to each other. - Spectral Clustering
- Models the similarity of data objects by an
affinity graph, and assume that the best
clustering result corresponds to the minimal
(ratio, normalized or min-max) graph cut. - It can be proven that the minimum of the
normalized cut can be achieved by minimizing this
objective function - and the corresponding solution q is the
eigenvector associated with the second smallest
eigenvalue of the generalized eigenvalue problem
.
4Co-Clustering
- Co-clustering is to group two types of objects
into their own clusters simultaneously. - Bipartite graph partitioning (Dhillon and Zha)
- Use bipartite graph to model the
inter-relationship between the two types of
objects the edges are of the same type in the
bipartite graph so the graph cut is still easy to
define. - It can be proven that the solutions are the
singular vectors associated with the second
smallest singular value of the normalized
inter-relationship matrix
5High-order Heterogeneous Co-Clustering (HHCC)
- HHCC is to group multiple (2) types of objects
into clusters simultaneously. - Order is defined as the number of types of
objects. - If we use graph to represent the
inter-relationship between data objects, we will
have that although the edges in each bipartite
graph are of the same type, they are of
different type for different bipartite graphs.
This is what heterogeneous refers to, as
compared to spectral clustering and bipartite
graph co-clustering.
6HHCC is not a Rare Problem
- Typical examples
- Surrounding Text Web Image Visual Features
- User Query Click through
- Many other examples
Category Document Term Reader Newspaper
Article Passenger Airplane Airways Webpage
Website Site-group Article Magazine
Category Hardware Computer Usage Software
People Community
7Why HHCC is a new problem?
- Although bipartite graph partitioning is just a
trivial extension of the spectral clustering, the
extension to HHCC is non-trivial - Since there are different types of edges in the
HHCC problem, the cut of high-order data is
difficult to define. It may not be very
reasonable to assign some weights to
heterogeneous edges so as to make their
contributions to the graph cut comparable. - Simply applying spectral clustering may cause the
high-order problem degraded to be a 2-order
problem.
8An Example of Weighting Heterogeneous Edges
a 0.01
a 1
no matter how we adjust the weights to balance
the different types of edges, we always can not
cluster X into two groups successfully
a 100
Embeddings produced by spectral clustering
9An Example of Weighting Heterogeneous Edges
(Cont.)
Including X and Z
10Order Degradation
2-Order Heterogeneous graph
11Our Solution
- We will try to tackle the aforementioned problems
by proposing a new solution to HHCC Consistent
Bipartite Graph Co-Partitioning (CGBC). - Where should we get started?
- Star-structured HHCC
- The concept of consistency
- An SDP-based solution
12Why Star-Structured?
- Star-Structure means that in the heterogeneous
graph, there is a central type of objects which
connects all the other types of objects, and
there is no direct connections between any other
object types - Star-Structured is the simplest but very common
case of HHCC.
13Why Star-Structured?
- Star-Structured is the simplest but very common
case of HHCC.
- Surrounding text
- Web Images
- Visual features
- Author
- Conference
- Paper
- Key Word
- Customer
- Shareholder
- Shop
- Supplier
- Advertisement Media
14The Concept of Consistency
- Divide the star-structured HHCC problem into a
set of bipartite sub-problems, where each
sub-problem only has homogeneous edges. - Solve each sub problem separately, to avoid the
order degradation. - Add a global constraint to the central type of
objects, so as to get a feasible cut for the
original problem.
15The Concept of Consistency
partition these two graphs simultaneously and
consistently
divide this tripartite graph into two bipartite
graphs
16Formulating the Optimization Problem
- Minimize the cuts of the two bipartite graphs,
with the constraints that their partitioning
results on the central type of objects are the
same. - Objective Function
The definition of q and p indicates the
consistency between these two graphs the y in
the two embeddings are the same, so we actually
force the partitioning on the central type of
objects to be the same.
17How to Solve the Optimization Problem 1 Convert
it to a QCQP Problem
Simplify the original Problem to
single-objective programming
Assistant Notations
Considering that the normalized Rayleigh quotient
has been a scalar measure of the graph structure,
the combination of two Rayleigh quotients is more
reasonable and indicates which graph we should
trust more. Linear combination is only one of the
approaches of multi-objective programming. We can
surely use other methods which do not have this
argument.
Quadratically Constrained Quadratic Programming
(QCQP)
Sum-of-ratios Quadratic Fractional Programming
18How to Solve the Optimization Problem 2 Convert
QCQP to SDP
Semi-definite Programming (SDP)
19The Final Algorithm (CGBC)
- Set the parameters ß, ?1 and ?2.
- Given the inter-relation matrices A and B, form
the corresponding diagonal matrices and Laplacian
matrices D(1), D(2), L(1) and L(2). - Extend D(1), D(2), L(1) and L(2) to ?1, ?2, ?1
and ?2, and form ?, such that the coefficient
matrices in the SDP problem can be computed. - Solve the above SDP problem by a certain
iterative algorithm such as SDPA. - Extract ? from W and regard it as the embedding
vector of the heterogeneous objects. - Run the k-means algorithm on ? to obtain the
desired partitioning of the heterogeneous objects.
20CGBCs Extension to the k-star-structured HHCC
21Experiment on Toy Problem
Relation Matrix A
Totally based on the first graph Y(812)
A more reasonable cut which is based on the
information from both the first and the second
graph
Embedding values of heterogeneous objects
ß 0 0.2 0.4
0.6 0.8 1.0
Relation Matrix B
Totally based on the second graph Y(128)
22Experiment on Web Image Clustering
23Embedding of the Clustering
Hill vs Owl
Flying vs Map
24Average Performance
Performance Comparison
25Conclusions
- We propose a new problem named high-order
heterogeneous co-clustering (HHCC). - We propose a consistent bipartite graph
co-partitioning algorithm to solve the HHCC
problem with star-structured inter-relationship. - Various experiments demonstrate the effectiveness
of our proposed algorithm.
26References
- Bin Gao, Tie-Yan Liu, et al, Consistent Bipartite
Graph Co-Partitioning for Star-Structured
High-Order Heterogeneous Data Co-Clustering, in
Proceedings of the Eleventh ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining (KDD 2005), pp4150. - Bin Gao, Tie-Yan Liu, Tao Qin, Qian-Sheng Cheng,
Wei-Ying Ma, Web Image Clustering by Consistent
Utilization of Low-level Features and Surrounding
Texts, in Proceedings of ACM Multimedia 2005.
27Thanks!
Contact tyliu_at_microsoft.com http//research.micro
soft.com/users/tyliu/