Title: Further Investigations on Heat Diffusion Models
1Further Investigations on Heat Diffusion Models
- Haixuan Yang
- Supervisors Prof Irwin King and Prof Michael R.
Lyu - Term Presentation 2006
2Outline
- Introduction
- Input Improvement Three candidate graphs
- Outside Improvement DiffusionRank
- Inside Improvement Volume-based heat difusion
model - Summary
3Introduction
DiffusionRank
Outside Improvement
PHDC
Volume-based HDM
Inside Improvement
Input Improvement
HDM on Graphs
PHDC the model proposed last year
4PHDC
- PHDC is a classifier motivated by
- Tenenbaum et al (Science 2000)
- Approximate the manifold by a KNN graph
- Reduce dimension by shortest paths
- Kondor Lafferty (NIPS 2002)
- Construct a diffusion kernel on an undirected
graph - Apply to a large margin classifier
- Belkin Niyogi (Neural Computation 2003)
- Approximate the manifold by a KNN graph
- Reduce dimension by heat kernels
- Lafferty Kondor (JMLR 2005)
- Construct a diffusion kernel on a special
manifold - Apply to SVM
5PHDC
- Ideas we inherit
- Local information
- relatively accurate in a nonlinear manifold.
- Heat diffusion on a manifold
- a generalization of the Gaussian density from
Euclidean space to manifold. - heat diffuses in the same way as Gaussian density
in the ideal case when the manifold is the
Euclidean space. - Ideas we think differently
- Establish the heat diffusion equation on a
weighted directed graph. - The broader settings enable its application on
ranking on the Web pages. - Construct a classifier by the solution directly.
6Heat Diffusion Model in PDHC
- Notations
- Solution
- Classifier
- G is the KNN Graph Connect a directed edge (j,i)
if j is one of the K nearest neighbors of i. - For each class k, f(i,0) is set as 1 if data is
labeled as k and 0 otherwise. - Assign data j to a label q if j receives most
heat from data in class q.
7Input Improvement
- Three candidate graphs
- KNN Graph
- Connect points j and i from j to i if j is one of
the K nearest neighbors of i, measured by the
Euclidean distance. - SKNN-Graph
- Choose the smallest Kn/2 undirected edges, which
amounts to Kn directed edges. - Minimum Spanning Tree
- Choose the subgraph such that
- It is a tree connecting all vertices the sum of
weights is minimum among all such trees.
8Input Improvement
- Illustration
- Manifold
- KNN Graph
- SKNN-Graph
- Minimum Spanning Tree
9Input Improvement
- Advantages and disadvantages
- KNN Graph
- Democratic to each node
- Resulting classifier is a generalization of KNN
- May not be connected
- Long edges may exit while short edges are removed
- SKNN-Graph
- Not democratic
- May not be connected
- Short edges are more important than long edges
- Minimum Spanning Tree
- Not democratic
- Long edges may exit while short edges are removed
- Connection is guaranteed
- Less parameter
- Faster in training and testing
10Experiments
- Experimental Setup
- Experimental Environments
- Hardware Nix Dual Intel Xeon 2.2GHz
- OS Linux Kernel 2.4.18-27smp (RedHat 7.3)
- Developing tool C
- Data Description
- 3 artificial Data sets and 6 datasets from UCI
- Comparison
- Algorithms
- Parzen windowKNNSVM KNN-HSKNN-HMST-H
- Results average of the ten-fold cross validation
11Experiments
12Conclusions
- KNN-H, SKNN-H and MST-H
- Candidates for the Heat Diffusion Classifier on a
Graph.
13Application Improvement
- PageRank
- Tries to find the importance of a Web page based
on the link structure. - The importance of a page i is defined recursively
in terms of pages which point to it - Two problems
- The incomplete information about the Web
structure. - The web pages manipulated by people for
commercial interests. - About 70 of all pages in the .biz domain are
spam - About 35 of the pages in the .us domain belong
to spam category.
14Why PageRank is susceptible to web spam?
- Two reasons
- Over-democratic
- All pages are born equal--equal voting ability of
one page the sum of each column is equal to one. - Input-independent
- For any given non-zero initial input, the
iteration will converge to the same stable
distribution. - Heat Diffusion Model -- a natural way to avoid
these two reasons of PageRank - Points are not equal as some points are born with
high temperatures while others are born with low
temperatures. - Different initial temperature distributions will
give rise to different temperature distributions
after a fixed time period.
15DiffusionRank
- On an undirected graph
- Assumption the amount of the heat flow from j to
i is proportional to the heat difference between
i and j. - Solution
- On a directed graph
- Assumption there is extra energy imposed on the
link (j, i) such that the heat flow only from j
to i if there is no link (i,j). - Solution
- On a random directed graph
- Assumption the heat flow is proportional to the
probability of the link (j,i). - Solution
16DiffusionRank
- On a random directed graph
- Solution
- The initial value f(i,0) in f(0) is set to be 1
if i is trusted and 0 otherwise according to the
inverse PageRank.
17Computation consideration
- Approximation of heat kernel
- N?
- When Ngt30, the real eigenvalues of
are less than 0.01 - when Ngt100, they are less than 0.005.
- We use N100 in the paper.
When N tends to infinity
18Discuss ?
- ?can be understood as the thermal conductivity.
- When ?0, the ranking value is most robust to
manipulation since no heat is diffused, but the
Web structure is completely ignored - When ? 8, DiffusionRank becomes PageRank, it can
be manipulated easily. - When?1, DiffusionRank works well in practice
19DiffusionRank
- Advantages
- Can detect Group-group relations
- Can cut Graphs
- Anti-manipulation
? 0.5 or 1
1
-1
20DiffusionRank
- Experiments
- Data
- a toy graph (6 nodes)
- a middle-size real-world graph (18542 nodes)
- a large-size real-world graph crawled from CUHK
(607170 nodes) - Compare with TrustRank and PageRank
21Results
- The tendency of DiffusionRank when ? becomes
larger - On the toy graph
22Anti-manipulation On the toy graph
23Anti-manipulation on the middle graph and the
large graph
24Stability--the order difference between ranking
results for an algorithm before it is manipulated
and those after that
25Conclusions
- This anti-manipulation feature enables
DiffusionRank to be a candidate as a penicillin
for Web spamming. - DiffusionRank is a generalization of PageRank
(when ?8). - DiffusionRank can be employed to detect
group-group relation. - DiffusionRank can be used to cut graph.
26Inside Improvement
- Motivations
- Finite Difference Method is a possible way to
solve the heat diffusion equation. - the discretization of time
- the discretization of space and time
27Motivation
- Problems where we cannot employ FD directly in
the real data analysis - The graph constructed is irregular
- The density of data varies this also results in
an irregular graph - The manifold is unknown
- The differential equation expression is unknown
even if the manifold is known.
28Intuition
29Volume-based Heat Diffusion Model
- Assumption
- There is a small patch SPj of space containing
node j - The volume of the small patch SPj is V (j), and
the heat diffusion ability of the small patch is
proportional to its volume. - The temperature in the small patch SPj at time
t is almost equal to f(j,t) because every unseen
node in the small patch is near node j. - Solution
30Volume Computation
- Define V(i) to be the volume of the hypercube
whose side length is the average distance between
node i and its neighbors.
a maximum likelihood estimation
31Experiments
K KNN P Parzen window U UniverSvm
L LightSVMC consistency method
VHD-v by the best vVHD v is found by the
estimation HD without volume considerationC1
1st variation of CC2 2nd variation of C
32Conclusions
- The proposed VHDM has the following advantages
- It can model the effect of unseen points by
introducing the volume of a node - It avoids the difficulty of finding the explicit
expression for the unknown geometry by
approximating the manifold by a finite
neighborhood graph - It has a closed form solution that describes the
heat diffusion on a manifold - VHDC is a generalization of both the Parzen
Window Approach (when the window function is a
multivariate normal kernel) and KNN.
33Summary
- The input improvement of PHDC provide us more
choices for the input graphs. - The outside improvement provides us a possible
penicillin for Web spamming, and a potentially
useful tool for group-group discovery and graph
cut. - The inside improvement shows us a promising
classifier.