Further Investigations on Heat Diffusion Models - PowerPoint PPT Presentation

About This Presentation
Title:

Further Investigations on Heat Diffusion Models

Description:

Input Improvement Three candidate graphs. Outside Improvement DiffusionRank ... Belkin & Niyogi (Neural Computation 2003) Approximate the manifold by a KNN graph ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 34
Provided by: jhi75
Category:

less

Transcript and Presenter's Notes

Title: Further Investigations on Heat Diffusion Models


1
Further Investigations on Heat Diffusion Models
  • Haixuan Yang
  • Supervisors Prof Irwin King and Prof Michael R.
    Lyu
  • Term Presentation 2006

2
Outline
  • Introduction
  • Input Improvement Three candidate graphs
  • Outside Improvement DiffusionRank
  • Inside Improvement Volume-based heat difusion
    model
  • Summary

3
Introduction
DiffusionRank
Outside Improvement
PHDC
Volume-based HDM
Inside Improvement
Input Improvement
HDM on Graphs
PHDC the model proposed last year
4
PHDC
  • PHDC is a classifier motivated by
  • Tenenbaum et al (Science 2000)
  • Approximate the manifold by a KNN graph
  • Reduce dimension by shortest paths
  • Kondor Lafferty (NIPS 2002)
  • Construct a diffusion kernel on an undirected
    graph
  • Apply to a large margin classifier
  • Belkin Niyogi (Neural Computation 2003)
  • Approximate the manifold by a KNN graph
  • Reduce dimension by heat kernels
  • Lafferty Kondor (JMLR 2005)
  • Construct a diffusion kernel on a special
    manifold
  • Apply to SVM

5
PHDC
  • Ideas we inherit
  • Local information
  • relatively accurate in a nonlinear manifold.
  • Heat diffusion on a manifold
  • a generalization of the Gaussian density from
    Euclidean space to manifold.
  • heat diffuses in the same way as Gaussian density
    in the ideal case when the manifold is the
    Euclidean space.
  • Ideas we think differently
  • Establish the heat diffusion equation on a
    weighted directed graph.
  • The broader settings enable its application on
    ranking on the Web pages.
  • Construct a classifier by the solution directly.

6
Heat Diffusion Model in PDHC
  • Notations
  • Solution
  • Classifier
  • G is the KNN Graph Connect a directed edge (j,i)
    if j is one of the K nearest neighbors of i.
  • For each class k, f(i,0) is set as 1 if data is
    labeled as k and 0 otherwise.
  • Assign data j to a label q if j receives most
    heat from data in class q.

7
Input Improvement
  • Three candidate graphs
  • KNN Graph
  • Connect points j and i from j to i if j is one of
    the K nearest neighbors of i, measured by the
    Euclidean distance.
  • SKNN-Graph
  • Choose the smallest Kn/2 undirected edges, which
    amounts to Kn directed edges.
  • Minimum Spanning Tree
  • Choose the subgraph such that
  • It is a tree connecting all vertices the sum of
    weights is minimum among all such trees.

8
Input Improvement
  • Illustration
  • Manifold
  • KNN Graph
  • SKNN-Graph
  • Minimum Spanning Tree

9
Input Improvement
  • Advantages and disadvantages
  • KNN Graph
  • Democratic to each node
  • Resulting classifier is a generalization of KNN
  • May not be connected
  • Long edges may exit while short edges are removed
  • SKNN-Graph
  • Not democratic
  • May not be connected
  • Short edges are more important than long edges
  • Minimum Spanning Tree
  • Not democratic
  • Long edges may exit while short edges are removed
  • Connection is guaranteed
  • Less parameter
  • Faster in training and testing

10
Experiments
  • Experimental Setup
  • Experimental Environments
  • Hardware Nix Dual Intel Xeon 2.2GHz
  • OS Linux Kernel 2.4.18-27smp (RedHat 7.3)
  • Developing tool C
  • Data Description
  • 3 artificial Data sets and 6 datasets from UCI
  • Comparison
  • Algorithms
  • Parzen windowKNNSVM KNN-HSKNN-HMST-H
  • Results average of the ten-fold cross validation

11
Experiments
  • Results

12
Conclusions
  • KNN-H, SKNN-H and MST-H
  • Candidates for the Heat Diffusion Classifier on a
    Graph.

13
Application Improvement
  • PageRank
  • Tries to find the importance of a Web page based
    on the link structure.
  • The importance of a page i is defined recursively
    in terms of pages which point to it
  • Two problems
  • The incomplete information about the Web
    structure.
  • The web pages manipulated by people for
    commercial interests.
  • About 70 of all pages in the .biz domain are
    spam
  • About 35 of the pages in the .us domain belong
    to spam category.

14
Why PageRank is susceptible to web spam?
  • Two reasons
  • Over-democratic
  • All pages are born equal--equal voting ability of
    one page the sum of each column is equal to one.
  • Input-independent
  • For any given non-zero initial input, the
    iteration will converge to the same stable
    distribution.
  • Heat Diffusion Model -- a natural way to avoid
    these two reasons of PageRank
  • Points are not equal as some points are born with
    high temperatures while others are born with low
    temperatures.
  • Different initial temperature distributions will
    give rise to different temperature distributions
    after a fixed time period.

15
DiffusionRank
  • On an undirected graph
  • Assumption the amount of the heat flow from j to
    i is proportional to the heat difference between
    i and j.
  • Solution
  • On a directed graph
  • Assumption there is extra energy imposed on the
    link (j, i) such that the heat flow only from j
    to i if there is no link (i,j).
  • Solution
  • On a random directed graph
  • Assumption the heat flow is proportional to the
    probability of the link (j,i).
  • Solution

16
DiffusionRank
  • On a random directed graph
  • Solution
  • The initial value f(i,0) in f(0) is set to be 1
    if i is trusted and 0 otherwise according to the
    inverse PageRank.

17
Computation consideration
  • Approximation of heat kernel
  • N?
  • When Ngt30, the real eigenvalues of
    are less than 0.01
  • when Ngt100, they are less than 0.005.
  • We use N100 in the paper.

When N tends to infinity
18
Discuss ?
  • ?can be understood as the thermal conductivity.
  • When ?0, the ranking value is most robust to
    manipulation since no heat is diffused, but the
    Web structure is completely ignored
  • When ? 8, DiffusionRank becomes PageRank, it can
    be manipulated easily.
  • When?1, DiffusionRank works well in practice

19
DiffusionRank
  • Advantages
  • Can detect Group-group relations
  • Can cut Graphs
  • Anti-manipulation

? 0.5 or 1
1
-1
20
DiffusionRank
  • Experiments
  • Data
  • a toy graph (6 nodes)
  • a middle-size real-world graph (18542 nodes)
  • a large-size real-world graph crawled from CUHK
    (607170 nodes)
  • Compare with TrustRank and PageRank

21
Results
  • The tendency of DiffusionRank when ? becomes
    larger
  • On the toy graph

22
Anti-manipulation On the toy graph
23
Anti-manipulation on the middle graph and the
large graph
24
Stability--the order difference between ranking
results for an algorithm before it is manipulated
and those after that
25
Conclusions
  • This anti-manipulation feature enables
    DiffusionRank to be a candidate as a penicillin
    for Web spamming.
  • DiffusionRank is a generalization of PageRank
    (when ?8).
  • DiffusionRank can be employed to detect
    group-group relation.
  • DiffusionRank can be used to cut graph.

26
Inside Improvement
  • Motivations
  • Finite Difference Method is a possible way to
    solve the heat diffusion equation.
  • the discretization of time
  • the discretization of space and time

27
Motivation
  • Problems where we cannot employ FD directly in
    the real data analysis
  • The graph constructed is irregular
  • The density of data varies this also results in
    an irregular graph
  • The manifold is unknown
  • The differential equation expression is unknown
    even if the manifold is known.

28
Intuition
29
Volume-based Heat Diffusion Model
  • Assumption
  • There is a small patch SPj of space containing
    node j
  • The volume of the small patch SPj is V (j), and
    the heat diffusion ability of the small patch is
    proportional to its volume.
  • The temperature in the small patch SPj at time
    t is almost equal to f(j,t) because every unseen
    node in the small patch is near node j.
  • Solution

30
Volume Computation
  • Define V(i) to be the volume of the hypercube
    whose side length is the average distance between
    node i and its neighbors.

a maximum likelihood estimation
31
Experiments
K KNN P Parzen window U UniverSvm
L LightSVMC consistency method
VHD-v by the best vVHD v is found by the
estimation HD without volume considerationC1
1st variation of CC2 2nd variation of C
32
Conclusions
  • The proposed VHDM has the following advantages
  • It can model the effect of unseen points by
    introducing the volume of a node
  • It avoids the difficulty of finding the explicit
    expression for the unknown geometry by
    approximating the manifold by a finite
    neighborhood graph
  • It has a closed form solution that describes the
    heat diffusion on a manifold
  • VHDC is a generalization of both the Parzen
    Window Approach (when the window function is a
    multivariate normal kernel) and KNN.

33
Summary
  • The input improvement of PHDC provide us more
    choices for the input graphs.
  • The outside improvement provides us a possible
    penicillin for Web spamming, and a potentially
    useful tool for group-group discovery and graph
    cut.
  • The inside improvement shows us a promising
    classifier.
Write a Comment
User Comments (0)
About PowerShow.com