Using Graphs in Unstructured and Semistructured Data Mining

About This Presentation

Title:

Using Graphs in Unstructured and Semistructured Data Mining

Description:

Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen Acknowledgments C. Faloutsos, CMU W. Cohen, CMU ... – PowerPoint PPT presentation

Number of Views:438

Avg rating:3.0/5.0

Slides: 143

Provided by: SoumenCha5

Category:

more less

Transcript and Presenter's Notes

Title: Using Graphs in Unstructured and Semistructured Data Mining

1
Using Graphs in Unstructuredand Semistructured
Data Mining

Soumen Chakrabarti
IIT Bombay
www.cse.iitb.ac.in/soumen

2
Acknowledgments

C. Faloutsos, CMU
W. Cohen, CMU
IBM Almaden (many colleagues)
IIT Bombay (many students)
S. Sarawagi, IIT Bombay
S. Sudarshan, IIT Bombay

3
Graphs are everywhere

Phone network, Internet, Web
Databases, XML, email, blogs
Web of trust (epinion)
Text and language artifacts (WordNet)
Commodity distribution networks

Protein Interactions genomebiology.com
Internet Map lumeta.com
Food Web Martinez1991
4
Why analyze graphs?

What properties do real-life graphs have?
How important is a node? What is importance?
Who is the best customer to target in a social
network?
Who spread a raging rumor?
How similar are two nodes?
How do nodes influence each other?
Can I predict some property of a node based on
its neighborhood?

5
Outline, some more detail

Part 1 (Modeling graphs)
What do real-life graphs look like?
What laws govern their formation, evolution and
properties?
What structural analyses are useful?
Part 2 (Analyzing graphs)
Modeling data analysis problems using graphs
Proposing parametric models
Estimating parameters
Applications from Web search and text mining

6
Modeling and generatingrealistic graphs
7
Questions

What do real graphs look like?
Edges, communities, clustering effects
What properties of nodes, edges are important to
model?
Degree, paths, cycles,
What local and global properties are important to
measure?
How to artificially generate realistic graphs?

8
Modeling why care?

Algorithm design
Can skewed degree distribution make our algorithm
faster?
Extrapolation
How well will Pagerank work on the Web 10 years
from now?
Sampling
Make sure scaled-down algorithm shows same
performance/behavior on large-scale data
Deviation detection
Is this page trying to spam the search engine?

9
Laws degree distributions

Q avg degree is 10 - what is the most probable
degree?

count
??
degree
10
10
Laws degree distributions

Q avg degree is 10 - what is the most probable
degree?

degree
11
Power-law outdegree O
Frequency
Exponent slope
O -2.15
-2.15
Nov97
Outdegree

The plot is linear in log-log scale FFF99
freq degree (-2.15)

12
Power-law rank R
outdegree
Exponent slope R -0.74
R
Dec98
Rank nodes in decreasing outdegree order

The plot is a line in log-log scale

13
Eigenvalues

Let A be the adjacency matrix of graph
The eigenvalue ? satisfies
A v ? v, where v is some vector
Eigenvalues are strongly related to graph
topology

14
Power-law eigenvalues of E

Eigenvalues in decreasing order

Eigenvalue
Exponent slope
E -0.48
Dec98
Rank of decreasing eigenvalue
15
The Node Neighborhood

N(h) of pairs of nodes within h hops
Let average degree 3
How many neighbors should I expect within 1,2, h
hops?
Potential answer
1 hop -gt 3 neighbors
2 hops -gt 3 3
h hops -gt 3h

16
The Node Neighborhood

N(h) of pairs of nodes within h hops
Let average degree 3
How many neighbors should I expect within 1,2, h
hops?
Potential answer
1 hop -gt 3 neighbors
2 hops -gt 3 3
h hops -gt 3h

WRONG!
WE HAVE DUPLICATES!
17
The Node Neighborhood

N(h) of pairs of nodes within h hops
Let average degree 3
How many neighbors should I expect within 1,2, h
hops?
Potential answer
1 hop -gt 3 neighbors
2 hops -gt 3 3
h hops -gt 3h

WRONG x 2!
avg degree meaningless!
18
Power-law hop-plot H
of Pairs
H 2.83
H 4.86
of Pairs
Dec 98
Hops Router level 95
Hops

Pairs of nodes as a function of hops N(h) hH

19
Observation

Q Intuition behind hop exponent?
A intrinsicfractal dimensionality of the
network

N(h) h1
N(h) h2
20
Any other laws?

The Web looks like a bow-tie Kumar1999
IN, SCC, OUT, tendrils
Disconnected components

21
Generators

How to generate graphs from a realistic
distribution?
Difficulty simultaneously preserving many local
and global properties seen in realistic graphs
Erdos-Renyi switch on each edge independently
with some probability
Problem degree distribution not power-law
Degree-based
Process-based (preferential attachment)

22
Degree-based generator

Fix the degree distribution (e.g., Zipf)
Assign degrees to nodes
Add matching edges to satisfy degrees
No direct control over other properties
ACL modelAielloCL2000

23
Process-based Preferential attachment

Start with a clique with m nodes
Add one node v at every time step
v makes m links to old nodes
Suppose old node u has degree d(u)
Let pu d(u)/ ?wd(w)
v invokes a multinomial distribution defined by
the set of ps
And links to whichever us show up
At time t, there are mt nodes, mt links
What is the degree distribution?

24
Preferential attachment analysis

ki(t) degree of node i at time t
Discrete random variable
Approximate as continuous random variable
Let ?i(t) E(ki(t)), expectation over random
linking choices
At time t, the infinitesimal expected growth rate
of ?i(t) is, by linearity of expectation,

m degrees to add
Total degree at t
Time at which node i was born
25
Preferential attachment, continued

Expected degree of each node grows as square-root
of age
Let the current time be t
A node must be old enough for its degree to be
large for ?i(t) gt k, we need
Therefore, the fraction of nodes with degree
larger than k is
Pr(degree k) ? const/k3 (data closer to 2)

26
Bipartite cores

Basic preferential attachment does not
Explain dense/complete bipartite cores (100,000s
in a O(20 million)-page crawl)
Account for influence of search engines
The story isnt over yet

n2
Number of cores
23 core (nm core)
n7
log m
27
Other process-based generators

Copying model KumarRRTU2000
New node v picks old reference node r u.a.r.
v adds k new links to old nodes for ith link
W.p. a add a link to an old node picked u.a.r.
W.p. 1a copy the ith link from r
More difficult to analyze
Reference node ? compression techniques!
H.O.T. connect to closest, high-connectivity
neighbor Fabrikant2002
Winner does not take all Pennock2002

28
Reference-based graph compression

Well-motivated pack graph into limited fast
memory for, e.g., query-time Web analysis
Standard approach
Assign integer IDs to URLs, lexicographic order
Delta or Gamma encoding of outlink IDs
If link-copying is rampant, should be able to
compress Outlinks(u) by recording
A reference node r
Outlinks(r) ? Outlinks(u) the correction
Finding r whats optimal? practical? Adler,
Mitzenmacher, Boldi, Vigna 20022004

29
Reference-based compression, contd

r is a candidate reference for u if
Outlinks(r)?Outlinks(u) is large enough
Given G, construct G in which
Directed edge from r to u with edge cost number
of bits needed to write down Outlinks(u) ?
Outlinks(r)
Dummy node z, z has no outlinks in G
z connected to each u in G
cost(z,u) bits to write Outlinks(u) w/o ref
Shortest path tree rooted at z
In practice, pick recent r 2.58 bits/link

30
Summary Power laws are everywhere

Bible rank vs. word frequency
Length of file transfers Bestavros
Web hit counts Huberman
Click-stream data Montgomery01
Lotkas law of publication count (CiteSeer data)

31
Resources

Generators
R-MAT deepay_at_cs.cmu.edu
BRITE www.cs.bu.edu/brite/
INET topology.eecs.umich.edu/inet
Visualization tools
Graphviz www.graphviz.org
Pajek vlado.fmf.uni-lj.si/pub/networks/pajek
Kevin Bacon web sitewww.cs.virginia.edu/oracle
Erdös numbers etc.

32
R-MAT Recursive MATrix generator

Goals
Power-law in- and out-degrees
Power-law eigenvalues
Small diameter (six degrees of separation)
Simple, few parameters
Approach
Subdivide the adjacency matrix
Choose a quadrant with probability (a,b,c,d)

33
R-MAT algorithm, contd

Subdivide the adjacency matrix
Choose a quadrant with probability (a,b,c,d)
Recurse till we reach a 1?1 cell
By construction
Rich gets richer for in- and out-degree
Self-similar (communities within communities)
Small diameter

a
b
a
c
d
d
c
34
Evaluation on clickstream data
Count vs Indegree
Count vs Outdegree
Hop-plot
Singular value vs Rank
Left Network value
Right Network value
R-MAT matches it well
35
Topic structure of the Web

Measure correlations between link proximity and
content similarity
How to characterize topics?
Started with http//dmoz.org
Keep pruning until all leaf topicshave enough
(gt300) samples
Approx 120k sample URLs
Flatten to approx 482 topics
Train a text classifier
Characterize new document d as a vector of
probabilities pd (Pr(cd) ?c)

Test doc
Classifier
36
Sampling the background topic distrib.

What fraction of Web pages are about
/Health/Diabetes?
How to sample the Web?
Invoke the random surfer model (Pagerank)
Walk from node to node
Sample trail adjusting for Pagerank
Modify Web graph to do better sampling
Self loops
Bidirectional edges

37
Convergence

Start from pairs of diverse topics
Two random walks, sample from each walk
Measure distance between topic distributions
L1 distance p1 p2 ?cp1(c) p2(c) in
0,2
Below .05 .2 within 300400 physical pages

38
Biases in topic directories

Use Dmoz to train a classifier
Sample the Web
Classify samples
Diff Dmoz topic distribution from Web sample
topic distribution
Report maximum deviation in fractions
NOTE Not exactly Dmoz

39
Topic-specific degree distribution

Preferential attachment connect u to v w.p.
proportional to the degree of v, regardless of
topic
More realistic u has a topic, and links to v
with related topics
Unclear if power-law should be upheld

Intra-topiclinkage
Inter-topiclinkage
40
Random forward walk without jumps

Sampling walk is designed to mix topics well
How about walking forward without jumping?
Start from a page u0 on a specific topic
Forward random walk (u0, u1, , ui, )
Compare (Pr(cui) ?c) with (Pr(cu0) ?c) and with
the background distribution

41
Observations and implications

Forward walks wander away fromstarting topic
slowly
But do not converge to thebackground
distribution
Global PageRank ok alsofor topic-specific
queries
Jump parameter d.1.2
Topic drift not too bad withinpath length of
510
Prestige conferred mostly bysame-topic neighbors
Also explains why focused crawling works

W.p. d jump toa random node
W.p. (1-d)jump to anout-neighboru.a.r.
Jump
High-prestige node
42
Citation matrix

Given a page is about topic i, how likely is it
to link to topic j?
Matrix Ci,j probability that page about topic
i links to page about topic j
Soft counting Ci,j Pr(iu)Pr(jv)
Applications
Classifying Web pages into topics
Focused crawling for topic-specific pages
Finding relations between topics in a directory

u
v
43
Citation, confusion, correction
From topic?
Classifiers confusion on held-out documents can
be used to correct confusion matrix
ArtsBusinessComputersGamesHealthHomeRecreati
onReferenceScienceShoppingSocietySports
To topic ?
True topic?
From topic?
To topic ?
Guessed topic ?
44
Fine-grained views of citation
Prominent off-diagonalentries raise
designissues for taxonomyeditors and maintainers
Clear block-structure derived from coarse-grain
topics
Strong diagonals reflecttightly-knit topic
communities
45
Outline, some more detail

Part 1 (Modeling graphs)
What do real-life graphs look like?
What laws govern their formation, evolution and
properties?
What structural analyses are useful?
Part 2 (Analyzing graphs)
Modeling data analysis problems using graphs
Proposing parametric models
Estimating parameters
Applications from Web search and text mining

46
Centrality and prestige
47
How important is a node?

Degree, min-max radius,
Pagerank
Maximum entropy network flows
HITS and stochastic variants
Stability and susceptibility to spamming
Hypergraphs and nonlinear systems
Using other hypertext properties
Applications Ranking, crawling, clustering,
detecting obsolete pages

48
Importance/prestige as Pagerank

A node is important if it is connected to
important nodes
Random surfer walks along links for ever
If current node has 3 outlinks, take each with
probability 1/3
Importance steady-stateprobability (ssp) of
visit
Maxwells equationfor the Web

49
(Simplified) Pagerank algorithm

Let A be the node adjacency matrix
Column normalize AT
Want vector p such that AT p p
I.e. p is the eigenvector corresponding to the
largest eigenvalue, which is 1

50
Intuition

A as vector transformation

x
x
x

x
1
3
2
1
51
Intuition

By defn., eigenvectors remain parallel to
themselves (fixed points)

v1
v1
l1
3.62

52
Convergence

Usually, fast
depends on ratio
l1 l2

l2
l1
53
Eigenvalues and epidemic networks

Will a virus spread across an arbitrary network
create an epidemic?
Susceptible-Infected-Susceptible (SIS)
Was healthy, did not got infected
Was infected, got cured without further attack
Was infected, got cured immediately after attack
Cured nodes immediately become susceptible

54
The infection model

(virus) Birth rate b probability than an
infected neighbor attacks
(virus) Death rate d probability that an
infected node heals

55
Working parameters

pi,t prob. node i is infected at time t
Was healthy, did not got infected
Was infected, got cured without further attack
Was infected, got attacks from neighbors and yet
cured itself in this time step
somewhat arbitrary

56
Time recurrence for pi,t

Assuming various probabilities are suitably
small
Recurrence of the probability of infection can be
approximated by this linear form
In other words, the (symmetric) transition matrix
is

57
Epidemic threshold

Virus strength s b/d
Epidemic threshold of a graph is defined as the
value of t, such that if strength s b / d lt t,
then an epidemic can not happen
Problem compute epidemic threshold

58
Epidemic threshold t

What should t depend on?
avg. degree? and/or highest degree?
and/or variance of degree?
and/or third moment of degree?

59
Analysis

Eigenvectors of S and A are the same
Eigenvalues are shifted and scaled
From spectral decomposition
A sufficient condition for infection dying down
is that

60
Epidemic threshold

An epidemic must die down if

epidemic threshold
recovery prob.
ß/d ltt 1/ ?1,A
largest eigenvalue of adj. matrix A
attack prob.
Proof Wang03
61
Experiments
b/d gt t (above threshold)
b/d t (at the threshold)
b/d lt t (below threshold)
62
Remarks

Primal problem topology design
Design a network resilient to infection
Dual problem viral marketing
Selectively convert important customers who can
influence many others
Will come back to this later

63
Back to the random surfer

Practical issues with Pagerank BrinP1997
PR converges only if E is aperiodic and
irreducible make it so
d is the (tuned) probability of teleporting to
one of N pages uniformly at random (0.10.2 in
practice)
(Possibly) unintended consequences topic
sensitivity, stability

64
Prestige as network flow

yij surfers clicking from i to j per unit time
Hits per unit time on page j is
Flow is conserved at
The total traffic is
NormalizeCan interpret pij as a probability
Standard Pagerank corresponds to one solution
Many other solutions possible

65
Maximum entropy flow Tomlin2003

Flow conservation modeled using feature
And the constraints
Goal is to maximizesubject to
Solution has form
?i is the hotness of page i

66
Maxent flow results
?i ranking is better than Pagerank Hiranking
is worse
Two IBMintranetdata setswith knowntop URLs
Depth up to whichdmoz.org URLs areused as
ground truth
Averagerank (106) of knowntop URLswhensorted
byPagerank
Hi
?i
(Smaller rank is better)
Average rank (108)
67
HITS Kleinberg1997

Two kinds of prestige
Good hubs link to good authorities
Good authorities are linked to by good hubs
In matrix notation, iterations amount towith
interleaved normalization of h and a
Note that scores are copied not divided

68
HITS graph acquisition

Steps
Get root set via keyword query
Expand by one move forward and backward
Drop same-site links (why?)
Each node assigned both ahub and an auth score
Graph tends to be topic-specific
Whereas Pagerank is usually run on the entire
Web (but need not be)

69
HITS vs. Pagerank, more comments

Dominant eigenvectors of different matrices
HITS Eigensystems of EET (h) and ETE (a)
Pagerank eigensystem ofwhere
HITS copies scores, Pagerank distributes
HITS drops same-site links, Pagerank does (can?)
not
Implications?

70
HITS Dyadic interpretation CohnC2000

Graph includes many communities z
QueryJaguar gets auto, game, animal links
Each URL is represented as two things
A document d
A citation c
Max
Guess number of aspects zs and use Hofmann 1999
to estimate Pr(cz)
These are the most authoritative URLs

71
Dyadic results for Machine learning
Clustering based on citations ranking within
clusters
72
Spamming link-based ranking

Recipe for spamming HITS
Create a hub linking to genuine authorities
Then mix in links to your customers sites
Highly susceptible to adversarial behavior
Recipe for spamming Pagerank
Buy a bunch of domains, cloak IP addresses
Host a site at each domain
Sprinkle a few links at random per page to other
sites you own
Takes more work than spamming HITS

73
Stability of link analysis NgZJ2001

Compute HITS authority scores and Pagerank
Delete 30 of nodes/links at random
Recompute and compare ranks repeat
Pagerank ranks more stable than HITS authority
ranks
Why?
How to design more stable algorithms?

HITS Authority
Pagerank
74
Stability depends on graph and params

Auth score is eigenvector for ETE S, say
Let ?1 gt ?2 be the first two eigenvalues
There exists an S such that
S and S are close SSF O(?1 ?2)
But u1 u12 ?(1)
Pagerank p is eigenvector of
U is a matrix full of 1/N and ? is the jump prob
If set C of nodes are changed in any way, the new
Pagerank vector p satisfies

75
Randomized HITS

Each half-step, with probability ?, teleport to a
node chosen uniformly at random
Much more stable than HITS
Results more meaningful
? near 1 will always stabilize
Here ? was 0.2

Randomized HITS
Pagerank
76
Another random walk variation of HITS

SALSA Stochastic HITS Lempel2000
Two separate random walks
From authority to authority via hub
From hub to hub via authority
Transition probability Pr(ai?aj)
If transition graph is irreducible,
For disconnected components, depends on relative
size of bipartite cores
Avoids dominance of larger cores

a1
77
SALSA sample result (movies)
HITS The Tightly-Knit Community (TKC) effect
SALSA Less TKC influence (but no reinforcement!)
78
Links in relational data GibsonKR1998

(Attribute, value) pair is a node
Each node v has weight wv
Each tuple is a hyperedge
Tuple r has weight xr
HITS-like iterations to update weight wv
For each tuple
Update weight
Combining operator ? can be sum, max, product, Lp
avg, etc.

79
Distilling links in relational data
Theory
Database
Author
Author
Forum
Year
80
Searching and annotating graph data
81
Searching graph data

Nodes in graph contain text
Random?Intelligent surfer RichardsonD2001
Topic-sensitive Pagerank Haveliwala2002
Assigning image captions using random walks
PanYFD2004
Rotting pages and links BarYossefBKT2004
Query is a set of keywords
All keywords may not match a single node
Implicit joins Hulgeri2001, Agrawal2002
Or rank aggregation Balmin2004 required

82
Intelligent Web surfer
Keyword
Probabilityof teleportingto node j
Probability of walking from i to j wrt q
Relevanceof node k wrt q
Pick out-link to walk on inproportion to
relevance oftarget out-neighbor
Querysetof words
Pick a query word per some distribution, e.g. IDF
83
Implementing the intelligent surfer

PRQ(j) approximates a walk that picks a query
keyword using Pr(q) at every step
Precompute and store Prq(j) for each keyword q in
lexicon space blowup avg doc length
Query-dependent PR rated better by volunteers

84
Topic-sensitive Pagerank

High overhead for per-word Pagerank
Instead, compute Pageranks for some collection of
broad topics PRc(j)
Topic c has sample page set Sc
Walk as in Pagerank
Jump to a node in Sc uniformly at random
Project query onto set of topics
Rank responses by projection-weighted Pageranks

85
Topic-sensitive Pagerank results

Users prefer topic-sensitive Pagerank on most
queries to global Pagerank keyword filter

86
Image captioning

Segment images into regions
Image has caption words
Three-layer graph image, regions, caption words
Threshold on region similarity to connect regions
(dotted)

87
Random walks with restarts
Regions
Images
Testimage
Words

Find regions in test image
Connect regions to other nodes in the region
layer using region similarity
Random walk, restarting at test image node
Pick words with largest visit probability

88
More random walks Link rot

How stale is a page?
Last-mod unreliable
Automatic dead-link cleaners mask disuse
A page is completely stale if it is dead
Let D be the set of pages which cannot be
accessed (404 and other problems)
How stale is a page u? Start with p ? u
If p?D declare decay value of u to be 1, else
With probability ? declare decay value of u 0
W.p. 1? choose outlink v, set p?v, loop

89
Page staleness results
Decay
404s

Decay score is correlated with, but generally
larger than the fraction of dead outlinks on a
page
Removing direct dead links automatically does not
eliminate live but rotting pages

90
Graph proximity search two paradigms

A single node as query response
Find node that matches query terms
or is near nodes matching query terms
Goldman 1998
A connected subgraph as query response
Single node may not match all keywords
No natural page boundary Bhalotia2002
Agrawal2002

91
Single-node response examples

Travolta, Cage
Actor, Face/Off
Travolta, Cage, Movie
Face/Off
Kleiser, Movie
Gathering, Grease
Kleiser, Woo, Actor
Travolta

Movie
is-a
Face/Off
Grease
Gathering
acted-in
Travolta
Cage
A3
directed
is-a
Actor
Kleiser
Woo
is-a
Director
92
Basic search strategy

Node subset A activated because they match query
keyword(s)
Look for node near nodes that are activated
Goodness of response node depends
Directly on degree of activation
Inversely on distance from activated node(s)

93
Proximity query screenshot
http//www.cse.iitb.ac.in/banks/
94
Ranking a single node response

Activated node set A
Rank node r in response set R based on
proximity to nodes a in A
Nodes have relevance ?R and ?A in 0,1
Edge costs are specified by the system
d(a,r) cost of shortest path from a to r
Bond between a and r
Parameter t tunes relative emphasis on distance
and relevance score
Several ad-hoc choices

95
Scoring single response nodes

Additive
Belief
Goal list a limited number of find nodes with
the largest scores
Performance issues
Assume the graph is in memory?
Precompute all-pairs shortest path (V 3)?
Prune unpromising candidates?

96
Hub indexing

Decompose APSP problem using sparsevertex cuts
AB shortest paths to p
AB shortest paths to q
d(p,q)
To find d(a,b) compare
d(a?p?b) not through q
d(a?q?b) not through p
d(a?p?q?b)
d(a?q?p?b)
Greatest savings when A?B
Heuristics to find cuts, e.g. large-degree nodes

A
B
p
a
b
q
97
ObjectRank Balmin2004

Given a data graph with nodes having text
For each keyword precompute a keyword-sensitive
Pagerank RichardsonD2001
Score of a node for multiple keyword search based
on fuzzy AND/OR
Approximation to Pagerank of node with restarts
to nodes matching keywords
Use Fagin-merge Fagin2002 to get best nodes in
data graph

98
Connected subgraph as response

Single node may not match all keywords
No natural page boundary
On-the-fly joins make up a response page
Two scenarios
Keyword search on relational data
Keywords spread among normalized relations
Keyword search on XML-like or Web data
Keywords spread among DOM nodes and subtrees

99
Keyword search on relational data

Tuple node
Some columns have text
Foreign key constraints edges in schema graph?
Query set of terms
No natural notionof a document
Normalization
Join may be needed to generate results
Cycles may exist in schema graph Cites

Cites
Paper
CitingCited? ? ?
PaperIDPaperName? ? ?
Writes
Author
AuthorIDPaperID? ? ?
AuthorIDAuthorName? ? ?
100
DBXplorer and DISCOVER

Enumerate subsets of relations in schema graph
which, when joined, may contain rows which have
all keywords in the query
Join trees derived from schema graph
Output SQL query for each join tree
Generate joins, checking rows for matches
Agrawal2001, Hristidis2002

T4
K1,K2,K3
T2
T3
T4
T2
T5
T1
T2
T3
K2
T4
T2
T3
T5
T2
T3
T5
K3
101
Discussion

Exploits relational schema information to contain
search
Pushes final extraction of joined tuples into
RDBMS
Faster than dealing with full data graph directly

Coarse-grained ranking based on schema tree
Does not model proximity or (dis) similarity of
individual tuples
No recipe for data with less regular (e.g. XML)
or ill-defined schema

102
Motivation from Web search

Linux modem driver for a Thinkpad A22p
Hyperlink path matches query collectively
Conjunction query would fail
Projects where X and P work together
Conjunction may retrieve wrong page
General notion of graph proximity

IBM Thinkpads
A20m
A22p

Thinkpad
Drivers
Windows XP
Linux

Download
Installation tips
Modem
Ethernet

The B System
Group members
P
S
X

Home Page ofProfessor X
Papers
VLDB
Students
P
Q

Ps home page I work on the B project.
103
Data structures for search

Answer tree with at least one leaf containing
each keyword in query
Group Steiner tree problem, NP-hard
Query term t found in source nodes St
Single-source-shortest-path SSSP iterator
Initialize with a source (near-) node
Consider edges backwards
getNext() returns next nearest node
For each iterator, each visited node v maintains
for each t a set v.Rt of nodes in St which have
reached v

104
Generic expanding search

Near node sets St with S ?t St
For all source nodes ? ? S
create a SSSP iterator with source ?
While more results required
Get next iterator and its next-nearest node v
Let t be the term for the iterators source s
crossProduct s ? ?t ?tv.Rt
For each tuple of nodes in crossProduct
Create an answer tree rooted at v with paths to
each source node in the tuple
Add s to v.Rt

105
Search example (Vu Kleinberg)
Quoc Vu
Jon Kleinberg
author
writes
cites
paper
106
First response
Quoc Vu
Jon Kleinberg
writes
writes
writes
Organizing Web pagesby Information Unit
Authoritative sources in ahyperlinked environment
cites
A metriclabeling problem
writes
cites
cites
Divyakant Agrawal
writes
Eva Tardos
author
writes
cites
paper
107
Subgraph search screenshot
http//www.cse.iitb.ac.in/banks/
108
Similarity, neighborhood, influence
109
Why are two nodes similar?

What is/are the best paths connecting two nodes
explaining why/how they are related?
Graph of co-starring, citation, telephone call,
Graph with nodes s and t budget of b nodes
Find best b nodes capturing relationship
between s and t FaloutsosMT2004
Proposing a definition of goodness
How to efficiently select best connections

Negroponte
Palmisano
Esther Dyson
Gerstner
110
Simple proposals that do not work

Shortest path
Pizza boy p gets sameattention as g
Network flow
s?a?b?t is as good as s?g?t
Voltage
Connect 1V at s, ground t
Both g and p will be at 0.5V
Observations
Must reward parallel paths
Must reward short paths
Must penalize/tax pizza boys

a
b
s
g
t
p
111
Resistive network with universal sink

Connect 1V to s
Ground t
Introduce universal sink
Grounded
Connected to every node
Universal sink is a tax collector
Penalizes pizza boys
Penalizes long paths
Goodness of a path is the electric current it
carries

a
b
s
g
t
p
Connectedto every node
112
Resistive network algorithm

Ohms law
Kirchhoffs current law
Boundary conditions (without sink)
Solution
Here C(u,v) is the conductance from u to v
Add grounded universal sink z with V(z)0
Set
Display subgraph carrying high current

113
Distributions influenced via graphs

Directed or undirected graph nodes have
Observable properties
Some unobservable (random) state
Edges indicate that distributions over
unobservable states are coupled
Many applications
Hypertext classification (topics are clustered)
Social network of customers buying products
Hierarchical classification
Labeling categorical sequences pos/ne tagging,
sense disambiguation, linkage analysis

114
Basic techniques

Directed (acyclic) graphs Bayesian network
Markov networks
(Loopy) belief propagation
Conditional Markov networks
Avoid modeling joint distribution over observable
and hidden properties of nodes
Some computationally simple special cases

115
Hypertext classification

Want to assign labels to Web pages
Text on a single page may be too little or
misleading
Page is not an isolated instance by itself
Problem setup
Web graph G(V,E)
Node u?V is a page having text uT
Edges in E are hyperlinks
Some nodes are labeled
Make collective guess at missing labels
Probabilistic model? Benefits?

116
Graph labeling model

Seek a labeling f of all unlabeled nodes so as to
maximize
Let VK be the nodes with known labels and f(VK)
their known label assignments
Let N(v) be the neighbors of v and NK(v)?N(v) be
neighbors with known labels
Markov assumption f(v) is conditionally
independent of rest of G given f(N(v))

117
Markov graph labeling

Circularity between f(v) and f(NU(v))
Some form of iterative Gibbs sampling or MCMC

118
Iterative labeling

Sum over all possible NU(v) labelings still too
expensive to compute
In practice, prune to most likely configurations
Let us look at the last term more carefully

119
A generative node model
By the Markov assumption, finally we need a
distributioncoupling f(v) and vT (the text on v)
and f(N(v))

Can use Bayes classifier as with ordinary text
estimate a parametric model forthe
class-conditional joint distribution between the
text on the page v and the labels of neighbors of
v
Must make naïve Bayes assumptions to keep
practical

120
Pictorially

cclass, ttext, Nneighbors
Text-only model Prtc
Using neighbors textto judge my topicPrt,
t(N) c (hurts)
Better modelPrt, c(N) c
Estimate histograms and update based on
neighbors histograms

?
121
Generative model results

9600 patents from 12 classes marked by USPTO
Patents have text and cite other patents
Expand test patent to include neighborhood
Forget fraction of neighbors classes

122
Detour generative vs. discriminative

x feature vector, y label 0,1 say
Generative method models Pr(xy) or Pr(x,y) the
generation of data x given label y
Use Bayes rule to get Pr(yx)
Inaccurate x may have many dimensions
Discriminative directly estimates Pr(yx)
Simple linear case want w s.t. w.x gt 0 if y1
and w.x ? 0 if y0
Cannot differentiate for w instead pick some
smooth loss function
Works very well in practice

123
Discriminative node model

OA(X) direct own attributes of node X
LD(X) link-derived attributes of node X
Mode-link most frequent label of neighbors(X)
Count-link histogram of neighbor labels
Binary-link 0/1 histogram of neighbor labels
Iterate as in generative case

Local model params
Neighborhood model params
124
Discriminative model results Li2003

Binary-link and count-link outperform
content-only at 95 confidence
Better to separately estimate wl and wo
InOutCocitation better than any subset for LD

125
Undirected Markov networks

Clique c?C(G) a set of completely connected nodes
Clique potential ?c(Vc) a function over all
possible configurations of nodes in Vc
Decompose Pr(v) as (1/Z)?c?C(G)?c(Vc)
Parametric form

Label coupling
Instance
Local feature variable
Label variable
Params of model
Feature functions
126
Conditional and relational networks

x vector of observed features at all nodes
y vector of labels
A set of clique templatesspecifying links to
use
Other features in the sameHTML section

127
Toy problem Hierarchical classification

Obvious approaches
Flatten to leaf topics, losing hierarchy info
Level-by-level, compounding error probability
Cascaded generative model
Pr(cd,r) estimated as Pr(cr)Pr(dc)/Z(r)
Estimate of Pr(dc) makes naïve independence
assumptions if d has high dimensionality
Pr(cd,r) tends to 0/1 for large dimensions and
Mistake made at shallow levels become irrevocable

r
c
128
Global discriminative model

Each node has an associated bit X
Propose a parametric form
Each training instance sets one path to 1, all
other nodes have X0

T
xr
d
F(d,xr)
xr0
xr1
2T1
wc
129
Network value of customers

Customer node X in graph, neighbors N
M is a marketing action (promotion, coupon)
Want to predict
Broader objective is to design action M
Again, we approximate as

Aggregatemarketing action
Response of customer i
Known response ofother customers
Product attributes
Sum over unknown neighbor configurations
130
Network value, continued

Let the action be boolean
c is the cost of marketing
r0 is the revenue without marketing, r1 with
Expected lift in profit by marketing to customer
i in isolation
Global effect

131
Special case sequential networks

Text modeled as sequence of tokens drawn from a
large but finite vocabulary
Each token has attributes
Visible allCaps, noCaps, hasXx, allDigits,
hasDigit, isAbbrev, (part-of-speech, wnSense)
Not visible part-of-speech, (isPersonName,
isOrgName, isLocation, isDateTime),
startscontinuesends-noun-phrase
Visible (symbols) and invisible (states)
attributes of nearby tokens are dependent
Application decides what is (not) visible
Goal Estimate invisible attributes

132
Hidden Markov model

A generative sequential model for the joint
distribution of states (s) and symbols (o)

St-1
St
St1
...
...
Ot
Ot1
Ot-1
133
Using redundant token features

Each o is usually a vector of features extracted
from a token
Might have high dependence/redundancy hasCap,
hasDigit, isNoun, isPreposition
Parametric model for Pr(st?ot) needs to make
naïve assumptions to be practical
Overall joint model Pr(s,o) can be very
inaccurate
(Same argument as in naïve Bayes vs. SVM or
maximum entropy text classifiers)

134
Discriminative graphical model

Assume one-stage Markov dependence
Propose direct parametric form for conditional
probability of state sequence given symbol
sequence

Model
Log-linear form
Feature function mightdepend on whole o
Parameters to fit
135
Feature functions and parameters
Penalizelarge params
Maximize total conditional likelihood over all
instances

Find ?L/??k for each k and perform a
gradient-based numerical optimization
Efficient for linear state dependence structure

136
Conditional vs. joint results
Out-of-vocabulary error
Orthography Use words, plus overlapping
features isCap, startsWithDigit, hasHyphen,
endsWith -ing, -ogy, -ed, -s, -ly, -ion, -tion,
-ity, -ies
137
Summary

Graphs provide a powerful way to model many kinds
of data, at multiple levels
Web pages, XML, relational data, images
Words, senses, phrases, parse trees
A few broad paradigms for analysis
Factors affecting graph evolution over time
Eigen analysis, conductance, random walks
Coupled distributions between node attributes and
graph neighborhood
Several new classes of model estimation and
inferencing algorithms

138
References

BrinP1998 The Anatomy of a Large-Scale
Hypertextual Web Search Engine, WWW.
GoldmanSVG1998 Proximity search in databases.
VLDB, 2637.
ChakrabartiDI1998 Enhanced hypertext
categorization using hyperlinks. SIGMOD.
BikelSW1999 An Algorithm that Learns Whats in
a Name. Machine Learning Journal.
GibsonKR1999 Clustering categorical data An
approach based on dynamical systems. VLDB.
Kleinberg1999 Authoritative sources in a
hyperlinked environment. JACM 46.

139
References

CohnC2000 Probabilistically Identifying
Authoritative Documents, ICML.
LempelM2000 The stochastic approach for
link-structure analysis (SALSA) and the TKC
effect. Computer Networks 33 (1-6) 387-401
RichardsonD2001 The Intelligent Surfer
Probabilistic Combination of Link and Content
Information in PageRank. NIPS 14 (1441-1448).
LaffertyMP2001 Conditional Random Fields
Probabilistic Models for Segmenting and Labeling
Sequence Data. ICML.
BorkarDS2001 Automatic text segmentation for
extracting structured records. SIGMOD.

140
References

NgZJ2001 Stable algorithms for link analysis.
SIGIR.
Hulgeri2001 Keyword Search in Databases. IEEE
Data Engineering Bulletin 24(3) 22-32.
Hristidis2002 DISCOVER Keyword Search in
Relational Databases. VLDB.
Agrawal2002 DBXplorer A system for
keyword-based search over relational databases.
ICDE.
TaskarAK2002 Discriminative probabilistic
models for relational data.
Fagin2002 Combining fuzzy information an
overview. SIGMOD Record 31(2), 109118.

141
References

Chakrabarti2002 Mining the Web Discovering
Knowledge from Hypertext Data
Tomlin2003 A New Paradigm for Ranking Pages on
the World Wide Web. WWW.
Haveliwala2003 Topic-Sensitive Pagerank A
Context-Sensitive Ranking Algorithm for Web
Search. IEEE TKDE.
LuG2003 Link-based Classification. ICML.
FaloutsosMT2004 Connection Subgraphs in Social
Networks. SIAM-DM workshop.
PanYFD2004 GCap Graph-based Automatic Image
Captioning. MDDE/CVPR.

142
References

Balmin2004 Authority-Based Keyword Queries in
Databases using ObjectRank. VLDB.
BarYossefBKT2004 Sic transit gloria telae
Towards an understanding of the Webs decay.
WWW2004.

Write a Comment

User Comments (0)