Using Graphs in Unstructured and Semistructured Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Using Graphs in Unstructured and Semistructured Data Mining

Description:

Using Graphs in Unstructured and Semistructured Data Mining Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soumen Acknowledgments C. Faloutsos, CMU W. Cohen, CMU ... – PowerPoint PPT presentation

Number of Views:433
Avg rating:3.0/5.0
Slides: 143
Provided by: SoumenCha5
Category:

less

Transcript and Presenter's Notes

Title: Using Graphs in Unstructured and Semistructured Data Mining


1
Using Graphs in Unstructuredand Semistructured
Data Mining
  • Soumen Chakrabarti
  • IIT Bombay
  • www.cse.iitb.ac.in/soumen

2
Acknowledgments
  • C. Faloutsos, CMU
  • W. Cohen, CMU
  • IBM Almaden (many colleagues)
  • IIT Bombay (many students)
  • S. Sarawagi, IIT Bombay
  • S. Sudarshan, IIT Bombay

3
Graphs are everywhere
  • Phone network, Internet, Web
  • Databases, XML, email, blogs
  • Web of trust (epinion)
  • Text and language artifacts (WordNet)
  • Commodity distribution networks

Protein Interactions genomebiology.com
Internet Map lumeta.com
Food Web Martinez1991
4
Why analyze graphs?
  • What properties do real-life graphs have?
  • How important is a node? What is importance?
  • Who is the best customer to target in a social
    network?
  • Who spread a raging rumor?
  • How similar are two nodes?
  • How do nodes influence each other?
  • Can I predict some property of a node based on
    its neighborhood?

5
Outline, some more detail
  • Part 1 (Modeling graphs)
  • What do real-life graphs look like?
  • What laws govern their formation, evolution and
    properties?
  • What structural analyses are useful?
  • Part 2 (Analyzing graphs)
  • Modeling data analysis problems using graphs
  • Proposing parametric models
  • Estimating parameters
  • Applications from Web search and text mining

6
Modeling and generatingrealistic graphs
7
Questions
  • What do real graphs look like?
  • Edges, communities, clustering effects
  • What properties of nodes, edges are important to
    model?
  • Degree, paths, cycles,
  • What local and global properties are important to
    measure?
  • How to artificially generate realistic graphs?

8
Modeling why care?
  • Algorithm design
  • Can skewed degree distribution make our algorithm
    faster?
  • Extrapolation
  • How well will Pagerank work on the Web 10 years
    from now?
  • Sampling
  • Make sure scaled-down algorithm shows same
    performance/behavior on large-scale data
  • Deviation detection
  • Is this page trying to spam the search engine?

9
Laws degree distributions
  • Q avg degree is 10 - what is the most probable
    degree?

count
??
degree
10
10
Laws degree distributions
  • Q avg degree is 10 - what is the most probable
    degree?

degree
11
Power-law outdegree O
Frequency
Exponent slope
O -2.15
-2.15
Nov97
Outdegree
  • The plot is linear in log-log scale FFF99
  • freq degree (-2.15)

12
Power-law rank R
outdegree
Exponent slope R -0.74
R
Dec98
Rank nodes in decreasing outdegree order
  • The plot is a line in log-log scale

13
Eigenvalues
  • Let A be the adjacency matrix of graph
  • The eigenvalue ? satisfies
  • A v ? v, where v is some vector
  • Eigenvalues are strongly related to graph
    topology

14
Power-law eigenvalues of E
  • Eigenvalues in decreasing order

Eigenvalue
Exponent slope
E -0.48
Dec98
Rank of decreasing eigenvalue
15
The Node Neighborhood
  • N(h) of pairs of nodes within h hops
  • Let average degree 3
  • How many neighbors should I expect within 1,2, h
    hops?
  • Potential answer
  • 1 hop -gt 3 neighbors
  • 2 hops -gt 3 3
  • h hops -gt 3h

16
The Node Neighborhood
  • N(h) of pairs of nodes within h hops
  • Let average degree 3
  • How many neighbors should I expect within 1,2, h
    hops?
  • Potential answer
  • 1 hop -gt 3 neighbors
  • 2 hops -gt 3 3
  • h hops -gt 3h

WRONG!
WE HAVE DUPLICATES!
17
The Node Neighborhood
  • N(h) of pairs of nodes within h hops
  • Let average degree 3
  • How many neighbors should I expect within 1,2, h
    hops?
  • Potential answer
  • 1 hop -gt 3 neighbors
  • 2 hops -gt 3 3
  • h hops -gt 3h

WRONG x 2!
avg degree meaningless!
18
Power-law hop-plot H
of Pairs
H 2.83
H 4.86
of Pairs
Dec 98
Hops Router level 95
Hops
  • Pairs of nodes as a function of hops N(h) hH

19
Observation
  • Q Intuition behind hop exponent?
  • A intrinsicfractal dimensionality of the
    network

N(h) h1
N(h) h2
20
Any other laws?
  • The Web looks like a bow-tie Kumar1999
  • IN, SCC, OUT, tendrils
  • Disconnected components

21
Generators
  • How to generate graphs from a realistic
    distribution?
  • Difficulty simultaneously preserving many local
    and global properties seen in realistic graphs
  • Erdos-Renyi switch on each edge independently
    with some probability
  • Problem degree distribution not power-law
  • Degree-based
  • Process-based (preferential attachment)

22
Degree-based generator
  • Fix the degree distribution (e.g., Zipf)
  • Assign degrees to nodes
  • Add matching edges to satisfy degrees
  • No direct control over other properties
  • ACL modelAielloCL2000

23
Process-based Preferential attachment
  • Start with a clique with m nodes
  • Add one node v at every time step
  • v makes m links to old nodes
  • Suppose old node u has degree d(u)
  • Let pu d(u)/ ?wd(w)
  • v invokes a multinomial distribution defined by
    the set of ps
  • And links to whichever us show up
  • At time t, there are mt nodes, mt links
  • What is the degree distribution?

24
Preferential attachment analysis
  • ki(t) degree of node i at time t
  • Discrete random variable
  • Approximate as continuous random variable
  • Let ?i(t) E(ki(t)), expectation over random
    linking choices
  • At time t, the infinitesimal expected growth rate
    of ?i(t) is, by linearity of expectation,

m degrees to add
Total degree at t
Time at which node i was born
25
Preferential attachment, continued
  • Expected degree of each node grows as square-root
    of age
  • Let the current time be t
  • A node must be old enough for its degree to be
    large for ?i(t) gt k, we need
  • Therefore, the fraction of nodes with degree
    larger than k is
  • Pr(degree k) ? const/k3 (data closer to 2)

26
Bipartite cores
  • Basic preferential attachment does not
  • Explain dense/complete bipartite cores (100,000s
    in a O(20 million)-page crawl)
  • Account for influence of search engines
  • The story isnt over yet

n2
Number of cores
23 core (nm core)
n7
log m
27
Other process-based generators
  • Copying model KumarRRTU2000
  • New node v picks old reference node r u.a.r.
  • v adds k new links to old nodes for ith link
  • W.p. a add a link to an old node picked u.a.r.
  • W.p. 1a copy the ith link from r
  • More difficult to analyze
  • Reference node ? compression techniques!
  • H.O.T. connect to closest, high-connectivity
    neighbor Fabrikant2002
  • Winner does not take all Pennock2002

28
Reference-based graph compression
  • Well-motivated pack graph into limited fast
    memory for, e.g., query-time Web analysis
  • Standard approach
  • Assign integer IDs to URLs, lexicographic order
  • Delta or Gamma encoding of outlink IDs
  • If link-copying is rampant, should be able to
    compress Outlinks(u) by recording
  • A reference node r
  • Outlinks(r) ? Outlinks(u) the correction
  • Finding r whats optimal? practical? Adler,
    Mitzenmacher, Boldi, Vigna 20022004

29
Reference-based compression, contd
  • r is a candidate reference for u if
    Outlinks(r)?Outlinks(u) is large enough
  • Given G, construct G in which
  • Directed edge from r to u with edge cost number
    of bits needed to write down Outlinks(u) ?
    Outlinks(r)
  • Dummy node z, z has no outlinks in G
  • z connected to each u in G
  • cost(z,u) bits to write Outlinks(u) w/o ref
  • Shortest path tree rooted at z
  • In practice, pick recent r 2.58 bits/link

30
Summary Power laws are everywhere
  • Bible rank vs. word frequency
  • Length of file transfers Bestavros
  • Web hit counts Huberman
  • Click-stream data Montgomery01
  • Lotkas law of publication count (CiteSeer data)

31
Resources
  • Generators
  • R-MAT deepay_at_cs.cmu.edu
  • BRITE www.cs.bu.edu/brite/
  • INET topology.eecs.umich.edu/inet
  • Visualization tools
  • Graphviz www.graphviz.org
  • Pajek vlado.fmf.uni-lj.si/pub/networks/pajek
  • Kevin Bacon web sitewww.cs.virginia.edu/oracle
  • Erdös numbers etc.

32
R-MAT Recursive MATrix generator
  • Goals
  • Power-law in- and out-degrees
  • Power-law eigenvalues
  • Small diameter (six degrees of separation)
  • Simple, few parameters
  • Approach
  • Subdivide the adjacency matrix
  • Choose a quadrant with probability (a,b,c,d)

33
R-MAT algorithm, contd
  • Subdivide the adjacency matrix
  • Choose a quadrant with probability (a,b,c,d)
  • Recurse till we reach a 1?1 cell
  • By construction
  • Rich gets richer for in- and out-degree
  • Self-similar (communities within communities)
  • Small diameter

a
b
a
c
d
d
c
34
Evaluation on clickstream data
Count vs Indegree
Count vs Outdegree
Hop-plot
Singular value vs Rank
Left Network value
Right Network value
R-MAT matches it well
35
Topic structure of the Web
  • Measure correlations between link proximity and
    content similarity
  • How to characterize topics?
  • Started with http//dmoz.org
  • Keep pruning until all leaf topicshave enough
    (gt300) samples
  • Approx 120k sample URLs
  • Flatten to approx 482 topics
  • Train a text classifier
  • Characterize new document d as a vector of
    probabilities pd (Pr(cd) ?c)

Test doc
Classifier
36
Sampling the background topic distrib.
  • What fraction of Web pages are about
    /Health/Diabetes?
  • How to sample the Web?
  • Invoke the random surfer model (Pagerank)
  • Walk from node to node
  • Sample trail adjusting for Pagerank
  • Modify Web graph to do better sampling
  • Self loops
  • Bidirectional edges

37
Convergence
  • Start from pairs of diverse topics
  • Two random walks, sample from each walk
  • Measure distance between topic distributions
  • L1 distance p1 p2 ?cp1(c) p2(c) in
    0,2
  • Below .05 .2 within 300400 physical pages

38
Biases in topic directories
  • Use Dmoz to train a classifier
  • Sample the Web
  • Classify samples
  • Diff Dmoz topic distribution from Web sample
    topic distribution
  • Report maximum deviation in fractions
  • NOTE Not exactly Dmoz

39
Topic-specific degree distribution
  • Preferential attachment connect u to v w.p.
    proportional to the degree of v, regardless of
    topic
  • More realistic u has a topic, and links to v
    with related topics
  • Unclear if power-law should be upheld

Intra-topiclinkage
Inter-topiclinkage
40
Random forward walk without jumps
  • Sampling walk is designed to mix topics well
  • How about walking forward without jumping?
  • Start from a page u0 on a specific topic
  • Forward random walk (u0, u1, , ui, )
  • Compare (Pr(cui) ?c) with (Pr(cu0) ?c) and with
    the background distribution

41
Observations and implications
  • Forward walks wander away fromstarting topic
    slowly
  • But do not converge to thebackground
    distribution
  • Global PageRank ok alsofor topic-specific
    queries
  • Jump parameter d.1.2
  • Topic drift not too bad withinpath length of
    510
  • Prestige conferred mostly bysame-topic neighbors
  • Also explains why focused crawling works

W.p. d jump toa random node
W.p. (1-d)jump to anout-neighboru.a.r.
Jump
High-prestige node
42
Citation matrix
  • Given a page is about topic i, how likely is it
    to link to topic j?
  • Matrix Ci,j probability that page about topic
    i links to page about topic j
  • Soft counting Ci,j Pr(iu)Pr(jv)
  • Applications
  • Classifying Web pages into topics
  • Focused crawling for topic-specific pages
  • Finding relations between topics in a directory

u
v
43
Citation, confusion, correction
From topic?
Classifiers confusion on held-out documents can
be used to correct confusion matrix
ArtsBusinessComputersGamesHealthHomeRecreati
onReferenceScienceShoppingSocietySports
To topic ?
True topic?
From topic?
To topic ?
Guessed topic ?
44
Fine-grained views of citation
Prominent off-diagonalentries raise
designissues for taxonomyeditors and maintainers
Clear block-structure derived from coarse-grain
topics
Strong diagonals reflecttightly-knit topic
communities
45
Outline, some more detail
  • Part 1 (Modeling graphs)
  • What do real-life graphs look like?
  • What laws govern their formation, evolution and
    properties?
  • What structural analyses are useful?
  • Part 2 (Analyzing graphs)
  • Modeling data analysis problems using graphs
  • Proposing parametric models
  • Estimating parameters
  • Applications from Web search and text mining

46
Centrality and prestige
47
How important is a node?
  • Degree, min-max radius,
  • Pagerank
  • Maximum entropy network flows
  • HITS and stochastic variants
  • Stability and susceptibility to spamming
  • Hypergraphs and nonlinear systems
  • Using other hypertext properties
  • Applications Ranking, crawling, clustering,
    detecting obsolete pages

48
Importance/prestige as Pagerank
  • A node is important if it is connected to
    important nodes
  • Random surfer walks along links for ever
  • If current node has 3 outlinks, take each with
    probability 1/3
  • Importance steady-stateprobability (ssp) of
    visit
  • Maxwells equationfor the Web

49
(Simplified) Pagerank algorithm
  • Let A be the node adjacency matrix
  • Column normalize AT
  • Want vector p such that AT p p
  • I.e. p is the eigenvector corresponding to the
    largest eigenvalue, which is 1

50
Intuition
  • A as vector transformation

x
x
x

x
1
3
2
1
51
Intuition
  • By defn., eigenvectors remain parallel to
    themselves (fixed points)

v1
v1
l1
3.62

52
Convergence
  • Usually, fast
  • depends on ratio
  • l1 l2

l2
l1
53
Eigenvalues and epidemic networks
  • Will a virus spread across an arbitrary network
    create an epidemic?
  • Susceptible-Infected-Susceptible (SIS)
  • Was healthy, did not got infected
  • Was infected, got cured without further attack
  • Was infected, got cured immediately after attack
  • Cured nodes immediately become susceptible

54
The infection model
  • (virus) Birth rate b probability than an
    infected neighbor attacks
  • (virus) Death rate d probability that an
    infected node heals

55
Working parameters
  • pi,t prob. node i is infected at time t
  • Was healthy, did not got infected
  • Was infected, got cured without further attack
  • Was infected, got attacks from neighbors and yet
    cured itself in this time step
    somewhat arbitrary

56
Time recurrence for pi,t
  • Assuming various probabilities are suitably
    small
  • Recurrence of the probability of infection can be
    approximated by this linear form
  • In other words, the (symmetric) transition matrix
    is

57
Epidemic threshold
  • Virus strength s b/d
  • Epidemic threshold of a graph is defined as the
    value of t, such that if strength s b / d lt t,
    then an epidemic can not happen
  • Problem compute epidemic threshold

58
Epidemic threshold t
  • What should t depend on?
  • avg. degree? and/or highest degree?
  • and/or variance of degree?
  • and/or third moment of degree?

59
Analysis
  • Eigenvectors of S and A are the same
  • Eigenvalues are shifted and scaled
  • From spectral decomposition
  • A sufficient condition for infection dying down
    is that

60
Epidemic threshold
  • An epidemic must die down if

epidemic threshold
recovery prob.
ß/d ltt 1/ ?1,A
largest eigenvalue of adj. matrix A
attack prob.
Proof Wang03
61
Experiments
b/d gt t (above threshold)
b/d t (at the threshold)
b/d lt t (below threshold)
62
Remarks
  • Primal problem topology design
  • Design a network resilient to infection
  • Dual problem viral marketing
  • Selectively convert important customers who can
    influence many others
  • Will come back to this later

63
Back to the random surfer
  • Practical issues with Pagerank BrinP1997
  • PR converges only if E is aperiodic and
    irreducible make it so
  • d is the (tuned) probability of teleporting to
    one of N pages uniformly at random (0.10.2 in
    practice)
  • (Possibly) unintended consequences topic
    sensitivity, stability

64
Prestige as network flow
  • yij surfers clicking from i to j per unit time
  • Hits per unit time on page j is
  • Flow is conserved at
  • The total traffic is
  • NormalizeCan interpret pij as a probability
  • Standard Pagerank corresponds to one solution
  • Many other solutions possible

65
Maximum entropy flow Tomlin2003
  • Flow conservation modeled using feature
  • And the constraints
  • Goal is to maximizesubject to
  • Solution has form
  • ?i is the hotness of page i

66
Maxent flow results
?i ranking is better than Pagerank Hiranking
is worse
Two IBMintranetdata setswith knowntop URLs
Depth up to whichdmoz.org URLs areused as
ground truth
Averagerank (106) of knowntop URLswhensorted
byPagerank
Hi
?i
(Smaller rank is better)
Average rank (108)
67
HITS Kleinberg1997
  • Two kinds of prestige
  • Good hubs link to good authorities
  • Good authorities are linked to by good hubs
  • In matrix notation, iterations amount towith
    interleaved normalization of h and a
  • Note that scores are copied not divided

68
HITS graph acquisition
  • Steps
  • Get root set via keyword query
  • Expand by one move forward and backward
  • Drop same-site links (why?)
  • Each node assigned both ahub and an auth score
  • Graph tends to be topic-specific
  • Whereas Pagerank is usually run on the entire
    Web (but need not be)

69
HITS vs. Pagerank, more comments
  • Dominant eigenvectors of different matrices
  • HITS Eigensystems of EET (h) and ETE (a)
  • Pagerank eigensystem ofwhere
  • HITS copies scores, Pagerank distributes
  • HITS drops same-site links, Pagerank does (can?)
    not
  • Implications?

70
HITS Dyadic interpretation CohnC2000
  • Graph includes many communities z
  • QueryJaguar gets auto, game, animal links
  • Each URL is represented as two things
  • A document d
  • A citation c
  • Max
  • Guess number of aspects zs and use Hofmann 1999
    to estimate Pr(cz)
  • These are the most authoritative URLs

71
Dyadic results for Machine learning
Clustering based on citations ranking within
clusters
72
Spamming link-based ranking
  • Recipe for spamming HITS
  • Create a hub linking to genuine authorities
  • Then mix in links to your customers sites
  • Highly susceptible to adversarial behavior
  • Recipe for spamming Pagerank
  • Buy a bunch of domains, cloak IP addresses
  • Host a site at each domain
  • Sprinkle a few links at random per page to other
    sites you own
  • Takes more work than spamming HITS

73
Stability of link analysis NgZJ2001
  • Compute HITS authority scores and Pagerank
  • Delete 30 of nodes/links at random
  • Recompute and compare ranks repeat
  • Pagerank ranks more stable than HITS authority
    ranks
  • Why?
  • How to design more stable algorithms?

HITS Authority
Pagerank
74
Stability depends on graph and params
  • Auth score is eigenvector for ETE S, say
  • Let ?1 gt ?2 be the first two eigenvalues
  • There exists an S such that
  • S and S are close SSF O(?1 ?2)
  • But u1 u12 ?(1)
  • Pagerank p is eigenvector of
  • U is a matrix full of 1/N and ? is the jump prob
  • If set C of nodes are changed in any way, the new
    Pagerank vector p satisfies

75
Randomized HITS
  • Each half-step, with probability ?, teleport to a
    node chosen uniformly at random
  • Much more stable than HITS
  • Results more meaningful
  • ? near 1 will always stabilize
  • Here ? was 0.2

Randomized HITS
Pagerank
76
Another random walk variation of HITS
  • SALSA Stochastic HITS Lempel2000
  • Two separate random walks
  • From authority to authority via hub
  • From hub to hub via authority
  • Transition probability Pr(ai?aj)
  • If transition graph is irreducible,
  • For disconnected components, depends on relative
    size of bipartite cores
  • Avoids dominance of larger cores

a1
77
SALSA sample result (movies)
HITS The Tightly-Knit Community (TKC) effect
SALSA Less TKC influence (but no reinforcement!)
78
Links in relational data GibsonKR1998
  • (Attribute, value) pair is a node
  • Each node v has weight wv
  • Each tuple is a hyperedge
  • Tuple r has weight xr
  • HITS-like iterations to update weight wv
  • For each tuple
  • Update weight
  • Combining operator ? can be sum, max, product, Lp
    avg, etc.

79
Distilling links in relational data
Theory
Database
Author
Author
Forum
Year
80
Searching and annotating graph data
81
Searching graph data
  • Nodes in graph contain text
  • Random?Intelligent surfer RichardsonD2001
  • Topic-sensitive Pagerank Haveliwala2002
  • Assigning image captions using random walks
    PanYFD2004
  • Rotting pages and links BarYossefBKT2004
  • Query is a set of keywords
  • All keywords may not match a single node
  • Implicit joins Hulgeri2001, Agrawal2002
  • Or rank aggregation Balmin2004 required

82
Intelligent Web surfer
Keyword
Probabilityof teleportingto node j
Probability of walking from i to j wrt q
Relevanceof node k wrt q
Pick out-link to walk on inproportion to
relevance oftarget out-neighbor
Querysetof words
Pick a query word per some distribution, e.g. IDF
83
Implementing the intelligent surfer
  • PRQ(j) approximates a walk that picks a query
    keyword using Pr(q) at every step
  • Precompute and store Prq(j) for each keyword q in
    lexicon space blowup avg doc length
  • Query-dependent PR rated better by volunteers

84
Topic-sensitive Pagerank
  • High overhead for per-word Pagerank
  • Instead, compute Pageranks for some collection of
    broad topics PRc(j)
  • Topic c has sample page set Sc
  • Walk as in Pagerank
  • Jump to a node in Sc uniformly at random
  • Project query onto set of topics
  • Rank responses by projection-weighted Pageranks

85
Topic-sensitive Pagerank results
  • Users prefer topic-sensitive Pagerank on most
    queries to global Pagerank keyword filter

86
Image captioning
  • Segment images into regions
  • Image has caption words
  • Three-layer graph image, regions, caption words
  • Threshold on region similarity to connect regions
    (dotted)

87
Random walks with restarts
Regions
Images
Testimage
Words
  • Find regions in test image
  • Connect regions to other nodes in the region
    layer using region similarity
  • Random walk, restarting at test image node
  • Pick words with largest visit probability

88
More random walks Link rot
  • How stale is a page?
  • Last-mod unreliable
  • Automatic dead-link cleaners mask disuse
  • A page is completely stale if it is dead
  • Let D be the set of pages which cannot be
    accessed (404 and other problems)
  • How stale is a page u? Start with p ? u
  • If p?D declare decay value of u to be 1, else
  • With probability ? declare decay value of u 0
  • W.p. 1? choose outlink v, set p?v, loop

89
Page staleness results
Decay
404s
  • Decay score is correlated with, but generally
    larger than the fraction of dead outlinks on a
    page
  • Removing direct dead links automatically does not
    eliminate live but rotting pages

90
Graph proximity search two paradigms
  • A single node as query response
  • Find node that matches query terms
  • or is near nodes matching query terms
  • Goldman 1998
  • A connected subgraph as query response
  • Single node may not match all keywords
  • No natural page boundary Bhalotia2002
    Agrawal2002

91
Single-node response examples
  • Travolta, Cage
  • Actor, Face/Off
  • Travolta, Cage, Movie
  • Face/Off
  • Kleiser, Movie
  • Gathering, Grease
  • Kleiser, Woo, Actor
  • Travolta

Movie
is-a
Face/Off
Grease
Gathering
acted-in
Travolta
Cage
A3
directed
is-a
Actor
Kleiser
Woo
is-a
Director
92
Basic search strategy
  • Node subset A activated because they match query
    keyword(s)
  • Look for node near nodes that are activated
  • Goodness of response node depends
  • Directly on degree of activation
  • Inversely on distance from activated node(s)

93
Proximity query screenshot
http//www.cse.iitb.ac.in/banks/
94
Ranking a single node response
  • Activated node set A
  • Rank node r in response set R based on
    proximity to nodes a in A
  • Nodes have relevance ?R and ?A in 0,1
  • Edge costs are specified by the system
  • d(a,r) cost of shortest path from a to r
  • Bond between a and r
  • Parameter t tunes relative emphasis on distance
    and relevance score
  • Several ad-hoc choices

95
Scoring single response nodes
  • Additive
  • Belief
  • Goal list a limited number of find nodes with
    the largest scores
  • Performance issues
  • Assume the graph is in memory?
  • Precompute all-pairs shortest path (V 3)?
  • Prune unpromising candidates?

96
Hub indexing
  • Decompose APSP problem using sparsevertex cuts
  • AB shortest paths to p
  • AB shortest paths to q
  • d(p,q)
  • To find d(a,b) compare
  • d(a?p?b) not through q
  • d(a?q?b) not through p
  • d(a?p?q?b)
  • d(a?q?p?b)
  • Greatest savings when A?B
  • Heuristics to find cuts, e.g. large-degree nodes

A
B
p
a
b
q
97
ObjectRank Balmin2004
  • Given a data graph with nodes having text
  • For each keyword precompute a keyword-sensitive
    Pagerank RichardsonD2001
  • Score of a node for multiple keyword search based
    on fuzzy AND/OR
  • Approximation to Pagerank of node with restarts
    to nodes matching keywords
  • Use Fagin-merge Fagin2002 to get best nodes in
    data graph

98
Connected subgraph as response
  • Single node may not match all keywords
  • No natural page boundary
  • On-the-fly joins make up a response page
  • Two scenarios
  • Keyword search on relational data
  • Keywords spread among normalized relations
  • Keyword search on XML-like or Web data
  • Keywords spread among DOM nodes and subtrees

99
Keyword search on relational data
  • Tuple node
  • Some columns have text
  • Foreign key constraints edges in schema graph?
  • Query set of terms
  • No natural notionof a document
  • Normalization
  • Join may be needed to generate results
  • Cycles may exist in schema graph Cites

Cites
Paper
CitingCited? ? ?
PaperIDPaperName? ? ?
Writes
Author
AuthorIDPaperID? ? ?
AuthorIDAuthorName? ? ?
100
DBXplorer and DISCOVER
  • Enumerate subsets of relations in schema graph
    which, when joined, may contain rows which have
    all keywords in the query
  • Join trees derived from schema graph
  • Output SQL query for each join tree
  • Generate joins, checking rows for matches
  • Agrawal2001, Hristidis2002

T4
K1,K2,K3
T2
T3
T4
T2
T5
T1
T2
T3
K2
T4
T2
T3
T5
T2
T3
T5
K3
101
Discussion
  • Exploits relational schema information to contain
    search
  • Pushes final extraction of joined tuples into
    RDBMS
  • Faster than dealing with full data graph directly
  • Coarse-grained ranking based on schema tree
  • Does not model proximity or (dis) similarity of
    individual tuples
  • No recipe for data with less regular (e.g. XML)
    or ill-defined schema

102
Motivation from Web search
  • Linux modem driver for a Thinkpad A22p
  • Hyperlink path matches query collectively
  • Conjunction query would fail
  • Projects where X and P work together
  • Conjunction may retrieve wrong page
  • General notion of graph proximity
  • IBM Thinkpads
  • A20m
  • A22p
  • Thinkpad
  • Drivers
  • Windows XP
  • Linux
  • Download
  • Installation tips
  • Modem
  • Ethernet
  • The B System
  • Group members
  • P
  • S
  • X
  • Home Page ofProfessor X
  • Papers
  • VLDB
  • Students
  • P
  • Q

Ps home page I work on the B project.
103
Data structures for search
  • Answer tree with at least one leaf containing
    each keyword in query
  • Group Steiner tree problem, NP-hard
  • Query term t found in source nodes St
  • Single-source-shortest-path SSSP iterator
  • Initialize with a source (near-) node
  • Consider edges backwards
  • getNext() returns next nearest node
  • For each iterator, each visited node v maintains
    for each t a set v.Rt of nodes in St which have
    reached v

104
Generic expanding search
  • Near node sets St with S ?t St
  • For all source nodes ? ? S
  • create a SSSP iterator with source ?
  • While more results required
  • Get next iterator and its next-nearest node v
  • Let t be the term for the iterators source s
  • crossProduct s ? ?t ?tv.Rt
  • For each tuple of nodes in crossProduct
  • Create an answer tree rooted at v with paths to
    each source node in the tuple
  • Add s to v.Rt

105
Search example (Vu Kleinberg)
Quoc Vu
Jon Kleinberg
author
writes
cites
paper
106
First response
Quoc Vu
Jon Kleinberg
writes
writes
writes
Organizing Web pagesby Information Unit
Authoritative sources in ahyperlinked environment
cites
A metriclabeling problem
writes
cites
cites
Divyakant Agrawal
writes
Eva Tardos
author
writes
cites
paper
107
Subgraph search screenshot
http//www.cse.iitb.ac.in/banks/
108
Similarity, neighborhood, influence
109
Why are two nodes similar?
  • What is/are the best paths connecting two nodes
    explaining why/how they are related?
  • Graph of co-starring, citation, telephone call,
  • Graph with nodes s and t budget of b nodes
  • Find best b nodes capturing relationship
    between s and t FaloutsosMT2004
  • Proposing a definition of goodness
  • How to efficiently select best connections

Negroponte
Palmisano
Esther Dyson
Gerstner
110
Simple proposals that do not work
  • Shortest path
  • Pizza boy p gets sameattention as g
  • Network flow
  • s?a?b?t is as good as s?g?t
  • Voltage
  • Connect 1V at s, ground t
  • Both g and p will be at 0.5V
  • Observations
  • Must reward parallel paths
  • Must reward short paths
  • Must penalize/tax pizza boys

a
b
s
g
t
p
111
Resistive network with universal sink
  • Connect 1V to s
  • Ground t
  • Introduce universal sink
  • Grounded
  • Connected to every node
  • Universal sink is a tax collector
  • Penalizes pizza boys
  • Penalizes long paths
  • Goodness of a path is the electric current it
    carries

a
b
s
g
t
p
Connectedto every node
112
Resistive network algorithm
  • Ohms law
  • Kirchhoffs current law
  • Boundary conditions (without sink)
  • Solution
  • Here C(u,v) is the conductance from u to v
  • Add grounded universal sink z with V(z)0
  • Set
  • Display subgraph carrying high current

113
Distributions influenced via graphs
  • Directed or undirected graph nodes have
  • Observable properties
  • Some unobservable (random) state
  • Edges indicate that distributions over
    unobservable states are coupled
  • Many applications
  • Hypertext classification (topics are clustered)
  • Social network of customers buying products
  • Hierarchical classification
  • Labeling categorical sequences pos/ne tagging,
    sense disambiguation, linkage analysis

114
Basic techniques
  • Directed (acyclic) graphs Bayesian network
  • Markov networks
  • (Loopy) belief propagation
  • Conditional Markov networks
  • Avoid modeling joint distribution over observable
    and hidden properties of nodes
  • Some computationally simple special cases

115
Hypertext classification
  • Want to assign labels to Web pages
  • Text on a single page may be too little or
    misleading
  • Page is not an isolated instance by itself
  • Problem setup
  • Web graph G(V,E)
  • Node u?V is a page having text uT
  • Edges in E are hyperlinks
  • Some nodes are labeled
  • Make collective guess at missing labels
  • Probabilistic model? Benefits?

116
Graph labeling model
  • Seek a labeling f of all unlabeled nodes so as to
    maximize
  • Let VK be the nodes with known labels and f(VK)
    their known label assignments
  • Let N(v) be the neighbors of v and NK(v)?N(v) be
    neighbors with known labels
  • Markov assumption f(v) is conditionally
    independent of rest of G given f(N(v))

117
Markov graph labeling
  • Circularity between f(v) and f(NU(v))
  • Some form of iterative Gibbs sampling or MCMC

118
Iterative labeling
  • Sum over all possible NU(v) labelings still too
    expensive to compute
  • In practice, prune to most likely configurations
  • Let us look at the last term more carefully

119
A generative node model
By the Markov assumption, finally we need a
distributioncoupling f(v) and vT (the text on v)
and f(N(v))
  • Can use Bayes classifier as with ordinary text
    estimate a parametric model forthe
    class-conditional joint distribution between the
    text on the page v and the labels of neighbors of
    v
  • Must make naïve Bayes assumptions to keep
    practical

120
Pictorially
  • cclass, ttext, Nneighbors
  • Text-only model Prtc
  • Using neighbors textto judge my topicPrt,
    t(N) c (hurts)
  • Better modelPrt, c(N) c
  • Estimate histograms and update based on
    neighbors histograms

?
121
Generative model results
  • 9600 patents from 12 classes marked by USPTO
  • Patents have text and cite other patents
  • Expand test patent to include neighborhood
  • Forget fraction of neighbors classes

122
Detour generative vs. discriminative
  • x feature vector, y label 0,1 say
  • Generative method models Pr(xy) or Pr(x,y) the
    generation of data x given label y
  • Use Bayes rule to get Pr(yx)
  • Inaccurate x may have many dimensions
  • Discriminative directly estimates Pr(yx)
  • Simple linear case want w s.t. w.x gt 0 if y1
    and w.x ? 0 if y0
  • Cannot differentiate for w instead pick some
    smooth loss function
  • Works very well in practice

123
Discriminative node model
  • OA(X) direct own attributes of node X
  • LD(X) link-derived attributes of node X
  • Mode-link most frequent label of neighbors(X)
  • Count-link histogram of neighbor labels
  • Binary-link 0/1 histogram of neighbor labels
  • Iterate as in generative case

Local model params
Neighborhood model params
124
Discriminative model results Li2003
  • Binary-link and count-link outperform
    content-only at 95 confidence
  • Better to separately estimate wl and wo
  • InOutCocitation better than any subset for LD

125
Undirected Markov networks
  • Clique c?C(G) a set of completely connected nodes
  • Clique potential ?c(Vc) a function over all
    possible configurations of nodes in Vc
  • Decompose Pr(v) as (1/Z)?c?C(G)?c(Vc)
  • Parametric form

Label coupling
Instance
Local feature variable
Label variable
Params of model
Feature functions
126
Conditional and relational networks
  • x vector of observed features at all nodes
  • y vector of labels
  • A set of clique templatesspecifying links to
    use
  • Other features in the sameHTML section

127
Toy problem Hierarchical classification
  • Obvious approaches
  • Flatten to leaf topics, losing hierarchy info
  • Level-by-level, compounding error probability
  • Cascaded generative model
  • Pr(cd,r) estimated as Pr(cr)Pr(dc)/Z(r)
  • Estimate of Pr(dc) makes naïve independence
    assumptions if d has high dimensionality
  • Pr(cd,r) tends to 0/1 for large dimensions and
  • Mistake made at shallow levels become irrevocable

r
c
128
Global discriminative model
  • Each node has an associated bit X
  • Propose a parametric form
  • Each training instance sets one path to 1, all
    other nodes have X0

T
xr
d
F(d,xr)
xr0
xr1
2T1
wc
129
Network value of customers
  • Customer node X in graph, neighbors N
  • M is a marketing action (promotion, coupon)
  • Want to predict
  • Broader objective is to design action M
  • Again, we approximate as

Aggregatemarketing action
Response of customer i
Known response ofother customers
Product attributes
Sum over unknown neighbor configurations
130
Network value, continued
  • Let the action be boolean
  • c is the cost of marketing
  • r0 is the revenue without marketing, r1 with
  • Expected lift in profit by marketing to customer
    i in isolation
  • Global effect

131
Special case sequential networks
  • Text modeled as sequence of tokens drawn from a
    large but finite vocabulary
  • Each token has attributes
  • Visible allCaps, noCaps, hasXx, allDigits,
    hasDigit, isAbbrev, (part-of-speech, wnSense)
  • Not visible part-of-speech, (isPersonName,
    isOrgName, isLocation, isDateTime),
    startscontinuesends-noun-phrase
  • Visible (symbols) and invisible (states)
    attributes of nearby tokens are dependent
  • Application decides what is (not) visible
  • Goal Estimate invisible attributes

132
Hidden Markov model
  • A generative sequential model for the joint
    distribution of states (s) and symbols (o)

St-1
St
St1
...
...
Ot
Ot1
Ot-1
133
Using redundant token features
  • Each o is usually a vector of features extracted
    from a token
  • Might have high dependence/redundancy hasCap,
    hasDigit, isNoun, isPreposition
  • Parametric model for Pr(st?ot) needs to make
    naïve assumptions to be practical
  • Overall joint model Pr(s,o) can be very
    inaccurate
  • (Same argument as in naïve Bayes vs. SVM or
    maximum entropy text classifiers)

134
Discriminative graphical model
  • Assume one-stage Markov dependence
  • Propose direct parametric form for conditional
    probability of state sequence given symbol
    sequence

Model
Log-linear form
Feature function mightdepend on whole o
Parameters to fit
135
Feature functions and parameters
Penalizelarge params
Maximize total conditional likelihood over all
instances
  • Find ?L/??k for each k and perform a
    gradient-based numerical optimization
  • Efficient for linear state dependence structure

136
Conditional vs. joint results
Out-of-vocabulary error
Orthography Use words, plus overlapping
features isCap, startsWithDigit, hasHyphen,
endsWith -ing, -ogy, -ed, -s, -ly, -ion, -tion,
-ity, -ies
137
Summary
  • Graphs provide a powerful way to model many kinds
    of data, at multiple levels
  • Web pages, XML, relational data, images
  • Words, senses, phrases, parse trees
  • A few broad paradigms for analysis
  • Factors affecting graph evolution over time
  • Eigen analysis, conductance, random walks
  • Coupled distributions between node attributes and
    graph neighborhood
  • Several new classes of model estimation and
    inferencing algorithms

138
References
  • BrinP1998 The Anatomy of a Large-Scale
    Hypertextual Web Search Engine, WWW.
  • GoldmanSVG1998 Proximity search in databases.
    VLDB, 2637.
  • ChakrabartiDI1998 Enhanced hypertext
    categorization using hyperlinks. SIGMOD.
  • BikelSW1999 An Algorithm that Learns Whats in
    a Name. Machine Learning Journal.
  • GibsonKR1999 Clustering categorical data An
    approach based on dynamical systems. VLDB.
  • Kleinberg1999 Authoritative sources in a
    hyperlinked environment. JACM 46.

139
References
  • CohnC2000 Probabilistically Identifying
    Authoritative Documents, ICML.
  • LempelM2000 The stochastic approach for
    link-structure analysis (SALSA) and the TKC
    effect. Computer Networks 33 (1-6) 387-401
  • RichardsonD2001 The Intelligent Surfer
    Probabilistic Combination of Link and Content
    Information in PageRank. NIPS 14 (1441-1448).
  • LaffertyMP2001 Conditional Random Fields
    Probabilistic Models for Segmenting and Labeling
    Sequence Data. ICML.
  • BorkarDS2001 Automatic text segmentation for
    extracting structured records. SIGMOD.

140
References
  • NgZJ2001 Stable algorithms for link analysis.
    SIGIR.
  • Hulgeri2001 Keyword Search in Databases. IEEE
    Data Engineering Bulletin 24(3) 22-32.
  • Hristidis2002 DISCOVER Keyword Search in
    Relational Databases. VLDB.
  • Agrawal2002 DBXplorer A system for
    keyword-based search over relational databases.
    ICDE.
  • TaskarAK2002 Discriminative probabilistic
    models for relational data.
  • Fagin2002 Combining fuzzy information an
    overview. SIGMOD Record 31(2), 109118.

141
References
  • Chakrabarti2002 Mining the Web Discovering
    Knowledge from Hypertext Data
  • Tomlin2003 A New Paradigm for Ranking Pages on
    the World Wide Web. WWW.
  • Haveliwala2003 Topic-Sensitive Pagerank A
    Context-Sensitive Ranking Algorithm for Web
    Search. IEEE TKDE.
  • LuG2003 Link-based Classification. ICML.
  • FaloutsosMT2004 Connection Subgraphs in Social
    Networks. SIAM-DM workshop.
  • PanYFD2004 GCap Graph-based Automatic Image
    Captioning. MDDE/CVPR.

142
References
  • Balmin2004 Authority-Based Keyword Queries in
    Databases using ObjectRank. VLDB.
  • BarYossefBKT2004 Sic transit gloria telae
    Towards an understanding of the Webs decay.
    WWW2004.
Write a Comment
User Comments (0)
About PowerShow.com