CS246 - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

CS246

Description:

How can we identify these Web communities? Junghoo 'John' Cho (UCLA ... Linux, Star wars, Anti-abortion, Nicole Kidman, ... Pages tend to point to each other ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 44
Provided by: wind358
Category:
Tags: cs246 | kidman | nicole

less

Transcript and Presenter's Notes

Title: CS246


1
CS246
  • Web Community

2
Todays Topic
  • Web Community
  • Community link
  • Complete bipartite core
  • Hub/Authority

3
Web Community
  • Many pages seem to form a community over time
  • Linux pages
  • Abortion pages
  • Star fan sites
  • How can we identify these Web communities?

4
Applications of Web Community
  • Better search and discovery
  • Topic specific searching
  • Users often want to see a set of related pages
  • Focused crawling
  • Study of knowledge/community evolution

5
Community Identification
  • Manual community categorization
  • Yahoo! approach (central approach)
  • Hire a bunch people for manual categorization
  • Dmoz directory (distributed approach)
  • Loosely organized volunteers
  • Independent category editor
  • Very expensive, often not very up to date
  • Can we identify communities automatically?

6
Questions
  • How can we find Web communities automatically?
  • What are the characteristics of Web communities?
  • What do we mean by Web communities?

7
Characteristics of Web Communities
  • Have a common interest
  • Linux, Star wars, Anti-abortion, Nicole Kidman,

  • Pages tend to point to each other
  • How can we exploit these characteristics to
    identify communities?

8
Two Cases
  • We are interested in a particular topic
  • Say, Linux
  • For focused crawling, for example
  • We want to find all communities on the Web
  • For Yahoo!, for example

9
Case 1 We Know the Topic
  • The topic or the example pages are given
  • Find the community on Linux
  • Find a community including pages A, B, and C
  • How should we do this?

10
Solution 1 IR measure
  • Use traditional IR similarity metric or Naïve
    Bayesian classifier and find similar pages
  • Return, say, top 100 pages
  • Problem?
  • Often, similar pages are not very relevant
  • Pages may be disconnected

11
Solution 2 Hub/Authority Kleinberg
  • Expand the example pages to their neighbors
  • Identify important pages from the neighbor
    graph
  • Authority Many hub pages point to it
  • Hub Point to many authority pages
  • Anything else?

12
Solution 3 Community Link Flake et al
  • Basic idea A community page should be closer
    to the community than to others
  • Start with the example pages
  • Add a page to the community
  • If the page is linked to the community pages more
    than to others
  • Stop if no more pages can be added to the
    community

13
Case 2 No Starting Point
  • Can we identify all existing Web communities
    without any starting points?

14
Solution 1 IR Clustering
  • Using an IR similarity metric, find similar
    clusters of pages
  • From each cluster, use Hub/Authority techniques
    to identify communities

15
Solution 2 Trawling
  • Find all complete, say, (2,3) bipartite graphs
  • Inspired by the Hub/Authority idea
  • Apply Hub/Authority to expand the subgraphs

16
Outline
  • Community link
  • Case 1 A particular community
  • Bipartite-core trawling
  • Case 2 All communities

17
Community Link Flake et al.
  • What is their definition of community?
  • C ? V is a community iff
  • ? v ? C, v has as many edges to C as to (V-C)
  • ? v ? (V-C), v has as many edges to (V-C) as to
    C
  • Rationale
  • Pages in a community should link to/be linked to
    by more pages in the community than to others
  • Given starting pages, does this definition lead
    to a unique community?

18
Uniqueness A Simple Case
C2
C1
C1 or C2?
19
Uniqueness More Troubling Case
C2
C1
C1 or C2?
20
Uniqueness
  • Typical solution
  • Balanced partition prefer partitions with
    balanced number of nodes in each side
  • NP-Complete (assuming minimum links between
    partitions)
  • Inappropriate for community identification
  • What choice did the authors make?

21
Community Identification
  • How can we identify a community given examples?
  • Assuming any community is fine

22
Solution 1 Greedy Algorithm
  • C initial examples
  • Add a node v iff
  • v has as many edges to C as to (V-C)
  • Stop if no more nodes can be added
  • Polynomial-time algorithm

23
Greedy Algorithm Example
  • What community?

Greedy prefers smaller (tighter) community
24
Solution 2 Min-Cut Algorithm
  • Partition the graph such that the flow between
    partitions is minimal
  • This algorithm also restricts the community to a
    specific case
  • Exactly how did they compute min-cut?

25
S-T Min Cut
  • Given a source (s) and a target (t) node, find a
    min cut
  • How did they pick s and t?

s
t
26
Choice of S and T
  • S Connect all examples to a virtual s

s
27
Choice of S and T
  • T Supposedly the center of the Web, but in
    practice, connect all grandchildren of the
    samples to a virtual t
  • Why? The authors say it works empirically well

s
t
28
One More Idea
  • Iteration (Expectation maximization)
  • Once we identify a community, repeat the same
    process using the newly identified community as
    the examples
  • Repeat until we reach a fixed point

29
Question
  • Why the heck did the authors use min cut/max
    flow?
  • Greedy is more natural given the definition
  • Lots of approximations and complicated (and
    loose) theorems
  • Maybe
  • Original idea was partitioning not the
    definition
  • Maybe reverse engineered
  • Greedy looked too simple to make the paper
    impressive
  • Min-cut may have worked better in experiments

30
Any Questions?
31
Outline
  • Community link
  • Bipartite-core trawling
  • Case 2 Identify all communities

32
Web Trawling
  • Let us find all potential communities
  • Community core complete (i, j) bipartite graph
  • Will they be meaningful?

33
Will They Be Meaningful?
  • Q How many (2,3)-bipartite cores in a billion
    node random graph (10 links per node)?
  • They are at least very rare!
  • 109 x 10-13 10-4 nodes on a billion-node graph

34
Question
  • How can we find, say, (2,3) bipartite graph?
  • Can be very expensive for a large graph
    structure
  • In the most part of the paper, the authors
    discuss their pruning heuristics
  • What pruning heuristics did they use?

35
Heuristics 1 Mirror Elimination
  • Mirrors can add bogus community
  • Remove all mirrors
  • Shingling technique. Pruned 60 of pages

Mirror page
36
Heuristics In/Out-Degree
  • Well known communities are not interesting
  • We already know them any way
  • Remove pages with high in-degree, 50
  • Anyone understood why 50?
  • Hub (Fan) pages should point to many pages on
    other sites
  • More than 6 inter-site links
  • Explanation not very convincing
  • 10M pages after the pruning
  • Iterative pruning
  • Iteratively remove all pages with out-degree and in-degree

37
Algorithm
  • How can we find the bipartite graphs?

38
Conceptual Algorithm
  • For every node n ? V
  • C(n) the children of n
  • If C(n)
  • For every ck ? C(n)
  • P(ck) the parents of ck
  • If ? P(ck) i for j cks return them
  • Two indexes?
  • Efficient disk-based algorithm?

n
39
Efficient Disk-Based Algorithm
Link table
Link table
Src1
Dst
Src2
Dst
10
1
2
1
3
2
3
1
113
2
6
2
13
4
5
3
40
Efficient Disk-Based Algorithm
  • Read all entries with the same Src1
  • Find ? P(ck) i

Src1
Dst
Src2
  • Potential problem
  • Is the joined table too large?

41
Joined Table Size
  • Original link table
  • 10M pages
  • 10 links per page on average
  • 10 x 10M 100M rows
  • Joined table
  • At most 50 in-degree
  • At most 50 x 100M 5B rows
  • In practice much fewer rows
  • Assuming a node ID is 8 bytes
  • 3 x 8 x 5B 120GB
  • Not too bad
  • In-degree pruning is the killer pruning

42
Algorithm in the Paper
  • For the first table, join only the entries with
    exactly i links
  • Maybe unnecessary restriction

43
Summary
  • Web Community
  • Seed pages given
  • Community link min cut/max flow
  • Hub/Authority
  • Seed pages not given
  • Trawling
  • Join-based algorithm
Write a Comment
User Comments (0)
About PowerShow.com