CS246 - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

CS246

Description:

How can we identify these Web communities? Junghoo 'John' Cho (UCLA ... Linux, Star wars, Anti-abortion, Nicole Kidman, ... Pages tend to point to each other ... – PowerPoint PPT presentation

Number of Views:116

Avg rating:3.0/5.0

Slides: 44

Provided by: wind358

Category:

more less

Transcript and Presenter's Notes

Title: CS246

1
CS246

Web Community

2
Todays Topic

Web Community
Community link
Complete bipartite core
Hub/Authority

3
Web Community

Many pages seem to form a community over time
Linux pages
Abortion pages
Star fan sites
How can we identify these Web communities?

4
Applications of Web Community

Better search and discovery
Topic specific searching
Users often want to see a set of related pages
Focused crawling
Study of knowledge/community evolution

5
Community Identification

Manual community categorization
Yahoo! approach (central approach)
Hire a bunch people for manual categorization
Dmoz directory (distributed approach)
Loosely organized volunteers
Independent category editor
Very expensive, often not very up to date
Can we identify communities automatically?

6
Questions

How can we find Web communities automatically?
What are the characteristics of Web communities?
What do we mean by Web communities?

7
Characteristics of Web Communities

Have a common interest
Linux, Star wars, Anti-abortion, Nicole Kidman,
Pages tend to point to each other
How can we exploit these characteristics to
identify communities?

8
Two Cases

We are interested in a particular topic
Say, Linux
For focused crawling, for example
We want to find all communities on the Web
For Yahoo!, for example

9
Case 1 We Know the Topic

The topic or the example pages are given
Find the community on Linux
Find a community including pages A, B, and C
How should we do this?

10
Solution 1 IR measure

Use traditional IR similarity metric or Naïve
Bayesian classifier and find similar pages
Return, say, top 100 pages
Problem?

Often, similar pages are not very relevant
Pages may be disconnected

11
Solution 2 Hub/Authority Kleinberg

Expand the example pages to their neighbors
Identify important pages from the neighbor
graph
Authority Many hub pages point to it
Hub Point to many authority pages
Anything else?

12
Solution 3 Community Link Flake et al

Basic idea A community page should be closer
to the community than to others
Start with the example pages
Add a page to the community
If the page is linked to the community pages more
than to others
Stop if no more pages can be added to the
community

13
Case 2 No Starting Point

Can we identify all existing Web communities
without any starting points?

14
Solution 1 IR Clustering

Using an IR similarity metric, find similar
clusters of pages
From each cluster, use Hub/Authority techniques
to identify communities

15
Solution 2 Trawling

Find all complete, say, (2,3) bipartite graphs
Inspired by the Hub/Authority idea
Apply Hub/Authority to expand the subgraphs

16
Outline

Community link
Case 1 A particular community
Bipartite-core trawling
Case 2 All communities

17
Community Link Flake et al.

What is their definition of community?
C ? V is a community iff
? v ? C, v has as many edges to C as to (V-C)
? v ? (V-C), v has as many edges to (V-C) as to
C
Rationale
Pages in a community should link to/be linked to
by more pages in the community than to others
Given starting pages, does this definition lead
to a unique community?

18
Uniqueness A Simple Case
C2
C1
C1 or C2?
19
Uniqueness More Troubling Case
C2
C1
C1 or C2?
20
Uniqueness

Typical solution
Balanced partition prefer partitions with
balanced number of nodes in each side
NP-Complete (assuming minimum links between
partitions)
Inappropriate for community identification
What choice did the authors make?

21
Community Identification

How can we identify a community given examples?
Assuming any community is fine

22
Solution 1 Greedy Algorithm

C initial examples
Add a node v iff
v has as many edges to C as to (V-C)
Stop if no more nodes can be added
Polynomial-time algorithm

23
Greedy Algorithm Example

What community?

Greedy prefers smaller (tighter) community
24
Solution 2 Min-Cut Algorithm

Partition the graph such that the flow between
partitions is minimal

This algorithm also restricts the community to a
specific case

Exactly how did they compute min-cut?

25
S-T Min Cut

Given a source (s) and a target (t) node, find a
min cut
How did they pick s and t?

s
t
26
Choice of S and T

S Connect all examples to a virtual s

s
27
Choice of S and T

T Supposedly the center of the Web, but in
practice, connect all grandchildren of the
samples to a virtual t
Why? The authors say it works empirically well

s
t
28
One More Idea

Iteration (Expectation maximization)
Once we identify a community, repeat the same
process using the newly identified community as
the examples
Repeat until we reach a fixed point

29
Question