Efficient Identification of Web Communities - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Identification of Web Communities

Description:

Search engine can not cover the ... Seeds: Internet Archive community. Results: Internet Archive community. Conclusion. Define a new type of web community. ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 18
Provided by: wu1
Category:

less

Transcript and Presenter's Notes

Title: Efficient Identification of Web Communities


1
Efficient Identification of Web Communities
  • Gary W. Flake, Steve Lawrence and C.Lee Giles
  • Presenter Baoning Wu

2
Motivation
  • Search engine can not cover the whole web and are
    out-of-date
  • The balance between precision and recall of query
    results.
  • The concept of web community can help to solve
    these two problems

3
Community
  • A community is a set of web pages that link to
    more web pages in the community than to pages
    outside of the community

4
Graph cuts and Partitions
  • The tightly coupled community identification can
    be considered as a balanced minimum cut problem
    edge weight is minimized while maintaining
    partitions of a minimal size.
  • It is NP-complete.
  • Unrestricted minimal cuts will leave one
    partition very small.

5
Maximum Flow
  • The s-t maximum flow problem
  • Given a directed graph, for two vertices, s,t,
    find the maximum flow that can be routed from the
    source s to the sink t.
  • It is identical to the minimum cut that separates
    s and t.
  • Many polynomial time algorithm exists, the
    shortest augmentation path algorithm

6
Definition
7
Theorem 1
8
Seeds and sinks
  • Multiple seeds are used to address the unrelated
    link problem.
  • Sinks are a small collection of web portals, such
    as Yahoo.

9
Approximate communities
  • Ideal communities require rapid access to the
    inbound and outbound links for many web sites.
    The whole web graph is needed.
  • An approximate method is developed instead.

10
Focused crawler
11
Expectation Maximization
  • If the seed set is small, only a small subset of
    community can be identified. New seeds need to be
    added.
  • Each run, the strongest newly discovered web site
    are relabeled as seeds.

12
Seeds SVM community
13
Results SVM community
14
Seeds Internet Archive community
15
Results Internet Archive community
16
Conclusion
  • Define a new type of web community.
  • Use maximum flow to calculate.
  • A focused web crawler to approximate a community.
  • Application focused crawler, automatic
    population of portal and improved filtering.

17
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com