Nurturing contentbased collaborative communities on the Web - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Nurturing contentbased collaborative communities on the Web

Description:

Alta Vista serves 40 million queries / day. Cannot even afford to seek on disk (8ms) ... Alta Vista: at most 2 3 words. Crisis of abundance. Relevance ranking ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 49
Provided by: CFI8
Category:

less

Transcript and Presenter's Notes

Title: Nurturing contentbased collaborative communities on the Web


1
Nurturing content-based collaborative
communitieson the Web
  • Soumen Chakrabarti
  • Center for Intelligent Internet ResearchComputer
    Science and EngineeringIndian Institute of
    Technology Bombay
  • www.cse.iitb.ernet.in/soumenwww.cse.iitb.ernet.i
    n/cfiir

2
Generic search engines
  • Struggle to cover the expanding Web
  • 35 coverage in 1997 (Bharat and Broder)
  • 18 in 1999 (Lawrence and Lee Giles)
  • Google rebounds to 50 in 2000
  • Moores law vs. Web population
  • Search quality, index freshness
  • Cannot afford advanced processing
  • Alta Vista serves 40 million queries / day
  • Cannot even afford to seek on disk (8ms)
  • Limits intelligence of search engines

3
Scale vs. quality
Lexical networks, parsing, semantic indexing
Resourcediscovery
Quality
Focusedcrawling
Link-assistedranking
Topic distillation
Keyword-basedsearch engines
Google, Clever
HotBot, Alta Vista
Scale
4
The case for vertical portals
  • Portals and search pages are changing rapidly,
    in part because their biggest strength massive
    size and reach can also be a drawback. The most
    interesting trend is the growing sense of natural
    limits, a recognition that covering a single
    galaxy can be more practical and useful than
    trying to cover the entire universe.
  • (San Jose Mercury News, 1/1999)

5
Scaling through specialization
  • The Web shows content-based locality
  • Link-based clusters correlated with content
  • Content-based communities emerge in a
    spontaneous, decentralized fashion
  • Can learn and exploit locality patterns
  • Analyze page visits and bookmarks
  • Automatically construct a focused portal with
    resources that
  • Have high relevance and quality
  • Are up-to-date and collectively comprehensive

6
Roadmap
  • Hyperlink mining a short history
  • Resource discovery
  • Content-based locality in hypertext
  • Taxonomy models, topic distillation
  • Strategies for focused crawling
  • Data capture and mining architecture
  • The Memex collaboration system
  • Collaborative construction of vertical portals
  • Link metadata management architecture
  • Surfing backwards on the Web

7
Historical background
  • First generation Web search engines
  • Delete stopwords from queries
  • Can only do syntactic matching
  • Users stopped asking good questions!
  • TREC queries tens to hundreds of words
  • Alta Vista at most 23 words
  • Crisis of abundance
  • Relevance ranking for very short queries
  • Quality complements relevance thats where
    hand-made topic directories shine

8
Hyperlink Induced Topic Search
Expanded graph
Query
Keyword Search engine
Response
a Eh h ETa Hubs and authorities
9
PageRank and Google
  • Prestige of a page is proportional to sum of
    prestige of citing pages
  • Standard bibliometric measure of influence
  • Simulate a random walk on the Web to precompute
    prestige of all pages
  • Sort keyword-matched responses by decreasing
    prestige

p1
p4
p2
p3
p4 ? p1 p2 p3
I.e., p Ep
10
Observations
  • HITS
  • Uses text initially to select Web subgraph
  • Expands subgraph by radius 1 magic!
  • h and a scores independent of content
  • Iterations required at query time
  • Google/PageRank
  • Precomputed query-independent prestige
  • No iterations needed at query time, faster
  • Keyword query selects subgraph to rank
  • No notion of hub or bipartite reinforcement

11
Limitations
  • Artificial decoupling of text and links
  • Connectivity-based topic drift (HITS)
  • movie awards ? movies
  • Expanders at www.web-popularity.com
  • Feature diffusion (Google)
  • more evil than evil ? www.microsoft.com
  • New threat of anchor-text spamming
  • Decoupled ranking (Google)
  • harvard mother ? Bill Gatess bio page!

12
Genealogy
Bibliometry
Exploiting anchor text
Google
HITS
Outlier elimination
Topic distillation _at_Compaq
Clever_at_IBM
Text classification
Focusedcrawling
Hypertextclassification
Relaxationlabeling
Crawlingcontext graphs
Learningtopic paths
13
Reducing topic drift anchor text
  • Page modeled as sequence of tokens and outlinks
  • Radius of influence around each token
  • Query term matching token increases link weight
  • Favors hubs and authorities near relevant pages
  • Better answers than HITS
  • Ad-hoc spreading activation, but no formal
    model as yet

Query term
14
Reducing topic drift Outlier detection
  • Search response is usually purer than radius1
    expansion
  • Compute document term vectors
  • Compute centroid of response vectors
  • Eliminate far-away expanded vectors
  • Results improve
  • Why stop at radius1?

Expanded graph
Keyword searchresponse
Vector-spacedocumentmodel
Centroid
Cut-off radius

15
Resource discovery
  • Given
  • Yahoo-like topic tree with example URLs
  • A selection of good topics to explore
  • Examples, not queries, define topics
  • Need 2-way decision, not ad-hoc cut-off
  • Goal
  • Start from the good / relevant examples
  • Crawl to collect additional relevant URLs
  • Fetch as few irrelevant URLs as possible

16
A model for relevance
Blocked class
Path class
All
BusEcon
Recreation
Arts
Companies
Cycling
...
...
Bike Shops
Clubs
Mt.Biking
Subsumed classes
Good classes
17
Pr(cd) from Pr(cd) using Bayes rule
  • Decide topic topic c is picked with prior
    probability ?(c) ?c?(c) 1
  • Each c has parameters ?(c,t) for terms t
  • Coin with face probabilities ?t ?(c,t) 1
  • Fix document length n(d) and toss coin
  • Naïve yet effective can use other algos
  • Given c, probability of document is

18
Enhanced models for hypertext
  • cclass, dtext, Nneighbors
  • Text-only model Pr(dc)
  • Using neighbors text to judge my topicPr(d,
    d(N) c)
  • Better recursive modelPr(d, c(N) c)
  • Relaxation labeling over Markov random fields
  • Or, EM formulation

?
19
Hyperlink modeling boosts accuracy
  • 9600 patents from 12 classes marked by USPTO
  • Patents have text and prior art links
  • Expand test patent to include neighborhood
  • Forget and re-estimate fraction of neighbors
    classes
  • (Even better for Yahoo)

20
Resource discovery basic approach
  • Topic taxonomy with examples and good topics
    specified
  • Crawler coupled to hypertext classifier
  • Crawl frontier expanded in relevance order
  • Neighbors of good hubs expanded with high priority

ExampleURLs
?
?
Radius-1 rule
Radius-2 rule
21
Focused crawler block diagram
22
Focused crawling evaluation
  • Harvest rate
  • What fraction of crawled pages are relevant
  • Robustness across seed sets
  • Perform separate crawls with random disjoint
    samples
  • Measure overlap in URLs, server IP addresses, and
    best-rated resources
  • Evidence of non-trivial work
  • Path length to the best resources

23
Harvest rate
Unfocused
24
Crawl robustness
URL Overlap
Server Overlap
Crawl 1
Crawl 2
25
Robustness of resource quality
  • Sample disjoint sets of starting URLs
  • Two separate crawls
  • Run HITS/Clever
  • Find best authorities
  • Order by rank
  • Find overlap in the top-rated resources

26
Distance to best resources
27
A top hub onairlines after half an hourof
focusedcrawling
28
A top hub onbicycling after one hour
of focused crawling
29
Learning context graphs
  • Topics form connected cliques
  • heart disease ? swimming, hiking
  • cycling ? first-aid!
  • Radius-1 rule can be myopic
  • Trapped within boundaries of related topics
  • From short pre-crawled paths
  • Can learn frequent chains of related topics
  • Use this knowledge to circumvent local topic
    traps

30
Context improves focused crawling
31
Roadmap
  • Hyperlink mining a short history
  • Resource discovery
  • Content-based locality in hypertext
  • Taxonomy models, topic distillation
  • Strategies for focused crawling
  • Data capture and mining architecture
  • The Memex collaboration system
  • Collaborative construction of vertical portals
  • Link metadata management architecture
  • Surfing backwards on the Web

32
Memex project goals
  • Infrastructure to support spontaneous formation
    of topic-based communities
  • Mining algorithms for personal and community
    level topic management and collaborative resource
    discovery
  • Extensible API for plugging in additional
    hypertext analysis tools

33
Memex project status
  • Java applet client
  • Netscape 4.5 (Javascript) available
  • IE4 (ActiveX) planned
  • Server code for Unix and Windows
  • Servlets IBM Universal Database
  • Berkeley DB lightweight storage manager
  • Simple-to-install RPMs for Linux planned
  • About a dozen alpha testers
  • First beta available 12/2000

34
Creating personal topic spaces
? indicates automatic placement by Memex
classifier
User cuts and pastes to correct or reinforce
the Memex classifier
File manager- like interface
Privacy choice
  • Valuable user input and feedback on topics and
    associated examples

35
Replaying topic-based contexts
Choice of topic context
Replay of recent browsing context restricted
to chosen topic
Active browser monitoring and dynamic layout of
new/ incremental context graph
Better mobility than one- dimensional history
provided by popular browsers
  • Where was I when last surfing around
    /Software/Programming?

36
Synthesis of a community taxonomy
  • Users classify URLs into folders
  • How to synthesize personal folders into common
    taxonomy?
  • Combine multiple similarity hints

Media
kpfa.org
bbc.co.uk
kron.com
Broadcasting
channel4.com
kcbs.com
Entertainment
foxmovies.com
lucasfilms.com
Studios
miramax.com
37
Setting up the focused crawler
Current Examples
Drag
Taxonomy Editor
Suggested Additional Examples
38
Monitoring harvest rate
One URL
Relevance/Harvest rate
Moving Average
Time
39
Overview of the Memex system
Browser
Memex server
Visit
Client JAR
Taxonomy synthesis
Resource discovery
Search
Attach
Recommendation
Folder
Download
Context
Classification
Mining demons
Running client applet
Event-handler servlets
Archive
Clustering
Relational metadata
Text index
Topic models
Memex client-server protocol and workload sharing
negotiations
40
Surfing backwards using contexts
  • Space-bounded referrer log
  • HTTP extension to query backlink data

GET /P2 HTTP/1.0 Referer http//S1/P1
S1
S2
C
http//S1/P1
http//S2/P2
41
Surfing backwards 1
42
Surfing backwards 2
43
Surfing backwards 3
44
Surfing backwards 4
45
User study and analysis
  • (1999) Significant improvement in finding
    comprehensive resource lists
  • Six broad information needs, 25 volunteers
  • Find good resources within limited time
  • Backlinks faked using search engines
  • Blind-reviewed by three other volunteers
  • (2000) Average path length of undirected Web
    graph is much smaller compared to directed Web
    graph
  • (2000) Better focused crawls using backlinks
  • Proposal to W3C

46
Backlinks improve focused crawling
  • Follow forward HREF as before
  • Also expand backlinks using link queries
  • Classify pages as before

but pays off in the end
Sometimes distracts in unrewarding work
47
Surfing backwards summary
  • Life must be lived forwards, but it can only be
    understood backwards Soren Kierkegaard
  • Hubs are everywhere!
  • To find them, look backwards
  • Bidirectional surfing is a valuable means to seed
    focused resource discovery
  • Even if one has to depend on search engines
    initially for link queries

48
Conclusion
  • Architecture for topic-specific web resource
    discovery
  • Driven by examples collected from surfing and
    bookmarking activity
  • Reduced dependence on large crawlers
  • Modest desktop hardware adequate
  • Variable radius goal-directed crawling
  • High harvest rate
  • High quality resources found far from keyword
    query response nodes
Write a Comment
User Comments (0)
About PowerShow.com