Nurturing contentbased collaborative communities on the Web - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Nurturing contentbased collaborative communities on the Web

Description:

Alta Vista serves 40 million queries / day. Cannot even afford to seek on disk (8ms) ... Alta Vista: at most 2 3 words. Crisis of abundance. Relevance ranking ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 49

Provided by: CFI8

Category:

more less

Transcript and Presenter's Notes

Title: Nurturing contentbased collaborative communities on the Web

1
Nurturing content-based collaborative
communitieson the Web

Soumen Chakrabarti
Center for Intelligent Internet ResearchComputer
Science and EngineeringIndian Institute of
Technology Bombay
www.cse.iitb.ernet.in/soumenwww.cse.iitb.ernet.i
n/cfiir

2
Generic search engines

Struggle to cover the expanding Web
35 coverage in 1997 (Bharat and Broder)
18 in 1999 (Lawrence and Lee Giles)
Google rebounds to 50 in 2000
Moores law vs. Web population
Search quality, index freshness
Cannot afford advanced processing
Alta Vista serves 40 million queries / day
Cannot even afford to seek on disk (8ms)
Limits intelligence of search engines

3
Scale vs. quality
Lexical networks, parsing, semantic indexing
Resourcediscovery
Quality
Focusedcrawling
Link-assistedranking
Topic distillation
Keyword-basedsearch engines
Google, Clever
HotBot, Alta Vista
Scale
4
The case for vertical portals

Portals and search pages are changing rapidly,
in part because their biggest strength massive
size and reach can also be a drawback. The most
interesting trend is the growing sense of natural
limits, a recognition that covering a single
galaxy can be more practical and useful than
trying to cover the entire universe.
(San Jose Mercury News, 1/1999)

5
Scaling through specialization

The Web shows content-based locality
Link-based clusters correlated with content
Content-based communities emerge in a
spontaneous, decentralized fashion
Can learn and exploit locality patterns
Analyze page visits and bookmarks
Automatically construct a focused portal with
resources that
Have high relevance and quality
Are up-to-date and collectively comprehensive

6
Roadmap

Hyperlink mining a short history
Resource discovery
Content-based locality in hypertext
Taxonomy models, topic distillation
Strategies for focused crawling
Data capture and mining architecture
The Memex collaboration system
Collaborative construction of vertical portals
Link metadata management architecture
Surfing backwards on the Web

7
Historical background

First generation Web search engines
Delete stopwords from queries
Can only do syntactic matching
Users stopped asking good questions!
TREC queries tens to hundreds of words
Alta Vista at most 23 words
Crisis of abundance
Relevance ranking for very short queries
Quality complements relevance thats where
hand-made topic directories shine

8
Hyperlink Induced Topic Search
Expanded graph
Query
Keyword Search engine
Response
a Eh h ETa Hubs and authorities
9
PageRank and Google

Prestige of a page is proportional to sum of
prestige of citing pages
Standard bibliometric measure of influence
Simulate a random walk on the Web to precompute
prestige of all pages
Sort keyword-matched responses by decreasing
prestige

p1
p4
p2
p3
p4 ? p1 p2 p3
I.e., p Ep
10
Observations

HITS
Uses text initially to select Web subgraph
Expands subgraph by radius 1 magic!
h and a scores independent of content
Iterations required at query time
Google/PageRank
Precomputed query-independent prestige
No iterations needed at query time, faster
Keyword query selects subgraph to rank
No notion of hub or bipartite reinforcement

11
Limitations

Artificial decoupling of text and links
Connectivity-based topic drift (HITS)
movie awards ? movies
Expanders at www.web-popularity.com
Feature diffusion (Google)
more evil than evil ? www.microsoft.com
New threat of anchor-text spamming
Decoupled ranking (Google)
harvard mother ? Bill Gatess bio page!

12
Genealogy
Bibliometry
Exploiting anchor text
Google
HITS
Outlier elimination
Topic distillation _at_Compaq
Clever_at_IBM
Text classification
Focusedcrawling
Hypertextclassification
Relaxationlabeling
Crawlingcontext graphs
Learningtopic paths
13
Reducing topic drift anchor text

Page modeled as sequence of tokens and outlinks
Radius of influence around each token
Query term matching token increases link weight
Favors hubs and authorities near relevant pages
Better answers than HITS
Ad-hoc spreading activation, but no formal
model as yet

Query term
14
Reducing topic drift Outlier detection

Search response is usually purer than radius1
expansion
Compute document term vectors
Compute centroid of response vectors
Eliminate far-away expanded vectors
Results improve
Why stop at radius1?

Expanded graph
Keyword searchresponse
Vector-spacedocumentmodel
Centroid
Cut-off radius

15
Resource discovery

Given
Yahoo-like topic tree with example URLs
A selection of good topics to explore
Examples, not queries, define topics
Need 2-way decision, not ad-hoc cut-off
Goal
Start from the good / relevant examples
Crawl to collect additional relevant URLs
Fetch as few irrelevant URLs as possible

16
A model for relevance
Blocked class
Path class
All
BusEcon
Recreation
Arts
Companies
Cycling
...
...
Bike Shops
Clubs
Mt.Biking
Subsumed classes
Good classes
17
Pr(cd) from Pr(cd) using Bayes rule

Decide topic topic c is picked with prior
probability ?(c) ?c?(c) 1
Each c has parameters ?(c,t) for terms t
Coin with face probabilities ?t ?(c,t) 1
Fix document length n(d) and toss coin
Naïve yet effective can use other algos
Given c, probability of document is

18
Enhanced models for hypertext

cclass, dtext, Nneighbors
Text-only model Pr(dc)
Using neighbors text to judge my topicPr(d,
d(N) c)
Better recursive modelPr(d, c(N) c)
Relaxation labeling over Markov random fields
Or, EM formulation

?
19
Hyperlink modeling boosts accuracy

9600 patents from 12 classes marked by USPTO
Patents have text and prior art links
Expand test patent to include neighborhood
Forget and re-estimate fraction of neighbors
classes
(Even better for Yahoo)

20
Resource discovery basic approach

Topic taxonomy with examples and good topics
specified
Crawler coupled to hypertext classifier
Crawl frontier expanded in relevance order
Neighbors of good hubs expanded with high priority

ExampleURLs
?
?
Radius-1 rule
Radius-2 rule
21
Focused crawler block diagram
22
Focused crawling evaluation

Harvest rate
What fraction of crawled pages are relevant
Robustness across seed sets
Perform separate crawls with random disjoint
samples
Measure overlap in URLs, server IP addresses, and
best-rated resources
Evidence of non-trivial work
Path length to the best resources

23
Harvest rate
Unfocused
24
Crawl robustness
URL Overlap
Server Overlap
Crawl 1
Crawl 2
25
Robustness of resource quality

Sample disjoint sets of starting URLs
Two separate crawls
Run HITS/Clever
Find best authorities
Order by rank
Find overlap in the top-rated resources

26
Distance to best resources
27
A top hub onairlines after half an hourof
focusedcrawling
28
A top hub onbicycling after one hour
of focused crawling
29
Learning context graphs

Topics form connected cliques
heart disease ? swimming, hiking
cycling ? first-aid!
Radius-1 rule can be myopic
Trapped within boundaries of related topics
From short pre-crawled paths
Can learn frequent chains of related topics
Use this knowledge to circumvent local topic
traps

30
Context improves focused crawling
31
Roadmap

Hyperlink mining a short history
Resource discovery
Content-based locality in hypertext
Taxonomy models, topic distillation
Strategies for focused crawling
Data capture and mining architecture
The Memex collaboration system
Collaborative construction of vertical portals
Link metadata management architecture
Surfing backwards on the Web

32
Memex project goals

Infrastructure to support spontaneous formation
of topic-based communities
Mining algorithms for personal and community
level topic management and collaborative resource
discovery
Extensible API for plugging in additional
hypertext analysis tools

33
Memex project status

Java applet client
Netscape 4.5 (Javascript) available
IE4 (ActiveX) planned
Server code for Unix and Windows
Servlets IBM Universal Database
Berkeley DB lightweight storage manager
Simple-to-install RPMs for Linux planned
About a dozen alpha testers
First beta available 12/2000

34
Creating personal topic spaces
? indicates automatic placement by Memex
classifier
User cuts and pastes to correct or reinforce
the Memex classifier
File manager- like interface
Privacy choice

Valuable user input and feedback on topics and
associated examples

35
Replaying topic-based contexts
Choice of topic context
Replay of recent browsing context restricted
to chosen topic
Active browser monitoring and dynamic layout of
new/ incremental context graph
Better mobility than one- dimensional history
provided by popular browsers

Where was I when last surfing around
/Software/Programming?

36
Synthesis of a community taxonomy

Users classify URLs into folders
How to synthesize personal folders into common
taxonomy?
Combine multiple similarity hints

Media
kpfa.org
bbc.co.uk
kron.com
Broadcasting
channel4.com
kcbs.com
Entertainment
foxmovies.com
lucasfilms.com
Studios
miramax.com
37
Setting up the focused crawler
Current Examples
Drag
Taxonomy Editor
Suggested Additional Examples
38
Monitoring harvest rate
One URL
Relevance/Harvest rate
Moving Average
Time
39
Overview of the Memex system
Browser
Memex server
Visit
Client JAR
Taxonomy synthesis
Resource discovery
Search
Attach
Recommendation
Folder
Download
Context
Classification
Mining demons
Running client applet
Event-handler servlets
Archive
Clustering
Relational metadata
Text index
Topic models
Memex client-server protocol and workload sharing
negotiations
40
Surfing backwards using contexts

Space-bounded referrer log
HTTP extension to query backlink data

GET /P2 HTTP/1.0 Referer http//S1/P1
S1
S2
C
http//S1/P1
http//S2/P2
41
Surfing backwards 1
42
Surfing backwards 2
43
Surfing backwards 3
44
Surfing backwards 4
45
User study and analysis

(1999) Significant improvement in finding
comprehensive resource lists
Six broad information needs, 25 volunteers
Find good resources within limited time
Backlinks faked using search engines
Blind-reviewed by three other volunteers
(2000) Average path length of undirected Web
graph is much smaller compared to directed Web
graph
(2000) Better focused crawls using backlinks
Proposal to W3C

46
Backlinks improve focused crawling

Follow forward HREF as before
Also expand backlinks using link queries
Classify pages as before

but pays off in the end
Sometimes distracts in unrewarding work
47
Surfing backwards summary

Life must be lived forwards, but it can only be
understood backwards Soren Kierkegaard
Hubs are everywhere!
To find them, look backwards
Bidirectional surfing is a valuable means to seed
focused resource discovery
Even if one has to depend on search engines
initially for link queries

48
Conclusion