Title: Nurturing content-based collaborative communities on the Web
1Nurturing content-based collaborative
communitieson the Web
- Soumen Chakrabarti
- Center for Intelligent Internet ResearchComputer
Science and EngineeringIndian Institute of
Technology Bombay - www.cse.iitb.ernet.in/soumenwww.cse.iitb.ernet.i
n/cfiir
2Generic search engines
- Struggle to cover the expanding Web
- 35 coverage in 1997 (Bharat and Broder)
- 18 in 1999 (Lawrence and Lee Giles)
- Google rebounds to 50 in 2000
- Moores law vs. Web population
- Search quality, index freshness
- Cannot afford advanced processing
- Alta Vista serves gt40 million queries / day
- Cannot even afford to seek on disk (8ms)
- Limits intelligence of search engines
3Scale vs. quality
Lexical networks, parsing, semantic indexing
Resourcediscovery
Quality
Focusedcrawling
Link-assistedranking
Topic distillation
Keyword-basedsearch engines
Google, Clever
HotBot, Alta Vista
Scale
4The case for vertical portals
- Portals and search pages are changing rapidly,
in part because their biggest strength massive
size and reach can also be a drawback. The most
interesting trend is the growing sense of natural
limits, a recognition that covering a single
galaxy can be more practical and useful than
trying to cover the entire universe. - (San Jose Mercury News, 1/1999)
5Scaling through specialization
- The Web shows content-based locality
- Link-based clusters correlated with content
- Content-based communities emerge in a
spontaneous, decentralized fashion - Can learn and exploit locality patterns
- Analyze page visits and bookmarks
- Automatically construct a focused portal with
resources that - Have high relevance and quality
- Are up-to-date and collectively comprehensive
6Roadmap
- Hyperlink mining a short history
- Resource discovery
- Content-based locality in hypertext
- Taxonomy models, topic distillation
- Strategies for focused crawling
- Data capture and mining architecture
- The Memex collaboration system
- Collaborative construction of vertical portals
- Link metadata management architecture
- Surfing backwards on the Web
7Historical background
- First generation Web search engines
- Delete stopwords from queries
- Can only do syntactic matching
- Users stopped asking good questions!
- TREC queries tens to hundreds of words
- Alta Vista at most 23 words
- Crisis of abundance
- Relevance ranking for very short queries
- Quality complements relevance thats where
hand-made topic directories shine
8Hyperlink Induced Topic Search
Expanded graph
Query
Keyword Search engine
Response
a Eh h ETa Hubs and authorities
9PageRank and Google
- Prestige of a page is proportional to sum of
prestige of citing pages - Standard bibliometric measure of influence
- Simulate a random walk on the Web to precompute
prestige of all pages - Sort keyword-matched responses by decreasing
prestige
p1
p4
p2
p3
p4 ? p1 p2 p3
I.e., p Ep
10Observations
- HITS
- Uses text initially to select Web subgraph
- Expands subgraph by radius 1 magic!
- h and a scores independent of content
- Iterations required at query time
- Google/PageRank
- Precomputed query-independent prestige
- No iterations needed at query time, faster
- Keyword query selects subgraph to rank
- No notion of hub or bipartite reinforcement
11Limitations
- Artificial decoupling of text and links
- Connectivity-based topic drift (HITS)
- movie awards ? movies
- Expanders at www.web-popularity.com
- Feature diffusion (Google)
- more evil than evil ? www.microsoft.com
- New threat of anchor-text spamming
- Decoupled ranking (Google)
- harvard mother ? Bill Gatess bio page!
12Genealogy
Bibliometry
Exploiting anchor text
Google
HITS
Outlier elimination
Topic distillation _at_Compaq
Clever_at_IBM
Text classification
Focusedcrawling
Hypertextclassification
Relaxationlabeling
Crawlingcontext graphs
Learningtopic paths
13Reducing topic drift anchor text
- Page modeled as sequence of tokens and outlinks
- Radius of influence around each token
- Query term matching token increases link weight
- Favors hubs and authorities near relevant pages
- Better answers than HITS
- Ad-hoc spreading activation, but no formal
model as yet
Query term
14Reducing topic drift Outlier detection
- Search response is usually purer than radius1
expansion - Compute document term vectors
- Compute centroid of response vectors
- Eliminate far-away expanded vectors
- Results improve
- Why stop at radius1?
Expanded graph
Keyword searchresponse
Vector-spacedocumentmodel
Centroid
Cut-off radius
15Resource discovery
- Given
- Yahoo-like topic tree with example URLs
- A selection of good topics to explore
- Examples, not queries, define topics
- Need 2-way decision, not ad-hoc cut-off
- Goal
- Start from the good / relevant examples
- Crawl to collect additional relevant URLs
- Fetch as few irrelevant URLs as possible
16A model for relevance
Blocked class
Path class
All
BusEcon
Recreation
Arts
Companies
Cycling
...
...
Bike Shops
Clubs
Mt.Biking
Subsumed classes
Good classes
17Pr(cd) from Pr(cd) using Bayes rule
- Decide topic topic c is picked with prior
probability ?(c) ?c?(c) 1 - Each c has parameters ?(c,t) for terms t
- Coin with face probabilities ?t ?(c,t) 1
- Fix document length n(d) and toss coin
- Naïve yet effective can use other algos
- Given c, probability of document is
18Enhanced models for hypertext
- cclass, dtext, Nneighbors
- Text-only model Pr(dc)
- Using neighbors text to judge my topicPr(d,
d(N) c) - Better recursive modelPr(d, c(N) c)
- Relaxation labeling over Markov random fields
- Or, EM formulation
?
19Hyperlink modeling boosts accuracy
- 9600 patents from 12 classes marked by USPTO
- Patents have text and prior art links
- Expand test patent to include neighborhood
- Forget and re-estimate fraction of neighbors
classes - (Even better for Yahoo)
20Resource discovery basic approach
- Topic taxonomy with examples and good topics
specified - Crawler coupled to hypertext classifier
- Crawl frontier expanded in relevance order
- Neighbors of good hubs expanded with high priority
ExampleURLs
?
?
Radius-1 rule
Radius-2 rule
21Focused crawler block diagram
22Focused crawling evaluation
- Harvest rate
- What fraction of crawled pages are relevant
- Robustness across seed sets
- Perform separate crawls with random disjoint
samples - Measure overlap in URLs, server IP addresses, and
best-rated resources - Evidence of non-trivial work
- Path length to the best resources
23Harvest rate
Unfocused
24Crawl robustness
URL Overlap
Server Overlap
Crawl 1
Crawl 2
25Robustness of resource quality
- Sample disjoint sets of starting URLs
- Two separate crawls
- Run HITS/Clever
- Find best authorities
- Order by rank
- Find overlap in the top-rated resources
26Distance to best resources
27A top hub onairlines after half an hourof
focusedcrawling
28A top hub onbicycling after one hour
of focused crawling
29Learning context graphs
- Topics form connected cliques
- heart disease ? swimming, hiking
- cycling ? first-aid!
- Radius-1 rule can be myopic
- Trapped within boundaries of related topics
- From short pre-crawled paths
- Can learn frequent chains of related topics
- Use this knowledge to circumvent local topic
traps
30Context improves focused crawling
31Roadmap
- Hyperlink mining a short history
- Resource discovery
- Content-based locality in hypertext
- Taxonomy models, topic distillation
- Strategies for focused crawling
- Data capture and mining architecture
- The Memex collaboration system
- Collaborative construction of vertical portals
- Link metadata management architecture
- Surfing backwards on the Web
32Memex project goals
- Infrastructure to support spontaneous formation
of topic-based communities - Mining algorithms for personal and community
level topic management and collaborative resource
discovery - Extensible API for plugging in additional
hypertext analysis tools
33Memex project status
- Java applet client
- Netscape 4.5 (Javascript) available
- IE4 (ActiveX) planned
- Server code for Unix and Windows
- Servlets IBM Universal Database
- Berkeley DB lightweight storage manager
- Simple-to-install RPMs for Linux planned
- About a dozen alpha testers
- First beta available 12/2000
34Creating personal topic spaces
? indicates automatic placement by Memex
classifier
User cuts and pastes to correct or reinforce
the Memex classifier
File manager- like interface
Privacy choice
- Valuable user input and feedback on topics and
associated examples
35Replaying topic-based contexts
Choice of topic context
Replay of recent browsing context restricted
to chosen topic
Active browser monitoring and dynamic layout of
new/ incremental context graph
Better mobility than one- dimensional history
provided by popular browsers
- Where was I when last surfing around
/Software/Programming?
36Synthesis of a community taxonomy
- Users classify URLs into folders
- How to synthesize personal folders into common
taxonomy? - Combine multiple similarity hints
Media
kpfa.org
bbc.co.uk
kron.com
Broadcasting
channel4.com
kcbs.com
Entertainment
foxmovies.com
lucasfilms.com
Studios
miramax.com
37Setting up the focused crawler
Current Examples
Drag
Taxonomy Editor
Suggested Additional Examples
38Monitoring harvest rate
One URL
Relevance/Harvest rate
Moving Average
Time
39Overview of the Memex system
Browser
Memex server
Visit
Client JAR
Taxonomy synthesis
Resource discovery
Search
Attach
Recommendation
Folder
Download
Context
Classification
Mining demons
Running client applet
Event-handler servlets
Archive
Clustering
Relational metadata
Text index
Topic models
Memex client-server protocol and workload sharing
negotiations
40Surfing backwards using contexts
- Space-bounded referrer log
- HTTP extension to query backlink data
GET /P2 HTTP/1.0 Referer http//S1/P1
S1
S2
C
http//S1/P1
http//S2/P2
41Surfing backwards 1
42Surfing backwards 2
43Surfing backwards 3
44Surfing backwards 4
45User study and analysis
- (1999) Significant improvement in finding
comprehensive resource lists - Six broad information needs, 25 volunteers
- Find good resources within limited time
- Backlinks faked using search engines
- Blind-reviewed by three other volunteers
- (2000) Average path length of undirected Web
graph is much smaller compared to directed Web
graph - (2000) Better focused crawls using backlinks
- Proposal to W3C
46Backlinks improve focused crawling
- Follow forward HREF as before
- Also expand backlinks using link queries
- Classify pages as before
but pays off in the end
Sometimes distracts in unrewarding work
47Surfing backwards summary
- Life must be lived forwards, but it can only be
understood backwards Soren Kierkegaard - Hubs are everywhere!
- To find them, look backwards
- Bidirectional surfing is a valuable means to seed
focused resource discovery - Even if one has to depend on search engines
initially for link queries
48Conclusion
- Architecture for topic-specific web resource
discovery - Driven by examples collected from surfing and
bookmarking activity - Reduced dependence on large crawlers
- Modest desktop hardware adequate
- Variable radius goal-directed crawling
- High harvest rate
- High quality resources found far from keyword
query response nodes