Title: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery
1Focused CrawlingA New Approach to
Topic-SpecificWeb Resource Discovery
- Soumen ChakrabartiIBM Almaden
Joint work withMartin van Den Berg
(Xerox)Byron Dom (IBM)David Gibson
(Berkeley) Funded by Global Web Solutions, IBM
Atlanta
2Portals and portholes
- Popular search portals and directories
- Useful for generic needs
- Difficult to do serious research
- Information needs of net-savvy users are getting
very sophisticated - Relatively little business incentive
- Need handmade specialty sites portholes
- Resource discovery must be personalized
3Quote
- The emergence of portholes will be one of the
major Internet trends of 1999. As people become
more savvy users of the Net, they want things
which are better focused on meeting their
specific needs. We're going to see a whole lot
more of this, and it's going to potentially erode
the user base of some of the big portals. - Jim Hake(Founder, Global Information
Infrastructure Awards)
4Quote
- The most interesting trend is the growing sense
of natural limits, a recognition that covering a
single galaxy can be more practicaland
usefulthan trying to cover the entire universe. - Dan Gillmore(Tech Columnist, San Jose Mercury
News)
5Scenario
- Disk drive research group wants to track magnetic
surface technologies - Compiler research group wants to trawl the web
for graduate student resumés - ____ wants to enhance his/her collection of
bookmarks about ____ with prominent and relevant
links - Virtual libraries like the Open Directory Project
and the Mining Co.
6Goal
- Automatically construct a focused portal
(porthole) containing resources that are - Relevant to the users focus of interest
- Of high influence and quality
- Collectively comprehensive
7Tools at hand
- Keyword search engines
- Synonymy, polysemy
- Abundance, lack of quality
- Hand compiled topic directories
- Labor intensive, subjective judgements
- Resources automatically located using keyword
search and link graph distillation - Dependence on large crawls and indices
8Estimating popularity
- Extensive research on social network theory
- Wasserman and Faust
- Hyperlink based
- Large in-degree indicates popularity/authority
- Not all votes are worth the same
- Several similar ideas and refinements
- Googol (Page and Brin) and HITS (Kleinberg)
- CLEVER (Chakrabarti et al)
- Topic distillation (Bharat and Henzinger)
9Topic distillation overview
- Given web graph and query
- Search engine selects sub-graph
- Expansion, pruning and edge weights
- Nodes iteratively transfer authority to cited
neighbors
The Web
Search Engine
Query
Selected subgraph
10Preliminary approach
- Use topic distillation for focused crawling
- Each node in topic taxonomy is a query
- Query is refined by trial-and-error
- Topic distillation runs at each node
- E.g. European airlines
- swissair iberia klm
11(No Transcript)
12Query construction
/Companies/Electronics/Power_Supply
power suppl
switch mode smps
-multiprocessor
uninterrupt power suppl ups
-parcel
13Query complexity
- Complex queries (966 trials)
- Average words 7.03
- Average operators (") 4.34
- Typical Alta Vista queries are much simpler
Silverstein, Henzinger, Marais and Moricz - Average query words 2.35
- Average operators (") 0.41
- Forcibly adding a hub or authority node helped in
86 of the queries
14Problems with preliminary approach
- Difficulty of query construction
- Dependence on large web crawl and index
- System crawler index distiller
- Unreliability of keyword match
- Engines differ significantly on a given query due
to small overlap Bharat and Bröder - Narrow, arbitrary view of relevant subgraph
- Topic model does not improve over time
- Lack of output sensitivity
15Output sensitivity
- Say the goal is to find a comprehensive
collection of recreational and competitive
bicycling sites and pages - Ideally effort should scale with size of the
result - Time spent crawling and indexing sites unrelated
to the topic is wasted - Likewise, time that does not improve
comprehensiveness is wasted
16Proposed solution
- Resource discovery system that can be customized
to crawl for any topic by giving examples - Hypertext mining algorithms learn to recognize
pages and sites about the given topic, and a
measure of their goodness - Crawler has guidance hooks controlled by these
two scores
17Advantages
- No need for query formulationsystem learns from
examples - No dependence on global crawls
- Specialized, deep and up-to-date web exploration
- Modest desktop hardware adequate
18Administration scenario
Current Examples
Drag
Taxonomy Editor
Suggested Additional Examples
19Relevance
Path nodes
All
BusEcon
Recreation
Arts
Companies
Cycling
...
...
Bike Shops
Clubs
Mt.Biking
Good nodes
Subsumed nodes
20Classification
- How relevant is a document w.r.t. a class?
- Supervised learning, filtering, classification,
categorization - Many types of classifiers
- Bayesian, nearest neighbor, rule-based
- Hypertext
- Both text and links are class-dependent clues
- How to model link-based features?
21Exploiting link features
- cclass, ttext, Nneighbors
- Text-only model Prtc
- Using neighbors textto judge my topicPrt,
t(N) c - Better modelPrt, c(N) c
- Non-linear relaxation
?
22Exploiting link features
- cclass, ttext, Nneighbors
- Text-only model Prtc
- Using neighbors textto judge my topicPrt,
t(N) c - Better modelPrt, c(N) c
- Non-linear relaxation
23Putting it together
24Monitoring the crawler
One URL
Relevance
Moving Average
Time
25RDBMS benefits
- Multiple priority controls
- Dynamically changing crawling strategies
- Concurrency and crash recovery
- Effective out-of-core computations
- Ad-hoc crawl monitoring and tweaking
- Synergy of scale
26Measures of success
- Harvest rate
- What fraction of crawled pages are relevant
- Robustness across seed sets
- Separate crawls with random disjoint samples
- Measure overlap in URLs and servers crawled
- Measure agreement in best-rated resources
- Evidence of non-trivial work
- Links from start set to the best resources
27Harvest rate
Unfocused
Focused
28Crawl robustness
URL Overlap
Server Overlap
Crawl A
Crawl B
29Top resources after one hour
- Recreational and competitive cycling
- http//www.truesport.com/Bike/links.htm
- http//reality.sgi.com/employees/billh_hampton/jrv
s/links.html - http//www.acs.ucalgary.ca/bentley/mark_links.htm
l - HIV/AIDS research and treatment
- http//www.stopaids.org/Otherorgs.html
- http//www.iohk.com/UserPages/mlau/aidsinfo.html
- http//www.ahandyguide.com/cat1/a/a66.htm
- Purer and better than root set
30(No Transcript)
31(No Transcript)
32Distance to best resources
33Robustness of resource discovery
- Sample disjoint sets of starting URLs
- Two separate crawls
- Find best authorities
- Order by rank
- Find overlap in the top-rated resources
34Future work
- Harvest rate at different levels of taxonomy
- By definition harvest rate is 1 for root node
- Sociology of citations
- Build a gigantic citation matrix for web topics
- Further enhance resource finding skills
- Semi-structured queries
- Suspicious link neighborhoods, e.g., traffic
radar manufacturer and auto insurance company
35Related work
- WebWatcher, HotListColdList
- Filtering as post-processing, not acquisition
- Fish search, WebCrawler
- Crawler guided by query keyword matches
- Ahoy!, Cora
- Hand-crafted to find home pages and papers
- ReferralWeb
- Social network on the Web
36Conclusion
- New architecture for example-driven
topic-specific web resource discovery - No dependence on full web crawl and index
- Modest desktop hardware adequate
- Variable radius goal-directed crawling
- High harvest rate
- High quality resources found far from keyword
query response nodes
37References
- soumen_at_cs.berkeley.edu
- www.cs.berkeley.edu/soumen/
- www8focus.pdf
- sigmod98.ps
- www.almaden.ibm.com/cs/k53/ir.html