Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

About This Presentation

Title:

Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

Description:

As people become more savvy users of the Net, they want things which are better ... Jim Hake (Founder, Global Information Infrastructure Awards) 4. Quote ... –

Number of Views:51

Avg rating:3.0/5.0

Slides: 36

Provided by: sou59

Category:

more less

Transcript and Presenter's Notes

Title: Focused Crawling A New Approach to Topic-Specific Web Resource Discovery

1
Focused CrawlingA New Approach to
Topic-SpecificWeb Resource Discovery

Soumen ChakrabartiIBM Almaden

Joint work withMartin van Den Berg
(Xerox)Byron Dom (IBM)David Gibson
(Berkeley) Funded by Global Web Solutions, IBM
Atlanta
2
Portals and portholes

Popular search portals and directories
Useful for generic needs
Difficult to do serious research
Information needs of net-savvy users are getting
very sophisticated
Relatively little business incentive
Need handmade specialty sites portholes
Resource discovery must be personalized

3
Quote

The emergence of portholes will be one of the
major Internet trends of 1999. As people become
more savvy users of the Net, they want things
which are better focused on meeting their
specific needs. We're going to see a whole lot
more of this, and it's going to potentially erode
the user base of some of the big portals.
Jim Hake(Founder, Global Information
Infrastructure Awards)

4
Quote

The most interesting trend is the growing sense
of natural limits, a recognition that covering a
single galaxy can be more practicaland
usefulthan trying to cover the entire universe.
Dan Gillmore(Tech Columnist, San Jose Mercury
News)

5
Scenario

Disk drive research group wants to track magnetic
surface technologies
Compiler research group wants to trawl the web
for graduate student resumés
____ wants to enhance his/her collection of
bookmarks about ____ with prominent and relevant
links
Virtual libraries like the Open Directory Project
and the Mining Co.

6
Goal

Automatically construct a focused portal
(porthole) containing resources that are
Relevant to the users focus of interest
Of high influence and quality
Collectively comprehensive

7
Tools at hand

Keyword search engines
Synonymy, polysemy
Abundance, lack of quality
Hand compiled topic directories
Labor intensive, subjective judgements
Resources automatically located using keyword
search and link graph distillation
Dependence on large crawls and indices

8
Estimating popularity

Extensive research on social network theory
Wasserman and Faust
Hyperlink based
Large in-degree indicates popularity/authority
Not all votes are worth the same
Several similar ideas and refinements
Googol (Page and Brin) and HITS (Kleinberg)
CLEVER (Chakrabarti et al)
Topic distillation (Bharat and Henzinger)

9
Topic distillation overview

Given web graph and query
Search engine selects sub-graph
Expansion, pruning and edge weights
Nodes iteratively transfer authority to cited
neighbors

The Web
Search Engine
Query
Selected subgraph
10
Preliminary approach

Use topic distillation for focused crawling
Each node in topic taxonomy is a query
Query is refined by trial-and-error
Topic distillation runs at each node
E.g. European airlines
swissair iberia klm

11
(No Transcript)
12
Query construction
/Companies/Electronics/Power_Supply
power suppl
switch mode smps
-multiprocessor
uninterrupt power suppl ups
-parcel
13
Query complexity

Complex queries (966 trials)
Average words 7.03
Average operators (") 4.34
Typical Alta Vista queries are much simpler
Silverstein, Henzinger, Marais and Moricz
Average query words 2.35
Average operators (") 0.41
Forcibly adding a hub or authority node helped in
86 of the queries

14
Problems with preliminary approach

Difficulty of query construction
Dependence on large web crawl and index
System crawler index distiller
Unreliability of keyword match
Engines differ significantly on a given query due
to small overlap Bharat and Bröder
Narrow, arbitrary view of relevant subgraph
Topic model does not improve over time
Lack of output sensitivity

15
Output sensitivity

Say the goal is to find a comprehensive
collection of recreational and competitive
bicycling sites and pages
Ideally effort should scale with size of the
result
Time spent crawling and indexing sites unrelated
to the topic is wasted
Likewise, time that does not improve
comprehensiveness is wasted

16
Proposed solution

Resource discovery system that can be customized
to crawl for any topic by giving examples
Hypertext mining algorithms learn to recognize
pages and sites about the given topic, and a
measure of their goodness
Crawler has guidance hooks controlled by these
two scores

17
Advantages

No need for query formulationsystem learns from
examples
No dependence on global crawls
Specialized, deep and up-to-date web exploration
Modest desktop hardware adequate

18
Administration scenario
Current Examples
Drag
Taxonomy Editor
Suggested Additional Examples
19
Relevance
Path nodes
All
BusEcon
Recreation
Arts
Companies
Cycling
...
...
Bike Shops
Clubs
Mt.Biking
Good nodes
Subsumed nodes
20
Classification

How relevant is a document w.r.t. a class?
Supervised learning, filtering, classification,
categorization
Many types of classifiers
Bayesian, nearest neighbor, rule-based
Hypertext
Both text and links are class-dependent clues
How to model link-based features?

21
Exploiting link features

cclass, ttext, Nneighbors
Text-only model Prtc
Using neighbors textto judge my topicPrt,
t(N) c
Better modelPrt, c(N) c
Non-linear relaxation

?
22
Exploiting link features

cclass, ttext, Nneighbors
Text-only model Prtc
Using neighbors textto judge my topicPrt,
t(N) c
Better modelPrt, c(N) c
Non-linear relaxation

23
Putting it together
24
Monitoring the crawler
One URL
Relevance
Moving Average
Time
25
RDBMS benefits

Multiple priority controls
Dynamically changing crawling strategies
Concurrency and crash recovery
Effective out-of-core computations
Ad-hoc crawl monitoring and tweaking
Synergy of scale

26
Measures of success

Harvest rate
What fraction of crawled pages are relevant
Robustness across seed sets
Separate crawls with random disjoint samples
Measure overlap in URLs and servers crawled
Measure agreement in best-rated resources
Evidence of non-trivial work
Links from start set to the best resources

27
Harvest rate
Unfocused
Focused
28
Crawl robustness
URL Overlap
Server Overlap
Crawl A
Crawl B
29
Top resources after one hour

Recreational and competitive cycling
http//www.truesport.com/Bike/links.htm
http//reality.sgi.com/employees/billh_hampton/jrv
s/links.html
http//www.acs.ucalgary.ca/bentley/mark_links.htm
l
HIV/AIDS research and treatment
http//www.stopaids.org/Otherorgs.html
http//www.iohk.com/UserPages/mlau/aidsinfo.html
http//www.ahandyguide.com/cat1/a/a66.htm
Purer and better than root set

30
(No Transcript)
31
(No Transcript)
32
Distance to best resources
33
Robustness of resource discovery

Sample disjoint sets of starting URLs
Two separate crawls
Find best authorities
Order by rank
Find overlap in the top-rated resources

34
Future work

Harvest rate at different levels of taxonomy
By definition harvest rate is 1 for root node
Sociology of citations
Build a gigantic citation matrix for web topics
Further enhance resource finding skills
Semi-structured queries
Suspicious link neighborhoods, e.g., traffic
radar manufacturer and auto insurance company

35
Related work