Title: Adaptive Focused Crawling
1Adaptive Focused Crawling
- Dr. Alexandra Cristea
- a.i.cristea_at_warwick.ac.uk
- http//www.dcs.warwick.ac.uk/acristea/
21. Contents
- Introduction
- Crawlers and the World Wide Web
- Focused Crawling
- Agent-Based Adaptive Focused Crawling
- Machine Learning-Based Adaptive Focused Crawling
- Evaluation Methodologies
- Conclusions
3Motivation
- Large amount of info on web
- Standard crawler traverses web downloading all
- Focused, adaptive crawler selects only related
documents, ignores rest
4Introduction
5A focused crawler retrieval
6Adaptive Focused Crawler
- Traditional non-adaptive focused crawlers
suitable for user communities w shared interests
goals that do not change with time. - focused crawler learning methods to adapt its
behavior to the particular environment and its
relationships with the given input parameters
(e.g. set of retrieved pages and the user-defined
topic ) gtgt adaptive - adaptive fc
- for personalized search systems w info needs,
users interests, goals, preferences, etc.. - for single users and not communities of people.
- sensitive to potential alterations in the
environment.
7Crawlers and the WWW
8Growth and size of the Web
- Growth
- 2005 at least 11.5 billion pages
- Doubling in less than 2 years
- http//www.worldwidewebsize.com/
- Today (2008) Indexable 23 billion pages
- Change
- 23 changes daily, 40 within a week
- Challenge search engines local copies
- Crawls time consuming gt tradeoffs needed
- Alternatives
- google Sitemaps an XML file lists web site pages
and how often they change (push instead of pull) - distributed crawling
- Truism Web is growing faster than search engines
9Reaching the Web Hypertextual Connectivity and
Deep Web
- Dark matter info not accessible to search
engines - Page sets In, Out, SCC (Strongly Connected
Component)
What happens if you crawl from Out?
10Deep Web
- dynamic page generators
- estimate public information on the deep Web is
in 2001 up to 550 times larger than the normally
accessible Web - databases
11Crawling Strategies
- Important pages first ordering metrics e.g.,
Breadth-First, Backlink, PageRank
12Backlink
- of pages linking in
- Based on bibliographic research
- Local minima issue
- Based on this PageRank, HITS
13Focused Crawling
- Exploiting additional info on web pages, such as
anchors or text surrounding the links, to skip
some of the pages encountered
14Exploiting the Hypertextual Info
- Links and (topical) locality are just as
important as IR info
15Fish search
- Input users query starting URLs (e.g.,
bookmarks) priority list - First in list downloaded, scored a heuristic
decides if to continue w that direction if not,
its links will be ignored - If yes, links are scanned, each w a depth value
(e.g., parent -1) when depth is zero, direction
is dropped - Timeout, max no pages is also possible
- Very heavy and demanding big web-burden
16Other focused crawlers
- Taxonomy and distillation
- Classifier evaluates relevance of hypertext docs
regarding topic - Distiller identifies nodes as access points to
pages (via HITS algorithm) - Tunneling
- allow a limited no of bad pages, to avoid
loosing info (close topic pages may not point to
each other) - Contextual crawling
- Context graph for each page with a related
distance (min no links to traverse from initial
set) - Naïve Bayes classifiers category
identification, according to distance
predictions of a generic documents distance is
possible - Problem reverse link info
- Semantic Web
- Ontologies
- Improvements in performance
17Agent-based Adaptive Focused Crawling
18Genetic-based crawling
- GA
- approximate solutions to hard-to-solve
combinatorial optimization problems - genetic operators inheritance, mutation,
crossover population evolution - GA crawler agents (InfoSpiders http//www.informat
ics.indiana.edu/fil/IS/ ) - genotype (chromosome set) defining search
behaviour - trust in out-links
- query terms
- weights (uniform distribution intially FF NN
info latersupervised/unsupervised BP) - Energy Benefit() Cost() (Fitness)
19Genotype and NN
Relevant/ irrelevant
20Algorithm 1. Pseudo-code of the InfoSpiders
algorithm
- initialize each agents genotype, energy and
starting page - PAGES ?maximum number of pages to visit
- while number of visited pages lt PAGES do
- while for each agent a do
- pick and visit an out-link from the
current agents page - update the energy estimating benefit() -
cost() - update the genotype as a function of the
current benefit - if agents energy gt THRESHOLD then
- apply the genetic operators to
produce offspring - else
- kill the agent end if
- end while
- end while
21Ant-based Crawling
- Collective intelligence
- Simple individual behaviour, complex results
(shortest path) - Pheromone trail
22Ant Crawling preferred path
23Transition probabilities (p) according to
pheromone trails (?)
24Task accomplishing behaviors
- 1. at the end of the cycle, the agent updates the
pheromone trails of the followed path and places
itself in one of the start resources - 2. if an ant trail exists, the agent decides to
follow it with a probability which is function of
the respective pheromone intensity - 3. if the agent does not have any available
information, it moves randomly
25Transition probability
Link between i,l
where tij(t) corresponds to the pheromone trail
between urli and urlj
26Trail updating
p(k) is the ordered set of pages visited by the
k-ant
p(k)i is the i-th element of p(k)
score(p) returns for each page p, the similarity
measure with current info needs 0, 1, where 1
is the highest similarity
M of ants
? is the trail evaporation coefficient
27Algorithm 2. Pseudo-code of the Ant-based crawler.
- initialize each agents starting page
- PAGES ?maximum number of pages to visit
- cycle ? 1
- t ? 1
- while number of visited pages lt PAGES do
- while for each agent a do
- for move 0 to cycle do
- calculate the probabilities
Pij(t) of the out-going links as in Eq. - select the next page to visit
for the agent a - end for
- end while
- update all the pheromone trails
- initialize each agents starting page
- cycle ? cycle 1
- t ? t 1
- end while
28Machine Learning-Based Adaptive Focused Crawling
- Intelligent Crawling Statistical Model
- Reinforcement Learning-Based Approaches
29Intelligent Crawling Statistical Model
- statistically learn the characteristics of the
Webs linkage structure while performing the
search - Unseen page predicates (content of pages linking
in, tokens on unseen page) - Evidence E is used to update probability of
relation to users needs.
30Evidence-based update
- P(CE) gt P(C)
- Interest Ratio
- I(C,E) P(CE) / P(C)
- P(C n E) / P(C)P(E)
e.g., E 10 of the pages pointing in contain
Bach
No initial collection needed At the beginning,
users specify their needs via predicates, e.g.,
the page content or the title must contain a
given set of keywords.
31Reinforcement Learning-Based Approaches
- Traditional focused crawler
- apprentice assigns priorities to unvisited URLs
(based on DOM features) for the next steps of
crawling - Naïve Bayes text classifiers compare text around
links to next steps
DOM Document Object Model
32Evaluation Methodologies
- For fixed Web corpus and standard crawl
- computation time to complete the crawl, or
- the number of downloaded resources per time unit
- For focused crawl
- We need correctly retrieved documents only, not
all of them so
33Focused Crawling
- Precision
- Pr found / (found false alarm)
- Recall
- Rr found / (found miss)
34Precision / Recall
35Conclusions
- Focused Crawling interesting alternative to Web
search - Adaptive Focused Crawlers
- Learning methods are able to adapt the system
behaviour to a particular environment and input
parameters during the search - Dark matter research, NLP, Semantic Web
36