Adaptive Focused Crawling - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Adaptive Focused Crawling

Description:

Large amount of info on web. Standard crawler: traverses web downloading all ... Truism: Web is growing faster than search engines ... – PowerPoint PPT presentation

Number of Views:1300

Avg rating:3.0/5.0

Slides: 37

Provided by: acri5

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive Focused Crawling

1
Adaptive Focused Crawling

Dr. Alexandra Cristea
a.i.cristea_at_warwick.ac.uk
http//www.dcs.warwick.ac.uk/acristea/

2
1. Contents

Introduction
Crawlers and the World Wide Web
Focused Crawling
Agent-Based Adaptive Focused Crawling
Machine Learning-Based Adaptive Focused Crawling
Evaluation Methodologies
Conclusions

3
Motivation

Large amount of info on web
Standard crawler traverses web downloading all
Focused, adaptive crawler selects only related
documents, ignores rest

4
Introduction
5
A focused crawler retrieval
6
Adaptive Focused Crawler

Traditional non-adaptive focused crawlers
suitable for user communities w shared interests
goals that do not change with time.
focused crawler learning methods to adapt its
behavior to the particular environment and its
relationships with the given input parameters
(e.g. set of retrieved pages and the user-defined
topic ) gtgt adaptive
adaptive fc
for personalized search systems w info needs,
users interests, goals, preferences, etc..
for single users and not communities of people.
sensitive to potential alterations in the
environment.

7
Crawlers and the WWW
8
Growth and size of the Web

Growth
2005 at least 11.5 billion pages
Doubling in less than 2 years
http//www.worldwidewebsize.com/
Today (2008) Indexable 23 billion pages
Change
23 changes daily, 40 within a week
Challenge search engines local copies
Crawls time consuming gt tradeoffs needed
Alternatives
google Sitemaps an XML file lists web site pages
and how often they change (push instead of pull)
distributed crawling
Truism Web is growing faster than search engines

9
Reaching the Web Hypertextual Connectivity and
Deep Web

Dark matter info not accessible to search
engines
Page sets In, Out, SCC (Strongly Connected
Component)

What happens if you crawl from Out?
10
Deep Web

dynamic page generators
estimate public information on the deep Web is
in 2001 up to 550 times larger than the normally
accessible Web
databases

11
Crawling Strategies

Important pages first ordering metrics e.g.,
Breadth-First, Backlink, PageRank

12
Backlink

of pages linking in
Based on bibliographic research
Local minima issue
Based on this PageRank, HITS

13
Focused Crawling

Exploiting additional info on web pages, such as
anchors or text surrounding the links, to skip
some of the pages encountered

14
Exploiting the Hypertextual Info

Links and (topical) locality are just as
important as IR info

15
Fish search

Input users query starting URLs (e.g.,
bookmarks) priority list
First in list downloaded, scored a heuristic
decides if to continue w that direction if not,
its links will be ignored
If yes, links are scanned, each w a depth value
(e.g., parent -1) when depth is zero, direction
is dropped
Timeout, max no pages is also possible
Very heavy and demanding big web-burden

16
Other focused crawlers

Taxonomy and distillation
Classifier evaluates relevance of hypertext docs
regarding topic
Distiller identifies nodes as access points to
pages (via HITS algorithm)
Tunneling
allow a limited no of bad pages, to avoid
loosing info (close topic pages may not point to
each other)
Contextual crawling
Context graph for each page with a related
distance (min no links to traverse from initial
set)
Naïve Bayes classifiers category
identification, according to distance
predictions of a generic documents distance is
possible
Problem reverse link info
Semantic Web
Ontologies
Improvements in performance

17
Agent-based Adaptive Focused Crawling

Genetic-based
Ants

18
Genetic-based crawling

GA
approximate solutions to hard-to-solve
combinatorial optimization problems
genetic operators inheritance, mutation,
crossover population evolution
GA crawler agents (InfoSpiders http//www.informat
ics.indiana.edu/fil/IS/ )
genotype (chromosome set) defining search
behaviour
trust in out-links
query terms
weights (uniform distribution intially FF NN
info latersupervised/unsupervised BP)
Energy Benefit() Cost() (Fitness)

19
Genotype and NN
Relevant/ irrelevant
20
Algorithm 1. Pseudo-code of the InfoSpiders
algorithm

initialize each agents genotype, energy and
starting page
PAGES ?maximum number of pages to visit
while number of visited pages lt PAGES do
while for each agent a do
pick and visit an out-link from the
current agents page
update the energy estimating benefit() -
cost()
update the genotype as a function of the
current benefit
if agents energy gt THRESHOLD then
apply the genetic operators to
produce offspring
else
kill the agent end if
end while
end while

21
Ant-based Crawling

Collective intelligence
Simple individual behaviour, complex results
(shortest path)
Pheromone trail

22
Ant Crawling preferred path
23
Transition probabilities (p) according to
pheromone trails (?)
24
Task accomplishing behaviors

1. at the end of the cycle, the agent updates the
pheromone trails of the followed path and places
itself in one of the start resources
2. if an ant trail exists, the agent decides to
follow it with a probability which is function of
the respective pheromone intensity
3. if the agent does not have any available
information, it moves randomly

25
Transition probability
Link between i,l
where tij(t) corresponds to the pheromone trail
between urli and urlj
26
Trail updating
p(k) is the ordered set of pages visited by the
k-ant
p(k)i is the i-th element of p(k)
score(p) returns for each page p, the similarity
measure with current info needs 0, 1, where 1
is the highest similarity
M of ants
? is the trail evaporation coefficient
27
Algorithm 2. Pseudo-code of the Ant-based crawler.

initialize each agents starting page
PAGES ?maximum number of pages to visit
cycle ? 1
t ? 1
while number of visited pages lt PAGES do
while for each agent a do
for move 0 to cycle do
calculate the probabilities
Pij(t) of the out-going links as in Eq.
select the next page to visit
for the agent a
end for
end while
update all the pheromone trails
initialize each agents starting page
cycle ? cycle 1
t ? t 1
end while

28
Machine Learning-Based Adaptive Focused Crawling

Intelligent Crawling Statistical Model
Reinforcement Learning-Based Approaches

29
Intelligent Crawling Statistical Model

statistically learn the characteristics of the
Webs linkage structure while performing the
search
Unseen page predicates (content of pages linking
in, tokens on unseen page)
Evidence E is used to update probability of
relation to users needs.

30
Evidence-based update

P(CE) gt P(C)
Interest Ratio
I(C,E) P(CE) / P(C)
P(C n E) / P(C)P(E)

e.g., E 10 of the pages pointing in contain
Bach
No initial collection needed At the beginning,
users specify their needs via predicates, e.g.,
the page content or the title must contain a
given set of keywords.
31
Reinforcement Learning-Based Approaches

Traditional focused crawler
apprentice assigns priorities to unvisited URLs
(based on DOM features) for the next steps of
crawling
Naïve Bayes text classifiers compare text around
links to next steps

DOM Document Object Model
32
Evaluation Methodologies

For fixed Web corpus and standard crawl
computation time to complete the crawl, or
the number of downloaded resources per time unit
For focused crawl
We need correctly retrieved documents only, not
all of them so

33
Focused Crawling

Precision
Pr found / (found false alarm)
Recall
Rr found / (found miss)

34
Precision / Recall
35
Conclusions

Focused Crawling interesting alternative to Web
search
Adaptive Focused Crawlers
Learning methods are able to adapt the system
behaviour to a particular environment and input
parameters during the search
Dark matter research, NLP, Semantic Web