Adaptive%20Focused%20Crawling

About This Presentation

Title:

Adaptive%20Focused%20Crawling

Description:

Taking the Web as a graph structure (V,E), web crawling is similar to graph ... InfoSpiders, also known as ARACHNID (Adaptive Retrieval Agents Choosing ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 35

Provided by: lc177

Learn more at: https://sites.pitt.edu

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive%20Focused%20Crawling

1
Adaptive Focused Crawling

Presented by Siqing Du
Date 10/19/05

2
Outline

Introduction of web crawling
Exploiting the hypertextual information
Genetic-based crawler
Ant-based crawler
Machine learning-based crawler
Evaluation

3
Crawling the Web

Simple crawling on the web proceeds by following
the urls in the seed pages, retrieve web pages
and add them into a local repository.
Taking the Web as a graph structure (V,E), web
crawling is similar to graph traversal problem.
Breadth-first search

4
Flow of a Basic Sequential Crawler
5
What is the Problem

Current Size of web (static/crawlable/visible) is
4 10 billion or maybe a lot more
Average out-degree( of urls in a page) of a
random page on the web is 7
Hence the size of the graph increases
exponentially by 7
A well-known web search engine only can cover a
part of the whole web

6
Adaptive Focused Crawling

Focused crawling developing particular crawlers
able to seek out and collect pages related to a
given topic.
It is also called topical crawling
If a focused crawler includes learning methods in
order to adapt its behavior during the crawl to
the particular environment and its relationships
with the given input parameters, e.g., the set of
retrieved pages and the user-defined topic, the
crawler is named adaptive.
Best-first search

7
Outline

Introduction of web crawling
Exploiting the hypertextual information
Genetic-based crawler
Ant-based crawler
Machine learning-based crawler
Evaluation

8
Exploiting the Hypertextural Information

PageRank and HITS founded from citation analysis
started in 1950s by Garfield.
In focused crawling systems, the precision is not
defined only in terms of number of crawled pages,
but in terms of rank.
Short result lists of high rank documents are
definitively better than long lists of
interesting documents that force the users to
sift through them in order to find the most
valuable information.

9
Topical Locality and Anchors

Topical locality occurs each time a page is
linked to others with related content. (in order
to give users the chance to see further related
information or services).
Proximal cues or residues correspond with the
imperfect information at intermediate locations
that a user exploits to decide the paths to
follow in order to reach a target information.
Text snippet, anchor text or icons are usually
the imperfect information related to a certain
distant content.

10
HITS

Authorities have relevant content about a topic
Hubs contain several links toward relevant
authoritative pages.

11
PageRank

Random surfer model a surfer in that model is
able to randomly click on one of the links
contained in a page p with equal probability 1/Np

12
Outline

Introduction of web crawling
Exploiting the hypertextual information
Genetic-based crawler
Ant-based crawler
Machine learning-based crawler
Evaluation

13
AI-based Approaches

Speculate that crawlers as single autonomous
units live and keep moving for interesting
resources.
Genetic-based crawlers
Ant paradigm

14
Genetic-based crawlers

InfoSpiders, also known as ARACHNID (Adaptive
Retrieval Agents Choosing Heuristic Neighborhoods
for Information Discovery)
Genetic algorithms have been introduced in order
to find approximate solutions to hard-to-solve
combinatorial optimization problems.
Inspired by evolutionary biology studies.

15
Basic Idea of GA

A population
Genetic operators, such as, inheritance,
mutation, crossover.
The ones that are closer to the better solutions
are given more chances to live and reproduce,
while the ones that are ill-suited for an
environment die out.
The initial population generated randomly

16
InfoSpider

In InfoSpiders an evolving population of
intelligent agents browse the Web driven by the
user queries.
Each agent is able to draw relevant resources and
reason autonomously about next page to download
and analyze.
The goal is to mimic the intelligent browsing
behavior of human users with little or no
interaction among agents.

17
InfoSpider cont.

Each agent is built on top of a genotype
(parameter that represents the degree to which a
gent trusts the textual description about
outgoing links, a set of keywords initialized
with the query terms, and a vector of weights)
A feed-forward neural network used to judging
what are the best keywords in the first set that
best discriminate the documents relevant to the
user.

18
InfoSpider cont.

The adaptivity is both unsupervised and
supervised. (With or without users feedback)
If any error occurs (uninteresting page )due to
the agents action selection, the weight of the
neural networks are updated subsequently.
Mutation and crossovers provide the second kind
of adaptivity to the environment.
An agents energy value is assigned at the
beginning, updated according to the relevance of
page visited.
The energy determines which agent survives or
dies out.

19
Itsy Bitsy Spider

Itsy Bitsy spider, an implementation of
genetic-based crawler, experimented on Yahoo
database.
During the evaluation, the genetic approach dose
not outperform the best first search algorithm.
(recall high, precision no significant
difference)
However, Itsy Bitsy is a simple version of
InfoSpiders, no neural network and some other
components, and no ability to autonomously
reasoning.

20
Outline

Introduction of web crawling
Exploiting the hypertextual information
Genetic-based crawler
Ant-based crawler
Machine learning-based crawler
Evaluation

21
Ant-based Crawlers

Based on a model of social insect collective
behavior.
Studies on how blind animals, such as ants, are
able to find out the shortest ways from their
nest to the feeding sources and back.
Ants can release an hormonal substance, the
pheromone, to mark the ground, leaving a trail.
Other ants follow the train and reinforce the
trail.

22
Mechanism

The first ants returning to their nest from the
feeding sources are those which chosen the
shortest paths.
The back and forth trip let them release
pheromone twice.
Others, if have to make choice between different
paths, will prefer those with more pheromone
path.

23
Ant-based Crawlers

Each agent corresponds to a virtual ant, move
from urli to urlj.
The system execution is divided into cycles in
each of them, the ants make a sequence of moves.
At the end of a cycle, the ants update the
pheromone intensity values of the followed path
as a function of the retrieved resource scores.

24
Ant-based Crawlers

The transition probability from urli to urlj at
cycle t is
Prevent circular paths, each ant stores a L list
containing the visited urls.

25
Updating Rule

The pheromone of trail from urli to urlj at cycle
t1
Adaptivity the pheromone intensities are updated
according to the visited resource scores.

26
Outline

Introduction of web crawling
Exploiting the hypertextual information
Genetic-based crawler
Ant-based crawler
Machine learning-based crawler
Evaluation

27
Intelligent Crawlings Statistical Model

Aims at learning statistically characteristics of
the linkage structure of the Web while performing
search.
Using particular knowledge obtained in the search
to calculate the conditional probability and
interest ratio to determine whether the unseen
page satisfies the user needs.
It does not need any collection of topical
example for training.
The crawler adapts its behavior by learning the
correlations among given features.

28
Reinforcement Learning-based Approaches

A classifier evaluates the relevance of a
hypertext document with respect to the chosen
topics.
The interesting documents found are the rewards.
To learn the text in the neighborhood of the
hyperlink that most likely point to relevant
pages during the crawling.

29
Outline

Introduction of web crawling
Exploiting the hypertextual information
Genetic-based crawler
Ant-based crawler
Machine learning-base crawler
Evaluation

30
Evaluation Methodologies

The goodness of the retrieved documents
The percentage of important page retrieved over
the progress of the crawl is another often used
measure.

31
An Example of Performance Plot
32
Summarization

Focused crawling has become an interesting
alternative to the current Web search tools.
A particular kind of crawlers able to seek out
and collect the subset of Web pages related to a
given topic.
With learning methods, adaptive focused crawlers
are able to adapt the system behavior to the
particular environment and input parameters
during the search.
Evaluation results show how the whole searching
process may profit of those techniques and
increase crawling performance.

33
Reference

Core paper
Alessandro Micarelli and Fabio Gasparetti,
Adaptive Focused Crawling
Additional papers
Gautam Pant, Padmini Srinivasan, and Filippo
Menczer, Crawling the Web ,Web Dynamics,
Springer-Verlag, 2003.
Martin Ester, Matthias Groß, Hans-Peter Kriegel,
Focused Web Crawling A Generic Framework for
Specifying the User Interest and for Adaptive
Crawling Strategies (VLDB2001)

Questions Comments?
Thanks!

Write a Comment

User Comments (0)

About PowerShow.com

Adaptive%20Focused%20Crawling - PowerPoint PPT Presentation

Adaptive%20Focused%20Crawling

Taking the Web as a graph structure (V,E), web crawling is similar to graph ... InfoSpiders, also known as ARACHNID (Adaptive Retrieval Agents Choosing ... – PowerPoint PPT presentation