Adaptive%20Focused%20Crawling - PowerPoint PPT Presentation

About This Presentation
Title:

Adaptive%20Focused%20Crawling

Description:

Taking the Web as a graph structure (V,E), web crawling is similar to graph ... InfoSpiders, also known as ARACHNID (Adaptive Retrieval Agents Choosing ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 35
Provided by: lc177
Learn more at: https://sites.pitt.edu
Category:

less

Transcript and Presenter's Notes

Title: Adaptive%20Focused%20Crawling


1
Adaptive Focused Crawling
  • Presented by Siqing Du
  • Date 10/19/05

2
Outline
  • Introduction of web crawling
  • Exploiting the hypertextual information
  • Genetic-based crawler
  • Ant-based crawler
  • Machine learning-based crawler
  • Evaluation

3
Crawling the Web
  • Simple crawling on the web proceeds by following
    the urls in the seed pages, retrieve web pages
    and add them into a local repository.
  • Taking the Web as a graph structure (V,E), web
    crawling is similar to graph traversal problem.
  • Breadth-first search

4
Flow of a Basic Sequential Crawler
5
What is the Problem
  • Current Size of web (static/crawlable/visible) is
    4 10 billion or maybe a lot more
  • Average out-degree( of urls in a page) of a
    random page on the web is 7
  • Hence the size of the graph increases
    exponentially by 7
  • A well-known web search engine only can cover a
    part of the whole web

6
Adaptive Focused Crawling
  • Focused crawling developing particular crawlers
    able to seek out and collect pages related to a
    given topic.
  • It is also called topical crawling
  • If a focused crawler includes learning methods in
    order to adapt its behavior during the crawl to
    the particular environment and its relationships
    with the given input parameters, e.g., the set of
    retrieved pages and the user-defined topic, the
    crawler is named adaptive.
  • Best-first search

7
Outline
  • Introduction of web crawling
  • Exploiting the hypertextual information
  • Genetic-based crawler
  • Ant-based crawler
  • Machine learning-based crawler
  • Evaluation

8
Exploiting the Hypertextural Information
  • PageRank and HITS founded from citation analysis
    started in 1950s by Garfield.
  • In focused crawling systems, the precision is not
    defined only in terms of number of crawled pages,
    but in terms of rank.
  • Short result lists of high rank documents are
    definitively better than long lists of
    interesting documents that force the users to
    sift through them in order to find the most
    valuable information.

9
Topical Locality and Anchors
  • Topical locality occurs each time a page is
    linked to others with related content. (in order
    to give users the chance to see further related
    information or services).
  • Proximal cues or residues correspond with the
    imperfect information at intermediate locations
    that a user exploits to decide the paths to
    follow in order to reach a target information.
  • Text snippet, anchor text or icons are usually
    the imperfect information related to a certain
    distant content.

10
HITS
  • Authorities have relevant content about a topic
  • Hubs contain several links toward relevant
    authoritative pages.

11
PageRank
  • Random surfer model a surfer in that model is
    able to randomly click on one of the links
    contained in a page p with equal probability 1/Np

12
Outline
  • Introduction of web crawling
  • Exploiting the hypertextual information
  • Genetic-based crawler
  • Ant-based crawler
  • Machine learning-based crawler
  • Evaluation

13
AI-based Approaches
  • Speculate that crawlers as single autonomous
    units live and keep moving for interesting
    resources.
  • Genetic-based crawlers
  • Ant paradigm

14
Genetic-based crawlers
  • InfoSpiders, also known as ARACHNID (Adaptive
    Retrieval Agents Choosing Heuristic Neighborhoods
    for Information Discovery)
  • Genetic algorithms have been introduced in order
    to find approximate solutions to hard-to-solve
    combinatorial optimization problems.
  • Inspired by evolutionary biology studies.

15
Basic Idea of GA
  • A population
  • Genetic operators, such as, inheritance,
    mutation, crossover.
  • The ones that are closer to the better solutions
    are given more chances to live and reproduce,
    while the ones that are ill-suited for an
    environment die out.
  • The initial population generated randomly

16
InfoSpider
  • In InfoSpiders an evolving population of
    intelligent agents browse the Web driven by the
    user queries.
  • Each agent is able to draw relevant resources and
    reason autonomously about next page to download
    and analyze.
  • The goal is to mimic the intelligent browsing
    behavior of human users with little or no
    interaction among agents.

17
InfoSpider cont.
  • Each agent is built on top of a genotype
    (parameter that represents the degree to which a
    gent trusts the textual description about
    outgoing links, a set of keywords initialized
    with the query terms, and a vector of weights)
  • A feed-forward neural network used to judging
    what are the best keywords in the first set that
    best discriminate the documents relevant to the
    user.

18
InfoSpider cont.
  • The adaptivity is both unsupervised and
    supervised. (With or without users feedback)
  • If any error occurs (uninteresting page )due to
    the agents action selection, the weight of the
    neural networks are updated subsequently.
  • Mutation and crossovers provide the second kind
    of adaptivity to the environment.
  • An agents energy value is assigned at the
    beginning, updated according to the relevance of
    page visited.
  • The energy determines which agent survives or
    dies out.

19
Itsy Bitsy Spider
  • Itsy Bitsy spider, an implementation of
    genetic-based crawler, experimented on Yahoo
    database.
  • During the evaluation, the genetic approach dose
    not outperform the best first search algorithm.
    (recall high, precision no significant
    difference)
  • However, Itsy Bitsy is a simple version of
    InfoSpiders, no neural network and some other
    components, and no ability to autonomously
    reasoning.

20
Outline
  • Introduction of web crawling
  • Exploiting the hypertextual information
  • Genetic-based crawler
  • Ant-based crawler
  • Machine learning-based crawler
  • Evaluation

21
Ant-based Crawlers
  • Based on a model of social insect collective
    behavior.
  • Studies on how blind animals, such as ants, are
    able to find out the shortest ways from their
    nest to the feeding sources and back.
  • Ants can release an hormonal substance, the
    pheromone, to mark the ground, leaving a trail.
  • Other ants follow the train and reinforce the
    trail.

22
Mechanism
  • The first ants returning to their nest from the
    feeding sources are those which chosen the
    shortest paths.
  • The back and forth trip let them release
    pheromone twice.
  • Others, if have to make choice between different
    paths, will prefer those with more pheromone
    path.

23
Ant-based Crawlers
  • Each agent corresponds to a virtual ant, move
    from urli to urlj.
  • The system execution is divided into cycles in
    each of them, the ants make a sequence of moves.
  • At the end of a cycle, the ants update the
    pheromone intensity values of the followed path
    as a function of the retrieved resource scores.

24
Ant-based Crawlers
  • The transition probability from urli to urlj at
    cycle t is
  • Prevent circular paths, each ant stores a L list
    containing the visited urls.

25
Updating Rule
  • The pheromone of trail from urli to urlj at cycle
    t1
  • Adaptivity the pheromone intensities are updated
    according to the visited resource scores.

26
Outline
  • Introduction of web crawling
  • Exploiting the hypertextual information
  • Genetic-based crawler
  • Ant-based crawler
  • Machine learning-based crawler
  • Evaluation

27
Intelligent Crawlings Statistical Model
  • Aims at learning statistically characteristics of
    the linkage structure of the Web while performing
    search.
  • Using particular knowledge obtained in the search
    to calculate the conditional probability and
    interest ratio to determine whether the unseen
    page satisfies the user needs.
  • It does not need any collection of topical
    example for training.
  • The crawler adapts its behavior by learning the
    correlations among given features.

28
Reinforcement Learning-based Approaches
  • A classifier evaluates the relevance of a
    hypertext document with respect to the chosen
    topics.
  • The interesting documents found are the rewards.
  • To learn the text in the neighborhood of the
    hyperlink that most likely point to relevant
    pages during the crawling.

29
Outline
  • Introduction of web crawling
  • Exploiting the hypertextual information
  • Genetic-based crawler
  • Ant-based crawler
  • Machine learning-base crawler
  • Evaluation

30
Evaluation Methodologies
  • The goodness of the retrieved documents
  • The percentage of important page retrieved over
    the progress of the crawl is another often used
    measure.

31
An Example of Performance Plot
32
Summarization
  • Focused crawling has become an interesting
    alternative to the current Web search tools.
  • A particular kind of crawlers able to seek out
    and collect the subset of Web pages related to a
    given topic.
  • With learning methods, adaptive focused crawlers
    are able to adapt the system behavior to the
    particular environment and input parameters
    during the search.
  • Evaluation results show how the whole searching
    process may profit of those techniques and
    increase crawling performance.

33
Reference
  • Core paper
  • Alessandro Micarelli and Fabio Gasparetti,
    Adaptive Focused Crawling
  • Additional papers
  • Gautam Pant, Padmini Srinivasan, and Filippo
    Menczer, Crawling the Web ,Web Dynamics,
    Springer-Verlag, 2003.
  • Martin Ester, Matthias Groß, Hans-Peter Kriegel,
    Focused Web Crawling A Generic Framework for
    Specifying the User Interest and for Adaptive
    Crawling Strategies (VLDB2001)

34
  • Questions Comments?
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com