Publication Spider - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Publication Spider

Description:

What is publication spider. Gathering publication pages. Using focused ... link-based web analysis. hard to analyze the page while the knowledge about the search ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 15
Provided by: wingCom
Category:

less

Transcript and Presenter's Notes

Title: Publication Spider


1
Publication Spider
  • Wang Xuan
  • 07/14/2006

2
What is publication spider
  • Gathering publication pages
  • Using focused crawling
  • With the help of Search Engine

3
What is publication spider
  • Gathering publication pages
  • Using focused crawling
  • With the help of Search Engine

4
What is focused crawling
  • Crawling vs. Focused crawling

5
Crawling methods
  • Web search algorithm
  • Breadth-first (using in standard crawling)
  • Best-first (using in focused crawling)
  • They are local-search strategies
  • Web analysis algorithm
  • content-based web analysis
  • page text, title, URI, page layout
  • link-based web analysis
  • hard to analyze the page while the knowledge
    about the search graph is not yet known
    completely.

6
Focused Crawling
  • Learning phase
  • Crawling phase

7
Related work - naïve Bayes Crawler
  • one of the simplest focused crawler
  • text extracted is represented as a vector of
    words weighted by the words frequency
  • relevance score is the cosine similarity between
    page p and the query q representing the topic
  • Only focus on target pages, assign low priority
    to source link.

8
Related work Context focused Crawler
  • Representation of the context - in which the
    target pages are found, by a graph.
  • page in layer (i) has a direct link to some page
    in layer (i-1)
  • layer (0) contains the target page
  • N classifier, one for each layer

9
What is publication spider
  • Gathering publication pages
  • Using focused crawling
  • With the help of Search Engine

10
Related Work -PaSE (Page Search Engine)
  • Given citation information, find the online PDF
    document
  • the top 10 links return from google --gt right
    page that is likely contain online PDF
  • Breadth-first, Depth-first, Radom
  • right web page --gt identify the right (citation,
    PDF) pair.
  • using (title, PDF)
  • pick the PDF link with shortest distance to the
    citation block

11
General framework for spider
Highly depend on the seed pages
keywords
Target pages
12
Target Repository
PublicationEntry
KeyWord
Search Engine
More Pages
13
(No Transcript)
14
Future work
  • Scale up the evaluation
  • Improve the performance of spider
Write a Comment
User Comments (0)
About PowerShow.com