Parallel Perl Robot PPRobot - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Parallel Perl Robot PPRobot

Description:

Study Crawlers. Design a Crawler system. Implement a Crawler system. Next. Previous ... Many to Many IP-URL mapping. Politeness (Robots Exclusion Protocol) ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 13
Provided by: cn8s
Category:

less

Transcript and Presenter's Notes

Title: Parallel Perl Robot PPRobot


1
Parallel Perl Robot (PPRobot)
  • Spring semester 2000
  • Technion
  • Comnet Lab
  • Shlomo Yona Semuel Fomberg

Instructor Yigal Bejerano.
2
Description
  • The Projects purpose
  • Study Crawlers
  • Design a Crawler system
  • Implement a Crawler system

3
Parts of Search Engine
Internet
Crawler
  • Indexer

Sifter
4
Find Host
Clean the URL
In Robots.txt cache?
Get next URL
Get robots.txt
Extract Links
Allowed to get URL?
Is HTML?
Get resource Serialize Req/Res
5
(No Transcript)
6
(No Transcript)
7
Problems of crawling
  • Equivalent Set (types of, solutions)
  • Many to Many IP-URL mapping
  • Politeness (Robots Exclusion Protocol)
  • DNS (Problem,C internals, the Squid Solution)
  • Load (Buckets algorithm)
  • Storage (Need of, compression, index file)
  • Speed (Buckets, Profiling,SELECT)
  • Parsing (DTD, incorrect syntax)
  • Counting on correctness of other systems

8
Technology
  • Programming Languages
  • Perl, SQL, HTML
  • Databases
  • MySQL, Berkely DB
  • Standards
  • HTTP
  • Robots Exclusion Protocol

9
Two implementations
  • SELECT solution
  • resourceGetter.pl (and company)
  • Multy-Process solution
  • getter.pl (and company)
  • Other ideas
  • Use of threads (why? Why not?)

10
Results
  • 0.5-2 million pages a day using getter.pl
  • While
  • Running 100-200 instances
  • 2 CPU (700 mhz P3) Linux machine
  • 2 GB memory
  • 200 GB disk space
  • T1 line to the internet

11
Benefits
12
Contact
  • Shlomo Yona shlomo_at_vipe.technion.ac.il
  • Semuel Fomberg semuel_at_vipe.technion.ac.il
  • http//www.comnet.technion.ac.il/cn8s00
Write a Comment
User Comments (0)
About PowerShow.com