SLASHPack: Collector Performance Improvement and Evaluation - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

SLASHPack: Collector Performance Improvement and Evaluation

Description:

Addition of protocol module for Weblog data set. Performance testing using the Weblog and HTTP modules. Identify problem areas. ... Weblog raw data post ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 30
Provided by: csUs7
Learn more at: https://www.cs.usfca.edu
Category:

less

Transcript and Presenter's Notes

Title: SLASHPack: Collector Performance Improvement and Evaluation


1
SLASHPack CollectorPerformance Improvement
andEvaluation
  • Rudd Stevens
  • CS 690
  • Spring 2006

2
Outline
  • 1. Introduction, system overview and design.
  • 2. Performance modifications, re-factoring and
    re-structuring.
  • 3. Performance testing results and evaluation.

3
Outline
  • 1. Introduction, system overview and design.
  • 2. Performance modifications, re-factoring and
    re-structuring.
  • 3. Performance testing results and evaluation.

4
Introduction
  • SLASHPack Toolkit
  • (Semi-LArge Scale Hypertext Package)
  • Sponsored by Prof. Chris Brooks, engineered for
    initial clients Nancy Montanez and Ryan King.
  • Collector component
  • Framework for collecting documents.
  • Evaluate and improve performance.

5
Contact and Information Sources
  • Contact Information
  • Rudd Stevens rstevens (at) cs.usfca.edu
  • Project Website
  • http//www.cs.usfca.edu/rstevens/slashpack/collec
    tor/
  • Project Sponsor
  • Professor Christopher Brooks Department of
    Computer Science University of San Francisco
    cbrooks (at) cs.usfca.edu

6
Stages
  • Addition of protocol module for Weblog data set.
  • Performance testing using the Weblog and HTTP
    modules. Identify problem areas.
  • Modify Collector to improve scalability and
    performance.
  • Repeat performance testing and evaluate
    performance improvements.

7
Implementation
  • Language Python
  • Platform Any Python supported OS.
  • Python 2.4 or later
  • (Developed and tested under Linux.)
  • Progress Fully built, newly re-factored for
    performance and usability.

8
High level design
  • SLASHPack designed as a framework.
  • Modular components, that contain sub-modules.
  • Collector pluggable for protocol modules,
    parsers, filters, output writers, etc.

9
High level design (cont.)
10
Outline
  • 1. Introduction, system overview and design.
  • 2. Performance modifications, re-factoring and
    re-structuring.
  • 3. Performance testing results and evaluation.

11
Performance Testing
  • Large scale text collection.
  • Weblog data set.
  • Long web crawls.
  • Performance testing monitoring
  • Python Profiling.
  • Integrated Statistics.
  • Functionality Testing
  • Python logging.
  • Functionality test runs.

12
Collector Runtime Statistics
  • UrlFrontier
  • Url Frontier size,
  • current number of links 3465
  • Urls requested from frontier 659
  • Url Frontier,
  • current number of server queues 78
  • Urls delivered from frontier 639
  • Collector
  • Documents per second
  • 3.70328865405
  • Total runtime
  • 2 Minutes 31.4869761467 Seconds
  • UrlSentry
  • Urls filtered using robots 38
  • Urls filtered for depth 9
  • Urls Processed 5881
  • Urls filtered using filters 165
  • UrlBookkeeper
  • Duplicate Urls 1557
  • Urls recorded 4104

13
Collector Runtime Statistics
  • DocFingerprinter
  • Documents Written 386
  • Average Document Size (bytes)
  • 20570
  • HTTP Status Responses
  • 200 394 204 10
  • 301 8 302 25
  • 404 91 403 7
  • 401 1 400 24
  • 500 1
  • Duplicate Documents 51
  • Total Documents Collected 561
  • Documents by mimetype
  • text/xml 1 image/jpeg 1
  • text/html 451 image/gif 1
  • text/plain 106
  • application/octet-stream 1

14
Challenges
  • Large text (XML) files
  • 21 1 GB XML files.
  • 450,000 files per XML file.
  • 10 Million files, after processing.
  • Memory/Storage
  • Disk space.
  • Memory usage during processing. (XML)

15
Weblog raw data
  • ltpostgt
  • ltweblog_urlgt http//www.livejournal.com/users/chu
    ckdarwin lt/weblog_urlgt ltweblog_titlegt
    ""Evolve!" lt/weblog_titlegt ltpermalinkgthttp//www
    .livejournal.com/users/chuckdarwin/1001264.html
  • lt/permalinkgt
  • ltpost_titlegt Flickr lt/post_titlegt
  • ltauthor_namegt Darwin (chuckdarwin)
    lt/author_namegt
  • ltdate_postedgt 2005-07-09 lt/date_postedgt
  • lttime_postedgt 000000 lt/time_postedgt
  • ltcontentgt lthtmlgtltheadgtltmeta
    content"text/html charsetUTF-8"
    http-equiv"Content-Type"/gtlttitlegt""Evolv
    e!""lt/titlegtlt/headgtltbodygt
  • ltdiv style"text-align center"gtltfont
    size"1"gtlta href"http//www.nytimes.com/20
    05/07/09/arts/09boxe.html?ei5088ampampen61cfc
    d5835008b1aampampex1278561600ampamppartner
    rssnytampampemcrssampamppagewantedprint"g
    t7/7 and 9/11?lt/agtlt/fontgtlt/divgt
    lt/bodygtlt/htmlgt
  • lt/contentgt
  • ltoutlinksgt
  • ltoutlinkgt
  • lturlgt http//www.nytimes.com/2005/07/09/arts
    /09boxe.html lt/urlgt
  • ltsitegt http//www.nytimes.com lt/sitegt
  • lttypegt Press lt/typegt
  • lt/outlinkgt
  • lt/outlinksgt
  • lt/postgt

16
Weblog processed data
  • ltspdatagt
  • lturlgthttp//www.livejournal.com/users/chuckdarwin
    lt/urlgtltdategt20060212lt/dategt
  • ltcrawlnamegtWeblogPosts20050709lt/crawlnamegt
  • ltwebloggt
  • ltweblog_titlegt"Evolve!""lt/weblog_titlegt
  • ltpermalinkgthttp//www.livejournal.com/us
    ers/chuckdarwin/1001264.htmllt/permalinkgt
  • ltpost_titlegtFlickrlt/post_titlegt
  • ltauthor_namegtDarwin (chuckdarwin)lt/autho
    r_namegt
  • ltdate_postedgt2005-07-09lt/date_postedgt
  • lttime_postedgt000000lt/time_postedgt
  • ltoutlinksgt
  • ltoutlinkgt
  • lttypegtPresslt/typegt
  • lturlgthttp//www.nytimes.com/2005/07/09/arts/09b
    oxe.htmllt/urlgt
  • ltsitegthttp//www.nytimes.comlt/sitegt
  • lt/outlinkgt
  • lt/outlinksgt
  • lt/webloggt
  • lttagsgtlt/tagsgt

17
Original Design
18
Problems to Address
  • Overall collection performance
  • Streamline processing.
  • Robot file look up
  • Incredibly slow and inefficient. (Not mine!)
  • Thread interaction
  • Efficient use of threads and queues to process
    data.
  • Inefficient code
  • Python code not always the fastest.
  • miniDom XML parsing.
  • Faster data structures
  • Re-work collection protocols, DNS prefetch.
  • Re-structure URL Frontier, URL Bookkeeper.

19
New Design
20
Performance Modifications
  • Structure Re-design (threading)
  • More queues, more independence.
  • Robot Parser
  • String creation, debug calls.
  • URL Frontier
  • More efficient data structures.
  • Protocol Modules
  • More efficient data structures.
  • Re-factoring for reliable collection.
  • XML parsing
  • Switch to faster parser, removal of DOM parser.
  • DNS Pre-fetching
  • More efficient structuring.

21
New data structures
  • Dictionary fields for Base data type.
  • (Must be implemented by any data protocol).
  • Now passed in dictionary to storage component.
  • Key Value
    Type
  • datatype user defined datatype name
    string
  • status HTTP document status
    string
  • url URL of document
    string
  • date collection date
    string
  • crawlname name of current crawl
    string
  • size byte length of content
    string
  • mimetype mime type of document
    string
  • fingerprint md5sum hash of content
    string
  • content raw text of document
    string

22
Outline
  • 1. Introduction, system overview and design.
  • 2. Performance modifications, re-factoring and
    re-structuring.
  • 3. Performance testing results and evaluation.

23
Performance Comparison
  • Initial Results
  • Weblog data set
  • w/o parsing, robots
  • 161 doc/s, 50 min.
  • w/ parsing, robots
  • 3.9 doc/s, 162 min. (killed)
  • HTTP Web crawl
  • 100 docs w/ parsing, robots
  • 0.2 doc/s,16 min13s
  • 150 docs w/ parsing, robots
  • 0.3 doc/s, 21min3s
  • Modified Results
  • Weblog data set
  • w/o parsing, robots
  • 170 doc/s, 42 min.
  • w/ parsing, robots
  • 186 doc/s, 63 min.
  • HTTP Web crawl
  • 100 docs w/ parsing, robots
  • 2.2 doc/s, 1min10s
  • 150 docs w/ parsing, robots
  • 2.9 doc/s, 1min14s

24
Performance Comparison (cont.)
  • Hardware considerations
  • HTTP web crawl for 500 documents
  • Pentium 4 2.4GHz 1 GB RAM
  • 3.7 doc/s 3min18s, 728 docs total
  • (faster connection)
  • Pentium 4 2.0GHz, 1GB RAM
  • 3.7 doc/s 4min25s, 725 docs total
  • Pentium 4 3.2GHz HT, 2GB RAM
  • 4.3 doc/s 2min47s, 717 docs total
  • (faster connection)

25
Performance Comparison (cont.)
  • Comparison to other web crawlers
  • (published results, 1999)
  • Google 33.5 doc/s
  • Internet Archive 46.3 doc/s
  • Mercator 112 doc/s
  • Consideration of functionality
  • More than just a web crawler.
  • Mime types.

26
Available Documentation
  • Pydoc API
  • Generated with Epydoc.
  • Use and configuration guide (README).
  • Quick start guide.
  • Full Report
  • Full specification of Collector, use,
    configuration and development background.

27
Future Work
  • Addition of pluggable modules.
  • Improved fingerprint sets.
  • Improved Python memory management and threading.

28
References
  • Allan Heydon and Marc Najork.
  • Mercator A scalable, extensible web crawler.
  • http//research.compaq.com/SRC/mercator/papers/www
    /paper.pdf
  • Soumen Chakrabati, Mining the Web, 2002.
  • Ch. 2, pages 17-43.
  • Heritrix, Internet Archive.
  • http//crawler.archive.org/
  • Python Performance Tips
  • http//wiki.python.org/moin/PythonSpeed/Performanc
    eTips
  • Prof. Chris Brooks and the SLASHPack Team.

29
Conclusion
  • Four stages
  • Addition of protocol module for Weblog data set.
  • Performance testing and identifying problem
    areas.
  • Modify Collector to improve scalability and
    performance.
  • Repeat performance testing and evaluate
    performance improvements.
  • Results
  • Expanded functionality for data types.
  • Modifications improved performance.
  • More stable and flexible design.
Write a Comment
User Comments (0)
About PowerShow.com