Gordon Mohr - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Gordon Mohr

Description:

Contractual crawls for libraries and governments. US Library of Congress ... Bug fixes. 1.4 Release (January 2004) Memory robustness ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 25
Provided by: jc146489
Category:
Tags: bugfixes | crawls | gordon | mohr

less

Transcript and Presenter's Notes

Title: Gordon Mohr


1
An Introduction To Heritrix
  • Gordon Mohr
  • Chief Technologist, Web Projects
  • Internet Archive

2
Web Collection
  • Since 1996
  • Over 4x1010resources(URItime)
  • Over 400TB(compressed)

3
Web Collection via Alexa
  • Alexa Internet
  • Private company
  • Crawling for IA since 1996
  • 2-month rolling snapshots
  • Recent 3 billion URIs, 35 million websites, 20
    TB
  • Crawling software
  • Sophisticated
  • Weighted towards popular sites
  • Proprietary we only receive the data

4
Heritrix Motivations 1
  • Deeper, specialized, in-house crawling
  • Sites of topical interest
  • Contractual crawls for libraries and governments
  • US Library of Congress
  • Elections, current events, government websites
  • UK Public Records Office, US National Archives
  • Government websites
  • Using our own software machines

5
Heritrix Motivations 2
  • Open source
  • Encourage collaboration on features and best
    practices
  • Avoid duplication of work, incompatibilities
  • Archival-quality
  • Perfect copies
  • Keep up with changing web
  • Meet evolving needs of Internet Archive and
    International Internet Preservation Consortium

6
Heritrix
  • New
  • Open-source
  • Extensible
  • Web-scale
  • Archival-quality
  • Web crawling software

7
Heritrix Use Cases
  • Broad Crawling
  • Large, as-much-as-possible
  • Focused Crawling
  • Collect specific sites/topics deeply
  • Continuous Crawling
  • Revisit changed sites
  • Experimental Crawling
  • Novel approaches

8
Heritrix Project
  • Heritrix means heiress
  • Java, modular
  • Project website http//crawler.archive.org
  • News, downloads, documentation
  • Sourceforge open source hosting site
  • Source-code control (CVS)
  • Issue databases
  • Lesser GPL license
  • Outside contributions

9
http//crawler.archive.org
10
Heritrix Milestones
  • Summer 2003 Prototypes created and tested
    against existing crawlers requirements collected
    from IA and IIPC
  • October 2003-April 2004 Nordic Web Archive
    programmers join project, add capabilities
  • January 2004 First public beta (0.2.0)
  • Used for all in-house crawling since
  • February June 2004 Workshops for Heritrix
    users at national libraries
  • August 2004 Version 1.0.0 released

11
Heritrix Architecture
  • Basic loop
  • 1. Choose a URI from among all those scheduled
  • 2. Fetch that URI
  • 3. Analyze or archive the results
  • 4. Select discovered URIs of interest, and add to
    those scheduled
  • 5. Note that the URI is done and repeat
  • Parallelized across threads (and eventually,
    machines)

12
Key components of Heritrix
  • Scope
  • which URIs should be included
  • (seeds rules)
  • Frontier
  • which URIs are done, or waiting to be done
  • (queues and lists/maps)
  • Processor chains
  • configurable sequential tasks to do to each URI
  • (code modules configuration)

13
Heritrix Architecture
14
Heritrix Processor Chains
  • Prefetch
  • Ensure conditions are met
  • Fetch
  • Network activity (HTTP, DNS, FTP, etc.)
  • Extract
  • Analyze especially for new URIs
  • Write
  • Save archival copy to disk
  • Postprocess
  • Feed URIs back to Frontier, update crawler state

15
Heritrix Features Limitations
  • Other key features
  • Web UI console to control monitor crawl
  • Very configurable inclusion, exclusion,
    politeness policies
  • Limitations
  • Requires sophisticated operator
  • Large crawls hit single-machine limits
  • No capacity for automatic revisit of changed
    material
  • Generally
  • Good for focused experimental crawling use
    cases not yet for broad and continuous

16
Heritrix console
17
Heritrix settings
18
Heritrix logs
19
Heritrix reports
20
Heritrix Current Uses
  • Weekly, Monthly, 6-monthly, and special one-time
    crawls
  • Hundreds to thousands of specific target sites
  • Over 20 million collected URIs per crawl
  • Crawls run for 1-2 weeks

21
Heritrix Performance
  • Not yet stressed, optimized
  • Current crawls limited by material to crawl and
    chosen politeness, not our performance
  • Typical observed rates (actual focused crawls)
  • 20-40 URIs/sec (peaking over 60)
  • 2-3Mbps (peaking over 20Mbps)
  • Limits imposed by memory usage
  • Over 10,000 hosts/over 10 million URIs (512MB
    machine, more on larger machines)

22
Heritrix Future Plans
  • Larger scale crawl capacity
  • Giant focused crawls
  • Broad whole-web crawls
  • New protocols formats
  • Automate expert operator tasks
  • Continuous and dynamic crawling
  • Revisit sites as they change
  • Dynamically rank sites and URIs

23
Latest Developments
  • 1.2 Release (next week)
  • Configurable canonicalization
  • Handles common session-IDs, URI variations
  • Politeness by IP address
  • Experimental more memory-efficient Frontier
  • Bug fixes
  • 1.4 Release (January 2004)
  • Memory robustness
  • Experimental multi-machine distribution support

24
The End
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com