Saving the Web for Future Generations - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Saving the Web for Future Generations

Description:

... years adding video, music and text collections. All available online for free public access. 3 ... Entire collection accessible for free to the public via the ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 14
Provided by: andre455
Category:

less

Transcript and Presenter's Notes

Title: Saving the Web for Future Generations


1
Saving the Web for Future Generations
  • Michele Kimpton
  • Internet Archive
  • Michele_at_archive.org

2
About Internet Archive
  • www.archive.org
  • Largest public web archive
  • 40 billion pages, 35 million sites from 96 to now
  • Last few years adding video, music and text
    collections
  • All available online for free public access

3
What do we collect?
  • Snapshot every two months of the public web- over
    4 billion pages
  • Have content in 21 languages
  • Websites from every domain

4
Why save it all?
  • It can be done
  • Web has no boundaries
  • Selection is expensive
  • What will be important?
  • What is there today may be gone tomorrow

5
(No Transcript)
6
Video
  • 1.5 million downloads in US election week
  • Get out the vote

7
Collections
  • Focused topic or site directed collections
    developed with Partners
  • Library of Congress
  • UK National Archives
  • US National Archives

Broad crawling is not good enough, important to
do both
8
Policy
  • Opt out
  • We collect it all, and make it inaccessible if
    requested by site owner
  • Site owner blocks harvester on site directly

9
Access
  • Entire collection accessible for free to the
    public via the website
  • Gets 100 hits/second
  • 60k unique users per day
  • Through public use we hope to find out what is
    important and continuously improve

10
Preservation
  • Multiple copies within each Archive
  • Multiple copies at different geographical
    locations
  • Standard storage boxes, open source design
  • 3,000 per terabyte

11
Challenges we face
  • Making it useful to researchers
  • Making sure we do not miss the good stuff
  • Developing more sophisticated tools for access
    and harvesting (crawler.archive.org)
  • Managing a collection of this size

12
IAs Future
  • Collaboration and Partnerships
  • Interoperable tools, common storage formats and
    standards through the IIPC
  • www.netpreserve.org
  • Multiple copies around the world

13
Recommendations
  • Start now
  • THINK BIG-Entry level is a Petabyte of storage
  • Collaborate
  • Provide Access
  • Have multiple copies in different locations
Write a Comment
User Comments (0)
About PowerShow.com