NARA Web Preservation Efforts - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

NARA Web Preservation Efforts

Description:

2001 Presidential Term Harvest. 2004 Presidential Term ... For preservation, it's 'race' between: web features, web technical capabilities, needs of users, ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 21
Provided by: RSpan
Category:

less

Transcript and Presenter's Notes

Title: NARA Web Preservation Efforts


1
NARA Web Preservation Efforts
  • Robert Spangler
  • Information Technology Specialist
  • U.S. National Archives and Records Administration
  • www.archives.gov

2
NARA Web Preservation Efforts
  • NARA and the Web
  • Managing Web sites as Records
  • Appraisal Issues
  • Guidance to Agencies on Managing Web Records
  • Guidance to Agencies on Transfer of Web Records
  • 2001 Presidential Term Harvest
  • 2004 Presidential Term Harvest

3
Appraisal Issues
  • Permanently valuable
  • Content vs. Presentation
  • Surface vs. deep
  • Validity of snapshot paradigm
  • Frequency of snapshot

4
Managing Web sites as Records
  • General Issues
  • Trustworthiness
  • Reliability
  • Authenticity
  • Integrity
  • Usability
  • Maintaining Trustworthiness
  • Will those aspects be maintained upon transfer

5
Transfer of Web Records
  • Part of U.S. Administrations e-Gov initiative
    Electronic Records Management
  • NARA effort define transfer standards for modern
    formats, of which Web records was one.
  • Guidance sent to agency records officers and
    effective 17 Sept 2004

6
Transfer of Web Records
  • Content limited to what can be accessed via
    Hypertext Transfer Protocol (HTTP).
  • Content may include images, text, audio, or
    video.
  • Static v. Dynamic Content

7
Transfer of Web Records
  • Supplementary documentation for each web transfer
    is required
  • Methods for transfer include
  • Harvest
  • Preferred method, but settings are important
  • PDF Capture
  • Must also comply with issued PDF Transfer
    Guidance
  • Manual download and copying
  • Applies only for small volume transfers

8
Transfer of Web Records
  • Acceptable formats include HTML and other
    standard markup formats such as XML
  • Component parts and files associated with the
    primary web content record must also be
    transferred
  • Hypertext links internal to the records must be
    redirected internally

9
Transfer of Web Records
  • External links must be disabled
  • If external links are determined to be
    significant to the content of the transferred
    records, they should be commented
  • Transfers of dynamic web content must include
  • Web form
  • Description of the business process
  • Back-end databases

10
Transfer of Web Records
  • Some examples 9/11 commission, Columbia Accident
    Investigation Board
  • Commercial and freeware harvesters have been used
  • Somewhat limited, but expected to grow as tools
    improve and acceptance grows

11
2001 Web Harvest
  • One-time snapshot of agency public web sites as
    they exist by 20 January 2001
  • Attempt to document at least in part agency use
    of the Internet
  • Preserved on NARA archival media (DLT tape)
  • There has been some off-line reference activity
    for this material

12
2001 Web Harvest
  • For inclusion
  • All public content
  • For dynamic content, take a snapshot background
    data not included
  • Snapshot should be in a format that can be read
    on other platforms
  • Technical documentation, site map, etc.
  • Termination of external links

13
2005 Web Harvest
  • Executed by NARA in a coordinated effort
  • Used GSA and DOD/DISA seed lists
  • Contractor hired, actual crawl by Internet
    Archive

14
2005 Web Harvest
  • Sites harvested
  • 1st and 2nd level Federal and military domains,
    i.e. .gov and .mil as represented on seed
    lists
  • IA Heretrix crawler used
  • Available at www.webharvest.gov
  • Also preserved by NARA in ARC format and HTML

15
2005 Web Harvest
  • Limitations of capabilities
  • The netpreserve.org guidance Web Harvesting
    Survey ties well with our experience and advises
    well on easy-gtdifficult continuum

16
2005 Web harvest
  • Types of elements that caused problems in QC
    process
  • Forms deep web
  • Encoding
  • Passwords
  • Server-side scripts
  • Proprietary software
  • Database requests

17
2005 Web Harvest
Stats 982 active and unrestricted second-level
URLs 6.5 terabytes of information, 75 million
web pages, about 50k .gov and .mil web sites
18
Lessons for Futureproofing
  • Adherence to standards
  • Plainer rather than fancier
  • Avoid flash, etc.
  • Avoid browser-specific features

19
Conclusions
  • For preservation, its race between
  • web features,
  • web technical capabilities,
  • needs of users,
  • harvesting capabilities,
  • preservation capabilities,
  • standards development

20
Thank You
  • Robert.spangler_at_nara.gov
Write a Comment
User Comments (0)
About PowerShow.com