Title: Preserving access: Making more informed
1Preserving accessMaking more informed guesses
about what works
- Prepared by Maxine Davis, Collaboration
Research Officer - Presented by David Pearson, Acting Director
- Web Archiving Digital Preservation,
- National Library of Australia
- IIPC Open Day, San Francisco, 7 October 2009
2Presentation Outline
- The problem
- Case study PANDORA Web Archive
- Some approaches options
- Approach 1 Unified Digital Format Registry
(UDFR) - Approach 2 Wikipedia
- Approach 3 Another way documenting what web
archives actually use/d
3The problem
- The World Wide Web is constantly evolving
- Requires combinations of software/hardware to
render web content - But what is used for creation and access changes
- Web archives
- Contain snapshots of websites taken at different
times (different sites or same sites multiple
times) - Lots of files, many file formats, various
versions - Aim for ongoing access
4Process of version creepin the archive
- Mixed accessibility resulting from
- Different browsers, plug-ins, operating systems
in use (then and now) - Backwards compatibility not guaranteed
- Changes in standards and coding practices
(deprecated, dead non-standard tags) - Obsolescence of file formats renderers
- Changes to access paths
- Incremental loss of access not directly obvious
- Alternative access paths not specified
5Case study PANDORA Australias Web Archive (1)
- Selective archive began collecting 1996
- Sites individually selected by NLA partners
- As at July 2009 over 70.6 million files
- Accessible over the web using standard web
browser - .au whole domain harvests
- 4 annual harvests 2005-2008 completed, 2009
underway with Internet Archive - Combined harvests 05-08 2.3 billion files
- Not currently publicly available
6Case study PANDORA Australias Web Archive (2)
7IIPC Preservation Working Group discussions
- Need for documenting the technical environment
- Support required for alternative preservation
action strategies - Emulation of past environments
- Migration to standard formats
- Risk notification
- Recording conversion and alternate access paths
- Exploring different approaches
- Sharing information sensible
8Technical information of interest
- Browsers plug-ins/helper applications versions
dependencies - Used approximately when?
-
- Appropriate for which individual/ type of file
format or whole archive?
9Already documented?
- Manufacturer/vendors websites
- Developers networks, forums, blogs, etc.
- File format registries
- File extension resources
- Software archives/download sites
- Internet history websites
- Internet statistics websites
- Wikipedia
10Possible Approach 1 UDFR
- Digital format registry will result from proposed
merger of PRONOM and GDFR - Pros
- Considerable intellectual investment already
- Could be used for general digital preservation
and potential interaction with other tools - Cons
- Under development
- Web archive requirements need to be specified,
use cases developed, changes to data model,
population with relevant data and regular
updating - Temporal aspect not currently catered for
- Entry point Individual file format or software
type could be a pro?
11Possible Approach 2 Wikipedia (1)
- Pros
- Existing free, web-based collaborative
multilingual project - Draws together a rich set of information
- browsers, layout engines, plug-ins software,
statistics, creators, standards, etc. - lists, history, comparisons, timelines, links to
internal external references - Updated by many voluntary contributors
12Possible Approach 2 Wikipedia (2)
- Cons
- General audience, not specific to web archive
requirements or specific web archive - Amount of detail varies (between different
language versions, articles) - Can be edited by multiple users ( -)
- Not designed to interact with other digital
preservation tools as UDFR has potential to do
13Extract example
14Possible Approach 3 Documenting what web
archives are using/used
- Pros
- Time based software suite approach
- Starting point for
- Potential UDFR seed list
- Identifying commonly used software
- Inferring additional software requirements
- Identifying alternate access paths
- Cons
- Easier to document current versions
- Obscure/obsolete material in our collections may
be unknown
15Individual web archives as sources of information
- Analysis of archive contents harvesting
statistics - Web archivists observations records
- UK Web Archive Technology Watch blog
- Website usage statistics
- Browser versions operating systems
- Indicative of popularity
- Archived sites
- Plug-in requirements, file type information
- May include useful information websites
- Internet Archive complementary collection
16Example NLA Web archiving software environment
July 2009
- Operating system Windows XP
- Computer Windows PC, Intel Pentium 4
- Browser Internet Explorer 7 (main browser), IE8,
Firefox 3.0 - Additional software
- Adobe Reader 8
- Adobe Shockwave Player
- Adobe Flash Player 10
- Real Player 10
- Apple QuickTime 7
- Windows Media Player 11
- Java 6 Update 11
- JavaScript enabled
- Word, Excel, PowerPoint 2003
- WinZip
17Example Earlier NLA Software Environment
18Example Comparison NLA and BnF software
environments
19Going forward
- Is it worth pursuing approach 3?
- If so where would we record (IIPC PWG wiki?,
other suggestions)? - Interested in contributing?
20Questions?
- Contact
- David Pearson dapearson_at_nla.gov.au
- Maxine Davis madavis_at_nla.gov.au
- Report to IIPC PWG by end October 2009
Everything, for Everyone Forever