Title: DSM
1Access to Web Archives- the Nordic Web Archive
Access Project
Svein Arne Solbakk Director of ICT National
Library of Norway
Archiving Web Resources Information Day, NLA,
Canberra, 12.11.2004
2- NWA History
- NWA Toolset / Nordic experience
- NWA Access Tool
- Relation to IIPC
- Status of the work in Norway
3- NWA History
- From 1996 exchange of experience
- Similar strategies and long history of
collaboration - Used a variety of tools over the years
- Decided to collaborate on developing a common
access interface towards web archives (2000) - Summer 2003 joined IIPC together
- Collaborated with Internet Archive on developing
a common harvester (2003 2004) - October 2004 NWA Access Tool version 1.1
4- NWA Toolset / Nordic Experience
- Harvesters Combine, NEDLIB harvester, HTTrack,
Heritrix - Search toolsExcalibur, Search engine from Fast
Search and Transfer, Lucene - Digital repositoryDifferent strategies
- AccessNWA Access Tool
5- NWA Access Tool
- Funded by Nordunet2 (40), NORDINFO (15) and NWA
(45) - Total cost ECU 400 000 (Aus 620 000)
- Version 1.0 as Open Source April 2004
- Version 1.1 as Open Source October 21st 2004
6- NWA Access Tool
- What it does
- Access based on URL
- Full text search
- Navigation
- Time dimension
7(No Transcript)
8- NWA Access Module
- - Why?
- For internal quality assurance of harvested web
sites - For internal access to the web archive
- For researchers
- Hopefully for everyone interested eventually
9- NWA Access Module
- Developed using Perl, PHP and Java
- Utilizes open standards like HTTP and XML for
communication - Access through standard web browser
- No plugin required
- Open Source
- Interface towards several search engines
- Interface towards several storage tools and
formats
10How does it work?
Lucene
Web archive (ARC-format)
NIXML NWA Indexing Markup Language
11How does it work?
Lucene
ARC Retriever
Web archive (ARC-format)
NIXML NWA Indexing Markup Language
12How does it work?
Index
Lucene
Exporter
HTTP
ARC Retriever
Web archive (ARC-format)
NIXML NWA Indexing Markup Language
13How does it work?
Index
Lucene
Lucene interface
HTTP
Exporter
NWA Access Tool v1.1
HTTP
HTTP
ARC Retriever
Web archive (ARC-format)
NIXML NWA Indexing Markup Language
14How does it work?
Index
Lucene
Lucene interface
HTTP
Exporter
Heritrix
NWA Access Tool v1.1
ARC Writer (heritrix module)
HTTP
HTTP
ARC Retriever
Web archive (ARC-format)
NIXML NWA Indexing Markup Language
15How does it work?
Index
Lucene
NiXml Writer (heritrix module)
Lucene interface
HTTP
Exporter
Heritrix
NWA Access Tool v1.1
ARC Writer (heritrix module)
HTTP
HTTP
ARC Retriever
Web archive (ARC-format)
NIXML NWA Indexing Markup Language
16How does it work?
Index
Lucene
Lucene interface
HTTP
Exporter
NWA Access Tool v1.1
NEDLIB
HTTP
HTTP
ARC Retriever
NEDLIB Retriever
Web archive (ARC-format)
Web archive (NEDLIB-format)
17 Index
FAST
How does it work?
Index
Lucene
Lucene interface
FAST interface
HTTP
Exporter
NWA Access Tool v1.1
NEDLIB
HTTP
HTTP
ARC Retriever
NEDLIB Retriever
Web archive (ARC-format)
Web archive (NEDLIB-format)
18 Index
Relation to IIPCArchitecture?
Index Search
Harvester
Ingest
Access tool
Access tool
Access tool
Access
AIP
SIP
seeds, rules, schedule
logs
DIP
AIP
Storage
Archive analysis tools
Content management
Web archive
19 Index
Relation to IIPCArchitecture?
- plug and play - but not standardised
interfaces and formats yet
Lucene
Heritrix
NWA Access Tool
seeds, rules
logs
ARC tools
Web archive (ARC-format)
20 Index
Relation to IIPCArchitecture?
Norwegian implementation to be operational
1.1.2005
FAST
Heritrix
NWA Access Tool
seeds, rules
logs
Digital Repository
Web archive
21 Index
Relation to IIPCArchitecture?
Norwegian implementation to be operational
1.1.2005
FAST
Heritrix
NWA Access Tool
seeds, rules
logs
Digital Repository
- Java, webservice - Hides storage servers
from applications - URN identification -
Integrity checking - Format aware -
Migration/emulation
Web archive
22 Index
Relation to IIPCArchitecture?
Norwegian implementation to be operational
1.1.2005
FAST
Heritrix
NWA Access Tool
seeds, rules
logs
Digital Repository
- Java, webservice - Hides storage servers
from applications - URN identification -
Integrity checking - Format aware -
Migration/emulation
Probably open source in 2005
Web archive
23- Direction for the Access Tool?
- IIPC plans to make an IIPC Access Tool
- Will be based on NWA Access tool, Wayback machine
and Lucene/Nutch - Important issues
- - scalability
- - flexibility
- - ease of installation and use
- - IIPC Architecture
24Useful links NWA website http//nwa.nb.no NW
A Access Tool http//nwatoolset.sourceforge.net H
eritrix http//archive-crawler.sourceforge.net
IIPC http//netpreserve.org Demo http//nwa
.nb.no/demo
25Demo