Title: Presentaci
1- Experiences on Migration of Data in Digitization
Projects - Julián Bescós
Presentation for the ERPANET Workshop Workflow in
Digital Preservation Budapest, 13-15 October 2004
2- The Migration Issue
- Our Experience
- Migration Tasks
- Best Practices for Preservation
- Planning and Schedule
3- Migration is the set of tasks to achieve periodic
transfer of digital materials from one hard/soft
configuration to another - Purpose
- Long term preservation of the digital information
created and stored using digital technology - Allow broad access
- Retrieve, display and use
- Origin
- New devices, processes and software replace the
methods to record, store and access - New standards
- Enhancement of service
4- Technology obsolescence
- Hardware
- More powerfull computers and higher density
storage - Elements for updating are not available (
increase of storage,
memory, etc) - Basic software
- Operating systems
- Data base managers
- Media
- Lifetime is rarely the constraining factor for DP
- Obsolescence of old storage media as newer and
better media are available in the market - Obsolescence of the Access software
- Access in new platform and media
- Not available long term programs
- Changes in metadata and in image formats
- New functions of the software
5- In practice it is a combination of
- Technology obsolescence
- New functionalities of the software
- Derived from information and communication
technology - Daily work on digitisation, storage and access
requiring - Higher density storage
- Faster computers
- It is a consequence of
- The digital world of information and
communication technology is still relatively
young and inmature
6- Beginning in 1988 with the design and development
of the Information System for the Archivo de
Indias in Seville - Computarization of 66 Archives and Libraries of
different kinds and sizes in Spain and abroad - Digitalization of more than 20 millions pages of
ancient documents - Installation of more than 320 workstations
- Development of the own products ArchiDOC-ArchiGES
for Archives - With a team in the areas of consulting, managing,
development, installation, trainning and
maintenance of systems for archives
Archivo General de Indias, Sevilla
Access Room in 1992
7- MAIN PROJECTS WITH DIGITALIZATION
-
Archivo General de Indias, Sevilla Archivo
General de Simancas Archivo Histórico Nacional,
Madrid Archivo Histórico Nacional - Sección
Nobleza, Toledo Archivo Histórico Nacional
Sección Guerra Civil, Salamanca Archivo de la
Corona de Aragón, Barcelona Archivo General de
Navarra Archivo del Reino de Valencia Archivo del
Reino de Mallorca Biblioteca Sancho el Sabio,
Vitoria Archivo Virtual de la corona de Aragón (
con Imágenes del ACA y AHN) Archivo Eclesiástico
de Poblet Archivo Histórico Universidad de
Salamanca Archivo Histórico de la Universidad de
Santiago de Compostela Archivo Histórico de la
Universidad de Oviedo Archivo General de la
Nación, Colombia Archivo Histórico Ultramarino,
Lisboa Archivo del Nacionalismo de la Fundación
Sabino Arana, Vizcaya Biblioteca Valenciana
Archivo del Ilustre Colegio Notarial de
Granada Real Academia Española (Diccionarios
Histórico) Diccionario Biográfico Real Academia
Historia
Archivo General Militar, Segovia Archivo General
Militar, Ávila Instituto de Historia y Cultura
Militar Archivo General de la Marina, El Viso
del Marqués, Ciudad Real Archivo Histórico
Provincial de Murcia Sistema de Información del
Archivo, Biblioteca, Fototeca y Videoteca de Cruz
Roja Española Biblioteca de la Fundación
Francisco de Zabalburu, Madrid Biblioteca
Parlamento Vasco Archivo-Biblioteca de la
Diputación de Cáceres Digitalización de 11
periódicos para 11 Instituciones Vascas de Prensa
retrospectiva y prensa actual Archivo Municipal
de Castellón de la Plana Archivo Histórico del
Excmo. Ayuntamiento de La Laguna,
Tenerife Archivo del Ayuntamiento Oviedo Archivo
del Komintern, Moscow and its replica in 6
National Archives, LOC and Open Society Archives
Archivo General Militar, Segovia
Archivo General de Navarra
Zabalburu Library
8(No Transcript)
9(No Transcript)
10- 1. Projects from 1988 1992
- Computer System for Archivo General de Indias
- The Archive contains 86 million of pages of
original manuscripts related to the Spanish
Administration in America (XV-XIX centuries), in
43.000 bundles - The Computer System integrated
- A Textual Data Base with 400.000 descriptive
entries - A Digital Image Archive with 11 million digital
images in 1995 - A Module for User and Document Management
Control of User management, Consultation room,
documents movements and statistics - Access by researchers and archivists from 50
workstations - About 30 of present consultations are on the
screen (1 million pages/year ) - About 35 of printing are digital ( 85.000/year )
- Access system in service since 1992
11- Architecture
- The Data Base for Descriptions in SQL/400 keeps
the hierarchical structure of fonds - Standalone Digitization Workstations with flat
bed scanners and optical disk driver under DOS - Images servers based on PCs with optical disk
drivers - Access from PCs under OS/2
- Image Acquisition and Storage
- 11 million images digitized in gray levels with
high fidelity with respect to the original
manuscripts - Low cost workstations
- Legibility Enhancements applied by users at the
consultation time - Non expert digitization operators
- Digitization 100 dpi, 16 gray levels
- 1 Page/minute, 15 workstations, 2 turns, 4 years
12- Image Acquisition and Storage
- Images stored in WORM optical disks
- The structure at the low level ( bundle/documents
) was also in directories in the WORM disks - Access to images in one disk done through the
call number of the document - Images path as metadata images names had
information about document call number and number
of page. - Not available standard compression for gray level
images. Images were DPCM compressed by software
without losses. - Compressed Image size of A4 300-350 Kbytes
- Storage for 1 bundle 2000 x 350 700 MB
13-
- Image Acquisition and Storage
- Media for storage of digital images
- Bundles Media Year beg. Number of disks Images
- 1.729 IBM optical disks ( 200 MB) 1989
6.916 3.458.000 - 3.732 Plasmon optical disks ( 940 MB) 1991
3.732 7.464.000 - 50 CD-R (640 MB) 1996 100.000
14(No Transcript)
15(No Transcript)
16(No Transcript)
17Example of blotches removal to be applied by the
user
18(No Transcript)
19Example of reduction of ink bleeding through the
paper
20Archivo General de Indias
Digitization Room of Archivo de Indias in 1989
21Archivo General de Indias
Shelf with optical disks
22- 2. Projects from 1992 1996
- Data Base Server under OS/2 and DB2
- Access and Digitization workstations from PCs
with OS/2 - The relational Data Base keeps the hierarchical
structure of documentation - Images stored in CDRs
- Directory structures and image names changed.
- Metadata in binary control files Each image has
information about signature, position in
hierarchical structure, number of page, notes - Image compression JPEG
- Metadata in images resolution, date, dimensions
23Example metadata in Binary Control File
- The file keeps information about the hierarchical
structure - It maintains relationship between each image file
and its position in the document. - The control file and its metadata can be imported
into the database
24- Migration of Images of Archivo de Indias from
10.600 optical disks to 6.000 CD-Rs - The images of a bundle are stored in 1 or 2 CD-R
- Reading of optical disks through the network
- No direct connectivity between optical disks and
Windows NT - Main Operation Tasks
- Decompression of the DPCM format
- Compression on JPEG format
- Temporary storage in magnetic disk
- All images of the bundle are copied in CD-R
- Verification of images by reading
- 6.000 CD-Rs, and 6.000 CD-Rs backup copy
25- Migration of Images from 6.916 WORM IBM disks to
CD-Rs - Typically 4 WORM disks ( 200 MB each) in 1 or 2
CD-R
IBM Disks to CD-R
Pentium PC Windows NT Token-Ring PCI Card 3GB
disk SCSI interface
Microchannel IBM PS/2 File system driver for
OS/2 OS/2 1.3 and Lan Server TokenRing
Microchannel Card
Token Ring Network
CD-R Drives
IBM Optical Drives
26- Migration of Images from 3.732 WORM Plasmon to
CD-Rs - 1 WORM Plasmon disk ( 940 MB) in 1 or 2 CD-R
Plasmon Disks to CD-R
PC with i486 SCSI interface File system driver
for OS/2 OS/2 3.0 Ethernet card
Pentium PC Windows NT Token-Ring PCI Card 3GB
disk SCSI interface
HUB Ethernet Network
HUB Ethernet Network
Plasmon Drives
CD-R Drives
27- Migration of Images of Archivo de Indias from
10.600 optical disks to 6.000 CD-Rs - Requirements of personnel and time
- 3 operators during 4 months
- Similar migration schemes with less images
- Library Sancho el Sabio ( Vitoria) 1.000.000
images - University of Salamanca 700.000 images
- Archivo General Militar, Segovia 200.000
images - Archivo del Monasterio Poblet 100.000 images
28- 3. Projects from 1996 to now
-
- Oracle Data Base
- Access and Digitization workstations with PCs
with W/NT,.. W XP - Capturing Images also using standard programs and
their metadata - Images stored in magnetic disks. CDROMS as backup
- Metadata in database Scanning operator, date of
creation, Signature, path, dimensions in bytes
Data about control of the information - Metadata in image resolution, dimensions Data
for presentation in computers and for printing - Image quality
- 200 300 dpi, 256 gray levels
- Color images
- Standard formats
- TIFF, CCITTGIV
- JPEG, PDF,
29Example metadata in database
Modes of Image Display
Management of Image Access
30Example metadata XML File
- Same functionality than binary control file
- Standard virtually any program can import these
metadata
31- Migration of Archivo de Indias from CD-R to
magnetic disk in 2000 - Project for online access and Internet
- Just copy. Images are already with JPEG
compression - 10 RAID cabinets of 350 GB each ( 8 disks x 50 GB
) - 1 operator was required during 1 month for the
copy from a CD-ROM tower to magnetic disks - Transfer rate from different media
- Media Transfer rate Image Bundle
- IBM optical disk 60 KBs 6 seconds 4
hours - Plasmon optical disk 100 KB/s 3 seconds 1
hour - CD-R 16x 2,5 MB/s lt1 second 5 minutes
- Magnetic disk 80 MB/s 1 minute
-
- Similar Migrations
- Sancho Sabio Library ( Vitoria) 1 million images
- Zabalburu Library 700.000 images
- Military Archives 500.000 images
- Archivo General Navarra 600.000 images
- Komintern Archives (Moscow) 1 million images
- ........
Komintern Archives, Moscow
32Archivo General de Indias
33Archivo General de Indias
34- Analysis of origin and destination data models
- Equivalence between of the fields in the origin
and destination models - New versions include new metadata not available
before - Development of migration software
- Testing with a limited number of objects
- Display of information in a destination card
- Application of migration to all data
- Verification of results
- Correction of errors
- Sometimes some images cannot be copied and must
be recoverd from alternative media or even to be
digitised again
Komintern Archives, Moscow
35- Preparation of the system for migration
- Hardware and Basic Software
- Magnetic disk storage for images
- PCs with appropriate OS and DB manager
- Development of Software (1 programmer, 2-3 weeks
work ) - Software development for migration
- Testing of migration of data
- Operation ( usually less than 1 week)
- Significant operation with removable media
Komintern Archives, Moscow
36- General principles
- Based on PCs and mainstream commercial equipment
- Key hardware provided by first class IT companies
- Database managers of widespread use
- Consultations with institutions undertaking
projects - Based on elements and standard formats. Officials
or the facto, like TIFF, JPEG, XML, etc. - Modular, allowing a progressive installation and
easy update of elements - Selection of software
- Functionalities
- Number of installations
- Maintenance
- Provided by a IT company settled in the sector
- Key factors
- Server, operating system, database manager
- Backup policies
37- Digitization
- Capture systems
- Robust flatbed scanners (A3)
- Zenithal scanners. Digital cameras with
limitations. - Use of standard compression formats. JPEG,
CCITTGIV - Ensure that digital images will allow a broad
range of future use - Capture the highest quality image technically
possible and economically feasible for
large-scale production - Capture the informational content / physical
appearance - Fast and easy correction of errors
- Criteria for holding selection
- Value
- Condition
- Use
- Acceptability of the digital object
- Access aids
38- Storage
- Media of wide use and low cost
- Magnetic disk for on line image service
(specially in high demand) - Disks with redundancy
- Backup in tapes of high capacity (10/20GB)
- One or two units available as hotsawp
- It allows migration without personnel operation
- In a distributed network they may need to be
stored online in multiple locations - CD-R or DVD as backup for off line access in case
of system failure - In general there is little experience in storing
massive quantities of culturally valuable
materials - Backup and Recovery
- Use industry standard backup and recovery
procedures - Periodic backup to magnetic tape
- A copy held on site for near term recovery
- A copy off-site stored for disaster recovery
39- Traditional approach of Computer Science
- Migration of media
- Refreshing digital information by copying it from
medium to medium - Conversion of files to another format to be
interpreted by new programs to a reduced number
of standard formats - Migration of technology platform
- Server and PCs
- Periphericals
- Capture devices and CDR writers
- Operating system and database manager
- Migration of the digitising and access software
- Maintenance of software in new platform
- New software versions for digitising and access
40- Planning for migration is difficult due to
- the limited experience
- we cannot predict when media, soft and hard will
become obsoleted - No single strategy applies to all formats of
digital information - It varies in different applicational
environments, for different formats of digital
materials and for preserving different degrees of
computation, display and retrieval - It requires a unique new solution for each new
format and process - Automatic conversion is only partially possible
- In general there are no firm plans for migration,
but to stay up to date with current technologies
by migration the content - Usually there is urgency involved in migration
due by the obsolescence of soft and hard
41- Schedule
- New releases of software, databases,etc. can be
expected every 2-3 years, with minor updates more
often - Migration from one storage media to another every
4-5 years, if not online - Migration to new hardware and software occur less
frequently but can be expected between 5-10 years
42-
- Best practices for Digital Preservation
- Mainstream commercial equipment
- Use of standard formats
- Storage in magnetic disk with redundancy
- Backup policies
- Maintenance
- Periodical Update Policy
- Hardware
- Media
- Basic sofware
- Application software