Title: PreScan Preservation Scanner towards automating the ingestion process
1PreScan (Preservation Scanner)towards automating
the ingestion process
- FORTH-ICS
- Presentation by Yannis Tzitzikas, Yannis
Marketakis
2Outline
- Motivation Background
- The architecture of PreScan
- Scanner
- Metadata Extactor
- Repository Manager
- Controller
- Time Performance
- Related Works and Systems
- Future Extensions
- Software Releases
3Motivation
- The creation and maintenance of metadata is a
laborious task that does not always pay off
immediately. - There is a need for tools that automate as much
as possible the creation and curation of
preservation metadata. - PreScan is a tool (developed by FORTH-ICS during
the third year of the project) for automating the
ingestion phase. - It can bind together automatically extracted
embedded metadata with manually provided
metadata, and dependency management services. - In addition it offers some features for keeping
the metadata repository up-to-date.
4Background
- Metadata can be stored either
- internally, i.e. in the same file as the data,
and these are called embedded - or externally, i.e. in a separate
file/repository, these are called detached - Both approaches have advantages and
disadvantages. - One benefit of the embedded metadata is that they
are transferred with the data and thus their
access and manipulation is straightforward.
However embedded metadata can create redundancies
and this approach does not allow holding and
managing all metadata together. - On the other hand, if the metadata are detached,
then this means that they are stored in a special
repository. This approach has less redundancy, we
can support efficient metadata search, and we can
manipulate them efficiently, e.g. we can perform
bulk metadata updates. However, the way metadata
are linked to data should be treated with care as
inconsistencies may arise.
5PreScan (Preservation Scanner)
- Components
- scanner scanning the file system
- metadata extractor extracts the embedded
metadata of the scanned files - repository manager for storing and managing
these metadata - controller controls the entire process and
metadata life-cycle.
6Component Scanner
- It acts like the scanner of an AntiVirus program
- The user defines
- the folders that should be scanned.
- where metadata should be stored
- It also allows the re-Scanning of projects
- That is aware of
- file movements/additions
- human provided metadata
7Component ExtractorExamples of extracted
metadata
- It extracts the embedded metadata of the scanned
files. - Currently it relies on JHOVE, although more
extractors could be plugged in.
Some of the supported file-types and metadata
8Component Controller
- It controls the entire process and metadata
life-cycle. - It offers a re-scan option that ensures that the
manually provided metadata will not be lost after
the next scan - To this end it tries to identify the new files,
the files that were deleted and the files that
changed location since the previous scan. - The identified file movements are shown to the
user in order to confirm the change (and thus the
association of the manually provided metadata
with the updated extracted metadata)
9Component Repository ManagerMore on the
available Choices
- The Repository Manager is responsible for
storing, querying and updating the metadata
records. - The metadata record of a file includes both the
extracted and the human-provided metadata. - There are more than one choices regarding where
these metadata are stored. The options that are
currently supported are listed below (they are
not mutually exclusive) - (SF) For each scanned file its metadata record is
created and stored in a Specific Folder specified
by the user. - (OF) For each scanned file its metadata record is
created and stored in the Original Folder (the
same folder with the scanned file). - (KB) The contents of the metadata records of the
scanned files are stored in a Semantic Web-based
Knowledge Base.
10Component Repository ManagerMore on the KB
choice
- Architecture of Ontologies and Metadata
The ontology of GapManager
11Component Repository ManagerMore on the KB
choice (cont)
12Component Repository ManagerRDF Exporter
P4 has time span
P1 is identified by
S11B was output of
P3 has note
S2B was source for
P43 has dimension
P4 has time span
P2 has type
P90 has value
P91 has unit
13PreScan Time Performance
- It takes about 10 hours for 100 thousands files
14Synopsis
- PreScan is quite similar in spirit with the
crawlers of Web Search Engines. In our case we
scan the file system, we extract the embedded
metadata and build an index. The difference in
our case is that we need to support (a) more
advanced extraction services, (b) manual addition
of metadata, (c) more expressive representation
frameworks for keeping and exploiting the
metadata (i.e. SW languages), (d) rescans that do
not start from scratch but exploit the previous
status of the index, and (e) associations with
external sources (e.g. registries). - In brier, PreScan can aid automating the
ingestion process for file system-based archives.
15Related Works and Systems
16Future Steps
- Extensions
- Pluggable additional Metadata Extractors (for
recognizing more formats) - Key Extension
- Flexible generation of CIDOC CRM Digital instances
17Software Releases
- Alpha Release (June 2009)
- Repository Manager (SF, OF) with XML output
- Known bugs
- The progress bar sometimes does not progress
(although scanning progresses) - For some files the extractor crashes (this is an
bug of the extractor, i.e. JHOVE) - Beta Release (September 2009)
- Generation of instances of CIDOC CRM Digital
- That would allow browsing the KB repository
though the GUI of GapManager - URL
- http//wiki.casparpreserves.eu/bin/view/Main/PreSc
an
18Developers and Contact points
- Main Developers
- Yannis Marketakis, Makis Tzanakis
- Contact person
- Yannis Tzitzikas
- PreScan Web Page
- http//www.ics.forth.gr/PreScan
19Thanks for your attention