Preservation Metadata Extraction and Collection : Tools and Techniques - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Preservation Metadata Extraction and Collection : Tools and Techniques

Description:

Preservation Metadata Extraction and Collection : Tools and Techniques Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 26
Provided by: Vri89
Category:

less

Transcript and Presenter's Notes

Title: Preservation Metadata Extraction and Collection : Tools and Techniques


1
Preservation Metadata Extraction and Collection
Tools and Techniques
  • Mat Black
  • National Library of New Zealand
  • Te Puna Matauranga o Aotearoa

2
How to get what you need to keep what youve got
3
The stack
  • Fixity generation
  • Virus checking
  • Format identification
  • Format validation
  • Enviromental metadata collection
  • Format specific metadata extraction

4
FixityGet it early and get it right
  • Common fixity types
  • Hashing algorithms (MD5, SHA1)
  • Digital signatures
  • File size?
  • Use multiple fixity algorythems.
  • Find out the legal implications.

5
Fixity values for what?
  • File
  • Bitstream
  • Compound (all the files in an object)
  • Metadata
  • The whole lot (files, filename metadata)

6
Virus checking
  • Virus check datetime
  • Results including false positives and any
    warnings (word macros etc)
  • The virus checker name and version
  • The virus pattern file name and version
  • The virus engine name and version

7
Format identificationFile / Bitstream / Complex
  • Methods of format identification
  • File name or extension
  • File type/creator codes (Old Macs)
  • Magic numbers
  • Brute force file parsing (for all, try throw
    catch)
  • http//en.wikipedia.org/wiki/File_format

8
A sound file opened in an image viewer
9
What file format is this?
  • ?
  • And the winner is..

10
Subzero by Pain Receptor
  • They describe their music as sounding like
  • falling down the stairs carrying leeches and
    bottles
  • http//www.myspace.com/painreceptor

11
Sub format identification
  • Embedded Bistreams
  • XML Base64 encoded octet streams
  • Microsoft Structured Storage
  • Archives
  • ZIP, TAR, ARC
  • Encapsulation/Container formats
  • OGG, AVI, MIME
  • CODECs
  • DV, DivX, Indeo, Cinepak, MS MPEG-4

12
Available tools
  • File extensions (google it)
  • Magic utilities (google it)
  • Jhove http//hul.harvard.edu/jhove/
  • DROID http//www.nationalarchives.gov.uk/pronom/
  • Build you own! (Java, PERL, C, C)
  • If you have a fixed format list
  • You use a proprietary format.

13
Format validation
  • Types of validation
  • Pattern comparison
  • Parsing
  • Rendering

14
Available tools
  • JHOVE
  • NLNZ Extract tool (sort of)
  • The application used to create the file
  • Anything that opens a file and can throw an
    error.
  • Parsing tools
  • E.g. XML Parsers, XML Schema, PERL Modules, Java
    Classes.
  • rendering tools
  • E.g. LibTIFF, ImageMagick, Microsoft Office
    (wrapped), OpenOffice PERL Modules, Java classes,
    etc.

15
Things to keep in mind
  • Test it till it breaks.
  • Define your requirements, break them, then define
    them again. (Repeat if required).
  • Not all tools are created equal.
  • Not all tools obey the rules.
  • Some rules are made to be broken.

16
Environmental metadata
  • Consider the native environment of your content.
  • Is there metadata that you need that only exists
    in a digital objects native environment?
  • Structure and relationships.
  • File system attributes

17
Format specific metadata extractionaka format
characterisation
  • Available metadata will vary depending on the
    format.
  • You will probably need format specific schemas.
  • The types of metadata that can be extracted
  • Preservation
  • Descriptive
  • Structural
  • Administrative
  • Rights
  • Technical

18
The big question.
  • Why would I extract the metadata now and store
    it in a database if I can just come back and
    extract it again later when I need it?

19
Available tools
  • NLNZ Metadata Extract Tool
  • http//www.natlib.govt.nz/en/whatsnew/4initiatives
    .htmlextraction
  • JHOVE
  • http//hul.harvard.edu/jhove/
  • Anything you can wrap
  • LibTIFF, ImageMagick, PERL Modules, Java classes
    etc
  • Build your own!
  • And make sure you open source it ?

20
What tools should I use?
  • Use as many tools as you need to.
  • Keep the workflow configurable
  • Preferably by content or format type.
  • Allow for multiple tools to be used.
  • Allow for new tools to be added later.
  • Compare metadata from multiple tools.

21
The workflow
  1. Fixity generation
  2. Virus checking
  3. Format identification
  4. Format validation
  5. Enviromental metadata extraction
  6. Format specific metadata extraction
  7. Store in repository

22
Paranoid workflow
  • Fixity generation
  • Virus checking
  • Fixity check
  • Format identification
  • Fixity check
  • Format validation
  • Fixity check
  • Enviromental metadata extraction
  • Fixity check
  • Format specific metadata extraction
  • Fixity check
  • Virus check
  • Store in repository
  • Fixity check
  • Virus check
  • Fixity check

23
Paranoid access flow.
  • Retrieve content from repository
  • Fixity check
  • Virus check
  • Send content to consumer

24
Global Digital Format Registry
  • Format identification components
  • Format validation components
  • Metadata extraction components
  • Format registry
  • At risk content alerts
  • http//hul.harvard.edu/gdfr/

25
Questions?
Write a Comment
User Comments (0)
About PowerShow.com