Preservation Metadata Extraction and Collection : Tools and Techniques - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Preservation Metadata Extraction and Collection : Tools and Techniques

Description:

File type/creator codes (Old Mac's) Magic numbers ... http://www.myspace.com/painreceptor. Sub format identification. Embedded Bistreams ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 26
Provided by: vrie7
Category:

less

Transcript and Presenter's Notes

Title: Preservation Metadata Extraction and Collection : Tools and Techniques


1
Preservation Metadata Extraction and Collection
Tools and Techniques
  • Mat Black
  • National Library of New Zealand
  • Te Puna Matauranga o Aotearoa

2
How to get what you need to keep what youve got
3
The stack
  • Fixity generation
  • Virus checking
  • Format identification
  • Format validation
  • Enviromental metadata collection
  • Format specific metadata extraction

4
FixityGet it early and get it right
  • Common fixity types
  • Hashing algorithms (MD5, SHA1)
  • Digital signatures
  • File size?
  • Use multiple fixity algorythems.
  • Find out the legal implications.

5
Fixity values for what?
  • File
  • Bitstream
  • Compound (all the files in an object)
  • Metadata
  • The whole lot (files, filename metadata)

6
Virus checking
  • Virus check datetime
  • Results including false positives and any
    warnings (word macros etc)
  • The virus checker name and version
  • The virus pattern file name and version
  • The virus engine name and version

7
Format identificationFile / Bitstream / Complex
  • Methods of format identification
  • File name or extension
  • File type/creator codes (Old Macs)
  • Magic numbers
  • Brute force file parsing (for all, try throw
    catch)
  • http//en.wikipedia.org/wiki/File_format

8
A sound file opened in an image viewer
9
What file format is this?
  • ?
  • And the winner is..

10
Subzero by Pain Receptor
  • They describe their music as sounding like
  • falling down the stairs carrying leeches and
    bottles
  • http//www.myspace.com/painreceptor

11
Sub format identification
  • Embedded Bistreams
  • XML Base64 encoded octet streams
  • Microsoft Structured Storage
  • Archives
  • ZIP, TAR, ARC
  • Encapsulation/Container formats
  • OGG, AVI, MIME
  • CODECs
  • DV, DivX, Indeo, Cinepak, MS MPEG-4

12
Available tools
  • File extensions (google it)
  • Magic utilities (google it)
  • Jhove http//hul.harvard.edu/jhove/
  • DROID http//www.nationalarchives.gov.uk/pronom/
  • Build you own! (Java, PERL, C, C)
  • If you have a fixed format list
  • You use a proprietary format.

13
Format validation
  • Types of validation
  • Pattern comparison
  • Parsing
  • Rendering

14
Available tools
  • JHOVE
  • NLNZ Extract tool (sort of)
  • The application used to create the file
  • Anything that opens a file and can throw an
    error.
  • Parsing tools
  • E.g. XML Parsers, XML Schema, PERL Modules, Java
    Classes.
  • rendering tools
  • E.g. LibTIFF, ImageMagick, Microsoft Office
    (wrapped), OpenOffice PERL Modules, Java classes,
    etc.

15
Things to keep in mind
  • Test it till it breaks.
  • Define your requirements, break them, then define
    them again. (Repeat if required).
  • Not all tools are created equal.
  • Not all tools obey the rules.
  • Some rules are made to be broken.

16
Environmental metadata
  • Consider the native environment of your content.
  • Is there metadata that you need that only exists
    in a digital objects native environment?
  • Structure and relationships.
  • File system attributes

17
Format specific metadata extractionaka format
characterisation
  • Available metadata will vary depending on the
    format.
  • You will probably need format specific schemas.
  • The types of metadata that can be extracted
  • Preservation
  • Descriptive
  • Structural
  • Administrative
  • Rights
  • Technical

18
The big question.
  • Why would I extract the metadata now and store
    it in a database if I can just come back and
    extract it again later when I need it?

19
Available tools
  • NLNZ Metadata Extract Tool
  • http//www.natlib.govt.nz/en/whatsnew/4initiatives
    .htmlextraction
  • JHOVE
  • http//hul.harvard.edu/jhove/
  • Anything you can wrap
  • LibTIFF, ImageMagick, PERL Modules, Java classes
    etc
  • Build your own!
  • And make sure you open source it ?

20
What tools should I use?
  • Use as many tools as you need to.
  • Keep the workflow configurable
  • Preferably by content or format type.
  • Allow for multiple tools to be used.
  • Allow for new tools to be added later.
  • Compare metadata from multiple tools.

21
The workflow
  • Fixity generation
  • Virus checking
  • Format identification
  • Format validation
  • Enviromental metadata extraction
  • Format specific metadata extraction
  • Store in repository

22
Paranoid workflow
  • Fixity generation
  • Virus checking
  • Fixity check
  • Format identification
  • Fixity check
  • Format validation
  • Fixity check
  • Enviromental metadata extraction
  • Fixity check
  • Format specific metadata extraction
  • Fixity check
  • Virus check
  • Store in repository
  • Fixity check
  • Virus check
  • Fixity check

23
Paranoid access flow.
  • Retrieve content from repository
  • Fixity check
  • Virus check
  • Send content to consumer

24
Global Digital Format Registry
  • Format identification components
  • Format validation components
  • Metadata extraction components
  • Format registry
  • At risk content alerts
  • http//hul.harvard.edu/gdfr/

25
Questions?
Write a Comment
User Comments (0)
About PowerShow.com