Title: File formats and registries
1File formats and registries
- Manfred Thaller, University at Cologne
- October 2nd, 2007
2- PART I Formats and RegistriesEXERCISE I
Evaluate somePART II Formats in
PLANETSEXERCISE II A bit of modelling
3An image
4An image
6 rows 5 columns
55 rows 6 columns
6An image
1 yellow 0 red
7An image
1 violett 0 green
8An image
Store 1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1
,0,1,1,1,1,1,1,1
9An image
Store 6,1,3,0,3,11,0,4,1,1,0,4,1,1,0,7,1
10An image
Store 1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1
,0,1,1,1,1,1,1,1
Uncompressed
11An image
Store 6,1,3,0,3,1,1,0,4,1,1,0,4,1,1,0,7,1
(Compressed)Run Length Encoded
12An image
Store SetSize 5 by 6 SetBackgroundColor Yellow
SetForegroundColor Red SetLetterHeight
4 MoveTo 3,5 DrawLetter T
13An image
6 rows 5 columns 1 yellow 0
red Uncompressed
14An image
dimensions 1 yellow 0 red Uncompressed
15An image
dimensions photogrammetric interpretation Uncomp
ressed
16An image
dimensions photogrammetric interpretation compre
ssion
17An image
ltbasic informationgt ltrendering
informationgt ltstorage informationgt
18An image
ltbasic informationgt (implicit /
explicit) ltrendering informationgt (implicit /
explicit) ltstorage informationgt (implicit /
explicit) and the data?
19An image
Data either as data stream 1,1,1,1,1,1, 0,0,0,1,
1,1, 0,1,1,1,1,0, 1,1,1,1,0,1, 1,1,1,1,1,1
20An image
Data either as data stream or as processing
instructions SetSize 5 by 6 SetBackgroundColor
Yellow SetForegroundColor Red SetLetterHeight
4 MoveTo 3,5 DrawLetter T
21File format
ltbasic informationgt ltrendering informationgt
ltstorage informationgt ltdatagt
22File format
ltbasic informationgt What to do? ltrendering
informationgt ltstorage informationgt ltdatagt
23File format
ltbasic informationgt What to do? ltrendering
informationgt How to do it? ltstorage
informationgt ltdatagt
24File format
ltbasic informationgt What to do? ltrendering
informationgt How to do it? ltstorage
informationgt How to move it from persistent
to deployed form? ltdatagt
25File format
ltbasic informationgt What to do? ltrendering
informationgt How to do it? ltstorage
informationgt How to move it from persistent
to deployed form? ltdatagt What to deploy?
26File format
ltbasic informationgt What to do? ltrendering
informationgt How to do it? ltstorage
informationgt How to move it from persistent
to deployed form? ltdatagt What to deploy?
27File format
ltbasic informationgt Mandatory ltrendering
informationgt Useful ltstorage informationgt
Historical ltdatagt Mandatory
28File format
A deterministic specification how the properties
of a digital object can reversibly be converted
into a linear bytestream (bitstream).
29File format TIFF
30File format PDF
1 0 obj ltlt /Type /Page /Parent 281 0 R
/Resources 2 0 R /Contents 3 0 R
/StructParents 2 /MediaBox 0 0 612 792
/CropBox 0 0 612 792 /Rotate 0 gtgt endobj
31File format PDF
2 0 obj ltlt /ProcSet /PDF /Text /Font ltlt
/TT2 292 0 R /TT4 288 0 R gtgt /ExtGState ltlt /GS1
300 0 R gtgt /ColorSpace ltlt /Cs6 289 0 R gtgt gtgt
endobj
32File format PDF
3 0 obj ltlt /Length 4605 /Filter /FlateDecode gtgt
stream HWÛÛÈWô4jRÀø Í"
(²5j"¹lräýoêÖ-j udTÙÂfPnìþgtÓE²ÝÕ˽âä
uª2iltltv úÓk9Q¼xXTP /i²½Ö)ÔÏöªÙHltCµ
and about 4000 bytes more øL"ÈÛÆJYØÂmjÝ
qõϺºÕ²ôÒÛº.u-kP0 4øTxMltéï¼9uôøòLiØoT
Ö mÇÿlÕºvéUËLmgu1Åëu5l3O
òËTîü7?ìNdh endstream endobj
33File format XML (here SVG)
lt?xml version"1.0" encoding"UTF-16"?gt
ltsvgsvg width"800" height"1000"
xmlnssvg"http//www.w3.org ... ltsvgrect
x"0" y"0" width"800" height"1000"
fill"white" /gt ltsvgg transform"translate(-140
,0)"gt ltsvgline x1"600" y1"20" x2"500"
y2"20" stroke"black" ltsvgtext x"600"
y"28.8" font-size"6" fill"black" lt/svgggt
ltsvgg transform"translate(-140,0)"gt
ltsvgtext x"500" y"24.4"gt ltsvgtspan
font-size"4" fill"black"gtLeistelt/svgtspangt
lt/svgtextgt lt/svgggt ltsvgdefsgt
ltsvgg id"halbeSaeuleLeiste0"gt
34File format XML (here SVG)
35File format XML (ETH column XML)
lt?xml version"1.0" encoding"UTF-8"?gt ltAutor
name"Vitruv"gt ltOrdnung name"Ionisch"
THz"" THn"" MH"" TBz"" TBn""
ltElement name"Gebaelk" original"" THz"" THn""
MH"" ltElement name"Gesims"
original"corona" THz"" THn"" MH""
ltElement name"Leiste" original"" THz"" THn""
MH"0.03" ltElement name"Kyma"
original"sima" THz"" THn""
ltElement name"Leiste" original"" THz"" THn""
MH"0.017" ltElement
name"Kyma_reversa" original"cymatium" THz""
ltElement name"Platte" original"corona"
THz"" THn"" ltElement name"Leiste"
original"" THz"" THn"" MH"0.017"
ltElement name"Kyma_reversa" original"cymatium"
THz"" lthElement name"Band"
typ"1" dx"0.048" r"0.019"/gt
lthElement name"Band" typ"1" dx"0.048"
r"0.019"/gt lt/Elementgt
36Files and Preservation
- Bit rot.
- Obscolescence of software.
37Bit rot
An Image file before .
38Bit rot
... and after one byte is changed.
39Bit rot
Undetectable by software.
... and after one byte is changed.
40Bit rot
Processing dictionary Payload
41Bit rot
One byte is damaged, one byte cannot be displayed
correctly.
42Bit rot
One byte is damaged, ten bytes cannot be
displayed correctly.
43Result http//www.cflr.beniculturali.it/Progett
i/Fixit.php
Università di Roma La Sapienza Dipartimento
Informatica
Centro Fotoriproduzione Legatoria e Restauro
Franco Liberati liberati_at_di.uniroma1.it
Paolo Buonora paolo.buonora_at_beniculturali.it
www.cflr.beniculturali.it
44Paolo on JPEG
JPEG2000 more robust against bit rot than TIFF.
45Paolo on JPEG
JPEG2000 more robust against bit rot than
TIFF. So, to stimulate more empiricism
46Notice / caveat
This problem does not go away, just because you
employ fancy language! It applies to digital
objects encoded in XML, just as well as to
humble files.
47IBM Digital Library, ca. 1997Kn A
269_1802.idx Kn A 269_1802.log TDI00001.jpg
TDI00002.jpg TDI00005.jpg TDI00006.jpg
TDI00007.jpg TDI00008.jpg TDI00010.jpg
TDI00011.jpg TDI00013.jpg TDI00015.jpg
TDI00017.jpg TDI00020.jpg TDI00023.jpg
TDI00024.jpg TDI00028.jpg TDI00032.jpg
TDI00036.jpg TDI00037.jpg TDI00039.jpg
TDI00040.jpg TDI00042.jpg TDI00047.jpg
TDI00048.jpg TDI00050.jpg TDI00053.jpg
TDI00054.jpg TDI00056.jpg TDI00057.jpg
TDI00060.jpg TDI00062.jpg TDI00063.jpg
TDI00064.jpg
48IBM Digital Library, ca. 1997ShelfmarkKn
A 269/1802Scanner-OpOmniscan
6000CountC2Timestamp4.1.2000xdoTDI00005.jp
gendend ShelfmarkKn A 269/1802
Scanner-OpOmniscan 6000CountF3
Timestamp4.1.2000 xdoTDI00006.jpg endend
49IBM Digital Library / savingKnA115_646A1r.jpg
KnA115_646A2.jpg KnA115_646A3.jpg
KnA115_646A4.jpg KnA115_646B1.jpg
KnA115_646B2.jpg KnA115_646B3.jpg
KnA115_646B4.jpg KnA115_646C1.jpg
KnA115_646C2.jpg KnA115_646C3.jpg
KnA115_646C4.jpg KnA115_646D1.jpg
KnA115_646D2.jpg KnA115_646D3.jpg
KnA115_646D4.jpg KnA115_646E1.jpg
KnA115_646E2.jpg KnA115_646E3.jpg
KnA115_646E4.jpg KnA115_646F1.jpg
KnA115_646F2.jpg KnA115_646G4v.jpgKnA115_646.x
ml
50IBM Digital Library / savingltbodygt
ltdivgtltpagegtTB Lsp ltimage seqno"KnA115_646A1r.jp
g" nativeno"A1r" timestamp"14.02.1997"/gtlt/pagegt
ltpagegtltimage seqno"KnA115_646A2.jpg"
nativeno"A2" timestamp"14.02.1997"/gtlt/pagegt
ltpagegtltimage seqno"KnA115_646A3.jpg"
nativeno"A3" timestamp"14.02.1997"/gtlt/pagegt
ltpagegtltimage seqno"KnA115_646A4.jpg"
nativeno"A4" timestamp"14.02.1997"/gtlt/pagegt
ltpagegtltimage seqno"KnA115_646B1.jpg"
nativeno"B1" timestamp"14.02.1997"/gtlt/pagegt
ltpagegtltimage seqno"KnA115_646B2.jpg"
nativeno"B2" timestamp"14.02.1997"/gtlt/pagegt
...
51Obsolescence
- Software able to read does not exist any more.
- Format specification lost.
- Implied algorithm lost.
- Required object lost.
52Recommended formats text
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
53Recommended formats bitmap / raster image
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
54Recommended formats vector graphics
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
55Recommended formats audio
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
56Recommended formats video
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
57Recommended formats data base
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
58Recommended formats 3D (virtual reality)
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
59What kind of file is this?
- Two ways to identify a file
- By extension.
- By internal characteristics (magic number,
signature).
60What kind of file is this?
- Two ways to identify a file
- By extension.
- Each file ending with .doc is a MS Word
document
61What kind of file is this?
Two ways to identify a file (b) By internal
characteristics (magic number, signature). A
TIFF file begins with Bytes 0-1 The byte order
used within the file. Legal values are II
(4949.H) / MM (4D4D.H) Bytes 2-3 An arbitrary
but carefully chosen number (42) that further
identifies the file as a TIFF file.
62What kind of file is this?
- Necessity to identify files lead to two
developments - Clever software inspects files to decide how
to process them. - MIME Types.
- FORMAT registries.
63What kind of file is this?
The following 4 transparencies are a quotation
from http//hul.harvard.edu/gdfr (see below).
64Why Do We Need a Registry?
- Repository functions are performed on a
format-specific basis - Interpretation of otherwise opaque content
streams is dependent upon knowledge of how typed
content is represented - Interchange requires mutual agreement of format
syntax and semantics
65Potential Use Cases
- Identification
- I have a digital object what format is it?
- Validation
- I have an object purportedly of format F is
it? - Transformation
- I have an object of format F, but need G how
can I produce it? - Characterization
- I have an object of format F what are its
significant properties? - Risk assessment
- I have an object of format F is at risk of
obsolescence? - Delivery
- I have an object of format F how can I render
it?
66Repository Format Dependencies Using the OAIS
Reference Model
67Whats Wrong with MIME Types?
- Insufficient depth of detail
- No requirements regarding syntax and semantic
description - No requirement for complete disclosure,
especially of proprietary formats - Insufficient granularity
- Both tiled RGB GeoTIFF with LZW and striped
bi-tonal TIFF-FX with Group 4 are typed as
image/tiff - All of PDF 1.0 1.4, PDF/X-1, X-2, X-3, and
PDF/A are typed as application/pdf - These variants might require radically different
workflows
68File format registries - URLs
PRONOM http//www.nationalarchives.gov.uk/prono
m/ (does not only rely on extensions) Global
Digital Format Registry http//hul.harvard.edu
/gdfr (predominantly project description) FileExt
http//filext.com (predominantly links to
software)
69Exercise I A few experiments
70Exercise I A few experiments
71Exercise I A few experiments
72Exercise I A few experiments
73Exercises
- Use the Shotgun to systematically shoot and
corrupt all files on your USB stick. - Use at least settings size1, count1 size512
count1 size1 countmore size512
countmore. - Try to damage each file a number of times, say
5. - Report on their robustness agianst this
treatment. - Check, whether your experience supports the
Florida recommendations. - Any idea, why the formats behave as they do?
74PART II Formats in PLANETSFile characteristics
75PART II Formats in PLANETSFile characteristics
- Based on two formal languages
- eXtensible Characterisation Extraction Language
( XCEL) - eXtensible Characterisation Description Language
( XCDL)
7699
2017
7793
png
78- ltXCELDocument...gt ...
- ltformatDescriptiongt....
- ltsymbol identifier"ID01_I01_I01_S02"
originalName"height interpretation"uint32"gt - ltrangegt
- ltstartposition xsitype"sequentialgt
lt/startpositiongt - ltlength xsitype"fixed"gt4lt/lengthgtlt/rangegt
- ltnamegtheightlt/namegt
- lt/symbolgt
- ltsymbol identifier"ID01_I01_I01_S04"
originalName"colourType"gt - ltrangegt
- ltstartposition xsitype"sequential"gt
lt/startpositiongt - ltlength xsitype"fixed"gt1lt/lengthgtlt/rangegt
- ltvalueInterpretationgt
- ltvalueLabelgtgreyscalelt/valueLabelgt
- ltvaluegt0lt/valuegt...
- ltnamegtimageTypelt/namegt
- lt/symbolgt
- ltsymbol identifier"ID01_I01_I01_S05"
originalName"compressionMethod"gt - ltrangegt
ltxcdlgt ltobject id"o1" gt ltnormData
id"nd1" gt ... lt/normDatagt ltproperty
id"p1" source"raw" cat"descr" gt
ltnamegtcompressionlt/namegt ltvalueSet
id"i_i1_s6" gt ltrawValuegt0
lt/rawValuegt ltlabValuegt...lt/labValu
egt ltdataRef ind"normAll" /gt
ltpropRel/gt lt/valueSetgt
lt/propertygt ltproperty id"p2"
source"raw" cat"descr" gt
ltnamegtheightlt/namegt ltvalueSet
id"i_i1_s3" gt ltrawValuegt0 0 1 ad
lt/rawValuegt ltlabValuegt
ltvalgt429lt/valgt
lttypegtuint32lt/typegt lt/labValuegt
ltdataRef ind"normAll" /gt
ltpropRel/gt lt/valueSetgt
lt/propertygt ltproperty id"p3"
source"raw" cat"descr" gt
ltnamegtimageTypelt/namegt .....
79(No Transcript)
80 Confession
81 Confession
Computer science does not really know what
information is.
82 Confession Claim
Computer science does not really know what
information is. It is pretty good at
representing and processing it, though.
83 Representations migrations
III 3 ? ??? Four representations of
the idea / concept / model three
84 Representations migrations
I divided by III 1 / 3 1.3333? I divided
by III 1 / 3 1.3 periodic Some ideas are
handled more precisely by some thinkers than
others.
85 Representations migrations
48 bit images on 24 and on 48 bit graphics
cards. Some data is processed more adequately by
some equipment than others
86 Representations migrations
A model for information before and after a
migration must therefore potentially represent
all information there is, irrespective of the
possibility to process it in a given environment.
87 XCEL / XCDL
Languages are being processed development
focus currently dynamic handling of format
specific algorithms.
88 XCEL / XCDL image model (1)
A pixel cube Each pixel MSB (channel 1), LSB
(channel 1), MSB (channel n), LSB (channel
n), MSB (aux 1), LSB (aux 1), MSB (aux m),
LSB (aux m)
89 XCEL / XCDL image model (2)
A pixel cube Accompanied by rendering info
plus deployment info.
90 XCEL / XCDL image model - example
ltproperty id"p4" source"raw" cat"descr" gt
ltnamegtimageTypelt/namegt
ltvalueSet id"i_i1_s5" gt
ltrawValuegt2lt/rawValuegt
ltlabValuegt ltvalgttruecolourlt/va
lgt lttypegtfixedLabellt/typegt
lt/labValuegt ltdataRef
ind"normAll" /gt ltpropRel/gt
lt/valueSetgt lt/propertygt
91 XCEL / XCDL text model
A text ( ltobjectgt) is composed of - data
(ltnormDatagt) plus - interpretations of data
according to the underlying format
specification (properties ltpropertygt).
92 XCEL / XCDL text model - example
This is a text ltrefData id"1"gt54 68 69 73 20 69
73 20 61 20 74 65 78 74lt/refDatagt ltpropertygt ltna
megtfontsizelt/namegt ltrawValgt ltvalgt00
18lt/valgt lttypegtunsignedInt8lt/typegt lt/rawValgt ltdata
Refgt lt!-- property refers to discrete part of
reference data- -gt ltref id"1" start"0"
end"3"/gt ltref id"1" start10"
end"12"/gt lt/dataRefgt lt/propertygt
93Exercise II Abstract modelling
Group 1 maps Group 2 music Group 3 excel
sheets Group 4 books ever heard of FRBR?
94File format XML (here SVG)
95File format XML (here SVG)
lt?xml version"1.0" encoding"UTF-16"?gt
ltsvgsvg width"800" height"1000"
xmlnssvg"http//www.w3.org ... ltsvgrect
x"0" y"0" width"800" height"1000"
fill"white" /gt ltsvgg transform"translate(-140
,0)"gt ltsvgline x1"600" y1"20" x2"500"
y2"20" stroke"black" ltsvgtext x"600"
y"28.8" font-size"6" fill"black" lt/svgggt
ltsvgg transform"translate(-140,0)"gt
ltsvgtext x"500" y"24.4"gt ltsvgtspan
font-size"4" fill"black"gtLeistelt/svgtspangt
lt/svgtextgt lt/svgggt ltsvgdefsgt
ltsvgg id"halbeSaeuleLeiste0"gt
96File format XML (ETH column XML)
lt?xml version"1.0" encoding"UTF-8"?gt ltAutor
name"Vitruv"gt ltOrdnung name"Ionisch"
THz"" THn"" MH"" TBz"" TBn""
ltElement name"Gebaelk" original"" THz"" THn""
MH"" ltElement name"Gesims"
original"corona" THz"" THn"" MH""
ltElement name"Leiste" original"" THz"" THn""
MH"0.03" ltElement name"Kyma"
original"sima" THz"" THn""
ltElement name"Leiste" original"" THz"" THn""
MH"0.017" ltElement
name"Kyma_reversa" original"cymatium" THz""
ltElement name"Platte" original"corona"
THz"" THn"" ltElement name"Leiste"
original"" THz"" THn"" MH"0.017"
ltElement name"Kyma_reversa" original"cymatium"
THz"" lthElement name"Band"
typ"1" dx"0.048" r"0.019"/gt
lthElement name"Band" typ"1" dx"0.048"
r"0.019"/gt lt/Elementgt
97- ltAbstract Definition of Visual Contentgt
- ltTransformation Layergt
- ltDisplay Systemgt
98- Content specific Languages
- XSLT, C, Java, PHP ...
- VRML, X3D, AutoCAD, 3DS, Maya, SoftImage ...