File formats and registries - PowerPoint PPT Presentation

1 / 98
About This Presentation
Title:

File formats and registries

Description:

File formats and registries – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 99
Provided by: hkiUni
Category:

less

Transcript and Presenter's Notes

Title: File formats and registries


1
File formats and registries
  • Manfred Thaller, University at Cologne
  • October 2nd, 2007

2
  • PART I Formats and RegistriesEXERCISE I
    Evaluate somePART II Formats in
    PLANETSEXERCISE II A bit of modelling

3
An image
4
An image
6 rows 5 columns
5
5 rows 6 columns
6
An image
1 yellow 0 red
7
An image
1 violett 0 green
8
An image
Store 1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1
,0,1,1,1,1,1,1,1
9
An image
Store 6,1,3,0,3,11,0,4,1,1,0,4,1,1,0,7,1
10
An image
Store 1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1
,0,1,1,1,1,1,1,1
Uncompressed
11
An image
Store 6,1,3,0,3,1,1,0,4,1,1,0,4,1,1,0,7,1
(Compressed)Run Length Encoded
12
An image
Store SetSize 5 by 6 SetBackgroundColor Yellow
SetForegroundColor Red SetLetterHeight
4 MoveTo 3,5 DrawLetter T
13
An image
6 rows 5 columns 1 yellow 0
red Uncompressed
14
An image
dimensions 1 yellow 0 red Uncompressed
15
An image
dimensions photogrammetric interpretation Uncomp
ressed
16
An image
dimensions photogrammetric interpretation compre
ssion
17
An image
ltbasic informationgt ltrendering
informationgt ltstorage informationgt
18
An image
ltbasic informationgt (implicit /
explicit) ltrendering informationgt (implicit /
explicit) ltstorage informationgt (implicit /
explicit) and the data?
19
An image
Data either as data stream 1,1,1,1,1,1, 0,0,0,1,
1,1, 0,1,1,1,1,0, 1,1,1,1,0,1, 1,1,1,1,1,1
20
An image
Data either as data stream or as processing
instructions SetSize 5 by 6 SetBackgroundColor
Yellow SetForegroundColor Red SetLetterHeight
4 MoveTo 3,5 DrawLetter T
21
File format
ltbasic informationgt ltrendering informationgt
ltstorage informationgt ltdatagt
22
File format
ltbasic informationgt What to do? ltrendering
informationgt ltstorage informationgt ltdatagt

23
File format
ltbasic informationgt What to do? ltrendering
informationgt How to do it? ltstorage
informationgt ltdatagt
24
File format
ltbasic informationgt What to do? ltrendering
informationgt How to do it? ltstorage
informationgt How to move it from persistent
to deployed form? ltdatagt
25
File format
ltbasic informationgt What to do? ltrendering
informationgt How to do it? ltstorage
informationgt How to move it from persistent
to deployed form? ltdatagt What to deploy?
26
File format
ltbasic informationgt What to do? ltrendering
informationgt How to do it? ltstorage
informationgt How to move it from persistent
to deployed form? ltdatagt What to deploy?
27
File format
ltbasic informationgt Mandatory ltrendering
informationgt Useful ltstorage informationgt
Historical ltdatagt Mandatory
28
File format
A deterministic specification how the properties
of a digital object can reversibly be converted
into a linear bytestream (bitstream).
29
File format TIFF
30
File format PDF
1 0 obj ltlt /Type /Page /Parent 281 0 R
/Resources 2 0 R /Contents 3 0 R
/StructParents 2 /MediaBox 0 0 612 792
/CropBox 0 0 612 792 /Rotate 0 gtgt endobj
31
File format PDF
2 0 obj ltlt /ProcSet /PDF /Text /Font ltlt
/TT2 292 0 R /TT4 288 0 R gtgt /ExtGState ltlt /GS1
300 0 R gtgt /ColorSpace ltlt /Cs6 289 0 R gtgt gtgt
endobj
32
File format PDF
3 0 obj ltlt /Length 4605 /Filter /FlateDecode gtgt
stream HWÛÛÈWô4jRÀø Í"
(²5j"¹lräýoêÖ-j udTÙÂfPnìþgtÓE²ÝÕ˽âä
uª2iltltv úÓk9Q¼xXTP /i²½Ö)ÔÏöªÙHltCµ
and about 4000 bytes more øL"ÈÛÆJYØÂmjÝ
qõϺºÕ²ôÒÛº.u-kP0 4øTxMltéï¼9uôøòLiØoT
Ö mÇÿlÕºvéUËLmgu1Åëu5l3O
òËTîü7?ìNdh endstream endobj
33
File format XML (here SVG)
lt?xml version"1.0" encoding"UTF-16"?gt
ltsvgsvg width"800" height"1000"
xmlnssvg"http//www.w3.org ... ltsvgrect
x"0" y"0" width"800" height"1000"
fill"white" /gt ltsvgg transform"translate(-140
,0)"gt ltsvgline x1"600" y1"20" x2"500"
y2"20" stroke"black" ltsvgtext x"600"
y"28.8" font-size"6" fill"black" lt/svgggt
ltsvgg transform"translate(-140,0)"gt
ltsvgtext x"500" y"24.4"gt ltsvgtspan
font-size"4" fill"black"gtLeistelt/svgtspangt
lt/svgtextgt lt/svgggt ltsvgdefsgt
ltsvgg id"halbeSaeuleLeiste0"gt
34
File format XML (here SVG)
35
File format XML (ETH column XML)
lt?xml version"1.0" encoding"UTF-8"?gt ltAutor
name"Vitruv"gt ltOrdnung name"Ionisch"
THz"" THn"" MH"" TBz"" TBn""
ltElement name"Gebaelk" original"" THz"" THn""
MH"" ltElement name"Gesims"
original"corona" THz"" THn"" MH""
ltElement name"Leiste" original"" THz"" THn""
MH"0.03" ltElement name"Kyma"
original"sima" THz"" THn""
ltElement name"Leiste" original"" THz"" THn""
MH"0.017" ltElement
name"Kyma_reversa" original"cymatium" THz""
ltElement name"Platte" original"corona"
THz"" THn"" ltElement name"Leiste"
original"" THz"" THn"" MH"0.017"
ltElement name"Kyma_reversa" original"cymatium"
THz"" lthElement name"Band"
typ"1" dx"0.048" r"0.019"/gt
lthElement name"Band" typ"1" dx"0.048"
r"0.019"/gt lt/Elementgt
36
Files and Preservation
  • Bit rot.
  • Obscolescence of software.

37
Bit rot
An Image file before .
38
Bit rot
... and after one byte is changed.
39
Bit rot
Undetectable by software.
... and after one byte is changed.
40
Bit rot
Processing dictionary Payload
41
Bit rot
One byte is damaged, one byte cannot be displayed
correctly.
42
Bit rot
One byte is damaged, ten bytes cannot be
displayed correctly.
43
Result http//www.cflr.beniculturali.it/Progett
i/Fixit.php
Università di Roma La Sapienza Dipartimento
Informatica
Centro Fotoriproduzione Legatoria e Restauro
Franco Liberati liberati_at_di.uniroma1.it
Paolo Buonora paolo.buonora_at_beniculturali.it
www.cflr.beniculturali.it
44
Paolo on JPEG
JPEG2000 more robust against bit rot than TIFF.
45
Paolo on JPEG
JPEG2000 more robust against bit rot than
TIFF. So, to stimulate more empiricism
46
Notice / caveat
This problem does not go away, just because you
employ fancy language! It applies to digital
objects encoded in XML, just as well as to
humble files.
47
IBM Digital Library, ca. 1997Kn A
269_1802.idx Kn A 269_1802.log TDI00001.jpg
TDI00002.jpg TDI00005.jpg TDI00006.jpg
TDI00007.jpg TDI00008.jpg TDI00010.jpg
TDI00011.jpg TDI00013.jpg TDI00015.jpg
TDI00017.jpg TDI00020.jpg TDI00023.jpg
TDI00024.jpg TDI00028.jpg TDI00032.jpg
TDI00036.jpg TDI00037.jpg TDI00039.jpg
TDI00040.jpg TDI00042.jpg TDI00047.jpg
TDI00048.jpg TDI00050.jpg TDI00053.jpg
TDI00054.jpg TDI00056.jpg TDI00057.jpg
TDI00060.jpg TDI00062.jpg TDI00063.jpg
TDI00064.jpg
48
IBM Digital Library, ca. 1997ShelfmarkKn
A 269/1802Scanner-OpOmniscan
6000CountC2Timestamp4.1.2000xdoTDI00005.jp
gendend ShelfmarkKn A 269/1802
Scanner-OpOmniscan 6000CountF3
Timestamp4.1.2000 xdoTDI00006.jpg endend
49
IBM Digital Library / savingKnA115_646A1r.jpg
KnA115_646A2.jpg KnA115_646A3.jpg
KnA115_646A4.jpg KnA115_646B1.jpg
KnA115_646B2.jpg KnA115_646B3.jpg
KnA115_646B4.jpg KnA115_646C1.jpg
KnA115_646C2.jpg KnA115_646C3.jpg
KnA115_646C4.jpg KnA115_646D1.jpg
KnA115_646D2.jpg KnA115_646D3.jpg
KnA115_646D4.jpg KnA115_646E1.jpg
KnA115_646E2.jpg KnA115_646E3.jpg
KnA115_646E4.jpg KnA115_646F1.jpg
KnA115_646F2.jpg KnA115_646G4v.jpgKnA115_646.x
ml
50
IBM Digital Library / savingltbodygt
ltdivgtltpagegtTB Lsp ltimage seqno"KnA115_646A1r.jp
g" nativeno"A1r" timestamp"14.02.1997"/gtlt/pagegt
ltpagegtltimage seqno"KnA115_646A2.jpg"
nativeno"A2" timestamp"14.02.1997"/gtlt/pagegt
ltpagegtltimage seqno"KnA115_646A3.jpg"
nativeno"A3" timestamp"14.02.1997"/gtlt/pagegt
ltpagegtltimage seqno"KnA115_646A4.jpg"
nativeno"A4" timestamp"14.02.1997"/gtlt/pagegt
ltpagegtltimage seqno"KnA115_646B1.jpg"
nativeno"B1" timestamp"14.02.1997"/gtlt/pagegt
ltpagegtltimage seqno"KnA115_646B2.jpg"
nativeno"B2" timestamp"14.02.1997"/gtlt/pagegt
...
51
Obsolescence
  • Software able to read does not exist any more.
  • Format specification lost.
  • Implied algorithm lost.
  • Required object lost.

52
Recommended formats text
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
53
Recommended formats bitmap / raster image
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
54
Recommended formats vector graphics
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
55
Recommended formats audio
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
56
Recommended formats video
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
57
Recommended formats data base
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
58
Recommended formats 3D (virtual reality)
http//www.fcla.edu/digitalArchive/pdfs/recFormats
.pdf
59
What kind of file is this?
  • Two ways to identify a file
  • By extension.
  • By internal characteristics (magic number,
    signature).

60
What kind of file is this?
  • Two ways to identify a file
  • By extension.
  • Each file ending with .doc is a MS Word
    document

61
What kind of file is this?
Two ways to identify a file (b) By internal
characteristics (magic number, signature). A
TIFF file begins with Bytes 0-1 The byte order
used within the file. Legal values are II
(4949.H) / MM (4D4D.H) Bytes 2-3 An arbitrary
but carefully chosen number (42) that further
identifies the file as a TIFF file.
62
What kind of file is this?
  • Necessity to identify files lead to two
    developments
  • Clever software inspects files to decide how
    to process them.
  • MIME Types.
  • FORMAT registries.

63
What kind of file is this?
The following 4 transparencies are a quotation
from http//hul.harvard.edu/gdfr (see below).
64
Why Do We Need a Registry?
  • Repository functions are performed on a
    format-specific basis
  • Interpretation of otherwise opaque content
    streams is dependent upon knowledge of how typed
    content is represented
  • Interchange requires mutual agreement of format
    syntax and semantics

65
Potential Use Cases
  • Identification
  • I have a digital object what format is it?
  • Validation
  • I have an object purportedly of format F is
    it?
  • Transformation
  • I have an object of format F, but need G how
    can I produce it?
  • Characterization
  • I have an object of format F what are its
    significant properties?
  • Risk assessment
  • I have an object of format F is at risk of
    obsolescence?
  • Delivery
  • I have an object of format F how can I render
    it?

66
Repository Format Dependencies Using the OAIS
Reference Model
67
Whats Wrong with MIME Types?
  • Insufficient depth of detail
  • No requirements regarding syntax and semantic
    description
  • No requirement for complete disclosure,
    especially of proprietary formats
  • Insufficient granularity
  • Both tiled RGB GeoTIFF with LZW and striped
    bi-tonal TIFF-FX with Group 4 are typed as
    image/tiff
  • All of PDF 1.0 1.4, PDF/X-1, X-2, X-3, and
    PDF/A are typed as application/pdf
  • These variants might require radically different
    workflows

68
File format registries - URLs
PRONOM http//www.nationalarchives.gov.uk/prono
m/ (does not only rely on extensions) Global
Digital Format Registry http//hul.harvard.edu
/gdfr (predominantly project description) FileExt
http//filext.com (predominantly links to
software)
69
Exercise I A few experiments

70
Exercise I A few experiments

71
Exercise I A few experiments

72
Exercise I A few experiments

73
Exercises
  • Use the Shotgun to systematically shoot and
    corrupt all files on your USB stick.
  • Use at least settings size1, count1 size512
    count1 size1 countmore size512
    countmore.
  • Try to damage each file a number of times, say
    5.
  • Report on their robustness agianst this
    treatment.
  • Check, whether your experience supports the
    Florida recommendations.
  • Any idea, why the formats behave as they do?

74
PART II Formats in PLANETSFile characteristics

75
PART II Formats in PLANETSFile characteristics
  • Based on two formal languages
  • eXtensible Characterisation Extraction Language
    ( XCEL)
  • eXtensible Characterisation Description Language
    ( XCDL)

76
  • 2007

99
2017
77
  • tiff

93
png
78
  • ltXCELDocument...gt ...
  • ltformatDescriptiongt....
  • ltsymbol identifier"ID01_I01_I01_S02"
    originalName"height interpretation"uint32"gt
  • ltrangegt
  • ltstartposition xsitype"sequentialgt
    lt/startpositiongt
  • ltlength xsitype"fixed"gt4lt/lengthgtlt/rangegt
  • ltnamegtheightlt/namegt
  • lt/symbolgt
  • ltsymbol identifier"ID01_I01_I01_S04"
    originalName"colourType"gt
  • ltrangegt
  • ltstartposition xsitype"sequential"gt
    lt/startpositiongt
  • ltlength xsitype"fixed"gt1lt/lengthgtlt/rangegt
  • ltvalueInterpretationgt
  • ltvalueLabelgtgreyscalelt/valueLabelgt
  • ltvaluegt0lt/valuegt...
  • ltnamegtimageTypelt/namegt
  • lt/symbolgt
  • ltsymbol identifier"ID01_I01_I01_S05"
    originalName"compressionMethod"gt
  • ltrangegt

ltxcdlgt ltobject id"o1" gt ltnormData
id"nd1" gt ... lt/normDatagt ltproperty
id"p1" source"raw" cat"descr" gt
ltnamegtcompressionlt/namegt ltvalueSet
id"i_i1_s6" gt ltrawValuegt0
lt/rawValuegt ltlabValuegt...lt/labValu
egt ltdataRef ind"normAll" /gt
ltpropRel/gt lt/valueSetgt
lt/propertygt ltproperty id"p2"
source"raw" cat"descr" gt
ltnamegtheightlt/namegt ltvalueSet
id"i_i1_s3" gt ltrawValuegt0 0 1 ad
lt/rawValuegt ltlabValuegt
ltvalgt429lt/valgt
lttypegtuint32lt/typegt lt/labValuegt
ltdataRef ind"normAll" /gt
ltpropRel/gt lt/valueSetgt
lt/propertygt ltproperty id"p3"
source"raw" cat"descr" gt
ltnamegtimageTypelt/namegt .....
79
(No Transcript)
80
Confession
81
Confession
Computer science does not really know what
information is.
82
Confession Claim
Computer science does not really know what
information is. It is pretty good at
representing and processing it, though.
83
Representations migrations
III 3 ? ??? Four representations of
the idea / concept / model three
84
Representations migrations
I divided by III 1 / 3 1.3333? I divided
by III 1 / 3 1.3 periodic Some ideas are
handled more precisely by some thinkers than
others.
85
Representations migrations
48 bit images on 24 and on 48 bit graphics
cards. Some data is processed more adequately by
some equipment than others
86
Representations migrations
A model for information before and after a
migration must therefore potentially represent
all information there is, irrespective of the
possibility to process it in a given environment.
87
XCEL / XCDL
Languages are being processed development
focus currently dynamic handling of format
specific algorithms.
88
XCEL / XCDL image model (1)
A pixel cube Each pixel MSB (channel 1), LSB
(channel 1), MSB (channel n), LSB (channel
n), MSB (aux 1), LSB (aux 1), MSB (aux m),
LSB (aux m)
89
XCEL / XCDL image model (2)
A pixel cube Accompanied by rendering info
plus deployment info.
90
XCEL / XCDL image model - example
ltproperty id"p4" source"raw" cat"descr" gt
ltnamegtimageTypelt/namegt
ltvalueSet id"i_i1_s5" gt
ltrawValuegt2lt/rawValuegt
ltlabValuegt ltvalgttruecolourlt/va
lgt lttypegtfixedLabellt/typegt
lt/labValuegt ltdataRef
ind"normAll" /gt ltpropRel/gt
lt/valueSetgt lt/propertygt
91
XCEL / XCDL text model
A text ( ltobjectgt) is composed of - data
(ltnormDatagt) plus - interpretations of data
according to the underlying format
specification (properties ltpropertygt).
92
XCEL / XCDL text model - example
This is a text ltrefData id"1"gt54 68 69 73 20 69
73 20 61 20 74 65 78 74lt/refDatagt ltpropertygt ltna
megtfontsizelt/namegt ltrawValgt ltvalgt00
18lt/valgt lttypegtunsignedInt8lt/typegt lt/rawValgt ltdata
Refgt lt!-- property refers to discrete part of
reference data- -gt ltref id"1" start"0"
end"3"/gt ltref id"1" start10"
end"12"/gt lt/dataRefgt lt/propertygt
93
Exercise II Abstract modelling

Group 1 maps Group 2 music Group 3 excel
sheets Group 4 books ever heard of FRBR?
94
File format XML (here SVG)
95
File format XML (here SVG)
lt?xml version"1.0" encoding"UTF-16"?gt
ltsvgsvg width"800" height"1000"
xmlnssvg"http//www.w3.org ... ltsvgrect
x"0" y"0" width"800" height"1000"
fill"white" /gt ltsvgg transform"translate(-140
,0)"gt ltsvgline x1"600" y1"20" x2"500"
y2"20" stroke"black" ltsvgtext x"600"
y"28.8" font-size"6" fill"black" lt/svgggt
ltsvgg transform"translate(-140,0)"gt
ltsvgtext x"500" y"24.4"gt ltsvgtspan
font-size"4" fill"black"gtLeistelt/svgtspangt
lt/svgtextgt lt/svgggt ltsvgdefsgt
ltsvgg id"halbeSaeuleLeiste0"gt
96
File format XML (ETH column XML)
lt?xml version"1.0" encoding"UTF-8"?gt ltAutor
name"Vitruv"gt ltOrdnung name"Ionisch"
THz"" THn"" MH"" TBz"" TBn""
ltElement name"Gebaelk" original"" THz"" THn""
MH"" ltElement name"Gesims"
original"corona" THz"" THn"" MH""
ltElement name"Leiste" original"" THz"" THn""
MH"0.03" ltElement name"Kyma"
original"sima" THz"" THn""
ltElement name"Leiste" original"" THz"" THn""
MH"0.017" ltElement
name"Kyma_reversa" original"cymatium" THz""
ltElement name"Platte" original"corona"
THz"" THn"" ltElement name"Leiste"
original"" THz"" THn"" MH"0.017"
ltElement name"Kyma_reversa" original"cymatium"
THz"" lthElement name"Band"
typ"1" dx"0.048" r"0.019"/gt
lthElement name"Band" typ"1" dx"0.048"
r"0.019"/gt lt/Elementgt
97
  • ltAbstract Definition of Visual Contentgt
  • ltTransformation Layergt
  • ltDisplay Systemgt

98
  • Content specific Languages
  • XSLT, C, Java, PHP ...
  • VRML, X3D, AutoCAD, 3DS, Maya, SoftImage ...
Write a Comment
User Comments (0)
About PowerShow.com