Title: DCCPersistent Identifiers for Representation Information
1DCCPersistent Identifiers for Representation
Information
Digital Curation Centre
a centre of expertise in data curation and
preservation
- D Giaretta
- http//www.dcc.ac.uk
- http//dev.dcc.ac.uk
Funders
2Outline
- DCC Development work
- Beginning with OAIS Reference Model
- Motivation for use of Persistent IDs
- Simple case!
- Discussion of some possible Persistent ID systems
- Conclusions
3OAIS Reminder
- OAIS is a standard about the long-term
preservation of information - An Information Objects is made up of a Data
Object plus its accompanying Representation
Information (RepInfo)
4Information Objects
5Representation Information
- The Data Object is interpreted using the
RepInfo - The Reference Model is designed to ensure that an
OAIS is NOT set the impossible task of having to
provide ALL possible RepInfo immediately - Hence
- Take account of the Designated Community and its
associated Knowledge Base
6Representation Information
- The information that maps a Data Object into more
meaningful concepts. An example is the ASCII
definition that describes how a sequence of bits
(i.e., a Data Object) is mapped into a symbol.
7Representation Information
- The Representation Information accompanying a
physical object, like a moon rock, may give
additional meaning - It typically is a result of some analysis of the
physically observable attributes of the rock - The Representation Information accompanying a
digital object, or sequence of bits, is used to
provide additional meaning. - It typically maps the bits into commonly
recognized data types such as character, integer,
and real and into groups of these data types. - It associates these with higher level meanings
which can have complex inter-relationships that
are also described
8Recursive Nature ofRepresentation Information
- Structure Information
- Semantic Information
- Other Representation Information
9Examples (cont)
- 504b0304140000000800f696.
- This is a ZIP file which contains Word files,
each of which contains an encoded message which
needs the key !DGAJUKI to decode it using
encryption method SHA7
10Examples (cont)
- LaTex file containing an EPS (Encapulated
Postscript) version of an image - Web page containing Java Applet generating random
numbers - SWISS-PROT data
- Foreign Language emails
11Further RepInfo Classification
12Why classify?
- This is a Word file
- This is a ZIP file which contains Word files
- This is a ZIP file which contains Word files,
each of which contains an encoded message which
needs the key !DGAJUKI to decode it using
encryption method SHA7 - This is a ZIP file which contains Word files,
each of which contains an encoded message which
needs the key !DGAJUKI to decode it using
encryption method SHA8 - To avoid repetition
- To facilitate automation
13Structure including Formats
- Distinguish
- formats which are used mainly for rendering to
be followed by human inspection, and - formats used for automated processing
- Distinguish
- Things with unknown structure needs software
- proprietary software e.g. MS Word
- Open Source software e.g. CDF
- Things with known structure
- ASCII file, FITS file etc
- Document the format
- Use description language if possible e.g. EAST
- The EAST tools are themselves Representation
Information which in due course will have to be
fully defined the closure of their
Representation Nets will be the EAST standard - Higher level definitions should include useful
scientific objects and humanities objects
14Layered Model from OAIS
15Semantics
- Meaning/ Relationships
- Hard problem
- Probably start with Data Dictionaries
- Add RDF etc
16Time Dependent Information
- Many, perhaps most, datasets change over time and
the state at each particular moment in time may
be important. It may be useful to break the issue
into separate parts. - at each moment in time we could, in principle,
take a snapshot and store it. That snapshot has
its associated Representation Net. - efficient storage of a series of snapshots may
lead one to store differences or include time
tags in the data - Additional Representation Information would be
needed which describes how to get to a particular
time's snapshot from the efficiently encoded
version. - Also applies to ANNOTATION who said what about
which and when did they say it
17Actions and Processes (Behaviour)
- Some information has, as an integral part of its
content, an implicit or explicit process
associated with it - An examples of this is a database or other time
dependent or reactive system such as a Neural
Net. - Emulations
- Universal Virtual Computer (UVC)
- A very well specified VM e.g. JVM
18Is saying its XML enough?
- lt?xml version'1.0'?gt
- ltVOTABLE version"1.1"
- xmlnsxsi"http//www.w3.org/2001/XMLSchema-insta
nce" - xsischemaLocation"http//www.ivoa.net/xml/VOTab
le/v1.1 http//www.ivoa.net/xml/VOTable/v1.1" - xmlns"http//www.ivoa.net/xml/VOTable/v1.1"gt
- lt!--
- ! VOTable written by uk.ac.starlink.votable.VOTa
bleWriter - !--gt
- ltRESOURCEgt
- ltTABLE name"6dfgs_E7_subset" nrows"875"gt
- ltPARAM arraysize"" datatype"char"
name"Original Source" value"http//www-wfau.roe.
ac.uk/6dFGS/6dfgs_E7.fld.gz"gt - ltDESCRIPTIONgtURL of data file used to create this
table.lt/DESCRIPTIONgt - lt/PARAMgt
- ltPARAM arraysize"" datatype"char"
name"Credits" value"Column explanations
provided by Mike Read (ROE) from 6dfGS
project."/gt - ltPARAM arraysize"" datatype"char"
name"Conversion" value"Converted from
6dfgs_E7.fld.gz by Mark Taylor (Starlink) using
STIL."/gt - ltPARAM arraysize"" datatype"char"
name"Comment" value"Cut down 6dfGS dataset for
TOPCAT demo usage."/gt - ltFIELD arraysize"15" datatype"char"
name"TARGET"gt - ltDESCRIPTIONgtTarget namelt/DESCRIPTIONgt
Or here
NO!
19Why not embed in the object?
- Do we have to repeat things each time?
- Does every archive have to do everything?
- What happens when the Designated Community
Knowledge Base changes?
20Registries
- A place to register something
- A place to look something up (find something)
21Examples
- http//www.loc.gov/film/nfr2004.html
- http//hul.harvard.edu/gdfr/
- http//sunsite.berkeley.edu/rbeaubie/metsimpl/
- http//metadata.net/registries.html
- http//uddi.microsoft.com/default.aspx
22Simplest cases
- Data object has an identifier pointing to
Representation Information (RepInfo) - Services Given an identifier return associated
contents of Repository - Writer of RepInfo needs to be able to find
related stuff (i.e. has someone already done the
work?) - Services must be able to SEARCH registry in
various ways - Updater of RepInfo someone/something needs to
be able to add, extend (add RepInfo for the
RepInfo), correct
23High Level Conceptual View
The Digital Object could have RepInfo packed with
it
Example of use of Representation Information
Labelling
24Possible ways to attachment ID
- DOI metadata
- SRB attribute
- METS/XFDU attribute
- Object-based Storage Devices (OSD) attribute
- NB local caching is possible
- Simple buy-in
25Example Label
26Persistent ID Digital Objects
- Persistent Identifiers of Persistent Objects
- Uniqueness (over time)
- Actionable i.e. actually allows one to get
something - Bootstrap step
- Sequence of resolutions
- Terminal step
27Uniqueness
- Hierarchy of name spaces
- In each namespace
- Unique (how many?)
- Final namespace e.g.
- Unique (probably out of a larger number)
- Repository assigned e.g. Sequential, Hashed etc
- Repository or Depositor assigned or Distributed
system e.g. UUID based
28Resolvability
- BsXsYs(Z)T
- B Bootstrap step
- s Separators may be different
- X, Y Sequence of intermediate resolver steps
- Z (implicit) terminal resolver service
- T terminal token
29Persistence Requirements
- External to Repository
- Bootstrap step
- Each resolver step
- Within the control of a Repository
- Terminal resolver
- The Digital Object
30Bootstrap step
- Fixed root
- ISO based
- ISO/IEC 6523-11998 (rolodex?)
- ISO/IEC 8824-11998
- DOI
- Handle
- PURL
- Mutable root
- ARK
- http//NMAH/ark/NAAN/Name
- URN
- LSID
31Two Forms of ISO Highest-Level Identifiers - from
ISO 8824
- 1. iso(1) standard(0)
- and
- 2. iso(1) identified-organization(3)
- Form 1
- Requirements on the standard, if any, are not
currently known - Can standard simply define procedures for ID
assignments? - Must standard explicitly give all identifiers
to be used? - Form 2
- Identified Organization is to be identified
using ISO 6523 - ISO 6523-1,-2 (1998) extensively revised from
1984 version
32ISO 6523-1 (1998)
- Rules for ICD registration, and usage of 3
additional fields - ICD identifies organization registration system,
1-4 characters (e.g., ICD 112 is system for
registering top level standards organizations) - Organization Identifier (OI), up to 35 characters
(e.g., 4 assigned to CCSDS) - Optional Organization Part Identifier (OPI), up
to 35 characters identifying sub-org., services,
or entity (e.g., 1 could be assigned to CCSDS
CA Agent) - Optional OPI Source (OPIS), 1 character,
identifies who assigned the OPI (e.g., 1 says
identified organization (CCSDS) assigned the
OPI) - Interpretation of identification string under
6523 requires full knowledge of context of usage - Fields can be in any order
- No syntax specified
33Implications for Registered Identifier Usage
- All identifiers are ambiguous without context of
usage - No string is globally unique
- Need syntax specification including meaning of
included fields - In most contexts of usage, full iso string not
needed - Sending and receiving parties understand context
- May need to broaden context of usage in some
cases - Can employ full string
- Map into new identifier string syntax and
semantics - not automatic
34Investigation Status
- ICD 112 has been obtained by ISO for
identification of standards developing
organizations - ICD 112 is under control of ISO JTC1/SC32 (in
2000 the contact was)
35Potential ID Construction(abstract level)
- Using ISO/ICDs
- x distinguishes among CCSDS defined domains
(TBD) - Maintained by CCSDS Secretariat
P2 CA ADID services
1 NSSD 0233 (Panel 2)
ICD
OI
OPI
1 5 2. (Panel 3)
P3 SLE services
OPIS
- Using ISO/ISO Standards
- x is number of ISO standard
13764 NSSD 0233 (Panel 2)
X
0
1
ID
P3 Top Level SLE Standard
5 2. (Panel 3)
36ISO 8824-1 Naming Tree
37Persistent ID - roles
- Who/What (role?) can update a Registry entry in
the long term? - Who/What (role?) can access a Registry entry?
- Authorisation?
- Encryption keys?
38What can be relied on?
- Organisational/ Procedural/ Sociological issues
are important - What can be relied on?
- Organisations?
- Internet? DNS?
- Nothing
39Example
- ltreferencegt
- ltidentifiergt
- ltvaluegte1fe9271-cd48-4418-a63e-b112ebf792c7lt/v
aluegt - ltresolver resolverType"ark"gthttp//foobar.zaf
.org/ark/64269/lt/resolvergt - ltresolver resolverType"doi"gt10.123456/lt/resol
vergt - lt/identifiergt
- ltdescriptiongtFor example something registered
with both ARK and DOIlt/descriptiongt - lt/referencegt
40Conclusions
- There is a need for Persistent Identifiers for
persistent objects - There are many systems some may be more
believable than others - None can actually be trusted in the really long
term