Identifiers%20and%20Types - PowerPoint PPT Presentation

About This Presentation
Title:

Identifiers%20and%20Types

Description:

Heraclitus and Plato: can you step into the same river twice? ... Carl Jay Lagoze, Dad, Hey you. 123-456-7890 (SSN) 1234-5678-1234-1234 (Visa Card) ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 38
Provided by: carll8
Category:

less

Transcript and Presenter's Notes

Title: Identifiers%20and%20Types


1
Identifiers and Types
  • CS431 Architecture of Web Information Systems
  • Carl Lagoze Cornell University Feb. 7 2005

2
Identity Change Persistence
  • Paradox reality contains things that persist and
    change over time
  • Heraclitus and Plato can you step into the same
    river twice?
  • Ship of Theseus over the years, the Athenians
    replaced each plank in the original ship of
    Theseus as it decayed, thereby keeping it in good
    repair. Eventually, there was not a single plank
    left of the original ship. So, did the Athenians
    still have one and the same ship that used to
    belong to Theseus

3
Identity Change Persistence
4
Identifiers
  • Provide a key or handle linking abstract concepts
    to physical or perceptible entities
  • Provide us with a necessary figment of
    persistence
  • They are perhaps the one essential and common
    form of metadata
  • Why bother?
  • Finding things
  • Referring to things (Citations)
  • Asserting ownership over things

5
I have lots of identifiers
  • Carl Jay Lagoze, Dad, Hey you
  • 123-456-7890 (SSN)
  • 1234-5678-1234-1234 (Visa Card)
  • FZBMLH (US Airways locator on January 18 flight
    to San Diego)

6
Identifier Issues
  • Object granularity
  • Identifier Context
  • Object atomicity
  • Part/whole relationships
  • Location independence
  • Global uniqueness
  • Persistent across time
  • Human vs. machine generation
  • Machine resolution
  • Administration (centralized vs. decentralized)
  • Intrinsic semantics
  • Type specificity

7
Two common pre-digital identifiers
  • ISBN (International Standard Book Number)
  • Uniquely identifies every monograph (book)
  • One ISBN for each format
  • HP SS hardback 0590353403
  • HP SS softcover 059035342X
  • Number is semantically meaningful (components)
  • International administration (gt150 countries)
  • ISSN (International Standard Serial Number)
  • Uniquely identifies every serial (not issue or
    volume)
  • Semantically meaningless
  • International administration

8
URI Universal Resource Identifier
  • Generic syntax for identifiers of resources
  • Defined by RFC 2396
  • Syntax ltschemegt//ltauthoritygtltpathgt?ltquerygt
  • Scheme
  • Defines semantics of remainder of URI
  • ftp, gopher, http, mailto, news, telnet
  • Authority
  • Authority governing namespace for remainder of
    URI
  • Typically Internet-based server
  • Path
  • Identification of data within scope of authority
  • Query
  • String of information to be interpreted by
    authority

9
Why is RFC 2396 so big?
  • Character encodings
  • Partial and relative URIs

10
URL Universal Resource Locator
  • String representation of the location for a
    resource that is available via the Internet
  • Use URI syntax
  • Scheme has function of defining the access
    (protocol) method. Used by client to determine
    the protocol to speak.
  • http//an.org/index.html - open socket to an.org
    on port 80 and issue a GET for index.html
  • ftp//an.org/index.html - open socket to an.org
    on port 21, open ftp session, issue ftp get for
    index.html.

11
URL Issues
  • Persistence
  • Location dependence
  • Valid only at the item level
  • What about works, expressions, manifestations
  • Multiple resolution
  • get the one that is cheapest, most reliable,
    most recent, most appropriate for my hardware,
    etc.
  • Non-digital resources?
  • Sub-parts

12
URC Uniform Resource Characteristic (Catalog)
  • Failed but interesting effort
  • Multiple resolution
  • Describe resource by its characteristics
  • Provide adequate bundled information about a
    resource (metadata) to create identification
    block for any given resource (including
    locations)
  • Exactly what are the common set of
    characteristics for describing different types of
    resources?
  • Where are these characteristics stored?

13
Robust Hyperlinks
  • Characteristic of document (metadata) is computed
    automatically via fingerprint of its content.
  • Lexical signatures The top n words of a
    document chosen for rarity, subject to heuristic
    filters to aid robustness.
  • a TF-IDF-like measure
  • Five or so words are sufficient
  • Can be used to locate document (via search
    engine) after it is moved

14
Robust Hyperlinks Why does this work?
  • Number of terms on Web is reportedly close to
    10,000,000.
  • If terms were distributed independently, the
    probability of 5 even moderately common terms
    occurring in more than one document is very
    small.
  • In fact, picking 3 terms restricted to those
    occurring in 100,000 documents works pretty well.
  • Many documents contain very infrequently used
    words.
  • There is lots of room for independence to be off,
    and to play with term selection for robustness,
    etc..
  • http//www.cs.berkeley.edu/phelps/Robust/

15
URN Universal Resource Name
  • globally unique, persistent names
  • Independence from location and location methods
  • ltURNgt "urn" ltNIDgt "" ltNSSgt
  • NID namespace identifier
  • NSS namespace-specific string
  • examples
  • urnISSN1234-5678
  • urnisbn9044107642
  • urndoi10.1000/140

16
Why isnt DNS sufficient (parenthetical comment)
  • Issue of semantic vs. non-semantic names
  • Changing ownership
  • Hierarchical legacy of DNS is sometimes
    inappropriate

17
Handles Names for Internet Resources
  • Naming system for location-independent,
    persistent names
  • One name, multiple resolutions
  • http//www.handle.net

The resource named by a Handle can be A
library item A collection of library items
A catalog record A computer An e-mail
address A public key for encryption etc.,
etc., etc. ....
18
Syntax of Handles
ltnaming_authoritygt/ltlocally_unique_stringgt or hd
lltnaming_authoritygt/ltlocally_unique_stringgt Exam
ples 10.1234/1995.02.12.16.42.219
(date-time stamp) cornell.cs/cstr-94.45
(mnemonic name) loc/a43v-8940c
gr (random string)
19
Example of a Handle and its DataUsed to Identify
Two Locations
Data type
Handle data
Handle
loc.ndlp.amrlp/123456
URL
http//www.loc.gov/.....
RAP
loc/repository-1r4589
20
Use of Handles in a Digital Library
Repository
User interface
Search System
Handle System
21
Replication for Performance and Reliability
Example the Global Handle System
Los Angeles, CA
Washington, DC
22
Proxies to Resolve Handles
A Web browser can resolve Handles via a proxy.
For example, the following URL can be used to
resolve the Handle loc.ndlp.amrlp/3a16616
23
Proxy Resolution
URL to Proxy
WWW browser
Proxy server
URL
hdl.handle.net
Handle System
URL
HTTP server
Resource
24
DOI Digital Object Identifier
  • Technology and social infrastructure for naming
  • Established by publishers for persistent naming
    of entities (articles, journals, conference
    proceedings)
  • Cognizant of FRBR elements
  • Underlying technology is handle system
  • persistent names
  • Persistence is fortified by social underpinnings
  • Rules for establishing registration agencies
  • Multiple resolution
  • Registration/mechanism has metadata associated
    with it
  • doi10.1000/186

25
OCLC's Persistent URL (PURL)
  • A PURL is a URL
  • -gt Is fully compatible with today's Internet
    browsers
  • -gt Users need no special software
  • Has some of the desirable features of URNs
  • Lacks some desirable features of URNs
  • -gt Resolves only to a URL
  • -gt Does not support multiple resolution
  • Developed by OCLC
  • Software openly available
  • http//www.purl.org

26
PURL Syntax
  • A PURL is a URL.
  • PURL resolvers use standard http redirects to
    return the actual URL.

27
PURL Namespaces
A PURL provides a local (not-global namespace)
http//purl.oclc.org/keith/home is different
from http//purl.stanford.edu/keith/home
28
OCLC PURL Resolution
WWW browser
PURL
PURL server
PURL database
URL
URL
HTTP server
Resource
29
Making links context sensitive
  • Why?
  • Appropriate item differs for each user
  • Licensing locality
  • Some users may want a choice (abstract, full
    text, etc.)
  • Conceptualize link as service rather than object
    targeted.
  • OpenURL
  • Transports metadata about the work to
  • A localized service that interprets the metadata
    and provides contextualized choices to the user.

30
OpenURL linking
transportation of metadata identifiers
user-specific
.
reference
context-sensitive
resolution of metadata identifiers into
services
provision of OpenURL
31
OpenURL 0.1 syntax
  • http//www.mysrv.org/menu?
  • iddoi10.111/12345
  • genrearticle
  • aulastWeibelaufirstStuISSN35345353
  • year2001volume14issue3spage44
  • pid2829393
  • sidOCLCInspec

32
Why havent URNs caught on beyond certain
communities?
  • Complexity of systems
  • One size does not fit all - special purpose URN
    schemes have been successful, e.g., PubMed ID,
    Astrophysics BibCode
  • No guarantee of persistence longevity is an
    organizational not technical issue
  • Requires well-regulated administrative systems
  • Absence of killer applications although
    reference linking is emerging

33
Types Not all data and content is the same
  • Format or Genre
  • How you sense it
  • What you can do with it
  • E.G. audio, video, map, book
  • Type
  • What you need to process it
  • What is its bit layout
  • Compression or encoding

34
Multipurpose Internet Mail Extensions
  • RFC 822 define textual format of email messages
  • RFC 2045-2049 Extend textual email to allow
  • Character sets other than US-ASCII
  • Extensible set of non-ASCII types for message
    bodies
  • Definition of multi-part mail (attachments)

35
MIME Types
  • Two part type hierarchy
  • Top level type
  • text
  • audio
  • video
  • image
  • application
  • multipart
  • Examples
  • text/plain image/gif application/postscript
  • Extensions are handled by IANA

36
MIME in HTTP (Content Negotiation)
  • Accept in request-header
  • Accept text/plain q0.5, text/html, text/x-dvi
    q0.8, text/xml
  • text/plain and text/xml are preferred, then
    text/x-dvi, then text/html
  • Content-Type in response-header
  • Content-Type text/html

37
MIME is too limited
  • Two-level type depth is simplistic
  • Multi-media documents
  • Documents that have many types or views
  • FEDORA
Write a Comment
User Comments (0)
About PowerShow.com