Identifiers and Types - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Identifiers and Types

Description:

Heraclitus and Plato: can you step into the same river twice? ... cornell.cs/cstr-94.45 (mnemonic name) loc/a43v-8940cgr (random string) Syntax of Handles ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 37
Provided by: carll8
Category:

less

Transcript and Presenter's Notes

Title: Identifiers and Types


1
Identifiers and Types
  • CS502 Architecture of Web Information Systems
  • Carl Lagoze Cornell University Feb. 03, 2003

2
Identity Change Persistence
  • Paradox reality contains things that persist and
    change over time
  • Heraclitus and Plato can you step into the same
    river twice?
  • Ship of Theseus over the years, the Athenians
    replaced each plank in the original ship of
    Theseus as it decayed, thereby keeping it in good
    repair. Eventually, there was not a single plank
    left of the original ship. So, did the Athenians
    still have one and the same ship that used to
    belong to Theseus

3
Identity Change Persistence
4
Identifiers
  • Provide a key or handle linking abstract concepts
    to physical or perceptible entities
  • Provide us with a necessary figment of
    persistence
  • They are perhaps the one essential and common
    form of metadata
  • Why bother?
  • Finding things
  • Referring to things
  • Asserting ownership over things

5
I have lots of identifiers
  • Carl Jay Lagoze, Dad, Hey you
  • 123-456-7890 (SSN)
  • 1234-5678-1234-1234 (Visa Card)
  • FZBMLH (US Airways locator on March 21 flight to
    San Diego)

6
Identifier Issues
  • Object granularity
  • Identifier Context
  • Object atomicity
  • Part/whole relationships
  • Location independence
  • Global uniqueness
  • Persistent across time
  • Human vs. machine generation
  • Machine resolution
  • Administration (centralized vs. decentralized)
  • Intrinsic semantics
  • Type specificity

7
Two common pre-digital identifiers
  • ISBN (International Standard Book Number)
  • Uniquely identifies every monograph (book)
  • One ISBN for each format
  • HP SS hardback 0590353403
  • HP SS softcover 059035342X
  • Number is semantically meaningful (components)
  • International administration (gt150 countries)
  • ISSN (International Standard Serial Number)
  • Uniquely identifies every serial (not issue or
    volume)
  • Semantically meaningless
  • International administration

8
URI Universal Resource Identifier
  • Generic syntax for identifiers of resources
  • Defined by RFC 2396
  • Syntax ltschemegt//ltauthoritygtltpathgt?ltquerygt
  • Scheme
  • Defines semantics of remainder of URI
  • ftp, gopher, http, mailto, news, telnet
  • Authority
  • Authority governing namespace for remainder of
    URI
  • Typically Internet-based server
  • Path
  • Identification of data within scope of authority
  • Query
  • String of information to be interpreted by
    authority

9
Why is RFC 2396 so big?
  • Character encodings
  • Partial and relative URIs

10
URL Universal Resource Locator
  • String representation of the location for a
    resource that is available via the Internet
  • Use URI syntax
  • Scheme has function of defining the access
    (protocol) method. Used by client to determine
    the protocol to speak.
  • http//an.org/index.html - open socket to an.org
    on port 80 and issue a GET for index.html
  • ftp//an.org/index.html - open socket to an.org
    on port 21, open ftp session, issue ftp get for
    index.html.

11
URL Issues
  • Persistence
  • Location dependence
  • Valid only at the item level
  • What about works, expressions, manifestations
  • Multiple resolution
  • get the one that is cheapest, most reliable,
    most recent, most appropriate for my hardware,
    etc.
  • Non-digital resources?
  • Disconnection from the entity

12
URC Uniform Resource Characteristic (Catalog)
  • Failed but interesting effort
  • Multiple resolution
  • Describe resource by its characteristics
  • Provide adequate bundled information about a
    resource (metadata) to create identification
    block for any given resource (including
    locations)
  • Exactly what are the common set of
    characteristics for describing different types of
    resources?
  • Where are these characteristics stored?

13
Robust Hyperlinks
  • Characteristic of document (metadata) is computed
    automatically via fingerprint of its content.
  • Lexical signatures The top n words of a
    document chosen for rarity, subject to heuristic
    filters to aid robustness.
  • a TF-IDF-like measure
  • Five or so words are sufficient
  • Can be used to locate document (via search
    engine) after it is moved

14
Robust Hyperlinks Why does this work?
  • Number of terms on Web is reportedly close to
    10,000,000.
  • If terms were distributed independently, the
    probability of 5 even moderately common terms
    occurring in more than one document is very
    small.
  • In fact, picking 3 terms restricted to those
    occurring in 100,000 documents works pretty well.
  • Many documents contain very infrequently used
    words.
  • There is lots of room for independence to be off,
    and to play with term selection for robustness,
    etc..
  • http//www.cs.berkeley.edu/phelps/Robust/

15
URN Universal Resource Name
  • globally unique, persistent names
  • Independence from location and location methods
  • ltURNgt "urn" ltNIDgt "" ltNSSgt
  • NID namespace identifier
  • NSS namespace-specific string
  • examples
  • urnISSN1234-5678
  • urnisbn9044107642
  • urndoi10.1000/140

16
Handles Names for Internet Resources
  • Naming system for location-independent,
    persistent names
  • http//www.handle.net

The resource named by a Handle can be A
library item A collection of library items
A catalog record A computer An e-mail
address A public key for encryption etc.,
etc., etc. ....
17
Syntax of Handles
ltnaming_authoritygt/ltlocally_unique_stringgt or hd
lltnaming_authoritygt/ltlocally_unique_stringgt Exam
ples 10.1234/1995.02.12.16.42.219
(date-time stamp) cornell.cs/cstr-94.45
(mnemonic name) loc/a43v-8940c
gr (random string)
18
Example of a Handle and its DataUsed to Identify
Two Locations
Data type
Handle data
Handle
loc.ndlp.amrlp/123456
URL
http//www.loc.gov/.....
RAP
loc/repository-1r4589
19
Use of Handles in a Digital Library
Repository
User interface
Search System
Handle System
20
Scalability and Caching
Client
Caching Server
Handle Servers
Hash
21
Replication for Performance and Reliability
Example the Global Handle System
Los Angeles, CA
Washington, DC
22
Global and Local Handle Servers
Global
Local Handle Servers
23
Ways to Resolve HandlesI. Resolution by Program
Any program can resolve Handles by sending
standard format messages to the Handle System. A
set of procedures, with Java and C versions, is
available to link into applications programs.
They are known as the Handle Client Library.
24
Ways to Resolve HandlesII. Web Browsers
Browsers modified to recognize Handles. This
requires installation of a Handle Extension. 1.
Whenever the browser expects a URL, it will
recognize "hdl". 2. The Handle is passed to the
Handle System, where it is resolved and a data
item of type "URL" is returned. Handle
Extensions for Netscape and Internet Explorer are
available for most versions of Windows.
25
Ways to Resolve HandlesIII. Proxies
Any Web browser can resolve Handles, even with no
extension, via a proxy. For example, the
following URL can be used to resolve the Handle
loc.ndlp.amrlp/3a16616 http//hdl.handle.net/loc
.ndlp.amrlp/3a16616
26
Proxy Resolution
URL to Proxy
WWW browser
Proxy server
URL
hdl.handle.net
Handle System
URL
HTTP server
Resource
27
OCLC's Persistent URL (PURL)
  • A PURL is a URL
  • -gt Is fully compatible with today's Internet
    browsers
  • -gt Users need no special software
  • Has some of the desirable features of URNs
  • Lacks some desirable features of URNs
  • -gt Resolves only to a URL
  • -gt Does not support multiple resolution
  • Developed by OCLC
  • Software openly available
  • http//www.purl.org

28
PURL Syntax
  • A PURL is a URL.
  • PURL resolvers use standard http redirects to
    return the actual URL.

29
PURL Namespaces
A PURL provides a local (not-global namespace)
http//purl.oclc.org/keith/home is different
from http//purl.stanford.edu/keith/home
30
OCLC PURL Resolution
WWW browser
PURL
PURL server
PURL database
URL
URL
HTTP server
Resource
31
Why havent URNs caught on?
  • Complexity of systems
  • One size does not fit all - special purpose URN
    schemes have been successful, e.g., PubMed ID,
    Astrophysics BibCode
  • No guarantee of persistence longevity is an
    organizational not technical issue
  • Requires well-regulated administrative systems
  • Absence of killer applications although
    reference linking is emerging

32
Types Not all data and content is the same
  • Format or Genre
  • How you sense it
  • What you can do with it
  • E.G. audio, video, map, book
  • Type
  • What you need to process it
  • What is its bit layout
  • Compression or encoding

33
Multipurpose Internet Mail Extensions
  • RFC 822 define textual format of email messages
  • RFC 2045-2049 Extend textual email to allow
  • Character sets other than US-ASCII
  • Extensible set of non-ASCII types for message
    bodies
  • Definition of multi-part mail (attachments)

34
MIME Types
  • Two part type hierarchy
  • Top level type
  • text
  • audio
  • video
  • image
  • application
  • multipart
  • Examples
  • text/plain image/gif application/postscript
  • Extensions are handled by IANA

35
MIME in HTTP (Content Negotiation)
  • Accept in request-header
  • Accept text/plain q0.5, text/html, text/x-dvi
    q0.8, text/xml
  • text/plain and text/xml are preferred, then
    text/x-dvi, then text/html
  • Content-Type in response-header
  • Content-Type text/html

36
MIME is too limited
  • Two-level type depth is simplistic
  • Multi-media documents
  • Documents that have many types or views
  • FEDORA
Write a Comment
User Comments (0)
About PowerShow.com