Title: Globally Unique Identifiers and Life Science Identifiers
1Globally Unique IdentifiersandLife Science
Identifiers
- Dave Thau
- thau_at_learningsite.com
- University of Kansas
- California Academy of Sciences
- www.learningsite.com
2Outline
- Describe Global Unique Identifiers
- Show how theyre relevant
- Describe one GUID system (LSIDs)
- Outline some issues around using GUIDs for
TDWG-related activities - Provide some resources
- Open discussion
3GUID Is Not An Ugly Word
It s guid to be merry and wise, It s guid to
be honest and true,       Robert Burns Heres a
Health to Them that s Awa.
Pteroptochos tarnii AKA Guidguid
Image From animaldiversity.ummz.umich.edu
4GUID Globally Unique Identifier
- A short name for a complex entity
- Useful for locating information about the entity
- Each name identifies only one entity
- There is some sense of permanence
5Some things which fit this description
- GenBank accession numbers AP006480.1
- US Patent numbers 5443036 (laser guided cat
exercise) - Digital Object Identifier 10.121/3212
6In Our Domain
SDD Document Representing some data
set. ltClassName id"1"gt ltLabelgt
ltRepresentation language"en"gt Â
ltTextgtCypselurus heterurus (Rafinesque,
1810)lt/Textgt  lt/Representationgt  lt/Labelgt
ltLinkgt  ltLSIDgtlsid.gbif.netwww.fishbase.org10
29lt/LSIDgt  lt/Linkgt  ltRankgtsplt/Rankgt
lt/ClassNamegt
Napier Schema Document Representing some
taxon. ltTaxonConcept idurnlsidbioguid.orgsee
k121212
type"original"gt
ltName type"scientific"gt  ltNameSimplegtCanis
lupuslt/NameSimplegt lt/Namegt
ltRelationshipsgt ltRelationship typeis child
of"gt  ltToTaxonConcept refurnlsidbioguid.o
rgseek5743" /gt lt/Relationshipgt
lt/Relationshipsgt lt/TaxonConceptgt
7Features of a GUID system
- Global uniqueness scoped to Internet
- Should be easily resolvable by a computer or
human - Should identify things down to whatever level of
granularity necessary - Should not be limited to proprietary systems
- Should serve up all sorts of data
- Database records
- Text files
- Images
- It would be nice if the identifier had associated
metadata
8Life Science Identifiers
- Official standard of the Object Management Group
(OMG) - Support for metadata and authentication
- Supports multiple protocols (e.g. HTTP, SOAP)
- Can serve up data in any format
- Decentralized anyone can issue an LSID
- LSID code available in Java and Perl.
- A young standard, but increasingly used.
9Organizations Using LSIDs
- National Center for Biotech Information (NCBI)
- Pubmed
- Genbank
- European Bioinformatics Institute (EBI)
- US Long Term Ecological Research Network (LTER)
- BioMOBY an biological database interoperability
program (biomoby.org) - Open Bioinformatics Foundation (open-bio.org)
- myGrid a BioGRID project (mygrid.org.uk)
10A Small Pause For More Squid Humor
11LSID Format
urnlsidbioguid.orgseek117866v1
- urn indicates that this is a URN
- lsid indicates that its an LSID-type urn
- bioguid.org the authority who issued the LSID
- Doesnt have to be a domain name but for now
probably should be. - bioguid.org does not necessarily have the data or
metadata. - There may not even be a machine called
bioguid.org. - seek a name space id internal to that authority
- The name space is meaningless to systems outside
that authority. - 117866 the local identifier within that
authority - Also internal to the authority
- v1 an optional version number
- If no version, no trailing colon either.
12Data and Metadata
- An LSID has data
- Examples
- The gene sequence in GenBank
- The actual LTER data set, maybe in excel, or in a
text file - The data should never change
- An LSID also has metadata
- Example metadata
- The format of the data
- A display title for clients displaying the LSID
- Dublin core metadata
- Anything you want
- The metadata can change
13Example LSIDs
- An LTER fish abundance data set
- urnlsidlimnology.wisc.edudatasetntlfi02
- A PubMed reference
- urnlsidncbi.nlm.nih.gov.lsid.biopathways.orgpub
med12441808 - A GenBank sequence
- urnlsidncbi.nlm.nih.gov.lsid.biopathways.orggen
bank_gi30350027
14How LSIDs work
LSID Client Maybe Launchpad Maybe Haystack Maybe
BioFerret Maybe myGRID Maybe Yours!
DNS Find DNS record Resolve it to get Address of
Authority
- Find the authority for this LSID
Returns the LSID Authority Server
LSID Authority
2. Query authority for available services
Returns WSDL for this LSID
3. Chose a service, get the goods
Data Store
Metadata Store
HTTP, SOAP, FTP, others
15LSID Promises
- I promise to never change the data behind an
LSID. - I will make sure my LSIDs are being served, or
give them to someone who can do it. - I will give my LSIDs metadata at least give
them a title and a format
16Other GUID systems
- URLs
- Files move
- The data change
- Unstructured metadata
- UUIDs 128 bit string, guaranteed unique
- 58f202ac-22cf-11d1-b12d-002035b29092
- No resolution
- No metadata
- Handle System / DOIs (10.12/2312)
- Non standard protocol
- Centralized resolution
- Unstructured metadata (for Handle System)
- High costs (for DOI)
17Issues For This Community
- What gets a GUID?
- For each of those things, whats the data, whats
the metadata? - One GUID per item?
- Centralization who issues GUIDs?
18What Gets a GUID?
- These things probably should get GUIDs
- Taxonomic concepts
- Specimens
- Publications
- People
- These things might get GUIDs
- Taxonomic names
- Journals
- Data providers
- Observations
19Specimen Data? Metadata?
- If specimens get a GUID what does it identify?
- The physical specimen?
- A collections database record of the specimen?
- What about multiple labels?
- Main question what doesnt change about a
specimen? - Other main question how should the data be
represented? - Darwin core includes current institution
location. Not a good idea for the data of a GUID
since that may change.
20One GUID Per Item?
- No GUID system inherently enforces a 11 mapping
between GUID and data. - Everyone should TRY to limit the number of GUIDs
per item. - Should there be any centralization to help
achieve this?
21Degrees of Centralization
- An index
- List your GUID authority in an index so your
GUIDs are easy to find. - A central authority
- One authority could be responsible for issuing
GUIDs to the community for specific types of
information youd have to get one from here. - GBIF?
- The IC_Ns? (ICZN, ICBN.)
- lsidauthority.org?
- This would help enforce a 11 mapping of GUIDs
and data items - It would also alleviate data providers from the
need to maintain their own authorities - It MAY also reduce the likelihood of GUIDs
becoming unresolvable - It may also be infeasible technically, or
socially. - A respected authority
- With LSIDs, an authority can be set up to serve
its own GUIDs and proxy other authorities. - This would help enforce a 11 mapping for those
who use the authority - It may also be more feasible.
22LSID Resources
- LSID Articles and code from IBM
- http//www-124.ibm.com/developerworks/oss/lsid/wh
atislsid - Current LSID specification
- http//www.omg.org/cgi-bin/doc?dtc/04-05-01
- Launchpad An LSID resolver for Windows IE
- available from first link
- A website which resolves LSIDs
- http//lsid.biopathways.org/resolver/
- URN specification
- http//www.ietf.org/rfc/rfc2141.txt
23Acknowledgements
- My work on GUIDs has been funded by the SEEK
project seek.ecoinformatics.org. - SEEK is funded by National Science Foundation
award 0225676. - Thanks to Ben Szekely at IBM for his LSID
articles, his LSID java code, and for answering
all my questions.
24Questions for Discussion
- Do we need GUIDs?
- What gets a GUID?
- One GUID per item?
- Centralization?