Title: SDD
1SDD
Structured Descriptive Data
(Talk held Tuesday, 2004-10-12, at the TDWG 2004
meeting in Christchurch, New Zealand)
2What are descriptive data?
- Descriptive data inform about the state of
repeatably observable, inherent properties of - objects ( individual organisms)
- classes ( taxon)
- Specimens in natural history collections are
special named cases of such instances. - SDD also considers descriptions of observed
objects / field data.
3What are descriptive data?
- Not limited to morphological / ultrastructural
data. However, these are important because they - have a long history of use in biology, reflected
in a huge knowledge base - in most cases are easily observable
- are easily memorized by human beings
- Specifically, the definition includes
- chemical/enzymatic features
- molecular data like nucleic/proteinsequences,
RFLP/AFLP patterns, etc. - behavior patterns
4Why are people doing it?
- The driving force behind most of the interest in
descriptive data is the identification of
organisms
5Why are people doing it?
Phylogenetic Relationships (cladograms)
Taxonomic concepts (species, genera etc)
The real world (The Field)
(From talk by K. Thiele, Library of Life, 2003)
6Why are people doing it?
Phylogenetic Relationships (cladograms)
Taxonomic concepts (species, genera etc)
Maps field guides, floras, monographs etc
The real world (The Field)
(From talk by K. Thiele, Library of Life, 2003)
7Why are people doing it?
TOL Tree of Life
GBIF-Specimen Services
GBIF-Names Services
Library of Life?Key to Life?
The real world (The Field)
(From talk by K. Thiele, Library of Life, 2003)
8Species diversity and estimated completeness by
taxonomic groups
Describedand named
Unknownto science!
Purvis Hector 2000
9Description of new Species?
- mycologists inadvertently redescribe already
known species at the rate of about 2.5 1
(Hawksworth 1991)
10Kind of descriptive data
- Terminology
- Principal definitions of terms in natural
language (glossary/ontology) - Operational definitions of preferred / selected
terms (character/state) - Can be defined by scientist but voluntary
standardization recommended - Coded description
- Like a taxon character spreadsheet (with
multiple values per cell) - Alternative paradigm a questionnaire form
11Example form
12Kind of descriptive data
- Terminology
- Principal definitions of terms in natural
language (glossary/ontology) - Operational definitions of preferred / selected
terms (character/state) - Can be defined by scientist but voluntary
standardization recommended - Coded description
- Like a taxon character spreadsheet (with
multiple values per cell) - Alternative paradigm a questionnaire form
- Natural language description
- Traditional free-form text not the ideal form,
but important for legacy publications and
low-learning-curve collaborations (e. g., through
WIKIs) - Can be dynamically generated if coded
descriptions and terminology wordings are present
13Example form
14Kind of descriptive data
- Terminology
- Principal definitions of terms in natural
language (glossary/ontology) - Operational definitions of preferred / selected
terms (character/state) - Can be defined by scientist but voluntary
standardization recommended - Coded description
- Like a taxon character spreadsheet (with
multiple values per cell) - Alternative paradigm a questionnaire form
- Natural language description
- Traditional free-form text not the ideal form,
but important for legacy publications and
low-learning-curve collaborations (e. g., through
WIKIs) - Can be dynamically generated if coded
descriptions and terminology wordings are present - Stored / static identification keys
- Traditional printed di-/polychotomous keys
legacy data - Can be dynamically generated if coded
descriptions and terminology wordings are present
but manual keys may be better! - Tool to capture taxonomists intuition about
preferred paths
15Background
- DELTA
- Well used standard (gt 25 years old!)
- Quite complex gt 170 directives
- Legacy problems, outgrown
- Some principal limitations
- Natural language descriptions and printed
dichotomous keys from coded data - Interactive identification from coded data
- DELTA II proposal as extension of DELTA
- Inclusion of taxon names, literature, etc.
- Other relevant standards
- Lucid Interchange Format (LIF, identification)
- NEXUS (phylogenetics)
16Background SDD
- Initiated in 1999 as a revision of DELTA in xml
- Took much longer than expected
- Recently 2 yearly meetings
- 2002 Fall Brazil
- 2003 Spring Paris
- 2003 Fall Lisbon
- 2004 Spring Berlin
- 2004 Fall Christchurch
- Currently we call SDD 1.0 beta fairly complete
- Criticism trap need experience!
- This week
- Refocus and scale down to a light version 1.0?
- Preserving forward compatibility!
- Release complete schema in parallel as 1.1?
17SDD Schema design philosophy
- Strongly typed
- Close to object-oriented programming types
correspond directly to OO-classes - Using schema inheritance mechanisms to promote
extensibility and ease of evolution - Attempt at intuitiveness of type/element names
- Less concerned with human-readability and
compact xml-text - Using object-relations!
- Definitions with id, references with a ref
attribute instead of labels. Validated by
identity constraints ? use in correct context is
validated. To ease OOP, IDs are also typed
(CharacterRelationID, etc.).
18Some SDD Requirements
- Should be complete format for scientific data
not optimized for a specific purpose - Lucid LIF is optimized for simple identification
data already pre-processed during building
process - Should describe taxa, specimens, observations,
media such as images - Structured, analyzable state annotations or
nuances (in addition to free-form notes) - Not bound to biological knowledge domain
- Medicine, pathology, archeology, musical
instruments, restaurants - Multilingual design
- Already DELTA was able to handle multiple
languages. Problems with comments were to be
addressed in DELTA II. - SDD (not UBIF!) extends language with audience
19UBIF
- SDD has a strong need to use objects from other
knowledge domains - class names and hierarchy
- collected or observed specimen objects
- agent data (person/organization)
- geographical data
- publication (description may be digitized from a
publication or cite published information) - media resources for example images
- If all these would be used as fully developed
data models (ABCD, TCS, MARC) SDD would become
even more complex than it is.
20UBIF
- Instead, we need a simplified abstraction
framework - This is not specific to the purpose of SDD and
should be elaborated with other modeling groups - UBIF Universal Biosciences Information
Framework is an attempt to do so - UBIF is under development SDD, ABCD, TCS
- For the purpose of SDD, it is best to learn UBIF
by example! - See also separate talk about UBIF held at TDWG
2004!
21SDD is a UBIF application
UBIF proxy data
22UBIF provides Metadata and Relations to external
objects
23SDD inside UBIF
SDD inside UBIF
24SDD Schema
25Character types
- Different kind of characters are being used
- Categorical characters
- Enumeration of values defining categories
- Examples redgreenblue, broadnarrow,
alternateopposite - Quantitative (numerical) characters
- Measures with or without measurement unit
- Actual values or statistical summary like mean,
std. dev., sample size
26Example form
Categorical and Quantitative (
numerical)characters in conventionalDeltaAccess
web forms
27Character types
- Different kind of characters are being used
- Categorical characters
- Enumeration of values defining categories
- Examples redgreenblue, broadnarrow,
alternateopposite - Quantitative (numerical) characters
- Measures with or without measurement unit
- Actual values or statistical summary like mean,
std. dev., sample size - Color range measurements
- Color range defined as area in color space
- (Aside Use triangle, circle, polygon?)
- Parameterized/numeric shape functions
- Various molecular data
- specially structured data for AFLP pattern
data, etc. - Multiple characters may be needed
- Nothing yet done only conceptual to make sure
SDD is extensible! - etc.
28Abstract characters
- Different character types all are based on an
abstract character - Categorical characters
- Quantitative (numerical) characters
- Color range measurements
- Currently as choice, rather than using xml schema
xsitype mechanism social question! - However, OOP software would benefit from
implementing it through type polymorphism
29Mapping between character types
30Modifiers
- Modifiers act on statements
- characters has state / measure
- variable has value
- Examples
- frequency rarely, usually, often
- probability perhaps, certainly, and (special
case of certainly not) by misinterpretation - spatial at the top, at the base
- temporal in spring, in autumn, when young
- degree strongly
- Abstract base type and derived concrete types
defined ? extensible - Complication character modifiers vs. state
modifiers
31Frequently asked questions that have never been
asked
- SDD is intimidating
- When you find things unintuitive do ask
- Nobody is expected to have read all the docu
- This meeting is for review, not politeness