Title: Metacat and EcoGrid: metadata and data management
1Metacat and EcoGrid metadata and data management
- Second KNB Data Management Workshop
- Matthew Jones
- National Center for Ecological Analysis and
Synthesis - University of California, Santa Barbara
2Metacat
- Flexible storage system for storing and accessing
metadata and data
3Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
- Part III EcoGrid
4Metacat
- Flexible storage system for metadata and data
- Stores arbitrary metadata documents (requires
XML) - Supports structured searches
- Customizable web interface
- Replication capabilities
- Works on Linux, Windows, MacOS
- Oracle, Postgres, MS SQL Server
5KNB Overview
Clients
Server
6Building the KNB network
Key
Metacat Catalog
Morpho clients
Web clients
Site metadata system
XML output filter
7Knowledge Network for Biocomplexity
Knowledge Network for Biocomplexity
LTER Network (24) Organization of Biological
Field Stations (180) UC Natural Reserve System
(36) Partnership for Interdisciplinary Studies of
Coastal Oceans (4) Multi-agency Rocky Intertidal
Network (60)
Metacat node
Site-specific node
8Metacat web user interface
9Metacat UI is reconfigurable
10Data Registries
- UC NRS Information System
- NRS Network, NCEAS
- Resource Discovery Initiative for Field Stations
(RDIFS) - LTER Network, OBFS Network, NCEAS, San Diego
Supercomputer Center, University of Kansas - Use metacat
- Web-based metadata entry
11Metacat features
- Metadata
- Store search any XML formatted metadata
- Ecological Metadata Language (EML)
- NBII Biological Data Profile
- FGDC CSDGM
- Site specific formats
- Metadata validation
- Configure to accept particular metadata formats
- Enforces access control rules
- Metadata conversion (using XSLT)
- To HTML for presentation
- To other metadata formats (e.g., NBII)
- Data
- Storage
- Access control
12Metacat implementation
- Java servlet for portability
- Linux, Windows 2000, MacOS X
- HTTP access
- Standard POST and GET queries
- Web HTML interface via XSLT transforms
- Separates content from presentation
- Interfaces with RDBMS for storage
- Oracle, PostgreSql, (SQL Server) backend
13Advanced features
- Replication
- Synchronize content between 2 metacat servers
- Harvesting
- Scheduled pull of XML documents from web
sources
14Questions
- Discussion?
- Questions/Comments?
15Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
- Part III EcoGrid
16Metacat architecture
17Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
- Part III EcoGrid
18Storage subsystem
- Storage
- XML metadata stored in relational db
- Oracle, PostgresQL, (SQL Server)
- Data object storage on filesystem
- Data viewed as opaque objects
- Assigned unique identifier
- Metadata describes data structure semantics
19Versioning and Identifiers
- Metacat prescribes a format for identifiers
- Identifiers are a contract regarding uniqueness
- Incorporates both identity and version
- Two data streams with the same ID are defined as
identical - insert requires a unique ID
- update requires an existing ID with a new
revision - Will be adopting Life Science Identifiers (LSID)
- E.g., urnlsidlsid.ecoinformatics.orgobfs263
obfs.23.4
scope
identifier
revision
20Storage actions
- Read
- Download a document
- Insert
- Put a new XML document in the database
- Update
- Replace an existing xml document with a new
version, incrementing the identifer - Delete
- Archive a document so that it does not show up in
searches - Upload
- Put a binary or other non-xml in the file system
21Reading a document
- Simple HTTP GET or POST request
- http//a.com/knb/metacat?actionreaddocidknb-lte
r-gce.109.5 - Return document is in XML format by default
- Login is optional
- If you dont login, you have public privileges
22A simple web client
23Schema-independence
- Most relational dbs support only one data model
- Makes maintenance as models change expensive
- Metacat is schema independent
- Any XML document, regardless of schema, can be
stored without modifications to metacat - Metacats data model follows the XML Document
Object Model (DOM) - Thus, it models the XML structure rather than the
data schema
24Yeah, so what?
- You can throw whatever you need in metacat
- (without schema or software changes)
- And you can query it
- lt?xml version1.0?gt
- ltpollgt
- ltfavoriteOS id1gtMacOSlt/favoriteOSgt
- ltfavoriteOS id2gtLinuxlt/favoriteOSgt
- ltfavoriteOS id3gtLinuxlt/favoriteOSgt
- ltfavoriteOS id4gtLinuxlt/favoriteOSgt
- ltfavoriteOS id5gtWinXPlt/favoriteOSgt
- ltfavoriteOS id6gtLinuxlt/favoriteOSgt
- ltfavoriteOS id7gtLinuxlt/favoriteOSgt
- lt/pollgt
25Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
- Part III EcoGrid
26Query subsystem
- Two means of submitting queries
- Structured query action (squery)
- Custom query syntax in xml format (pathquery)
- Query action (query)
- Query parameters passed as url-encoded form
parameters (i.e., an html form) - Metacat builds a pathquery document automatically
27HTML form queries
- Parameters passed as form elements
- Special fields
- action, qformat, operator
- returnfield, returndoctype
- anyfield
- Other fields create additional conditions
- Metacat builds the query from the fields
query
28Example HTML form
- ltform method"POST" action"_at_servlet-path_at_"
target"_top"gt - Search for
- ltinput name"action" value"query"
type"hidden"gt - ltinput name"operator" value"INTERSECT
type"hidden"gt - ltinput name"anyfield" type"text"
value"size"14"gt - ltinput name"organizationName"
value"Organization of Biological Field Stations"
type"hidden"gt -
- ltinput name"qformat" valuexml"
type"hidden"gt - ltinput name"returnfield" value"creator/individ
ualName/surName" type"hidden"gt - ltinput name"returnfield" value"creator/individ
ualName/givenName" type"hidden"gt - ltinput name"returnfield" value"creator/organiz
ationName" type"hidden"gt - ltinput name"returnfield" value"dataset/title"
type"hidden"gt - ltinput name"returnfield" value"keyword"
type"hidden"gt - ltinput name"returndoctype" value"eml//ecoinfo
rmatics.org/eml-2.0.1" type"hidden"gt -
- ltinput value"Start Search" type"submit"gt
- lt/formgt
29Query Result Set Structure
- lt?xml version1.0?gt
- ltresultsetgt
- ltdocumentgt
- ltdocidgtknb.2.1lt/docidgt
- ltdocnamegtemllt/docnamegt
- ltdoctypegteml//ecoinformatics.org/eml-2.0.
1lt/doctypegt - ltdoctitlegtSoil profiles from lower
Yosemite Valleylt/doctitlegt - ltcreatedategt2000-06-10 125407lt/createdat
egt - ltupdatedategt2000-06-10 125407lt/updatedat
egt - ltparam name/eml/dataset/creator/individu
alName/surName"gtLevingslt/paramgt\ - ltparam name/eml/dataset/creator/individu
alName/surName"gtShriverlt/paramgt - ltparam name/eml/dataset/keywordSet/keywo
rd"gtstratalt/paramgt - ltparam name/eml/dataset/keywordSet/keywo
rd"gtmineralizationlt/paramgt - lt/documentgt
- ltdocumentgt
-
- lt/documentgt
- lt/resultsetgt
30Structured queries
- Pathquery syntax
- Can build precise queries against arbitrary
metadata schemas - Boolean combinations of conditions (AND, OR)
- Uses Xpath-like syntax
- Specify document types to search
- Specify fields to return in resultset
squery
31Query Conditions
- Language independent representation of a query
structure - Transformed into the appropriate native language
of the data store - Example
- ltquerygroup operatorUNION"gt
- ltqueryterm searchmode"contains"
casesensitive"false"gt - ltvaluegtsoillt/valuegt
- ltpathexprgtdataset/titlelt/pathexprgt
- lt/querytermgt
- ltqueryterm searchmode"contains"
casesensitive"false"gt - ltvaluegtnutrientslt/valuegt
- ltpathexprgtdataset/titlelt/pathexprgt
- lt/querytermgt
- lt/querygroupgt
32Specifying the Resultset
- Specify the list of fields to be returned in the
resultset - Simple paths used to identify elements or
document subtrees - Effectively flattens the structure of the
records, but allows generic representation (i.e,
multiple standards) - Example
- ltreturnfieldgtdataset/titlelt/returnfieldgt
- ltreturnfieldgtcreator/individualName/surNamelt/ret
urnfieldgt - ltreturnfieldgtkeywordlt/returnfieldgt
33Full Query Example
- lt?xml version"1.0"?gt
- ltpathquery version"1.2"gt
- ltquerytitlegtSoil searchlt/querytitlegt
- ltreturndoctypegteml//ecoinformatics.org/eml-2.0.
0lt/returndoctypegt - ltreturnfieldgtcreator/individualName/surNamelt/ret
urnfieldgt - ltreturnfieldgtkeywordlt/returnfieldgt
- ltquerygroup operatorUNION"gt
- ltqueryterm searchmode"contains"
casesensitive"false"gt - ltvaluegtsoillt/valuegt
- lt/querytermgt
- ltqueryterm searchmode"contains"
casesensitive"false"gt - ltvaluegtnutrientslt/valuegt
- lt/querytermgt
- lt/querygroupgt
- ltpathquerygt
34Query Result Set Structure
- lt?xml version1.0?gt
- ltresultsetgt
- ltdocumentgt
- ltdocidgtknb.2.1lt/docidgt
- ltdocnamegtemllt/docnamegt
- ltdoctypegteml//ecoinformatics.org/eml-2.0.
1lt/doctypegt - ltdoctitlegtSoil profiles from lower
Yosemite Valleylt/doctitlegt - ltcreatedategt2000-06-10 125407lt/createdat
egt - ltupdatedategt2000-06-10 125407lt/updatedat
egt - ltparam name/eml/dataset/creator/individu
alName/surName"gtLevingslt/paramgt\ - ltparam name/eml/dataset/creator/individu
alName/surName"gtShriverlt/paramgt - ltparam name/eml/dataset/keywordSet/keywo
rd"gtstratalt/paramgt - ltparam name/eml/dataset/keywordSet/keywo
rd"gtmineralizationlt/paramgt - lt/documentgt
- ltdocumentgt
-
- lt/documentgt
- lt/resultsetgt
35Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
- Part III EcoGrid
36Transforming a document
- Used to convert document before returning it
- Conversion uses XSLT
- Configuration of which style sheet to use is
controlled by the skin via the qformat
parameter - http//knb.ecoinformatics.org/knb/metacat?actionr
eaddocidknb-lter-gce.109.5qformatltss - Return document is converted and returned
- The skinname.xml file controls the mappings
that determine which style sheet to use
37Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
- Part III EcoGrid
38Authentication subsystem
- Actions login and logout
- Simple username/password system
- Successful login creates a session
- Session ID tracked using an HTTP cookie
39Authentication plugins
- Delegates authentication requests to a
backend-service via a plugin - Lightweight Directory Access Protocol (LDAP)
- Replaceable to interface with other systems
- Metacat admin can choose which LDAP server to use
40Ecoinformatics.org LDAP
- Need for community-wide user identities
- Distributed system for participating institutions
- Root LDAP server refers requests to specific
organizations for authentication
dcecoinformatics,dcorg
oNCEAS
oUCNRS
oLTER
ounaffiliated
41Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
- Part III EcoGrid
42Other subsystems
- actionvalidate
- valtext, docid
- actionsetaccess
- docid, principal, permission, permType,
permOrder, principal - actiongetversion
- actiongetlog
- ipaddress, principal, docid, event, start, end
- http//68.111.43.2258080/knb/metacat?actiongetlo
geventinsert
43Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
- Part III EcoGrid
44Client API
- Application Programming Interface (API)
- Defines language-specific binding for
communicating with Metacat - Available in Java and Perl (partial python)
- Allows development of new applications
- Allows integration of metacat with existing
applications - Simple set of method calls
45Basic Client API
- public String login(String username, String
password) - public String logout()
- public Reader read(String docid)
- public Reader query(Reader xmlQuery)
- public String insert(String docid, Reader
xmlDocument, Reader schema) - public String update(String docid, Reader
xmlDocument, Reader schema) - public String delete(String docid)
- public String upload(String docid, File file)
- public String upload(String docid, String
fileName, InputStream fileData, int size)
46Example use of client
- String metacatUrl "http//foo.com/context/metaca
t" - String username "uidjones,oNCEAS,dcecoinforma
tics,dcorg" - String password "neverHarcodeAPasswordInCode"
- try
- Metacat m MetacatFactory.createMetacatConnectio
n(metacatUrl) -
- m.login(username, password)
- Reader r m.read("testdocument.1.1")
- // Do whatever you want with Reader r
- catch (MetacatAuthException mae)
- handleError("Authorization failed\n"
mae.getMessage()) - catch (MetacatInaccessibleException mie)
- handleError("Metacat Inaccessible\n"
mie.getMessage()) - catch (Exception e)
- handleError("General exception\n"
e.getMessage())
47Overview
48Documentation
49Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
- Part III EcoGrid
50Grid Services
- A Grid service is a Web service
- plus
- Lifecycle management
- (persisting the service over outages)
- State management
- (tracking sessions across multiple requests)
- Factory services
- (allowing many clients to connect)
- Security
- (authorization)
51Web services
- Service Oriented Architecture (SOA)
- Remote discovery and execution of services
- Network transport of data (HTTP)
- Message format (SOAP/XML)
- Service interface description (WSDL)
1
2
3
Diagram from http//www.w3.org/TR/2002/WD-ws-arch-
20021114/
52SEEK EcoGrid
- Goal allow diverse environmental data systems to
interoperate - Hides complexity of underlying systems using
lightweight interfaces - Integrate diverse data networks from ecology,
biodiversity, and environmental sciences - Data systems
- Any system can implement these interfaces
- Prototyping using
- Metacat, SRB, DiGIR, Xanthoria, etc.
- Supports multiple metadata standards
- EML, Darwin Core as foci
53EcoGrid example
54EcoGrid Query Interfaces
- Provides a mechanism for search and retrieval of
metadata and federated data - Supports third party interaction with search
results - forwarding of result set identifiers to another
service instance for retrieval - Different levels of compliance
- Low barrier for participation
- Bulk of data will be accessible through Type I
55EcoGrid Query Level I
- Basic, entry level exposure of data and metadata
for EcoGrid and SEEK - Response contains data intended for direct
communications rather than 3rd party indirection - ResultsetType query(SessionID,QueryType)
- byte get(SessionID,objectID)
56Query Conditions
- Language independent representation of a query
structure - Transformed into the appropriate native language
of the data store - Example
- ltANDgt
- ltcondition operator"LIKE
- concept"ScientificName"gtperomyscus
lt/conditiongt - ltcondition operator"NOT EQUALS
concept"DecimalLatitude"gtNULLlt/conditiongt - lt/ANDgt
57Specifying the Resultset
- Specify the list of concepts (fields) to be
returned in the resultset - Simple paths used to identify elements or
document subtrees - Effectively flattens the structure of the
records, but allows generic representation - Example
- ltreturnfieldgt/ScientificNamelt/returnfieldgt
- ltreturnfieldgt/Longitudelt/returnfieldgt
- ltreturnfieldgt/Latitudelt/returnfieldgt
58Full Query Example
- ltegqquery queryId"query-digir.1.1"
system"http//knb.ecoinformatics.org" - xmlnsegq"ecogrid//ecoinformatics.org/ecogri
d-query-1.0.0beta1" - xmlnsxsi"http//www.w3.org/2001/XMLSchema-in
stance" - xsischemaLocation"ecogrid//ecoinformatics.o
rg/ecogrid-query-1.0.0beta1 ../../src/xsd/query.xs
d"gt - ltnamespace prefix"darwin"gthttp//digir.net/sc
hema/conceptual/darwin/2003/1.0lt/namespacegt - ltreturnfieldgt/ScientificNamelt/returnfieldgt
- ltreturnfieldgt/Longitudelt/returnfieldgt
- ltreturnfieldgt/Latitudelt/returnfieldgt
- lttitlegtPeromyscus genus querylt/titlegt
- ltcondition operator"LIKE" concept"Genus"gtPer
omyscuslt/conditiongt - lt/egqquerygt
59Query Result Set Structure
- ltrsresultset resultsetId"foo.1.1"
- system"urnnot//sure/what/to/put/here"
- xmlnsrs"ecogrid//ecoinformatics.org/ecogrid
-resultset-1.0.0beta1" - xmlnsxsi"http//www.w3.org/2001/XMLSchema-in
stance" - xsischemaLocation"ecogrid//ecoinformatics.o
rg/ecogrid-resultset-1.0.0beta1
../../src/xsd/resultset.xsd"gt - ltresultsetMetadatagt
- ltsendTimegt2003-05-02T164550-0900lt/sendT
imegt - ltstartRecordgt1lt/startRecordgt
- ltendRecordgt2lt/endRecordgt
- ltrecordCountgt2lt/recordCountgt
- ltnamespacegthttp//digir.net/schema/concept
ual/darwin/2003/1.0lt/namespacegt - ltsystem
- id"1"gthttp//speciesanalyst.net/digir/Di
GIR.php?resourceMammalsDwC2lt/systemgt - lt/resultsetMetadatagt
- ltrecord number"1"
- system"1"
- identifier"mvz1"gt
60EcoGrid Write
- Used to push data back to sources (e.g.
publishing EML documents) - Depends on the availability of an authentication
and access control system - put(sessionID, objectID, object, type)
- delete(sessionID,objectID)
61Building the EcoGrid
LTER Network (24) Natural History
Collections (gtgt 100) Organization of Biological
Field Stations (180) UC Natural Reserve System
(36) Partnership for Interdisciplinary Studies of
Coastal Oceans (4) Multi-agency Rocky Intertidal
Network (60)
Metacat node
SRB node
VegBank node
DiGIR node
Xanthoria node
Legacy system
62EcoGrid Queries in Kepler
63Metadata-driven analysis cycle
64Acknowledgements
This material is based upon work supported
by The National Science Foundation under Grant
Numbers 9980154, 9904777, 0131178, 9905838,
0129792, and 0225676. The National Center for
Ecological Analysis and Synthesis, a Center
funded by NSF (Grant Number 0072909), the
University of California, and the UC Santa
Barbara campus. The Andrew W. Mellon
Foundation. PBI Collaborators NCEAS, University
of New Mexico (Long Term Ecological Research
Network Office), San Diego Supercomputer Center,
University of Kansas (Center for Biodiversity
Research) Kepler contributors SEEK, Ptolemy II,
SDM/SciDAC, GEON
65Spare slides
66DOM
- DOM models hierarchical element and attribute
structure of XML
Node types
element
eml
attribute
packageId
text
knb.1.2
dataset
title
Soil profiles in 1982
creator
individualName
givenName
Matthew
surName
Jones
67Metacats Data Model
- Metacat data model is a recursive structure
- Gains in flexibility offset by performance
penalties (were working on this)
68The simple client html code
- ltform action/knb/metacat" target"right"
method"POST"gt - ltstronggt1. Choose an action lt/stronggt
- ltinput type"radio" name"action"
value"insert" checkedgt Insert - ltinput type"radio" name"action"
value"update"gt Update - ltinput type"radio" name"action"
value"delete"gt Delete - ltinput type"submit" value"Process Action"gt
- ltbr /gt
- ltstronggt2. Provide a Document ID lt/stronggt
- ltinput type"text" name"docid"gt
- ltbr /gt
- ltstronggt3. Provide XML text lt/stronggt (not
needed for Delete) - ltstronggt
- ltbr /gt
- lttextarea name"doctext" cols"65"
rows"15"gtlt/textareagt - ltbr /gt
- ltstronggt4. Provide DTD text for upload
lt/stronggt (optional not needed for Delete) - ltbr /gt
- lttextarea name"dtdtext" cols"65"
rows"15"gtlt/textareagt - ltbr /gt