Title: PANGAEA
1Data portal based on Open Archives Initiative
Protocols and Apache Lucene Uwe Schindler,
uschindler_at_wdc-mare.orgMichael Diepenbroek,
mdiepenbroek_at_wdc-mare.org MARUM, University of
Bremen, Germany EGU 2006, Vienna, 2006-04-03
2Data Portals
- WDC-MARE with its information system PANGAEA
provides data portals for several
EU/international projects - CARBOOCEAN, EUR-OCEANS, IODP
- Problem
- Not all data are stored centralized, so all
datasets provided in portals must be consolidated
from different sources!
3Example CARBOOCEAN data portal
- Data stays at the data providers
- Metadata is harvested by the portal
- Search queries are handled by the centralized
catalogue - Scientist gets link to data at the provider
4Open Archives Protocol
- The Open Archives Initiative Protocol for
Metadata Harvesting (OAI-PMH) is a protocol
developed by the Open Archives Initiative. - uses it during web crawling (
Scholar) - Almost all digital libraries support it (most
famous ones arXiv and the CERN Document
Server) - Very simple to implement (XML over HTTP based)
- Repository software for databases or file system
metadata providers is widely available
5Current OAI-PMH software
- Limited to Dublin Core metadata (libraries)!
- Limited full text search functionality due to
relational databases in the background! - No geographic retrievals (because of Dublin Core
limitation)! - End user interface is part of the software, this
limits usability in CMS systems
???
6Requirements for portal software
- Open for any XML metadata format
- Any mappings to document fields should be done by
XPath - Possibility to map incompatible XML schemas
during harvesting by XSL - No relational database, only a full text search
engine, that contains everything needed for
operation - Range queries for specific fields (date/time or
numeric) - Web service interface for the end user software
that is accessible from any language (Java/JSP,
PHP, Perl,...)
7MetadataPortal Java Package
Lucene
OAI- PMH
OAI- Harvester
OAI protocol in HTTP
Virtual Index
Apache Axis
Lucene
XSL
OAI- PMH
OAI- Harvester
OAI protocol in HTTP (specific set)
Virtual Index
Lucene
XSL
XML- Files
Filesystem- Harvester
filesystem directory, FTP,
Mini PanHTTP Server Jetty HTTP Server Tomcat
Portal 1(Webserver, PHP)
Portal 2(Webserver, JSP)
Stored xmldata (same format everywhere, XSL
before indexing), identifier, lastModified,
sets Searchable field1 /oai_dcdc/dcauthorfi
eld2 /oai_dcdc/dctitlefield3
javaorg.test.LatLon.parse(/oai_dcdc/dccoverage
) default . ) xmlnsjavahttp//xml.apache
.org/xalan/java
!!!
8CARBOOCEAN Data Portal
- Metadata standard harvested for search DIF v9.4
- Searchable fields Bounding box, date/time,
parameters, authors, investigators, title - Data centers
World Data Center for Marine Environmental
Sciences (WDC-MARE), University of Bremen and
Alfred-Wegener-Institute in Bremerhaven, Germany
French National Oceanographic Data Centre, SISMER
(Systèmes d'Informations Scientifiques pour la
Mer) at the Ifremer in Brest, France
Carbon Dioxide Information Analysis Center
(CDIAC), Environmental Sciences Division at Oak
Ridge National Laboratory, USA
9Thank you!