Title: Metadata Acquisition with XML
1Metadata Acquisition with XML
- Case studies from the
- Swiss Federal Archives
- 9. October 2002 / Stephan Heuscher
2Overview
- Problems acquiring metadata
- Why XML?
- Featured Projects
- Lessons learned
- Conclusions
3Problems acquiring metadata
- Documentation
- Data format
- Data consistency
- System borders
- Money
- Communication with stakeholders
4Why XML?
- XML
- is an open standard
- is self-explanatory
- is human-readable
- can be validated automatically
- has a broad software support
- Most products feature XML support
5Featured Projects
- SIARD
- Archiving of relational databases
- Manual generation of additional metadata
- Metadata and content is stored in XML files
- AMDA
- Manages metadata for audio data from the Swiss
Parliament - Does not manage audio data
- Import of XML metadata
- Must provide a variety of export formats
6SIARD (System Independent Archiving of
Relational Databases)
Oracle
MS-SQL
???-DB
Database regeneration
Data and low-level metadata extraction
Digital Archive
(to be built)
Additional high-level descriptive metadata
7XML use in SIARD
- SQL-99 (ISO/IEC 9075)
- Low-level data description
- Structure
- Datatypes
- Constraints
- XML
- High level metadata
- Table content (thin wrapper)
8Data Logic (SQL)
CREATE TABLE "FLUGLE"."CLASS" ( "CLASS_ID"
NATIONAL CHARACTER VARYING(20) NOT NULL ,
"SCHEDULE_ID" NATIONAL CHARACTER VARYING(20) ,
"CLASS_BUILDING" NATIONAL CHARACTER VARYING(25) ,
"CLASS_ROOM" NATIONAL CHARACTER VARYING(25) ,
"COURSE_ID" NATIONAL CHARACTER VARYING(5) ,
"DEPARTMENT_ID" NATIONAL CHARACTER VARYING(20) ,
"INSTRUCTOR_ID" NATIONAL CHARACTER VARYING(20) ,
"SEMESTER" NATIONAL CHARACTER VARYING(6) ,
"SCHOOL_YEAR" TIMESTAMP(0) ) CREATE TABLE
"FLUGLE"."CLASS_LOCATION" ( "CLASS_BUILDING"
NATIONAL CHARACTER VARYING(25) NOT NULL ,
"CLASS_ROOM" NATIONAL CHARACTER VARYING(25) NOT
NULL ...
9SIARD Metadata XML
lt?xml version"1.0" encoding"UTF-8"?gt ltarchivegt
ltdatabase product-name"Oracle"
product-version"Personal Oracle9i Release
9.0.1.1.1 - Production. With the Partitioning
option. JServer Release 9.0.1.1.1 - Production"
table-number"22" view-number"4"
archiv-size"175KB"gt ltschemasgt ltschema
tag-name"FLUGLE" table-number"22"
view-number"4"gt ltstatus sql3"true"
integrity"true" archiv"true" reason"0"
mandatory"true"/gt lttablesgt
lttable tag-name"BACKUP_CLASS" column-number"9"
row-number"10"gt ltstatus sql3"true"
integrity"false" archiv"true" reason"3"
mandatory"true"/gt ltcolumnsgt
ltcolumn tag-name"CLASS_ID"
sql3type"NATIONAL CHARACTER VARYING"
sql3size"(20)" type"VARCHAR2" length"20"
precision"" scale"" nullable"false"
defaultvalue""gt ltstatus
sql3"true" integrity"true" archiv"true"
reason"0" mandatory"true"/gt
lt/columngt ...
10SIARD Data XML
lt?xml version"1.0" encoding"UTF-16"?gt ltdmp-file
xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
ce" xsinoNamespaceSchemaLocation"../dmp.xsd"gt
ltschema tag-name"FLUGLE"/gt lttable
tag-name"CLASS"/gt ltcolumn tag-name"CLASS_ID"
sql3type"NATIONAL CHARACTER VARYING"
sql3size"(20)" defaultvalue"" nullable"false"
constraints"PKPK_CLASS"/gt ltcolumn
tag-name"SCHEDULE_ID" sql3type"NATIONAL
CHARACTER VARYING" sql3size"(20)"
defaultvalue"" nullable"true"
constraints"FKFLUGLE.SCHEDULE_TYPE.SCHEDULE_ID"/
gt ... ltdatagt ltrowgt6,1042004,S1809,POCO
HALL3,1503,1985,PHILO4,E4916,SPRING19,1997-0
3-01 000000lt/rowgt ltrowgt6,1045003,T1511,NA
RROW HALL3,2003,1844,HIST4,D9446,SPRING19,19
97-03-01 000000lt/rowgt ...
11AMDA (Audio MetaData Acquisition)
Access DB
Online parliament session metadata (XML)
Webinterface
Unified XML import
AMDA
Metadata
Digital Archive
(to be built)
12XML use in AMDA
- Import
- XSLT transformation to common format
- Online metadata
- Legacy data (Access database)
- Export
- Raw XML output transformed using XSLT
13AMDA Import XML (raw)
lt?xml version"1.0" encoding"iso-8859-1"?gt ltrootgt
ltsession oid"34695" session_id"session_4609"
text_update_time"1002882007656"gt ltmeeting
date"20010917" local_time"1430" location"N"
oid"34696" publish_status"final"gt
ltsubject oid"34697" publish_status"draft"
subject_type"gesch"gt ltgesch_list
oid"34698" publish_status"draft"
transfer_gesch_list"01.9001"gt 01.9001
ltgesch_info oid"000000000"gt
lta99_gesch last_modified"2001/03/05 144342
GMT0100"gt ltgesch_id
raw_id"20019001"gt2001.9001lt/gesch_idgt
lttitle language"d"gt
ltlinegtMitteilungenlt/linegt
ltlinegtdes Präsidentenlt/linegt
lt/titlegt lt/a99_geschgt
lt/gesch_infogt lt/gesch_listgt
ltspeech_text audio_channel"N" audio_end"10007299
95203" audio_start"1000729751250"
speaker_id"9005" turnus_nr"1000"
turnus_oid"155989"gt ltpd_textgt
ltpgtDer Beginn dieser Herbstsession ist
schmerzlich getrübt von unseren Gedanken an das
...
14AMDA Import XML (transformed)
lt?xml version"1.0" encoding"iso8859-1"?gt ltSessio
n id"4609" start"20010917T14300200"gt
ltGeschaeftegt ltGeschaeft nummer"1998.0446"
themaDeutsch"Parlamentarische InitiativexAHämm
erle Andrea.xAPost, SBB, Swisscom.xAArbeitsp
lätzexAin der ganzen Schweiz"
themaFranzoesisch"Initiative parlementairexAHä
mmerle Andrea.xAPoste, CFF, Swisscom.xADes
emploisxAdans toute la Suisse" /gt
ltGeschaeft nummer"2001.9001" themaDeutsch"Mittei
lungenxAdes Präsidenten" themaFranzoesisch"Com
municationsxAdu président" /gt ...
lt/Geschaeftegt ltVerhandlungengt ltVerhandlung
geschaeftNummern"2001.9001" rat"V"
start"1000729751" dauer"244" bulletin""
bulletinSeiten"825"gt ltVotum
start"1000729751" dauer"20" sprache"de"gt
ltPerson id"9005" vorname"Peter"
nachname"Hess" kanton"ZG" ort"Zug" /gt
ltVotumTextgtDer Beginn dieser Herbstsession ist
schmerzlich getrübt von unseren Gedanken
...
15Lessons learned
- Transforming and reformatting of XML data is easy
- Documentation and data integrity are crucial
- Agree on rules and standards for XML formats
early - Stakeholders uses of XML differ greatly
16Conclusions
- XML
- is not a preservation strategy
- is only a technology
- is too new for a common understanding
- XML provides tools and techniques for a concise
metadata management - Working solutions need both XML and non-XML
experience - Most problems are still of human nature