Collection and KnowledgeBased Persistent Archives at SDSC

About This Presentation

Title:

Collection and KnowledgeBased Persistent Archives at SDSC

Description:

tagged E-mail using XML syntax (6 required, 13optional, 1000 user-defined tags) ... any tagged data, where tags are treated as information attributes ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 96

Provided by: bertramlu

Category:

more less

Transcript and Presenter's Notes

Title: Collection and KnowledgeBased Persistent Archives at SDSC

1
Collection- and Knowledge-Based Persistent
Archives at SDSC

Bertram Ludäscher
LUDAESCH_at_SDSC.EDU
San Diego Supercomputer Center
University of California, San Diego

work sponsored by the National Archives and
Records Administration and Advanced Research
Projects Agency (NARA) and NHPRC
2
Content Overview

Part I
SDSCs Persistent Archives Initiative
(material from Reagan Moore, Deputy
Director SDSC, Data Intensive Computing
Environments)
Example/Case Study
Wrapping Websites into XML for
Archival (material from Valter
Crescenzi, visiting from U Roma 3)
Part II
From Collection-Based to Self-Validating
Knowledge-Based Archives
(joint work
with Reagan Moore and Richard Marciano)
Running Example The Senate Collection

3
Data Intensive Computing Environments

Staff
Reagan Moore
Chaitan Baru
Sheau Yen Chen
Charles Cowart
Amarnath Gupta
George Kremenek
Bertram Ludäscher
Richard Marciano
Arcot Rajasekar
Abe Singer
Michael Wan
Ilya Zaslavsky
Bing Zhu

Students - GSRA
Martin Kuhl
Liying Sui
Yang Yu
Valter Crescenzi
Students - Undergrad Interns
Peter Shin
Roman Olshanowsky
Shabbar Tambawala
Pratik Mukhopadhyay
/- NN

4
Part I

Overview
SDSCs Persistent Archival Approach
Case Study
Wrapping Websites into XML for Archival

5
Persistent Archive Goals/Objectives

Manage digital objects for the life of the
republic
Maintain ability to discover and access digital
objects while the supporting hardware and
software systems evolve

6
Example Archive Components

Storage system for storing the digital objects
e.g. HPSS (tape silos with disk cache) or SANs
(Storage Area Networks)
Database for managing a collection that
represents the digital objects
e.g. an object-relational DBMS
Web server for discovering and displaying the
digital objects
e.g. CGI scripts with helper applications

7
Example Archive

Assyrian clay tablets
Provided long term storage, but limited volume
Characterize an archive by the bandwidth it
provides for transporting data into the future
and its archival capacity

8
Risks and Challenges of Persistent Archives

Each of the software and hardware systems may
become obsolete
the storage media may degrade
the storage system may become obsolete
the database backups may become obsolete, with no
way to recover the collection (structure)
the digital object formats may become obsolete,
with no helper application that can read them

9
Good News Persistent Archives are Possible

The Archivist (archival engineer) is in control
Archivist gets to decide on the persistence
policies
how to minimize risk
how to minimize cost
when to use new technology

10
Persistent Archive Bandwidth (Migration Speed)

Ability/Necessity to transport data is
(size of the archive) / (media lifetime)
the larger the archive and/or shorter the
lifetime,
the higher the required bandwidth
(migration speed)
Example size(archive) 100TB media_lifetime
5 years
ability/necessity to migrate 20 TB/year to
avoid data loss!
Clay tablets provided a long media lifetime, but
a very small storage capacity
effective bandwidth was a byte/day

11
Concept 1

Persistent Archive is a Migration Mechanism
Since the amount of data is increasing
exponentially, the archive capacity must increase
correspondingly
Migrate to new technology to get to higher
sustainable Archive Bandwidths

12
Data Scales

Megabyte - one million bytes
Digital content of a book
Gigabyte - one thousand MBs
Terabyte - one thousand GBs
Digital content of a film
Petabyte - one thousand TBs
Amount of tape produced in 1994
Exabyte - one thousand PBs
Data produced per year

13
Archive Bandwidth

If you wait too long to migrate, you will be
unable to read the data from the archive before
the media degrades
There is a maximum capacity for any choice of
archive media, for a given capital investment in
media read devices

14
Archive Capacities for a 2-tape Drive System
15
SDSC Archive

Currently store 240 TeraBytes, with the capacity
of the system being 500 TeraBytes
16 tape drives in 3 silos
Migrated digital holdings through
three different storage systems
five different computer systems
six different types of tapemedia

16
Reasons for Migration

Avoid data loss
manage degradation of media
Minimize cost
minimize the number of tapes that must be managed
recover floor space
Keep pace with data growth
provide higher Archive Bandwidth

17
Migration Costs

Media costs are fixed
price of each new tape technology cartridge is
the same as the previous cartridge, but the
capacity is doubled (so far)
Cost is then
1 1/2 1/4 1/8 2 times original
price
(additional assumption labor cost is minimized
by using a tape robot)

18
Concept 2

Automation of all processes is essential if costs
are to be minimized
eliminate manual manipulation of tapes
robots
eliminate manual manipulation of digital objects
data handling systems
eliminate manual discovery of digital objects
information catalogs

19
Data Archive
Ingest Services
Management
Access Services
Access platform
Data repositories
Ingestion platform
Interoperability Standards
Interoperability Protocols
20
ERA Concept model
21
Concept 3

Persistent Archive is an Interoperability System
persistence requires migration over time onto new
technology
while the migration occurs, a persistent archive
must be able to interoperate with both the old
technology and the new technology
employ XML-based interoperability
(mediation) technology

22
Implicit Concepts for Persistent Archive

Infrastructure independence
Non-proprietary formatting
Collection management
Data set access
Authentication
Presentation
Information models
XML as a (meta-) information markup language
Example GML - Graphics markup language
Support for ingestion, management, access
Accessioning workbench, archive, access workbench

23
XML as a Standard Information Markup Language

XML representation of metadata attributes
standardization of DTDs - MOA II DTD for text
standardization of markup language
XML based representation of collection structure
attributes defining the physical layout of a
schema into relational tables (foreign keys,
attribute data types, )
XML databases XML organized data collections
commercial systems Excelon, TAMINO, Oracle8i,
...
XML-based queries (XQuery, Quilt, XQL, XMAS, ...)
XML based Topic Maps
represent relationships between collection domain
concepts, collection attributes
navigational access of intra- and
intercollection concept spaces

24
Archival Example E-mail Collection

Test of the scalability of the technology
archived a one-million record E-mail collection
(1999)
Ingestion
tagged E-mail using XML syntax (6 required,
13optional, 1000 user-defined tags) wrapping
of raw data
created description of the collection
aggregated E-mail into containers, stored in an
archive
retrieved collection description, created
database, and optimized for query
Total time was 27 hours (used 10 Mbit/sec
Ethernet)

25
Collection-Based Persistent Archive
Ingest Services
Management
Access Services
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
26
Collection-Based Persistent Archive Processes
27
What Types of Interoperability are Needed?

Data management (digital objects)
ability to work with multiple types of storage
systems, across separate administration domains
Information management (attributes)
ability to define a collection independent of
database choice
ability to migrate collection onto new databases
Knowledge management (relationships)
ability to manage relationships and high-level
domain concepts
ability to map concepts to collection attributes

28
Simplest Definitions

Data
digital object, i.e., the object representation
as a bit stream
Information
any tagged data, where tags are treated as
information attributes
attributes may be tagged data within the digital
object, or tagged data that is associated with
the digital object
Knowledge
higher-order concepts and relationships between
attributes
relationships can be procedural, temporal,
structural, spatial, functional, ... and
described in a Logic formalism (semantic
networks, description logics, conceptual graphs,
...) which is often rule-based (e.g. Datalog,
Frame-Logic)

29
(No Transcript)
30
Types of Knowledge Relationships

Logical / semantic
e.g. Digital Library cross-walks
Temporal / procedural
e.g. Workflow systems
Spatial / structural
e.g. GIS systems
Functional / algorithmic
e.g. scientific feature analysis

31
Knowledge-Based Persistent Archive (more Part
II)
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
XTM DTD
Knowledge
Rules - KQL
(Topic Maps / Buckets / Model-based Access)
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
32
Creating Archivable Forms

Archivable form for digital objects
infrastructure-independent/self-describing...
... mechanism for describing the digital object
format
... mechanism for displaying the digital object
based upon the digital object description
no proprietary formats
information content directly tagged
based on XML (for information tagging,
interoperability, ...)
Example Archival of Web information

33
Wrapping Websites into XML for Archival

... based on material by

Valter Crescenzi Visiting Scholar San Diego
Supercomputer Center University Of California,
San Diego crescenz_at_sdsc.edu Dipartimento di
Informatica ed Automazione Universita' di Roma
Tre crescenz_at_dia.uniroma3.it
34
Outline

General issues for archival of websites
Specific aspects of archival of websites
gathering information
data extraction
Extraction of information content from web pages
first results
effort needed to extract information from web
pages
recommendations to minimize wrapping effort

35
Website Archival Issues

How to access?
locally ( /htdocs/ )
remotely ( http// )
What to archive?
a binary image of the site
(explicit) information content only
behavior?
as is (e.g., existing cgi scripts)
equivalent functionality

36
Gathering Information

Def. A website is completely crawlable if all
pages can be automatically copied to a local
archive
Fact Many (most?) websites are not completely
crawlable!
example service-oriented websites
(maps.yahoo.com, ... )
Many mirroring tools are available if the site is
crawlable, i.e. through index pages
If the sites URLs can be enumerated, a
specialized mirroring tool may easily be developed

37
Focus on Crawlable Data-intensive Websites
Data-Centricity
Structuredness
Data-intensive Sites
Web-Based Information System
high
Senate NeuronDB
www.amazon.com
www.hotmail.com
maps.yahoo.com
low
Service-Oriented Sites
Web-Presence Sites
GridPortal
low
high
Complexity Of Applications
38
Information Extraction Approach

Information content may be extracted from
documents through dedicated software modules
called wrappers

Information content (e.g. in XML)
Web pages (X)HTML
39
Extraction of Information Content

Wrappers
brittle
difficult to maintain
mean time to failure quite short!
manually coded/automatically generated
manually very expensive, but only possibility if
sources lack structure
automatically cheap but needs structure

40
Case Study for Extraction of Information CONTENT
The "Senate Web Site"

Bill Status
Summary Info
Site is crawlable
through URL enumeration
through indices

41
Sample XML Output File
42
Why Wrapping HTML XML?

What is being archived?
presentation (maybe)
content (most likely)
Technology/infrastructure independence
"better" persistent archival format
XML can express information model and schema
information (while HTML only provides fixed
structural and presentational information)

43
Example Minerva Wrapper Specification
S out.println("encoding'ISO-8859-1' ?")
out.println("'BillSummary.dtd'") out.println("")
Bill\sSummary\s\sStatus\sfor\st
he\sCongress\sCongress
BillNumber
((nbsp)"("")")?
out.println(" "Congress""
) out.println("
"BillNumber"")
(NewStyle OldStyle)
TitleList Status ...
44
Steps of the Manual Wrapping Process

Initial wrapper specification has been derived
from a few sample bills from the 106th congress
Refinement using ore samples (including from
other congresses)
changes in the HTML layout starting from 105th
have been discovered!!
many manual fixes were needed (irregularities in
the structure)
use of a random bill URL generator for testing
the wrapper two more fixes

45
Changes in the HTML Layout
H.R.4236 (Major Legislation)
Public
Law 104-333 (11/12/96)
SPONSOR href""Rep Young, D. (introduced
09/27/96)
HTML code from a bill of the 104th congress
H.J.RES.25
Sponsor href""Rep Livingston, Bob (introduced
1/9/1997)
Latest Major Action 2/3/1997
Became Public Law No 105-1.

Title Making technical corrections to the
Omnibus Consolidated Appropriations
Act, 1997 (Public Law 104-208), and
for other purposes.
Corresponding code from a bill of the 105th
congress
46
Level of Development Effort (Senate Website)

Manual Approach
Around 350 lines of wrapper code
250 lines of grammar specification
100 lines to specify output format
One full day to write the specification
One full day to test debug it
Automatic Approach
automatic wrapper generator fails!

47
Wrapping a Structured Website NeuronDB

NeuronDB is a well structured site which
presents information content of an underlying
database about neurons
A wrapper has been manually coded for comparing
the efforts needed to wrap this structured site
and the Senate site
The automatic wrapper generator was able to
successfully build a wrapper without user
interaction

48
(No Transcript)
49
Level of Development Effort (Neuron DB)

Manual Approach
Around 220 lines of code
140 lines of grammar specification
80 lines to specify output format
Half day to write the specification
The wrapper extracts all information content
Automatic Approach
automatic wrapper generator succeeds

50
Automatic Wrapper Generation Succeeds for NeuronDB

The wrapper generator toolkit is part of an
ongoing project at the Terza Universita di
Roma called RoadRunner
The wrapper has been automatically generated
looking at similar pages without user
interactions
wrapper generation takes a few seconds

51
Neuron DB the inferred schema

A common schema (expressed as regular expression)
is inferred for input pages

A B ( C ) ( D ( ( E F ( G )? ) )? ( H ( I
)? ) ( J K ( L )? ) ) ( M N )

The schema is enriched with the extraction
rules needed to actually wrap sources
This is a kind of physical schema of the HTML
layout, not a logical schema

52
NeuronDB Result of the data extraction
53
Results (1)

Make sites crawlable for remote archival
e.g. archival backdoor
Extracting information from web sites may be very
expensive depending on the (ir)regularities of
pages
The more regular the structure, the cheaper
the writing of wrappers
Automatic approaches are feasible for well
structured web pages!

54
Results (2)

Well-Structured Sites
not only for data archival but also to minimize
cost for web site maintenance and management
XHTML can help (simplifies XHTML XML)
Separation of content and presentation
XML XSL(T)

55
References (1)

G. Mecca, P. Atzeni Cut and Paste - Journal of
Computing and
System Sciences, Special Issue on PODS'97, 1999
DOM The document object model.
http//www.w3.org/DOM/
D. W. Embley, D. M. Campbell, Jiang Y. S., S. W.
Liddle, Ng Y., D. Quass,
and Smith R. D.
A conceptual-modeling approach to extracting
data from the web.
In ER98.
N. Kushmerick. Wrapper Induction Efficiency and
expressiveness
Artificial Intelligence, 118, 200
V. Crescenzi, G. Mecca Grammars Have Exceptions
Information Systems, Special Issue on
Semistructured Data, 1998

56
References (2)

B. Adelberg NoDoSe a tool for
semi-automatically extracting struc-
tured and semistructured data from text
documents. In SIGMOD98.
The Neuron DB Web Site http//senselab.med.yal
e.edu/senselab/NeuronDB/
S. Grumbach, G. Mecca In Search of the Lost
Schema -
In Proceedings of Intern. Conference on
Database Theory (ICDT'99), 1999
The Senate Web Site http//thomas.loc.gov/
The Tidy Utility http//www.w3.org/People/Ragg
ett/tidy/
The W3C XHTML activity. http//www.w3.org/MarkU
p/
Extensible Markup Language (XML),
http//www.w3.org/XML/
Extensible Stylesheet Language (XSL).
http//www.w3.org/Style/XSL/

57
Overview

Part I
SDSCs Persistent Archives Initiative
(material from Reagan Moore, Deputy
Director SDSC, Data Intensive Computing
Environments)
Example/Case Study
Wrapping Websites into XML for
Archival (material from Valter
Crescenzi, visiting from U Roma 3)
Part II
From Collection-Based to Self-Validating
Knowledge-Based Archives
(joint work
with Reagan Moore and Richard Marciano)
Running Example The Senate Collection

58
WARM UP XML (eXtensible Markup Language)

origins HTML SGML (ISO Standard, 1986, 600pp)
W3C standard (26 pp) XML syntax DTDs
XML HTML ? presentational tags
user-defined DTD
(tagsnesting)
a metalanguage for defining other languages
(e.g. via DTDs)
XML is more like SGML than HTML
XML SGML ? complexity, document perspective
simplicity, data
exchange perspective

59
XML as a Self-Describing Data Exchange Format

can be easily understood by our friend (...
even using CP/M edlin)
can be parsed easily
contains its own structure (parse tree) in the
data
allows the application programmer to
rediscover schema and content/semantics (to
which extent???)
may include its own schema description (e.g.,
DTD, XML Schema)
meta-language definition of specific
languages (XYZ-ML)
allows separation of marked-up content from
presentation (style sheets)
many tools (and many more to come -- (re)use
code) parsers, validators, query languages,
storage,
standards (good for interoperation, integration,
etc)
generic standards (XML, DTDs, XML Schema,
XPath,...)
community/industry standards (specific markup
languages)

60
Mind Your Vocabulary Identifying Vocabularies
with XML Namespaces

My element may not be your element
geometry context line
chemistry context oxygen
SGML/XML context ....
use XML namespaces to identify the vocabulary
... when I say semantics, I can make clear
whether I am talking as a logician (needs
additional specificiation mathematical logic,
philosophy, AI, ...) or a linguist, or a
psychologist, etc.

61
XML Namespaces

mechanism for globally unique tag names
xmlnsh"http//www.w3.org/HTML/1998/htm
l4"
Book Review
...
XML A Primer
...
mix of different tag vocabularies without
confusion
namespaces only identify the vocabulary
additional mechanisms required for structure and
meaning of tags

62
Information Hierarchy (Simplest Definitions)

Data
digital object, i.e., the object representation
as a bit stream
Information
any tagged data, where tags are treated as
information attributes
attributes may be tagged data within the digital
object, or tagged data that is associated with
the digital object
Knowledge
higher-order concepts and relationships between
attributes
relationships can be procedural, temporal,
structural, spatial, functional, ... and
described in a Logic formalism (semantic
networks, description logics, conceptual graphs,
...) which is often rule-based (e.g. Datalog,
Frame-Logic)

63
Types of Knowledge Relationships

Logical / semantic digression
semantics ?semantics
e.g. Digital Library cross-walks
Temporal / procedural
e.g. Workflow systems
Spatial / structural
e.g. GIS systems
Functional / algorithmic
e.g. scientific feature analysis

64
Knowledge-Based Persistent Archive
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
XTM DTD
Knowledge
Rules - KQL
(Topic Maps / Buckets / Model-based Access)
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
65
Data Handling System
SDSC Storage Resource Broker Meta-data Catalog
Application
Resource
Third-party copy
User
Remote Proxies
MCAT
Dublin Core
DataCutter
Application Meta-data
66
Ingestion Processes for Collection Creation
Accession Template
Closure Concept/Attribute
Attribute Inverse Indexing
Information Generation
Knowledge Generation
Attribute Tagging
Attribute Selection
Occurrence Tagging
View Management
Data Organization
Collection
67
Running Example Senate Collection

from the 106th US Congress database
keeps track of Senate bills, resolutions, and
amendments
raw format 99 RTF (Rich Text Format) files on
CD-ROM (provided by NARA)
one file per senator

68
Examples of Implied KnowledgeSenate Legislative
Activities

Structural knowledge
Pertinent information embedded in document
headers
Procedural knowledge
Naming convention
Senator represented by last name
Senator represented by last name and state
Senator represented by last name, first name, and
state
Collection knowledge
Referenced senators include senators no longer in
the senate

69
Knowledge Generation

Accessioning Template
Defines the concepts under which the data objects
will be tagged and organized
Attribute selection
Define the attributes that represent the
information content associated with the domain
concepts
Tag attributes using minimal constraint language,
such as XML or XMLSchema
Evaluate closure of mined attributes compared to
expected attributes
Refine concept map

70
Information Generation

Create occurrence index
(Occurrence, attribute, value)
This is needed to be able to recreate original
form of digital object
Analyze completeness of information
Inverse index of attribute values
Identifies unexpected values - consistency
Analyze closure of collection
Are additional attributes needed to represent
inverse index value ranges?

71
Data Organization

Archive preferred views of collection
Original data
XML tagged representation
Minimal representation of consolidated
information
Noise-freeversion based upon occurrence tags
Object-relational database version
Archive occurrence tagged view
Archive ingestion procedures that transform
collection from the original digital objects to
the preferred views

72
Information Management Projects

Digital Libraries
NSF Digital Library Initiative, Phase II - UCSB,
Stanford
Digital Embryo digital library - GMU
NPACI Digital Sky - Caltech 2MASS sky survey
CDL - AMICO
NSF NSDL - UCAR / DLESE
Grid Environments
NASA Information Power Grid - NASA Ames
DOE Data Visualization Corridor - LLNL
DOE Particle Physics Data Grid - Stanford,
Caltech
NSF Grid Physics Network - U Fl
Persistent Archives
NARA Persistent Archive
NHPRC - Scalable archives

73
Research and Development Activities - FY00

Demonstration of scalable systems
Expansion of persistent archive Framework
Knowledge-based persistent archives
Demonstration of archivable forms for new types
of data
Web, GIS, compound documents, collections
Knowledge and anomaly processing
Tightness of fit of XML DTDs
Self validating archives as a preservation
strategy

74
Research Challenges

Infrastructure independence
Progress on archivable form creation
Digital paper
Finding aids for a million collections
Concept spaces that support identification of
collection
Product authentication
Tracking all updates, movements, media
migrations, collection instantiations
Choice of Archival Markup Language
Tracking of E-commerce implementations
Knowledge management systems
Workflow, ingestion processing steps, system
evolution procedures, finding aid concept spaces

75
Digital Archives

Problem
How to achieve long-term preservation of
information (for the archivist records) and
sustained access?
Challenges and Opportunities
fight archives obsolescence (in the presence of
with)
rapidly changing storage, data formats, software
environment, hardware,
Approaches
Time out (do nothing assume hardware,
software, data formats, etc. all work 400 years
from now ...)
Emulation (emulate hardware and software
infrastructure)
Migration (migrate to new infrastructure)
Factors
What do you need to archive? (records, data,
programs, ?)
determine usefulness and cost of emulation
vs. migration
archival of electronic records data-centric
migration

76
What is it That We Try to Archive??

What constitutes a record?
beats me...
but there are hierarchies of information /
abstractions
data ... information ... knowledge ...
wisdom?
instance ... schema ... model ... metamodel ...
metametamodel ...
object serialization ... data structure ... data
model ... meta model
What is the nature of the information?
data .. functions/programs
extensional data . intensional/virtual/derived
data (facts/rules)
Managing complexity using layers
protocol stacks (e.g. ISO/OSI, SemanticWeb,
Semantic Mediation)
going up abstract, correlate, aggregate,
index, the lower levels

77
Archival Processes and Functions

Data Submission/Accessioning
loop information producer "archival
engineer (ok archivist)
Ingestion
a sequence of information preserving
transformations is applied to submitted "raw
data" ingestion network
Migration
... as time goes by ...
... migrate to new physical media, maybe data
formats, information model ...
"easy migration" "good" archival format
model
Instantiation/Access
revive/reanimate the archive queryable
collection/database
Goal preserve information!
(ok just records ...)

78
Archival Example Senate Collection

What you see

is maybe NOT what you get (a not so well
documented format)

79
Senate Collection Example

Rich Text Format (a documented Microsoft
format)

\pard\parM \pard\b S. 345\b0\parM
\pard\qr DATE INTRODUCED 02/03/1999\parM
\pard SPONSOR Allard\parM \i\qc OFFICIAL
TITLE\i0\parM \pard A bill to amend the Animal
Welfare Act to remove the limitation that permits
\ interstate movement of live birds, for the
purpose of fighting, to States in which \ animal
fighting is lawful.\parM \i\qc LATEST
STATUS\i0\par\pardM \pard\plain
\fi-1900\li1900\nowidctlpar\adjustrightFeb 3,
1999\tab Read twice and\ referred to the
Committee on Agriculture.\parM \pardM

can be wrapped into XML

S. 345 bold"off"DATE INTRODUCED 02/03/1999 bold"off"SPONSOR Allard bold"off" italic"off"OFFICIAL TITLE bold"off" italic"off"A bill to amend the
Animal Welfare Act to remove the lim\ itation
that permits interstate movement of live birds,
for the purpose of fighting\ , to States in which
animal fighting is lawful. bold"off" italic"off"LATEST STATUS
Feb 3, 1999tabRead twice and
referred to the Committee on Agriculture\ .g
80
Senate Collection Example

the XML can be lifted from the presentation
level

to the information level

SENATE AGRICULTURE
02/03/1999te_introduced
Feb 3,
1999
Read twice and referred to the
Committee on Agriculture

A bill to amend the Animal
Welfare Act to remove the limitation that permits
interstate movement of live birds, for the
purpose of fighting, to States in which animal
fighting is lawful.
Allard, Wayne CO

81
XML as an Archival Format

Information level schema as an XML DTD

bills (bill) committees?, congressional_record?, cosponsors?,
date_introduced?,
digest?, latest_status_list?, official_title?,
sponsor?, statement_of_purpose?,
submitted_by?, submitted_for?) bill_name CDATA REQUIRED (committee) (cosponsor) T latest_status_list (latest_status) latest_status (ls_date, ls_txt) abstract (PCDATA) (PCDATA) (PCDATA) T co_name (PCDATA) CDATA IMPLIED (PCDATA) (PCDATA) (PCDATA)
82
Open Archival Information System (OAIS)
Information Model

An AIP (archival information package) contains
content information (CI) (represented as
info_objects), and
preservation description information (PDI)
(A)IP (archival) information package
DI descriptive information
PI packaging information (ISO-9660 for CD
directories)
CI content information
PDI preservation description
information
PR
provenance (origin, processing history)
CON context (relation to external
information)
REF reference (identifies the CI, e.g.,
ISBN, URI)
FIX
fixity (e.g., checksum over CI)

83
Archival Ingestion Networks
Transformation t is information preserving, if
it is reversible, i.e., if there is an inverse
t_inv, s.t., for all d in dom(t) t_inv(
t( d ) ) d .

Example
d1, d2, ? HTML wrapper d1, d2, ?
XML
d1, d2, inverse wrapper (XSLT) d1,
d2, ? HTML
asking for exact inverse often not practical
consider e.g. normalized HTML or restrict to
higher level representations

84
Ingestion Network Senate Collection
85
From XML-Based to Knowledge-Based Archives

Collection-based archival with XML save data "as
is" plus...
... separate content from presentation
... tag your data (take a lift in the info
hierarchy)
... use a self-describing, semistructured data
format (XML)
Knowledge-based archival now add ...
... conceptual level information
... integrity constraints
... explanations/derivation rules
archiving only results yf(x) vs. archiving the
rules/function "f" (e.g. f the
Florida procedure...)
employ knowledge representation languages

86
Knowledge-Based Archival Senate Example

Data provider says
Please archive all records of legislative
activities of the 106th senate!
Integrity constraints, eg
(1) senators_with_file UNION (sponsor,
cosponsors, submitted_by)
(2) senators sponsors co-sponsors
Violation
the rhs is a SUPERSET of the lhs !
Exceptions
(Chafee, John), (Gramm, Phil), (Miller, Zell)
(Possible) Explanations
senators who joined (Zell), passed away (Chafee),
were forgotten (Gramm)!?
Checking ICs
IF sponsor(X), not senator(X) THEN
ADD(exception_log, missing_senator_info(X))
IF condition THEN action
Action LOG, WARN,
ABORT, ...

87
Maximizing Self-Containedness ...

Self-validating archives add ...
... "executable knowledge" (rules)
"helping (bugging?) the data provider"
add the functionality and meaning of DTD
(SchemaIC...) validation to the AIP
package the validator!
Self-instantiating archives add ...
... "executable ingestion process"
helping the archival engineer (aka archivist)
here is looking over your shoulder
add the functionality of database
transformations to the AIP
package the transformers!
BUT packaging validators and transformers
increases infrastructure dependence!

88
Maximize Self-Containedness ...While
Minimizing Infrastructure Dependence

Basic Idea use a language of executable
specifications for self-validation and
self-instantiation!
Use Bootstrapping for Self-Validating
Self-Instantiating Archives
Example DTD Validator in Logic (F-Logic,
Datalog,)
specify
false IF PX, not (P1.X)Y.
false IF PX, not (P2.X)Y.
false IF PX, not P_- _.
false IF PXN-_, not N1, not N2.
...

89
XML Extensions as General Constraint Languages

Assume an archival language A for IPs (e.g.
AXML)
Def. C is a constraint language for A, if for all
? ? C, the set of valid archives V(?) a ? A
a ? is decidable.
Example C XML_DTD, ? Senate_DTD
Def. C subsumes C (C ? C) w.r.t. A, if for
all ? ? C there is an encoding enc(? ) ? C s.t.
for all a ? A
a ? iff a enc(?)
Proposition
XML_Schema ? XML_DTD
F_Logic,Datalog ? XML_DTD

90
Summary Towards Bootstrapping Knowledge-Based
Archives

enable addition of semantic annotations
("knowledge") via logic rules to AIPs
add executable specifications of semantics
AIP KP (knowledge package,
i.e., logic ules)
self-validating archive
add executable specifications of the ingestion
network AIP IN (ingestion network, ...more
logic rules)
self-instantiating archive

bootstrapping knowledge-based archive with
DTD/Schema/IC validation and ingestion
transformations all expressed in a declarative
logic program
Outlook from the 2do list build a prototype
BARON Bootstrapping Archive of Rules,
Ontologies, and Ingestion Networks

Baron von Münchhausen, pulling himself out of the
swamp
91
References

Towards Self-Validating Knowledge-Based
Archives, Bertram Ludäscher, Richard Marciano,
Reagan Moore, 11th Workshop on Research Issues in
Data Engineering (RIDE), Heidelberg, IEEE
Computer Society, April 2001, SDSC TR-2001-1,
January 18, 2001.
Knowledge-Based Persistent Archives, Reagan
Moore, SDSC TR-2001-7, January 18, 2001
The Senate Legislative Activities Collection
(SLA) a Case Study Infrastructure Research to
Support Preservation Strategies, Richard
Marciano, Bertram Ludäscher, Reagan Moore, SDSC
TR-2001-5, January 18, 2001
Reference Model for an Open Archival Information
System (OAIS), Draft Recommendation, Consultative
Committee for Space Data Systems, CCSDS
650.0-R-1, May 1999.
Digital Rosetta Stone A Conceptual Model for
Maintaining Long-term Access to Digital
Documents, Alan R. Heminger, Steven B. Robertson

92
ADDITIONAL MATERIAL AHEAD
93
Collection-Based Archival with XML

Archival Formats Desiderata
standardized, open, as simple as possible,
self-contained and self-describing
XML provides a good framework for archival
Data/Instance Level records/objects/tuples
content information (CI)
Schema/Class Level
collection structure metadata, types
packaging information (PI) and descriptive
information (DI)
Missing in Action...
conceptual level information relationships
between collection attributes/classes, integrity
constraints, derived knowledge,
parts in CON, PI, but need for knowledge packages
(KPs)
Knowledge-Based Archival

94
Getting your hands dirty with logic rules

Some logic rules for reassembling the doc
structure (lexical scopes) from the OAV (or
rather AOV)

attr_interval(Attr, SID, Attr_val, LN, LN1) -
oav(Attr, (SID, LN), Attr_val),
oav(Attr, (SID, LN1), _), LN1 LN,
not attr_between(Attr,SID,LN,LN1).
attr_between(Attr,SID,LN,LN1) -
oav(Attr, (SID, LN), _), oav(Attr, (SID,
LN1), _), oav(Attr, (SID, LN2), _),
LN
95
Summary what is the declarative (logic) approach?

Use of declarative database and knowledge
representation formalisms for...
adding knowledge packages to AIPs
capture context known at the time of archival
using conceptual models of collections, integrity
constraints, virtual relations,
applying them at ingestion, migration, and
instantiation/access time
( wrapping, transforming, querying
collections)