Title: Collection and KnowledgeBased Persistent Archives at SDSC
1Collection- and Knowledge-Based Persistent
Archives at SDSC
- Bertram Ludäscher
- LUDAESCH_at_SDSC.EDU
- San Diego Supercomputer Center
- University of California, San Diego
work sponsored by the National Archives and
Records Administration and Advanced Research
Projects Agency (NARA) and NHPRC
2Content Overview
- Part I
- SDSCs Persistent Archives Initiative
(material from Reagan Moore, Deputy
Director SDSC, Data Intensive Computing
Environments) - Example/Case Study
Wrapping Websites into XML for
Archival (material from Valter
Crescenzi, visiting from U Roma 3) - Part II
- From Collection-Based to Self-Validating
Knowledge-Based Archives
(joint work
with Reagan Moore and Richard Marciano) - Running Example The Senate Collection
3Data Intensive Computing Environments
- Staff
- Reagan Moore
- Chaitan Baru
- Sheau Yen Chen
- Charles Cowart
- Amarnath Gupta
- George Kremenek
- Bertram Ludäscher
- Richard Marciano
- Arcot Rajasekar
- Abe Singer
- Michael Wan
- Ilya Zaslavsky
- Bing Zhu
- Students - GSRA
- Martin Kuhl
- Liying Sui
- Yang Yu
- Valter Crescenzi
- Students - Undergrad Interns
- Peter Shin
- Roman Olshanowsky
- Shabbar Tambawala
- Pratik Mukhopadhyay
- /- NN
4Part I
- Overview
- SDSCs Persistent Archival Approach
- Case Study
- Wrapping Websites into XML for Archival
5Persistent Archive Goals/Objectives
- Manage digital objects for the life of the
republic - Maintain ability to discover and access digital
objects while the supporting hardware and
software systems evolve
6Example Archive Components
- Storage system for storing the digital objects
- e.g. HPSS (tape silos with disk cache) or SANs
(Storage Area Networks) - Database for managing a collection that
represents the digital objects - e.g. an object-relational DBMS
- Web server for discovering and displaying the
digital objects - e.g. CGI scripts with helper applications
7Example Archive
- Assyrian clay tablets
- Provided long term storage, but limited volume
- Characterize an archive by the bandwidth it
provides for transporting data into the future
and its archival capacity
8 Risks and Challenges of Persistent Archives
- Each of the software and hardware systems may
become obsolete - the storage media may degrade
- the storage system may become obsolete
- the database backups may become obsolete, with no
way to recover the collection (structure) - the digital object formats may become obsolete,
with no helper application that can read them
9Good News Persistent Archives are Possible
- The Archivist (archival engineer) is in control
- Archivist gets to decide on the persistence
policies - how to minimize risk
- how to minimize cost
- when to use new technology
10Persistent Archive Bandwidth (Migration Speed)
- Ability/Necessity to transport data is
- (size of the archive) / (media lifetime)
- the larger the archive and/or shorter the
lifetime, - the higher the required bandwidth
(migration speed) - Example size(archive) 100TB media_lifetime
5 years - ability/necessity to migrate 20 TB/year to
avoid data loss! - Clay tablets provided a long media lifetime, but
a very small storage capacity - effective bandwidth was a byte/day
11Concept 1
- Persistent Archive is a Migration Mechanism
- Since the amount of data is increasing
exponentially, the archive capacity must increase
correspondingly - Migrate to new technology to get to higher
sustainable Archive Bandwidths
12Data Scales
- Megabyte - one million bytes
- Digital content of a book
- Gigabyte - one thousand MBs
- Terabyte - one thousand GBs
- Digital content of a film
- Petabyte - one thousand TBs
- Amount of tape produced in 1994
- Exabyte - one thousand PBs
- Data produced per year
13Archive Bandwidth
- If you wait too long to migrate, you will be
unable to read the data from the archive before
the media degrades - There is a maximum capacity for any choice of
archive media, for a given capital investment in
media read devices
14Archive Capacities for a 2-tape Drive System
15SDSC Archive
- Currently store 240 TeraBytes, with the capacity
of the system being 500 TeraBytes - 16 tape drives in 3 silos
- Migrated digital holdings through
- three different storage systems
- five different computer systems
- six different types of tapemedia
16Reasons for Migration
- Avoid data loss
- manage degradation of media
- Minimize cost
- minimize the number of tapes that must be managed
- recover floor space
- Keep pace with data growth
- provide higher Archive Bandwidth
17Migration Costs
- Media costs are fixed
- price of each new tape technology cartridge is
the same as the previous cartridge, but the
capacity is doubled (so far) - Cost is then
- 1 1/2 1/4 1/8 2 times original
price - (additional assumption labor cost is minimized
by using a tape robot)
18Concept 2
- Automation of all processes is essential if costs
are to be minimized - eliminate manual manipulation of tapes
- robots
- eliminate manual manipulation of digital objects
- data handling systems
- eliminate manual discovery of digital objects
- information catalogs
19Data Archive
Ingest Services
Management
Access Services
Access platform
Data repositories
Ingestion platform
Interoperability Standards
Interoperability Protocols
20ERA Concept model
21Concept 3
- Persistent Archive is an Interoperability System
- persistence requires migration over time onto new
technology - while the migration occurs, a persistent archive
must be able to interoperate with both the old
technology and the new technology - employ XML-based interoperability
(mediation) technology
22Implicit Concepts for Persistent Archive
- Infrastructure independence
- Non-proprietary formatting
- Collection management
- Data set access
- Authentication
- Presentation
- Information models
- XML as a (meta-) information markup language
- Example GML - Graphics markup language
- Support for ingestion, management, access
- Accessioning workbench, archive, access workbench
23XML as a Standard Information Markup Language
- XML representation of metadata attributes
- standardization of DTDs - MOA II DTD for text
- standardization of markup language
- XML based representation of collection structure
- attributes defining the physical layout of a
schema into relational tables (foreign keys,
attribute data types, ) - XML databases XML organized data collections
- commercial systems Excelon, TAMINO, Oracle8i,
... - XML-based queries (XQuery, Quilt, XQL, XMAS, ...)
- XML based Topic Maps
- represent relationships between collection domain
concepts, collection attributes - navigational access of intra- and
intercollection concept spaces
24Archival Example E-mail Collection
- Test of the scalability of the technology
- archived a one-million record E-mail collection
(1999) - Ingestion
- tagged E-mail using XML syntax (6 required,
13optional, 1000 user-defined tags) wrapping
of raw data - created description of the collection
- aggregated E-mail into containers, stored in an
archive - retrieved collection description, created
database, and optimized for query - Total time was 27 hours (used 10 Mbit/sec
Ethernet)
25Collection-Based Persistent Archive
Ingest Services
Management
Access Services
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
26Collection-Based Persistent Archive Processes
27What Types of Interoperability are Needed?
- Data management (digital objects)
- ability to work with multiple types of storage
systems, across separate administration domains - Information management (attributes)
- ability to define a collection independent of
database choice - ability to migrate collection onto new databases
- Knowledge management (relationships)
- ability to manage relationships and high-level
domain concepts - ability to map concepts to collection attributes
28Simplest Definitions
- Data
- digital object, i.e., the object representation
as a bit stream - Information
- any tagged data, where tags are treated as
information attributes - attributes may be tagged data within the digital
object, or tagged data that is associated with
the digital object - Knowledge
- higher-order concepts and relationships between
attributes - relationships can be procedural, temporal,
structural, spatial, functional, ... and
described in a Logic formalism (semantic
networks, description logics, conceptual graphs,
...) which is often rule-based (e.g. Datalog,
Frame-Logic)
29(No Transcript)
30Types of Knowledge Relationships
- Logical / semantic
- e.g. Digital Library cross-walks
- Temporal / procedural
- e.g. Workflow systems
- Spatial / structural
- e.g. GIS systems
- Functional / algorithmic
- e.g. scientific feature analysis
31Knowledge-Based Persistent Archive (more Part
II)
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
XTM DTD
Knowledge
Rules - KQL
(Topic Maps / Buckets / Model-based Access)
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
32Creating Archivable Forms
- Archivable form for digital objects
- infrastructure-independent/self-describing...
- ... mechanism for describing the digital object
format - ... mechanism for displaying the digital object
based upon the digital object description - no proprietary formats
- information content directly tagged
- based on XML (for information tagging,
interoperability, ...) - Example Archival of Web information
33Wrapping Websites into XML for Archival
Valter Crescenzi Visiting Scholar San Diego
Supercomputer Center University Of California,
San Diego crescenz_at_sdsc.edu Dipartimento di
Informatica ed Automazione Universita' di Roma
Tre crescenz_at_dia.uniroma3.it
34Outline
- General issues for archival of websites
- Specific aspects of archival of websites
- gathering information
- data extraction
- Extraction of information content from web pages
- first results
- effort needed to extract information from web
pages - recommendations to minimize wrapping effort
35Website Archival Issues
- How to access?
- locally ( /htdocs/ )
- remotely ( http// )
- What to archive?
- a binary image of the site
- (explicit) information content only
- behavior?
- as is (e.g., existing cgi scripts)
- equivalent functionality
36Gathering Information
- Def. A website is completely crawlable if all
pages can be automatically copied to a local
archive - Fact Many (most?) websites are not completely
crawlable! - example service-oriented websites
(maps.yahoo.com, ... ) - Many mirroring tools are available if the site is
crawlable, i.e. through index pages - If the sites URLs can be enumerated, a
specialized mirroring tool may easily be developed
37Focus on Crawlable Data-intensive Websites
Data-Centricity
Structuredness
Data-intensive Sites
Web-Based Information System
high
Senate NeuronDB
www.amazon.com
www.hotmail.com
maps.yahoo.com
low
Service-Oriented Sites
Web-Presence Sites
GridPortal
low
high
Complexity Of Applications
38Information Extraction Approach
- Information content may be extracted from
documents through dedicated software modules
called wrappers
Information content (e.g. in XML)
Web pages (X)HTML
39Extraction of Information Content
- Wrappers
- brittle
- difficult to maintain
- mean time to failure quite short!
- manually coded/automatically generated
- manually very expensive, but only possibility if
sources lack structure - automatically cheap but needs structure
40Case Study for Extraction of Information CONTENT
The "Senate Web Site"
- Bill Status
- Summary Info
- Site is crawlable
- through URL enumeration
- through indices
41Sample XML Output File
42Why Wrapping HTML XML?
- What is being archived?
- presentation (maybe)
- content (most likely)
- Technology/infrastructure independence
"better" persistent archival format - XML can express information model and schema
information (while HTML only provides fixed
structural and presentational information)
43Example Minerva Wrapper Specification
S out.println("encoding'ISO-8859-1' ?")
out.println("'BillSummary.dtd'") out.println("")
Bill\sSummary\s\sStatus\sfor\st
he\sCongress\sCongress BillNumber ((nbsp)"("")")?
out.println(" "Congress""
) out.println("
"BillNumber"")
(NewStyle OldStyle)
TitleList Status ...
44Steps of the Manual Wrapping Process
- Initial wrapper specification has been derived
from a few sample bills from the 106th congress - Refinement using ore samples (including from
other congresses) - changes in the HTML layout starting from 105th
have been discovered!! - many manual fixes were needed (irregularities in
the structure) - use of a random bill URL generator for testing
the wrapper two more fixes
45Changes in the HTML Layout
H.R.4236 (Major Legislation)
Public
Law 104-333 (11/12/96)
SPONSOR href""Rep Young, D. (introduced
09/27/96)
HTML code from a bill of the 104th congress
H.J.RES.25 Sponsor href""Rep Livingston, Bob (introduced
1/9/1997)
Latest Major Action 2/3/1997
Became Public Law No 105-1.
Title Making technical corrections to the
Omnibus Consolidated Appropriations
Act, 1997 (Public Law 104-208), and
for other purposes.
Corresponding code from a bill of the 105th
congress
46Level of Development Effort (Senate Website)
- Manual Approach
- Around 350 lines of wrapper code
- 250 lines of grammar specification
- 100 lines to specify output format
- One full day to write the specification
- One full day to test debug it
- Automatic Approach
- automatic wrapper generator fails!
47Wrapping a Structured Website NeuronDB
- NeuronDB is a well structured site which
presents information content of an underlying
database about neurons - A wrapper has been manually coded for comparing
the efforts needed to wrap this structured site
and the Senate site - The automatic wrapper generator was able to
successfully build a wrapper without user
interaction
48(No Transcript)
49Level of Development Effort (Neuron DB)
- Manual Approach
- Around 220 lines of code
- 140 lines of grammar specification
- 80 lines to specify output format
- Half day to write the specification
- The wrapper extracts all information content
- Automatic Approach
- automatic wrapper generator succeeds
50Automatic Wrapper Generation Succeeds for NeuronDB
- The wrapper generator toolkit is part of an
ongoing project at the Terza Universita di
Roma called RoadRunner - The wrapper has been automatically generated
looking at similar pages without user
interactions - wrapper generation takes a few seconds
51Neuron DB the inferred schema
- A common schema (expressed as regular expression)
is inferred for input pages
A B ( C ) ( D ( ( E F ( G )? ) )? ( H ( I
)? ) ( J K ( L )? ) ) ( M N )
- The schema is enriched with the extraction
rules needed to actually wrap sources - This is a kind of physical schema of the HTML
layout, not a logical schema
52NeuronDB Result of the data extraction
53Results (1)
- Make sites crawlable for remote archival
- e.g. archival backdoor
- Extracting information from web sites may be very
expensive depending on the (ir)regularities of
pages - The more regular the structure, the cheaper
the writing of wrappers - Automatic approaches are feasible for well
structured web pages!
54Results (2)
- Well-Structured Sites
- not only for data archival but also to minimize
cost for web site maintenance and management - XHTML can help (simplifies XHTML XML)
- Separation of content and presentation
- XML XSL(T)
55References (1)
- G. Mecca, P. Atzeni Cut and Paste - Journal of
Computing and - System Sciences, Special Issue on PODS'97, 1999
- DOM The document object model.
http//www.w3.org/DOM/ - D. W. Embley, D. M. Campbell, Jiang Y. S., S. W.
Liddle, Ng Y., D. Quass, - and Smith R. D.
- A conceptual-modeling approach to extracting
data from the web. - In ER98.
- N. Kushmerick. Wrapper Induction Efficiency and
expressiveness - Artificial Intelligence, 118, 200
- V. Crescenzi, G. Mecca Grammars Have Exceptions
- Information Systems, Special Issue on
Semistructured Data, 1998
56References (2)
- B. Adelberg NoDoSe a tool for
semi-automatically extracting struc- - tured and semistructured data from text
documents. In SIGMOD98. - The Neuron DB Web Site http//senselab.med.yal
e.edu/senselab/NeuronDB/ - S. Grumbach, G. Mecca In Search of the Lost
Schema - - In Proceedings of Intern. Conference on
Database Theory (ICDT'99), 1999 - The Senate Web Site http//thomas.loc.gov/
- The Tidy Utility http//www.w3.org/People/Ragg
ett/tidy/ - The W3C XHTML activity. http//www.w3.org/MarkU
p/ - Extensible Markup Language (XML),
http//www.w3.org/XML/ - Extensible Stylesheet Language (XSL).
http//www.w3.org/Style/XSL/
57Overview
- Part I
- SDSCs Persistent Archives Initiative
(material from Reagan Moore, Deputy
Director SDSC, Data Intensive Computing
Environments) - Example/Case Study
Wrapping Websites into XML for
Archival (material from Valter
Crescenzi, visiting from U Roma 3) - Part II
- From Collection-Based to Self-Validating
Knowledge-Based Archives
(joint work
with Reagan Moore and Richard Marciano) - Running Example The Senate Collection
58WARM UP XML (eXtensible Markup Language)
- origins HTML SGML (ISO Standard, 1986, 600pp)
- W3C standard (26 pp) XML syntax DTDs
- XML HTML ? presentational tags
- user-defined DTD
(tagsnesting) - a metalanguage for defining other languages
(e.g. via DTDs) - XML is more like SGML than HTML
- XML SGML ? complexity, document perspective
- simplicity, data
exchange perspective
59XML as a Self-Describing Data Exchange Format
- can be easily understood by our friend (...
even using CP/M edlin) - can be parsed easily
- contains its own structure (parse tree) in the
data - allows the application programmer to
rediscover schema and content/semantics (to
which extent???) - may include its own schema description (e.g.,
DTD, XML Schema) - meta-language definition of specific
languages (XYZ-ML) - allows separation of marked-up content from
presentation (style sheets) - many tools (and many more to come -- (re)use
code) parsers, validators, query languages,
storage, - standards (good for interoperation, integration,
etc) - generic standards (XML, DTDs, XML Schema,
XPath,...) - community/industry standards (specific markup
languages)
60Mind Your Vocabulary Identifying Vocabularies
with XML Namespaces
- My element may not be your element
- geometry context line
- chemistry context oxygen
- SGML/XML context ....
- use XML namespaces to identify the vocabulary
- ... when I say semantics, I can make clear
whether I am talking as a logician (needs
additional specificiation mathematical logic,
philosophy, AI, ...) or a linguist, or a
psychologist, etc.
61XML Namespaces
- mechanism for globally unique tag names
-
- xmlnsh"http//www.w3.org/HTML/1998/htm
l4" - Book Review
- ...
-
- XML A Primer
- ...
-
- mix of different tag vocabularies without
confusion - namespaces only identify the vocabulary
additional mechanisms required for structure and
meaning of tags
62Information Hierarchy (Simplest Definitions)
- Data
- digital object, i.e., the object representation
as a bit stream - Information
- any tagged data, where tags are treated as
information attributes - attributes may be tagged data within the digital
object, or tagged data that is associated with
the digital object - Knowledge
- higher-order concepts and relationships between
attributes - relationships can be procedural, temporal,
structural, spatial, functional, ... and
described in a Logic formalism (semantic
networks, description logics, conceptual graphs,
...) which is often rule-based (e.g. Datalog,
Frame-Logic)
63Types of Knowledge Relationships
- Logical / semantic digression
semantics ?semantics - e.g. Digital Library cross-walks
- Temporal / procedural
- e.g. Workflow systems
- Spatial / structural
- e.g. GIS systems
- Functional / algorithmic
- e.g. scientific feature analysis
64Knowledge-Based Persistent Archive
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
XTM DTD
Knowledge
Rules - KQL
(Topic Maps / Buckets / Model-based Access)
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
65Data Handling System
SDSC Storage Resource Broker Meta-data Catalog
Application
Resource
Third-party copy
User
Remote Proxies
MCAT
Dublin Core
DataCutter
Application Meta-data
66Ingestion Processes for Collection Creation
Accession Template
Closure Concept/Attribute
Attribute Inverse Indexing
Information Generation
Knowledge Generation
Attribute Tagging
Attribute Selection
Occurrence Tagging
View Management
Data Organization
Collection
67Running Example Senate Collection
- from the 106th US Congress database
- keeps track of Senate bills, resolutions, and
amendments - raw format 99 RTF (Rich Text Format) files on
CD-ROM (provided by NARA) - one file per senator
68Examples of Implied KnowledgeSenate Legislative
Activities
- Structural knowledge
- Pertinent information embedded in document
headers - Procedural knowledge
- Naming convention
- Senator represented by last name
- Senator represented by last name and state
- Senator represented by last name, first name, and
state - Collection knowledge
- Referenced senators include senators no longer in
the senate
69Knowledge Generation
- Accessioning Template
- Defines the concepts under which the data objects
will be tagged and organized - Attribute selection
- Define the attributes that represent the
information content associated with the domain
concepts - Tag attributes using minimal constraint language,
such as XML or XMLSchema - Evaluate closure of mined attributes compared to
expected attributes - Refine concept map
70Information Generation
- Create occurrence index
- (Occurrence, attribute, value)
- This is needed to be able to recreate original
form of digital object - Analyze completeness of information
- Inverse index of attribute values
- Identifies unexpected values - consistency
- Analyze closure of collection
- Are additional attributes needed to represent
inverse index value ranges?
71Data Organization
- Archive preferred views of collection
- Original data
- XML tagged representation
- Minimal representation of consolidated
information - Noise-freeversion based upon occurrence tags
- Object-relational database version
- Archive occurrence tagged view
- Archive ingestion procedures that transform
collection from the original digital objects to
the preferred views
72Information Management Projects
- Digital Libraries
- NSF Digital Library Initiative, Phase II - UCSB,
Stanford - Digital Embryo digital library - GMU
- NPACI Digital Sky - Caltech 2MASS sky survey
- CDL - AMICO
- NSF NSDL - UCAR / DLESE
- Grid Environments
- NASA Information Power Grid - NASA Ames
- DOE Data Visualization Corridor - LLNL
- DOE Particle Physics Data Grid - Stanford,
Caltech - NSF Grid Physics Network - U Fl
- Persistent Archives
- NARA Persistent Archive
- NHPRC - Scalable archives
73Research and Development Activities - FY00
- Demonstration of scalable systems
- Expansion of persistent archive Framework
- Knowledge-based persistent archives
- Demonstration of archivable forms for new types
of data - Web, GIS, compound documents, collections
- Knowledge and anomaly processing
- Tightness of fit of XML DTDs
- Self validating archives as a preservation
strategy
74Research Challenges
- Infrastructure independence
- Progress on archivable form creation
- Digital paper
- Finding aids for a million collections
- Concept spaces that support identification of
collection - Product authentication
- Tracking all updates, movements, media
migrations, collection instantiations - Choice of Archival Markup Language
- Tracking of E-commerce implementations
- Knowledge management systems
- Workflow, ingestion processing steps, system
evolution procedures, finding aid concept spaces
75Digital Archives
- Problem
- How to achieve long-term preservation of
information (for the archivist records) and
sustained access? - Challenges and Opportunities
- fight archives obsolescence (in the presence of
with) - rapidly changing storage, data formats, software
environment, hardware, - Approaches
- Time out (do nothing assume hardware,
software, data formats, etc. all work 400 years
from now ...) - Emulation (emulate hardware and software
infrastructure) - Migration (migrate to new infrastructure)
- Factors
- What do you need to archive? (records, data,
programs, ?) - determine usefulness and cost of emulation
vs. migration - archival of electronic records data-centric
migration
76What is it That We Try to Archive??
- What constitutes a record?
- beats me...
- but there are hierarchies of information /
abstractions - data ... information ... knowledge ...
wisdom? - instance ... schema ... model ... metamodel ...
metametamodel ... - object serialization ... data structure ... data
model ... meta model - What is the nature of the information?
- data .. functions/programs
- extensional data . intensional/virtual/derived
data (facts/rules) - Managing complexity using layers
- protocol stacks (e.g. ISO/OSI, SemanticWeb,
Semantic Mediation) - going up abstract, correlate, aggregate,
index, the lower levels
77Archival Processes and Functions
- Data Submission/Accessioning
- loop information producer "archival
engineer (ok archivist) - Ingestion
- a sequence of information preserving
transformations is applied to submitted "raw
data" ingestion network - Migration
- ... as time goes by ...
- ... migrate to new physical media, maybe data
formats, information model ... - "easy migration" "good" archival format
model - Instantiation/Access
- revive/reanimate the archive queryable
collection/database - Goal preserve information!
- (ok just records ...)
78Archival Example Senate Collection
- is maybe NOT what you get (a not so well
documented format)
79Senate Collection Example
- Rich Text Format (a documented Microsoft
format)
\pard\parM \pard\b S. 345\b0\parM
\pard\qr DATE INTRODUCED 02/03/1999\parM
\pard SPONSOR Allard\parM \i\qc OFFICIAL
TITLE\i0\parM \pard A bill to amend the Animal
Welfare Act to remove the limitation that permits
\ interstate movement of live birds, for the
purpose of fighting, to States in which \ animal
fighting is lawful.\parM \i\qc LATEST
STATUS\i0\par\pardM \pard\plain
\fi-1900\li1900\nowidctlpar\adjustrightFeb 3,
1999\tab Read twice and\ referred to the
Committee on Agriculture.\parM \pardM
S. 345 bold"off"DATE INTRODUCED 02/03/1999 bold"off"SPONSOR Allard bold"off" italic"off"OFFICIAL TITLE bold"off" italic"off"A bill to amend the
Animal Welfare Act to remove the lim\ itation
that permits interstate movement of live birds,
for the purpose of fighting\ , to States in which
animal fighting is lawful. bold"off" italic"off"LATEST STATUS
Feb 3, 1999tabRead twice and
referred to the Committee on Agriculture\ .g
80Senate Collection Example
- the XML can be lifted from the presentation
level
S. 345 bold"off"DATE INTRODUCED 02/03/1999 bold"off"SPONSOR Allard bold"off" italic"off"OFFICIAL TITLE bold"off" italic"off"A bill to amend the
Animal Welfare Act to remove the lim\ itation
that permits interstate movement of live birds,
for the purpose of fighting\ , to States in which
animal fighting is lawful. bold"off" italic"off"LATEST STATUS
Feb 3, 1999tabRead twice and
referred to the Committee on Agriculture\ .g
SENATE AGRICULTURE
02/03/1999te_introduced
Feb 3,
1999
Read twice and referred to the
Committee on Agriculture
A bill to amend the Animal
Welfare Act to remove the limitation that permits
interstate movement of live birds, for the
purpose of fighting, to States in which animal
fighting is lawful.
Allard, Wayne CO
81XML as an Archival Format
- Information level schema as an XML DTD
bills (bill) committees?, congressional_record?, cosponsors?,
date_introduced?,
digest?, latest_status_list?, official_title?,
sponsor?, statement_of_purpose?,
submitted_by?, submitted_for?) bill_name CDATA REQUIRED (committee) (cosponsor) T latest_status_list (latest_status) latest_status (ls_date, ls_txt) abstract (PCDATA) (PCDATA) (PCDATA) T co_name (PCDATA) CDATA IMPLIED (PCDATA) (PCDATA) (PCDATA)
82Open Archival Information System (OAIS)
Information Model
- An AIP (archival information package) contains
- content information (CI) (represented as
info_objects), and - preservation description information (PDI)
- (A)IP (archival) information package
- DI descriptive information
- PI packaging information (ISO-9660 for CD
directories) - CI content information
- PDI preservation description
information - PR
provenance (origin, processing history) - CON context (relation to external
information) - REF reference (identifies the CI, e.g.,
ISBN, URI) - FIX
fixity (e.g., checksum over CI) -
83 Archival Ingestion Networks
Transformation t is information preserving, if
it is reversible, i.e., if there is an inverse
t_inv, s.t., for all d in dom(t) t_inv(
t( d ) ) d .
- Example
- d1, d2, ? HTML wrapper d1, d2, ?
XML - d1, d2, inverse wrapper (XSLT) d1,
d2, ? HTML - asking for exact inverse often not practical
- consider e.g. normalized HTML or restrict to
higher level representations
84Ingestion Network Senate Collection
85From XML-Based to Knowledge-Based Archives
- Collection-based archival with XML save data "as
is" plus... - ... separate content from presentation
- ... tag your data (take a lift in the info
hierarchy) - ... use a self-describing, semistructured data
format (XML) - Knowledge-based archival now add ...
- ... conceptual level information
- ... integrity constraints
- ... explanations/derivation rules
- archiving only results yf(x) vs. archiving the
rules/function "f" (e.g. f the
Florida procedure...) - employ knowledge representation languages
86 Knowledge-Based Archival Senate Example
- Data provider says
- Please archive all records of legislative
activities of the 106th senate! - Integrity constraints, eg
- (1) senators_with_file UNION (sponsor,
cosponsors, submitted_by) - (2) senators sponsors co-sponsors
- Violation
- the rhs is a SUPERSET of the lhs !
- Exceptions
- (Chafee, John), (Gramm, Phil), (Miller, Zell)
- (Possible) Explanations
- senators who joined (Zell), passed away (Chafee),
were forgotten (Gramm)!? - Checking ICs
- IF sponsor(X), not senator(X) THEN
ADD(exception_log, missing_senator_info(X)) - IF condition THEN action
- Action LOG, WARN,
ABORT, ...
87 Maximizing Self-Containedness ...
- Self-validating archives add ...
- ... "executable knowledge" (rules)
- "helping (bugging?) the data provider"
- add the functionality and meaning of DTD
(SchemaIC...) validation to the AIP - package the validator!
- Self-instantiating archives add ...
- ... "executable ingestion process"
- helping the archival engineer (aka archivist)
- here is looking over your shoulder
- add the functionality of database
transformations to the AIP - package the transformers!
- BUT packaging validators and transformers
increases infrastructure dependence!
88Maximize Self-Containedness ...While
Minimizing Infrastructure Dependence
- Basic Idea use a language of executable
specifications for self-validation and
self-instantiation! - Use Bootstrapping for Self-Validating
Self-Instantiating Archives - Example DTD Validator in Logic (F-Logic,
Datalog,) - specify
- false IF PX, not (P1.X)Y.
- false IF PX, not (P2.X)Y.
- false IF PX, not P_- _.
- false IF PXN-_, not N1, not N2.
- ...
89XML Extensions as General Constraint Languages
- Assume an archival language A for IPs (e.g.
AXML) - Def. C is a constraint language for A, if for all
? ? C, the set of valid archives V(?) a ? A
a ? is decidable. - Example C XML_DTD, ? Senate_DTD
- Def. C subsumes C (C ? C) w.r.t. A, if for
all ? ? C there is an encoding enc(? ) ? C s.t.
for all a ? A - a ? iff a enc(?)
- Proposition
- XML_Schema ? XML_DTD
- F_Logic,Datalog ? XML_DTD
-
90Summary Towards Bootstrapping Knowledge-Based
Archives
- enable addition of semantic annotations
("knowledge") via logic rules to AIPs - add executable specifications of semantics
AIP KP (knowledge package,
i.e., logic ules) - self-validating archive
- add executable specifications of the ingestion
network AIP IN (ingestion network, ...more
logic rules) - self-instantiating archive
- bootstrapping knowledge-based archive with
DTD/Schema/IC validation and ingestion
transformations all expressed in a declarative
logic program - Outlook from the 2do list build a prototype
BARON Bootstrapping Archive of Rules,
Ontologies, and Ingestion Networks
Baron von Münchhausen, pulling himself out of the
swamp
91References
- Towards Self-Validating Knowledge-Based
Archives, Bertram Ludäscher, Richard Marciano,
Reagan Moore, 11th Workshop on Research Issues in
Data Engineering (RIDE), Heidelberg, IEEE
Computer Society, April 2001, SDSC TR-2001-1,
January 18, 2001. -
- Knowledge-Based Persistent Archives, Reagan
Moore, SDSC TR-2001-7, January 18, 2001 - The Senate Legislative Activities Collection
(SLA) a Case Study Infrastructure Research to
Support Preservation Strategies, Richard
Marciano, Bertram Ludäscher, Reagan Moore, SDSC
TR-2001-5, January 18, 2001 - Reference Model for an Open Archival Information
System (OAIS), Draft Recommendation, Consultative
Committee for Space Data Systems, CCSDS
650.0-R-1, May 1999. - Digital Rosetta Stone A Conceptual Model for
Maintaining Long-term Access to Digital
Documents, Alan R. Heminger, Steven B. Robertson
92ADDITIONAL MATERIAL AHEAD
93 Collection-Based Archival with XML
- Archival Formats Desiderata
- standardized, open, as simple as possible,
- self-contained and self-describing
- XML provides a good framework for archival
- Data/Instance Level records/objects/tuples
- content information (CI)
- Schema/Class Level
- collection structure metadata, types
- packaging information (PI) and descriptive
information (DI) - Missing in Action...
- conceptual level information relationships
between collection attributes/classes, integrity
constraints, derived knowledge, - parts in CON, PI, but need for knowledge packages
(KPs) - Knowledge-Based Archival
94Getting your hands dirty with logic rules
- Some logic rules for reassembling the doc
structure (lexical scopes) from the OAV (or
rather AOV)
attr_interval(Attr, SID, Attr_val, LN, LN1) -
oav(Attr, (SID, LN), Attr_val),
oav(Attr, (SID, LN1), _), LN1 LN,
not attr_between(Attr,SID,LN,LN1).
attr_between(Attr,SID,LN,LN1) -
oav(Attr, (SID, LN), _), oav(Attr, (SID,
LN1), _), oav(Attr, (SID, LN2), _),
LN
95Summary what is the declarative (logic) approach?
- Use of declarative database and knowledge
representation formalisms for... - adding knowledge packages to AIPs
- capture context known at the time of archival
using conceptual models of collections, integrity
constraints, virtual relations, - applying them at ingestion, migration, and
instantiation/access time - ( wrapping, transforming, querying
collections)