Title: L.S. Jackson
1Illinois Electronic Archives ProjectPreserving
Electronic Publications
- http//realfun.isrl.uiuc.edu/eArchives/
- Principal Investigator
- Joe Natale, Illinois State Library
- jnatale_at_ilsos.net
- StateGILS-4 Presenter
- Larry S. Jackson, University of Illinois,
Graduate School of Library and Information
Science - lsjackso_at_uiuc.edu
- http//realfun.isrl.uiuc.edu/jackson
2Illinois Electronic Archives ProjectPreserving
Electronic Publications
- Design philosophy
- System summary
- Results to date
- Problems to date
- Open issues
- Implementation schedule
3Design Philosophy
- Support local control, local responsibility
- Can be run remotely -- its the Web, after all
- Nominal system costs (acquisition, operation)
- Commodity hardware, freeware
- Costs should be comparable to a listserve or
httpd - Relatively complete coverage of Web features
- i.e., static or dynamic HTML, and things that
look like URLs - Minimal administrative burden on agency web
authors webmasters, once running - Web access to configuration and job control
4Design Philosophy
- More an archive than a digital library
- For the PEP project at this stage, it is more
important to retain the (volatile) materials than
to facilitate a wide variety of modes of
searching and browsing. - Retention facilities must support subsequent
export into other archival formats or systems. - Usage modes will reflect user information needs
and scenarios that are not presently understood. - System capacity requirements, being a function of
the result of user needs assessment, should not
be presumed to warrant great expenditures. If
great numbers of users, or highly time-critical
needs are identified, it may be necessary to
export the PEP archive contents to a
high-performance system. - That is not presently anticipated.
5Design Philosophy
- Standards-based (i.e., primarily uses XML
open, mature freeware) - Forms the basis of the design of the selected
freeware, and generally essential for freeware
interoperability. - Best odds on eventually supporting data export,
or interoperability of the PEP system with other
tools/layers. - Do no harm
- Not all sites are standards-based, but if they
are, the PEP system should archive them as
completely and accurately as the freeware
components support. - e.g. Spiders generally cant process (i.e.,
follow links out of) a proprietary-format file.
They (at most) just download the
proprietary-format file itself.
6Illinois Electronic Archives ProjectPreserving
Electronic Publications
- Design philosophy
- System summary
- Results to date
- Problems to date
- Open issues
- Implementation schedule
7PEP High-LevelSystem Architecture Summary
Users(browser)
ApacheWeb Server
XML Web SiteDescriptors
ConfigurationAdmin Script
XML SpiderDefaults
WGETWeb Spider
RetrievalScript
SpiderLauncherand Post-Processor
Retrieved SitesDocument Trees
VersionDeposit
Retrieval Logs
Key
VersionRetrievalScript
ISearchEngine
CVS VersionControl System
ExistingTechnology
New
mySQL DBMS (Metadata)
Pending
Data File
8Illinois Electronic Archives ProjectPreserving
Electronic Publications
- Design philosophy
- System summary
- Results to date
- Problems to date
- Open issues
- Implementation schedule
9Results to Date
- The State of Illinois web (_at_ 15 Jan 2002)
- 5.6 GB total size
- 84K files
- 23K are HTML, 26K are PDF , 1.7K are DOC,
- 9.3K are GIF, 5.5K are JPEG
- 82K META tags
- META tag usage frequency extensively analyzed by
type - Discovered one agency using their own thesaurus
(vs. using ISLs thesaurus) - Most are not using embedded metadata
significantly - They may have their own, separate databases and
indexes
10Results to Date
- Partial (private) ability to browse the recorded
documents, for all 100 official sites.
ARCHIVE
LIVE WEBSITE
11Results to Date
- Now working on version control issues
- Many freeware incompatibility issues
- especially variations in file and directory
naming conventions and metacharacters across
different platform types. - How fast is this content changing? What
metric(s)? - e.g., text vs. binary changes
BinaryBytes
Binary Bytes
TextBytes
Spidering
Text Bytes
12Illinois Electronic Archives ProjectPreserving
Electronic Publications
- Design philosophy
- System summary
- Results to date
- Problems to date
- Open issues
- Implementation schedule
13Problems Encountered to Date
- General design issues when including freeware/
shareware - Overlooking hidden costs.
- Is it adequate, per our system requirements?
- Is it reliable robust?
- If not, at what cost can it be fixed?
- At what cost can it be extended?
- e.g., plug-in architectures, provisions for
interfacing with scripts and other programs, good
documentation. - What are the learning-curve costs to set it up?
- Will all the parts work with our hardware and
operating system?
14Problems Encountered to Date
- Hidden implementation problems
- Platform incompatibilities
- Platform computer hardware and/or operating
system. (Applies to both Client/browser and
Server.) - Metacharacters
- Characters performing special functions, per the
given operating system. Legal names under one
system may be illegal under another system. - e.g., / . , ! ( ) newlines, tabs,
and space. - Case sensitivity
- Requesting A.html and a.html from some
servers can return one document twice. On other
servers, these names may return two different
documents.
15Problems Encountered to Date
- Hidden implementation problems
- Platform incompatibilities
- Use of cutting-edge features in web pages
limits portability and disenfranchises segments
of the population - Mostly seeing just animations or
appearance-changing flash. - Technologies often vendor-specific, not portable
to, or interoperable with other browsers,
wireless access devices, PDAs, or disability
access devices. - Freeware (e.g., spiders) doesnt try to (cant
afford to) keep up with all these web-incremental
experiments.
16Problems Encountered to Date
- Hidden implementation problems
- Platform incompatibilities
- Dynamic (script-served) documents
- Web sites composing web pages only in response
to user request have date/time tags that say
just modified!, while the content has not in
fact changed. (Static HTML files are usually
tagged with the date/time of the last editing of
the file.) - Bugs in freeware?
- Your money cheerfully refunded!
17Problems Encountered to Date
- Statewide metadata retrofit
- From the agency point of view, unforeseen costs,
time requirements, and promises of usability
benefits they little understand. - ISLs metadata generator
- A web script
- Produces the HTML META tags for direct insertion
into the HEAD of agency HTML documents. - Tag system and thesaurus adapted from Wa-GILS by
ISL. - Part of a largely manual cataloging and editing
process, by the agencies themselves
18Illinois Electronic Archives ProjectPreserving
Electronic Publications
- Design philosophy
- System summary
- Results to date
- Problems to date
- Open issues
- Implementation schedule
19Open Issues
- What are useful metrics addressing web site
change? - Presumably based on some combination of
- ( ) of (new changed deleted) (files
lines bytes) - elapsed time
- How can these be visualized effectively?
- What rates of change are typical?
- For term of office periodicities (e.g., a few
years)? - For disaster response periodicities (e.g., an
hour)? - Is government web site change reliably,
quantifiably predictable per such models?
20Open Issues
- Is it live, or is it Memorex?
- Will users become confused between live and
recorded material when browsing an archive? - Almost certainly! Would you notice this prefix,
only visible in the lower-left corner, and only
visible when the mouse cursor is over a link?
21Open Issues
- Is it live, or is it Memorex?
- Differences in URL appearance
- An original web page URLhttp//www.dot.state.il.u
s/bridges/bridges.html - Resulting archive URLhttp//pep.isrl.uiuc.edu/DoT
ran/200201121430/www.dot.state.il.us/bridges/brid
ges.html - Off-site content is
- live (volatile, transient), not archived, and
- intermingled with links pointing back into the
archive
22Open Issues
- Is it live, or is it Memorex?
- What measures can be taken to minimize this
chance of confusion? - e.g., headers/footers ala Googles cached
documents - HTML FRAMEs
- color change, etc.
- Are legal disclaimers in order?
- Should state-run archives have some form of
pedigree statement on their contents?
23Open Issues
- Searching archives, and other usability issues
- Some users will want specific web pages on
specific dates - How to best support vertical searches, for the
same page or topic, down through the layers of
different versions? - e.g., identifying the versions of each identified
document where a search term was or was not found - Part of a more basic question of identifying the
user community, and assessing their needs
24Open Issues
- Approximating the retrofit of systematic metadata
across one agencys web archive - So as to better convince the agencies of the
value of systematic metadata use. - Implementation intended via insertion of
additional records into the archives metadata
database, and thereby into the search process. - Tag storage would remain distinct from
agency-provided metadata, to support removal if
desired. - Perform a cost/benefit analysis, from the
agencys point of view.
25Illinois Electronic Archives ProjectPreserving
Electronic Publications
- Design philosophy
- System summary
- Results to date
- Problems to date
- Open issues
- Implementation schedule
26Implementation Schedule
- Now Extensive testing with 5 IL agencies, with
beginning statistics-gathering. Irregular
archival of all 100 IL agencies. Integrating
acquisition archival script. - May/June Develop web-based control features and
test keyword search mechanisms. - July/August Install custom host hardware and
begin weekly archival for the 5 agencies, and
infrequent archival for all 100 IL agencies.
Gather statistics. - September Package, document, and release project
software download. Report statistical analysis.
27Illinois Electronic Archives ProjectPreserving
Electronic Publications
- http//realfun.isrl.uiuc.edu/eArchives/
- Principal Investigator
- Joe Natale, Illinois State Library
- jnatale_at_ilsos.net
- StateGILS-4 Presenter
- Larry S. Jackson, University of Illinois,
Graduate School of Library and Information
Science - lsjackso_at_uiuc.edu
- http//realfun.isrl.uiuc.edu/jackson