L.S. Jackson - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

L.S. Jackson

Description:

Commodity hardware, freeware. Costs should be comparable to a ... freeware ... Freeware (e.g., spiders) doesn't try to (can't afford to) keep up ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 28
Provided by: larrysj
Category:
Tags: freeware | jackson

less

Transcript and Presenter's Notes

Title: L.S. Jackson


1
Illinois Electronic Archives ProjectPreserving
Electronic Publications
  • http//realfun.isrl.uiuc.edu/eArchives/
  • Principal Investigator
  • Joe Natale, Illinois State Library
  • jnatale_at_ilsos.net
  • StateGILS-4 Presenter
  • Larry S. Jackson, University of Illinois,
    Graduate School of Library and Information
    Science
  • lsjackso_at_uiuc.edu
  • http//realfun.isrl.uiuc.edu/jackson

2
Illinois Electronic Archives ProjectPreserving
Electronic Publications
  • Design philosophy
  • System summary
  • Results to date
  • Problems to date
  • Open issues
  • Implementation schedule

3
Design Philosophy
  • Support local control, local responsibility
  • Can be run remotely -- its the Web, after all
  • Nominal system costs (acquisition, operation)
  • Commodity hardware, freeware
  • Costs should be comparable to a listserve or
    httpd
  • Relatively complete coverage of Web features
  • i.e., static or dynamic HTML, and things that
    look like URLs
  • Minimal administrative burden on agency web
    authors webmasters, once running
  • Web access to configuration and job control

4
Design Philosophy
  • More an archive than a digital library
  • For the PEP project at this stage, it is more
    important to retain the (volatile) materials than
    to facilitate a wide variety of modes of
    searching and browsing.
  • Retention facilities must support subsequent
    export into other archival formats or systems.
  • Usage modes will reflect user information needs
    and scenarios that are not presently understood.
  • System capacity requirements, being a function of
    the result of user needs assessment, should not
    be presumed to warrant great expenditures. If
    great numbers of users, or highly time-critical
    needs are identified, it may be necessary to
    export the PEP archive contents to a
    high-performance system.
  • That is not presently anticipated.

5
Design Philosophy
  • Standards-based (i.e., primarily uses XML
    open, mature freeware)
  • Forms the basis of the design of the selected
    freeware, and generally essential for freeware
    interoperability.
  • Best odds on eventually supporting data export,
    or interoperability of the PEP system with other
    tools/layers.
  • Do no harm
  • Not all sites are standards-based, but if they
    are, the PEP system should archive them as
    completely and accurately as the freeware
    components support.
  • e.g. Spiders generally cant process (i.e.,
    follow links out of) a proprietary-format file.
    They (at most) just download the
    proprietary-format file itself.

6
Illinois Electronic Archives ProjectPreserving
Electronic Publications
  • Design philosophy
  • System summary
  • Results to date
  • Problems to date
  • Open issues
  • Implementation schedule

7
PEP High-LevelSystem Architecture Summary
Users(browser)
ApacheWeb Server
XML Web SiteDescriptors
ConfigurationAdmin Script
XML SpiderDefaults
WGETWeb Spider
RetrievalScript
SpiderLauncherand Post-Processor
Retrieved SitesDocument Trees
VersionDeposit
Retrieval Logs
Key
VersionRetrievalScript
ISearchEngine
CVS VersionControl System
ExistingTechnology
New
mySQL DBMS (Metadata)
Pending
Data File
8
Illinois Electronic Archives ProjectPreserving
Electronic Publications
  • Design philosophy
  • System summary
  • Results to date
  • Problems to date
  • Open issues
  • Implementation schedule

9
Results to Date
  • The State of Illinois web (_at_ 15 Jan 2002)
  • 5.6 GB total size
  • 84K files
  • 23K are HTML, 26K are PDF , 1.7K are DOC,
  • 9.3K are GIF, 5.5K are JPEG
  • 82K META tags
  • META tag usage frequency extensively analyzed by
    type
  • Discovered one agency using their own thesaurus
    (vs. using ISLs thesaurus)
  • Most are not using embedded metadata
    significantly
  • They may have their own, separate databases and
    indexes

10
Results to Date
  • Partial (private) ability to browse the recorded
    documents, for all 100 official sites.

ARCHIVE
LIVE WEBSITE
11
Results to Date
  • Now working on version control issues
  • Many freeware incompatibility issues
  • especially variations in file and directory
    naming conventions and metacharacters across
    different platform types.
  • How fast is this content changing? What
    metric(s)?
  • e.g., text vs. binary changes

BinaryBytes
Binary Bytes
TextBytes
Spidering
Text Bytes
12
Illinois Electronic Archives ProjectPreserving
Electronic Publications
  • Design philosophy
  • System summary
  • Results to date
  • Problems to date
  • Open issues
  • Implementation schedule

13
Problems Encountered to Date
  • General design issues when including freeware/
    shareware
  • Overlooking hidden costs.
  • Is it adequate, per our system requirements?
  • Is it reliable robust?
  • If not, at what cost can it be fixed?
  • At what cost can it be extended?
  • e.g., plug-in architectures, provisions for
    interfacing with scripts and other programs, good
    documentation.
  • What are the learning-curve costs to set it up?
  • Will all the parts work with our hardware and
    operating system?

14
Problems Encountered to Date
  • Hidden implementation problems
  • Platform incompatibilities
  • Platform computer hardware and/or operating
    system. (Applies to both Client/browser and
    Server.)
  • Metacharacters
  • Characters performing special functions, per the
    given operating system. Legal names under one
    system may be illegal under another system.
  • e.g., / . , ! ( ) newlines, tabs,
    and space.
  • Case sensitivity
  • Requesting A.html and a.html from some
    servers can return one document twice. On other
    servers, these names may return two different
    documents.

15
Problems Encountered to Date
  • Hidden implementation problems
  • Platform incompatibilities
  • Use of cutting-edge features in web pages
    limits portability and disenfranchises segments
    of the population
  • Mostly seeing just animations or
    appearance-changing flash.
  • Technologies often vendor-specific, not portable
    to, or interoperable with other browsers,
    wireless access devices, PDAs, or disability
    access devices.
  • Freeware (e.g., spiders) doesnt try to (cant
    afford to) keep up with all these web-incremental
    experiments.

16
Problems Encountered to Date
  • Hidden implementation problems
  • Platform incompatibilities
  • Dynamic (script-served) documents
  • Web sites composing web pages only in response
    to user request have date/time tags that say
    just modified!, while the content has not in
    fact changed. (Static HTML files are usually
    tagged with the date/time of the last editing of
    the file.)
  • Bugs in freeware?
  • Your money cheerfully refunded!

17
Problems Encountered to Date
  • Statewide metadata retrofit
  • From the agency point of view, unforeseen costs,
    time requirements, and promises of usability
    benefits they little understand.
  • ISLs metadata generator
  • A web script
  • Produces the HTML META tags for direct insertion
    into the HEAD of agency HTML documents.
  • Tag system and thesaurus adapted from Wa-GILS by
    ISL.
  • Part of a largely manual cataloging and editing
    process, by the agencies themselves

18
Illinois Electronic Archives ProjectPreserving
Electronic Publications
  • Design philosophy
  • System summary
  • Results to date
  • Problems to date
  • Open issues
  • Implementation schedule

19
Open Issues
  • What are useful metrics addressing web site
    change?
  • Presumably based on some combination of
  • ( ) of (new changed deleted) (files
    lines bytes)
  • elapsed time
  • How can these be visualized effectively?
  • What rates of change are typical?
  • For term of office periodicities (e.g., a few
    years)?
  • For disaster response periodicities (e.g., an
    hour)?
  • Is government web site change reliably,
    quantifiably predictable per such models?

20
Open Issues
  • Is it live, or is it Memorex?
  • Will users become confused between live and
    recorded material when browsing an archive?
  • Almost certainly! Would you notice this prefix,
    only visible in the lower-left corner, and only
    visible when the mouse cursor is over a link?

21
Open Issues
  • Is it live, or is it Memorex?
  • Differences in URL appearance
  • An original web page URLhttp//www.dot.state.il.u
    s/bridges/bridges.html
  • Resulting archive URLhttp//pep.isrl.uiuc.edu/DoT
    ran/200201121430/www.dot.state.il.us/bridges/brid
    ges.html
  • Off-site content is
  • live (volatile, transient), not archived, and
  • intermingled with links pointing back into the
    archive

22
Open Issues
  • Is it live, or is it Memorex?
  • What measures can be taken to minimize this
    chance of confusion?
  • e.g., headers/footers ala Googles cached
    documents
  • HTML FRAMEs
  • color change, etc.
  • Are legal disclaimers in order?
  • Should state-run archives have some form of
    pedigree statement on their contents?

23
Open Issues
  • Searching archives, and other usability issues
  • Some users will want specific web pages on
    specific dates
  • How to best support vertical searches, for the
    same page or topic, down through the layers of
    different versions?
  • e.g., identifying the versions of each identified
    document where a search term was or was not found
  • Part of a more basic question of identifying the
    user community, and assessing their needs

24
Open Issues
  • Approximating the retrofit of systematic metadata
    across one agencys web archive
  • So as to better convince the agencies of the
    value of systematic metadata use.
  • Implementation intended via insertion of
    additional records into the archives metadata
    database, and thereby into the search process.
  • Tag storage would remain distinct from
    agency-provided metadata, to support removal if
    desired.
  • Perform a cost/benefit analysis, from the
    agencys point of view.

25
Illinois Electronic Archives ProjectPreserving
Electronic Publications
  • Design philosophy
  • System summary
  • Results to date
  • Problems to date
  • Open issues
  • Implementation schedule

26
Implementation Schedule
  • Now Extensive testing with 5 IL agencies, with
    beginning statistics-gathering. Irregular
    archival of all 100 IL agencies. Integrating
    acquisition archival script.
  • May/June Develop web-based control features and
    test keyword search mechanisms.
  • July/August Install custom host hardware and
    begin weekly archival for the 5 agencies, and
    infrequent archival for all 100 IL agencies.
    Gather statistics.
  • September Package, document, and release project
    software download. Report statistical analysis.

27
Illinois Electronic Archives ProjectPreserving
Electronic Publications
  • http//realfun.isrl.uiuc.edu/eArchives/
  • Principal Investigator
  • Joe Natale, Illinois State Library
  • jnatale_at_ilsos.net
  • StateGILS-4 Presenter
  • Larry S. Jackson, University of Illinois,
    Graduate School of Library and Information
    Science
  • lsjackso_at_uiuc.edu
  • http//realfun.isrl.uiuc.edu/jackson
Write a Comment
User Comments (0)
About PowerShow.com