Title: Symposium on Best Practice
1Ensuring that digital data last
- The priority of archival form over working
form and presentation form Gary
Simons SIL International
2A paradox of writing history
- The more advanced the writing technology, the
less durable the written product. - From most durable to least durable
- Clay tablets and stone
- Velum
- Papyrus
- Paper
- Digital word processing
3Storage media are ephemeral
- Life expectancy of digital storage media
- Magnetic tape 10 to 20 years
- CD-R (write once)
- Manufacturers say 100 to 200 years
- Independent lab says 30 years
- CD-RW (write many times)
- Manufacturers say 25 years
4Hardware devices are ephemeral
- Removable media on personal computers advance
over 25 years - 8-inch floppies
- 5.25-inch floppies
- 3.5-inch floppies
- Zip drives
- CD-Rs
- DVD-Rs
5Software formats are ephemeral
- Software vendors change file formats and
functionality with each version. - When we use a proprietary single vendor format,
we lose access to the data when the software is
obsolete. - For instance,
- Microsoft Word files from the 1980s cannot be
read by current versions of Word
6An impending Digital Dark Age
- Future historians may see our present age as
another Dark Ages since so much information
documenting our current civilization is recorded
digitally and will have vanished. - If linguists fail to act in time, our digital
data records are in danger of dying out before
the endangered languages we are seeking to
document.
7Whats a linguist to do?
- Do two things to ensure that digital data endure
long into the future - Put the materials into an enduring file format.
- Deposit the materials with an archive that will
make a practice of periodically migrating them to
new storage media as needed.
8Forms contrasted by function
- Working form
- The form in which information is stored as it is
created and edited. - Presentation form
- The form in which information is presented to the
public. - Archival form
- The form in which information isstored for
access long into the future.
9The problem
- Popular working forms (like Microsoft Word or
database applications) are not suitable archival
forms. - Popular presentation forms (like dynamic web
pages) are not suitable archival forms. - Linguists tend to focus on working form and
presentation form they must look beyond these to
create enduring work.
10Unacceptable practice
- The form that is archived is a binary working
form that requires a specific piece of software,
e.g., - .DOC, .XLS, .PPT, .MDB
- A format supported by homemade software
- The information will cease to exist when the
required software ceases to work on the hardware
in use.
11Minimally acceptable practice
- The form that is archived is a presentation form
based on an open format supported by multiple
vendors, e.g., - HTML, PDF
- The good news
- A snapshot of how you presented the information
will persist. - The bad news
- It is a dead end formatthe information is not
repurposeable.
12Best practice
- The form that is archived preserves all of the
information (including its structure) in such a
way that it is portable and repurposeable. - Descriptive XML markup
- An XML archival form is not a dead end
- It may be reloaded into a working form.
- it may regenerate new presentation forms.
13A sample presentation form
- From a dictionary of Sikaiana, Solomon Islands
aha na the shell tool used for measuring the
spaces between mesh in nets (seu manu, kupena).
ahaa (from PPN afaa) n a cyclone, a tidal
wave. aaha 1. vt to open up, to push apart, as
in pushing apart branches in order to look
through. 2. vt to open up a new settlement or
start a new garden. 3. vt to start, to begin a
new project or way of life. Tapa mai a koe ko
hano i mua ki aaha te ala o te taina, 'you called
upon me to go first (to school) to open the way
for my brother (MS)'.
14Unacceptable practice
- If you archive a .DOC file, this is what future
generations will see when they open it
15Minimally acceptable practice
- If you archive an HTML presentation, this is what
future generations will see
ltPgtltBgtahalt/Bgt ltIgtnalt/Igt the shell tool used for
measuring the spaces between mesh in nets (ltIgtseu
manu, kupenalt/Igt).lt/PgtltPgtltBgt ahaalt/Bgt (from PPN
afaa) ltIgtnlt/Igt a cyclone, a tidal
wave.lt/PgtltPgtltBgt aahalt/Bgt 1. ltIgtvtlt/Igt to open
up, to push apart, as in pushing apart branches
in order to look through. 2. ltIgtvtlt/Igt to open
up a new settlement or start a new garden. 3.
ltIgtvtlt/Igt to start, to begin a new project or
way of life. ltIgtTapa mai a koe ko hano i mua ki
aaha te ala o te taina,lt/Igt 'you called upon me
to go first (to school) to open the way for my
brother (MS)'. lt/Pgt
16Best practice
- If you archive descriptive XML markup, this is
what future generations will see - Future generations (though they lack our current
working tools) will be able to - See and understand the information
- Load it into their own working tools
- Create modern presentation forms
17Is XML just one more ephemeral format?
- No! Its as rock solid as ASCII.
- ASCII was adopted in 1963 40 years later it is
at the heart of operating sys-tems, email, the
web it wont change. - XML uses ASCII notation to essentially extend
ASCII by solving two of its inherent limitations - Via Unicode it encodes text in any language
- Via tags it encodes the structure of information
18Is XML just one more theory?
- No! It has become part of the fabric of the
global information infrastructure. - Its a family of open standards from the
Worldwide Web Consortium. - All major vendors (e.g. Microsoft, IBM, Sun,
Oracle) have embraced it. - Hundreds of small vendors and open-source
projects have developed tools.
19Whats linguistics to do?
- The community needs to recognize the fleeting
value of digital presentation forms and embrace
archival forms. - Grants should require best practice archiving,
not just dissemination. - Reward archival language documentation.
- Get into league with libraries and archives.
- Only by taking steps like these can we ensure
that our digital data will endure.