Title: XMLCENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLARLY PUBLISHERS
1XML-CENTRIC WORKFLOW OFFERS BENEFITS TO
SCHOLARLY PUBLISHERS
XML 2004 Washington, D.C. 18 November 2004
Alexander (Sasha) Schwarzman,
AGU ltsschwarzman_at_agu.orggt Hyunmin Hur,
DocsDoc Shu-Li Pai, AGU Carter M. Glass, AGU
2PRESENTATION OVERVIEW
- Introduction
- Publishing requirements in a dual
paperelectronic world - AGU manuscript production modes in 2001
- System architecture and workflow in 2004
- Design decisions
- Schema language DTD vs. W3C Schema and Relax NG
- Copyediting manuscript in author-submitted
format vs. in XML - Converting manuscript to XML vendor vs. in-house
- Validating XML instance beyond a validating
parser - Extracting and loading metadata into a metadata
database (MDDB) - MDDB-based information products and services
- Choice of DB technology and programming languages
- Lessons learned
- DOI what does it identify and what its format
should be - Version of record
- Published as ready journal model deconstructed
- Page numbers and article IDs
- Special characters and math
- Results
3INTRODUCTION
- AGU
- Nonprofit multidisciplinary scientific society
41,000 members from 130 countries - Focus organization and dissemination of
scientific info in interdisciplinary field of
geophysics (atmospheric, oceanic, solid Earth,
hydrologic, and space sciences) - Publishes 14 high-impact English language
journals, 4500 articles annually - Manuscript life cycle
- Production
- copyediting (copy editor)
- proofing (proofreader and author)
- correcting (vendor)
- publishing (production coordinator)
While AGU has introduced radical changes in
how a manuscript is handled at all stages of
its life cycle, in this presentation we will
concentrate on the post-acceptance part of the
publishing process.
4PUBLISHING REQUIREMENTS IN A PAPERELECTRONIC
WORLD
- Multiple outputs
- journal article has to appear in multiple
formats print, PDF, HTML, - Search
- both articles metadata and its full text have to
be searchable - Linking
- bibliographic references, external datasets,
inter- and intra-article linking - Dynamic content
- authors have to be able to include multimedia
objects, such as videos or animations, into their
articles - Cross-journal products
- ability to create collections cutting across
journals (virtual journals) - Customization
- Metadata sharing
- Preservation of scientific content
- ability to preserve scientific content in a
readable nonproprietary format for the
foreseeable future
5AGU MANUSCRIPT PRODUCTION MODES IN 2001
- Camera-ready copy (CRC)
- no electronic copy ? no reuse/repurposing of
scientific content - authors prepared production files ? inconsistent
quality of published product - metadata-based products issue ToCs, author and
subject indices, AGU bib. database EASI created
manually (rekeying). Still, no abstracts in EASI
! - no article in electronic form ? printed issues
mailed to AI services, metadata rekeyed. Delay
between issue publication and AI information
availability - Typeset manuscript
- two journals wholly typeset in XyVision. PDF,
HTML, and ISO 12083 SGML generated from
proprietary typesetting system - 1997 GRL authors given an option to submit in
LaTeX, which was converted to HTML and PDF
in-house and posted on the Web - SGML markup of the electronic manuscript file
- 1997 Earth Interactions marked up in SGML in
accordance with own DTD - 1999 Geochemistry, Geophysics, Geosystems
marked up with a variation of ISO 12083 SGML DTD - 2000 Global Biogeochemical Cycles partial use
of AGU Article SGML DTD
CRC publishing model is a dead end. Disparate
production modes counterproductive. Solution
unified XML-centric process to cut costs, provide
services, and stay competitive
6SYSTEM ARCHITECTURE AND WORKFLOW 1
7SYSTEM ARCHITECTURE AND WORKFLOW 2
- Custom software
- AGU-article XML DTD
- AGU Validator
- XML conversion tool
- Metadata Loader (bib and ref metadata extractor
and inserter) - AGU metadata database (AGU MDDB)
- reporting, linking, and metadata dissemination
modules - full-text database and search engine
8DESIGN DECISIONS DTD vs. W3C Schema and Relax NG
- DTD advantages Beck and Lapeyre, 2003
- Technical
- parameter entity mechanism modular design,
inclusion of DTDs (CALS, MathML),
maintainability, scalability, customization - availability of processing tools
- consistency of validity checking among parsers
- Business
- vendor community taggers, compositors,
conversion shops - aggregators, archives
- Practical
- XML character entities eacute vs. Unicode
point x000E9 (readability) - DTD disadvantages
- not in XML syntax
- lacks strong datatyping
- DTD publisher-specific vs. industry-standard
- no suitable DTD at the time AGU developed its
own to meet the Requirements - emerging industry standard NCBI/NLM DTD
http//dtd.nlm.nih.gov/publishing/
9DESIGN DECISIONS non-DTD schema languages W3C
Schema
- W3C Schema
- Ask yourself
- schema must be in XML syntax?
- strong datatyping, full namespace support for
elements and attributes essential? - min/max mechanism for elements essential?
- contents models similar enough to make use of
derivation by extension/restriction? - developers and vendors okay with inconsistency of
tools? - Cons
- in scholarly publishing content models are
diverse ? not derivable - mixed content ? min/max, regular expressions not
usable - difficult to modularize, scale, maintain
- no XML character entities
10DESIGN DECISIONS non-DTD schema languages
Relax NG
- Relax NG
- Pros
- has both XML and compact (non-XML) syntax
- combines intelligibility of DTD with datatyping
capability of W3C Schema - provides for context-sensitive content models
- can validate documents of different types using a
combined schema - DocBook and TEI converted to Relax NG in the past
year - Cons
- not as well-established as W3C Schema, number of
tools available is limited - does not support XML character entities ? you
must also have a DTD - does not permit Formal Public Identifiers (FPI)
- Relax NG-specific features may not translate
neatly to either DTD or W3C Schema - Today we would still opt for a DTD but Relax
NG may become a schema of choice for text (as
opposed to data) content in the future
11DESIGN DECISIONS more
- Converting manuscript to XML
- vendor vs. in-house
- Copyediting manuscript
- in author-submitted format vs. in XML
- Validator
- XML instance may be valid but not correct or even
meaningful Rosenblum and Golfman, 2001 - in addition to validity, also checks datatypes,
specific dependencies, and naming conventions - overall, performs 100 checks on each XML article
instance - Metadata Loader
- extracts bibliographic and reference metadata and
loads them to MDDB - Validator and Loader Java/XSLT applications
portability - MDDB and its reporting, linking, and metadata
dissemination modules - information products and services online ToCs
(updated daily), printed issue ToCs, author and
subject indices, virtual journals - metadata deposits, response pages, and citation
linking - business reports for the managers
- MDDB relational vs. native XML DB
12LESSONS LEARNED 1
- DOI
- Decide what your DOI identifies
abstraction/manifestation, format, extent,
granularity - DOI format dumb vs. intelligent, based on
volume/issue/page vs. tracking number, etc. - Version of record
- XML, if the goal is to separate content from
presentation - rendering preserved content in a variety of
formats/devices/media - archiving textual source and non-textual
components - Published as ready journal model deconstructed
- each article appears online as soon as its
production cycle is completed, assembled into
printed issues and mailed later - if article is published (online) in one calendar
year and printed in the next, what is its year
and volume? Should your XML schema account for
the difference? - set of articles selected on the basis of user- or
publisher-defined criteria may cut across
journals ? journal is but one of many
collections. Any collection is just a query
executed against MDDB
13LESSONS LEARNED 2
- Page numbers and article IDs
- at the time of article publishing it is not
always possible to predict accurately what its
continuous pagination within the printed issue
will be - waiting until a printed issue is assembled and
then adding page numbers to article
representations creates discrepancy in how an
article is to be cited and runs contrary to the
principle that an article must not be changed
after it is published - abandoning page numbers altogether is not an
option because many AI services may need them
for the purposes of citation tracking and
metadata resolution (Thomson ISI, CrossRef) - as long as a printed issue exists, the reader
needs a means of finding a particular article
within it
AGU solution smart Article ID (citation
number) Citation Holzworth, R. H., and R. A.
Goldberg (2004), Electric field measurements in
noctilucent clouds, J. Geophys. Res.,
109, D16203, doi10.1029/2003JD004468. D16203 is
a citation number, where D part D (Atmospheres)
of Journal of Geophysical Research
(JGRD) 16 issue number 2 Aerosols and Clouds
subset of JGRD 03 article sequence within the
subset
14LESSONS LEARNED 3
- Page numbers and article IDs (contd)
- all metadata needed for the citation are in the
version of record (XML), as well as in HTML and
PDF ? article can be consistently cited as soon
as it is published! - AI services can use either page numbers (each
article begins with a page number 1, though), or
a citation number, or both - citation numbers appear in print ? easy for the
reader to locate an article within a printed
issue - citation numbers in most cases follow the
physical sequence of articles within an issue or
its unit, but may occasionally deviate from it ?
AGU has the flexibility to deal with exceptions
to a regular publishing flow - Special characters
- XML is the version of record, and eacute is
more readable than x000E9 - an XML instance with Unicode points can always be
produced simply by running a validating parser
15LESSONS LEARNED 4
- Math
- Tagging math
- MathML
- LaTeX
- link to an image
- Rendering math
- displaying MathML in a browser
- providing an image
- Problems with MathML Gaylord, 2004
- MathML Presentation vs. MathML Content
- MathML Presentation verbosity (debugging
problems) - Firefox Netscape display MathML natively, IE
6.0 needs plug-in. Opera, etc.? - Math Player Windows only. Mac, Linux, UNIX?
- all browsers require additional fonts, yet not
all characters can be displayed
MathML as a display format is not an option if
multiple browsers/platforms are involved Using
image gives complete control over appearance but
math cant be searched/reused AGU approach tag
math using LaTeX within XML convert to GIF for
presenting in HTML
16RESULTS 1
- Improved productivity
- since introduction of the XML-centric workflow,
the number of published articles has increased,
while in the journal production department 2
full-time positions have been eliminated and 25
of staff positions have been downgraded - production time from acceptance to publication
has been substantially reduced it is now the
fastest since 1984 (when records began to be
kept). GRL 5 weeks, semimonthlies and monthlies
10 weeks from acceptance to publication - management reporting improved significantly
- Automated production of publishing products
- printed issue ToCs
- end-of-year author and subject indices
- Improved quality and value of published product
- human error reduced
- authors responsible for content only, publisher
responsible for accuracy of articles structure
and consistency of their appearance - multiple outputs (PDF, HTML, print) produced
automatically from the XML source - previously unfeasible checks performed accuracy
of references metadata
17RESULTS 2
- Automatic production of the Web search repository
- metadata and full text automatically extracted
from XML into the repository - Automatic archiving
- XML, HTML, PDF, non-textual components, and
article metadata - Direct data feeds to AI services
- CrossRef, NASAs Astrophysics Data System (ADS),
AIPs SPIN, etc. - Used to be up to half a year delays between
article publication and metadata appearance in
AI services. Now delivery is instantaneous and
fully automated - Reference linking implementation
- CrossRef inbound, outbound, and forward linking
- Introduction of new information products and
services - virtual journals (cross-journal article
collections) - multimedia content
- immediate access to underlying datasets
- RSS
Making production process XML-centric has
allowed AGU to bring its readers the results of
scientific research of the highest quality in the
fastest most cost-efficient manner