XMLCENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLARLY PUBLISHERS - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

XMLCENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLARLY PUBLISHERS

Description:

– PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 18

Provided by: SSchwa

Category:

more less

Transcript and Presenter's Notes

Title: XMLCENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLARLY PUBLISHERS

1
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO
SCHOLARLY PUBLISHERS
XML 2004 Washington, D.C. 18 November 2004
Alexander (Sasha) Schwarzman,
AGU ltsschwarzman_at_agu.orggt Hyunmin Hur,
DocsDoc Shu-Li Pai, AGU Carter M. Glass, AGU
2
PRESENTATION OVERVIEW

Introduction
Publishing requirements in a dual
paperelectronic world
AGU manuscript production modes in 2001
System architecture and workflow in 2004
Design decisions
Schema language DTD vs. W3C Schema and Relax NG
Copyediting manuscript in author-submitted
format vs. in XML
Converting manuscript to XML vendor vs. in-house
Validating XML instance beyond a validating
parser
Extracting and loading metadata into a metadata
database (MDDB)
MDDB-based information products and services
Choice of DB technology and programming languages
Lessons learned
DOI what does it identify and what its format
should be
Version of record
Published as ready journal model deconstructed
Page numbers and article IDs
Special characters and math
Results

3
INTRODUCTION

AGU
Nonprofit multidisciplinary scientific society
41,000 members from 130 countries
Focus organization and dissemination of
scientific info in interdisciplinary field of
geophysics (atmospheric, oceanic, solid Earth,
hydrologic, and space sciences)
Publishes 14 high-impact English language
journals, 4500 articles annually
Manuscript life cycle

Production
copyediting (copy editor)
proofing (proofreader and author)
correcting (vendor)
publishing (production coordinator)

While AGU has introduced radical changes in
how a manuscript is handled at all stages of
its life cycle, in this presentation we will
concentrate on the post-acceptance part of the
publishing process.
4
PUBLISHING REQUIREMENTS IN A PAPERELECTRONIC
WORLD

Multiple outputs
journal article has to appear in multiple
formats print, PDF, HTML,
Search
both articles metadata and its full text have to
be searchable
Linking
bibliographic references, external datasets,
inter- and intra-article linking
Dynamic content
authors have to be able to include multimedia
objects, such as videos or animations, into their
articles
Cross-journal products
ability to create collections cutting across
journals (virtual journals)
Customization
Metadata sharing
Preservation of scientific content
ability to preserve scientific content in a
readable nonproprietary format for the
foreseeable future

5
AGU MANUSCRIPT PRODUCTION MODES IN 2001

Camera-ready copy (CRC)
no electronic copy ? no reuse/repurposing of
scientific content
authors prepared production files ? inconsistent
quality of published product
metadata-based products issue ToCs, author and
subject indices, AGU bib. database EASI created
manually (rekeying). Still, no abstracts in EASI
!
no article in electronic form ? printed issues
mailed to AI services, metadata rekeyed. Delay
between issue publication and AI information
availability
Typeset manuscript
two journals wholly typeset in XyVision. PDF,
HTML, and ISO 12083 SGML generated from
proprietary typesetting system
1997 GRL authors given an option to submit in
LaTeX, which was converted to HTML and PDF
in-house and posted on the Web
SGML markup of the electronic manuscript file
1997 Earth Interactions marked up in SGML in
accordance with own DTD
1999 Geochemistry, Geophysics, Geosystems
marked up with a variation of ISO 12083 SGML DTD
2000 Global Biogeochemical Cycles partial use
of AGU Article SGML DTD

CRC publishing model is a dead end. Disparate
production modes counterproductive. Solution
unified XML-centric process to cut costs, provide
services, and stay competitive
6
SYSTEM ARCHITECTURE AND WORKFLOW 1
7
SYSTEM ARCHITECTURE AND WORKFLOW 2

Custom software
AGU-article XML DTD
AGU Validator
XML conversion tool
Metadata Loader (bib and ref metadata extractor
and inserter)
AGU metadata database (AGU MDDB)
reporting, linking, and metadata dissemination
modules
full-text database and search engine

8
DESIGN DECISIONS DTD vs. W3C Schema and Relax NG

DTD advantages Beck and Lapeyre, 2003
Technical
parameter entity mechanism modular design,
inclusion of DTDs (CALS, MathML),
maintainability, scalability, customization
availability of processing tools
consistency of validity checking among parsers
Business
vendor community taggers, compositors,
conversion shops
aggregators, archives
Practical
XML character entities eacute vs. Unicode
point x000E9 (readability)
DTD disadvantages
not in XML syntax
lacks strong datatyping
DTD publisher-specific vs. industry-standard
no suitable DTD at the time AGU developed its
own to meet the Requirements
emerging industry standard NCBI/NLM DTD
http//dtd.nlm.nih.gov/publishing/

9
DESIGN DECISIONS non-DTD schema languages W3C
Schema

W3C Schema
Ask yourself
schema must be in XML syntax?
strong datatyping, full namespace support for
elements and attributes essential?
min/max mechanism for elements essential?
contents models similar enough to make use of
derivation by extension/restriction?
developers and vendors okay with inconsistency of
tools?
Cons
in scholarly publishing content models are
diverse ? not derivable
mixed content ? min/max, regular expressions not
usable
difficult to modularize, scale, maintain
no XML character entities

10
DESIGN DECISIONS non-DTD schema languages
Relax NG

Relax NG
Pros
has both XML and compact (non-XML) syntax
combines intelligibility of DTD with datatyping
capability of W3C Schema
provides for context-sensitive content models
can validate documents of different types using a
combined schema
DocBook and TEI converted to Relax NG in the past
year
Cons
not as well-established as W3C Schema, number of
tools available is limited
does not support XML character entities ? you
must also have a DTD
does not permit Formal Public Identifiers (FPI)
Relax NG-specific features may not translate
neatly to either DTD or W3C Schema
Today we would still opt for a DTD but Relax
NG may become a schema of choice for text (as
opposed to data) content in the future

11
DESIGN DECISIONS more

Converting manuscript to XML
vendor vs. in-house
Copyediting manuscript
in author-submitted format vs. in XML
Validator
XML instance may be valid but not correct or even
meaningful Rosenblum and Golfman, 2001
in addition to validity, also checks datatypes,
specific dependencies, and naming conventions
overall, performs 100 checks on each XML article
instance
Metadata Loader
extracts bibliographic and reference metadata and
loads them to MDDB
Validator and Loader Java/XSLT applications
portability
MDDB and its reporting, linking, and metadata
dissemination modules
information products and services online ToCs
(updated daily), printed issue ToCs, author and
subject indices, virtual journals
metadata deposits, response pages, and citation
linking
business reports for the managers
MDDB relational vs. native XML DB

12
LESSONS LEARNED 1

DOI
Decide what your DOI identifies
abstraction/manifestation, format, extent,
granularity
DOI format dumb vs. intelligent, based on
volume/issue/page vs. tracking number, etc.
Version of record
XML, if the goal is to separate content from
presentation
rendering preserved content in a variety of
formats/devices/media
archiving textual source and non-textual
components
Published as ready journal model deconstructed
each article appears online as soon as its
production cycle is completed, assembled into
printed issues and mailed later
if article is published (online) in one calendar
year and printed in the next, what is its year
and volume? Should your XML schema account for
the difference?
set of articles selected on the basis of user- or
publisher-defined criteria may cut across
journals ? journal is but one of many
collections. Any collection is just a query
executed against MDDB

13
LESSONS LEARNED 2

Page numbers and article IDs
at the time of article publishing it is not
always possible to predict accurately what its
continuous pagination within the printed issue
will be
waiting until a printed issue is assembled and
then adding page numbers to article
representations creates discrepancy in how an
article is to be cited and runs contrary to the
principle that an article must not be changed
after it is published
abandoning page numbers altogether is not an
option because many AI services may need them
for the purposes of citation tracking and
metadata resolution (Thomson ISI, CrossRef)
as long as a printed issue exists, the reader
needs a means of finding a particular article
within it

AGU solution smart Article ID (citation
number) Citation Holzworth, R. H., and R. A.
Goldberg (2004), Electric field measurements in
noctilucent clouds, J. Geophys. Res.,
109, D16203, doi10.1029/2003JD004468. D16203 is
a citation number, where D part D (Atmospheres)
of Journal of Geophysical Research
(JGRD) 16 issue number 2 Aerosols and Clouds
subset of JGRD 03 article sequence within the
subset
14
LESSONS LEARNED 3

Page numbers and article IDs (contd)
all metadata needed for the citation are in the
version of record (XML), as well as in HTML and
PDF ? article can be consistently cited as soon
as it is published!
AI services can use either page numbers (each
article begins with a page number 1, though), or
a citation number, or both
citation numbers appear in print ? easy for the
reader to locate an article within a printed
issue
citation numbers in most cases follow the
physical sequence of articles within an issue or
its unit, but may occasionally deviate from it ?
AGU has the flexibility to deal with exceptions
to a regular publishing flow
Special characters
XML is the version of record, and eacute is
more readable than x000E9
an XML instance with Unicode points can always be
produced simply by running a validating parser

15
LESSONS LEARNED 4

Math
Tagging math
MathML
LaTeX
link to an image
Rendering math
displaying MathML in a browser
providing an image
Problems with MathML Gaylord, 2004
MathML Presentation vs. MathML Content
MathML Presentation verbosity (debugging
problems)
Firefox Netscape display MathML natively, IE
6.0 needs plug-in. Opera, etc.?
Math Player Windows only. Mac, Linux, UNIX?
all browsers require additional fonts, yet not
all characters can be displayed

MathML as a display format is not an option if
multiple browsers/platforms are involved Using
image gives complete control over appearance but
math cant be searched/reused AGU approach tag
math using LaTeX within XML convert to GIF for
presenting in HTML
16
RESULTS 1

Improved productivity
since introduction of the XML-centric workflow,
the number of published articles has increased,
while in the journal production department 2
full-time positions have been eliminated and 25
of staff positions have been downgraded
production time from acceptance to publication
has been substantially reduced it is now the
fastest since 1984 (when records began to be
kept). GRL 5 weeks, semimonthlies and monthlies
10 weeks from acceptance to publication
management reporting improved significantly
Automated production of publishing products
printed issue ToCs
end-of-year author and subject indices
Improved quality and value of published product
human error reduced
authors responsible for content only, publisher
responsible for accuracy of articles structure
and consistency of their appearance
multiple outputs (PDF, HTML, print) produced
automatically from the XML source
previously unfeasible checks performed accuracy
of references metadata

17
RESULTS 2

Automatic production of the Web search repository
metadata and full text automatically extracted
from XML into the repository
Automatic archiving
XML, HTML, PDF, non-textual components, and
article metadata
Direct data feeds to AI services
CrossRef, NASAs Astrophysics Data System (ADS),
AIPs SPIN, etc.
Used to be up to half a year delays between
article publication and metadata appearance in
AI services. Now delivery is instantaneous and
fully automated
Reference linking implementation
CrossRef inbound, outbound, and forward linking
Introduction of new information products and
services
virtual journals (cross-journal article
collections)
multimedia content
immediate access to underlying datasets
RSS

Making production process XML-centric has
allowed AGU to bring its readers the results of
scientific research of the highest quality in the
fastest most cost-efficient manner

Write a Comment

User Comments (0)