XMLCENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLARLY PUBLISHERS - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

XMLCENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLARLY PUBLISHERS

Description:

– PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 18
Provided by: SSchwa
Category:

less

Transcript and Presenter's Notes

Title: XMLCENTRIC WORKFLOW OFFERS BENEFITS TO SCHOLARLY PUBLISHERS


1
XML-CENTRIC WORKFLOW OFFERS BENEFITS TO
SCHOLARLY PUBLISHERS
XML 2004 Washington, D.C. 18 November 2004
Alexander (Sasha) Schwarzman,
AGU ltsschwarzman_at_agu.orggt Hyunmin Hur,
DocsDoc Shu-Li Pai, AGU Carter M. Glass, AGU
2
PRESENTATION OVERVIEW
  • Introduction
  • Publishing requirements in a dual
    paperelectronic world
  • AGU manuscript production modes in 2001
  • System architecture and workflow in 2004
  • Design decisions
  • Schema language DTD vs. W3C Schema and Relax NG
  • Copyediting manuscript in author-submitted
    format vs. in XML
  • Converting manuscript to XML vendor vs. in-house
  • Validating XML instance beyond a validating
    parser
  • Extracting and loading metadata into a metadata
    database (MDDB)
  • MDDB-based information products and services
  • Choice of DB technology and programming languages
  • Lessons learned
  • DOI what does it identify and what its format
    should be
  • Version of record
  • Published as ready journal model deconstructed
  • Page numbers and article IDs
  • Special characters and math
  • Results

3
INTRODUCTION
  • AGU
  • Nonprofit multidisciplinary scientific society
    41,000 members from 130 countries
  • Focus organization and dissemination of
    scientific info in interdisciplinary field of
    geophysics (atmospheric, oceanic, solid Earth,
    hydrologic, and space sciences)
  • Publishes 14 high-impact English language
    journals, 4500 articles annually
  • Manuscript life cycle
  • Production
  • copyediting (copy editor)
  • proofing (proofreader and author)
  • correcting (vendor)
  • publishing (production coordinator)

While AGU has introduced radical changes in
how a manuscript is handled at all stages of
its life cycle, in this presentation we will
concentrate on the post-acceptance part of the
publishing process.
4
PUBLISHING REQUIREMENTS IN A PAPERELECTRONIC
WORLD
  • Multiple outputs
  • journal article has to appear in multiple
    formats print, PDF, HTML,
  • Search
  • both articles metadata and its full text have to
    be searchable
  • Linking
  • bibliographic references, external datasets,
    inter- and intra-article linking
  • Dynamic content
  • authors have to be able to include multimedia
    objects, such as videos or animations, into their
    articles
  • Cross-journal products
  • ability to create collections cutting across
    journals (virtual journals)
  • Customization
  • Metadata sharing
  • Preservation of scientific content
  • ability to preserve scientific content in a
    readable nonproprietary format for the
    foreseeable future

5
AGU MANUSCRIPT PRODUCTION MODES IN 2001
  • Camera-ready copy (CRC)
  • no electronic copy ? no reuse/repurposing of
    scientific content
  • authors prepared production files ? inconsistent
    quality of published product
  • metadata-based products issue ToCs, author and
    subject indices, AGU bib. database EASI created
    manually (rekeying). Still, no abstracts in EASI
    !
  • no article in electronic form ? printed issues
    mailed to AI services, metadata rekeyed. Delay
    between issue publication and AI information
    availability
  • Typeset manuscript
  • two journals wholly typeset in XyVision. PDF,
    HTML, and ISO 12083 SGML generated from
    proprietary typesetting system
  • 1997 GRL authors given an option to submit in
    LaTeX, which was converted to HTML and PDF
    in-house and posted on the Web
  • SGML markup of the electronic manuscript file
  • 1997 Earth Interactions marked up in SGML in
    accordance with own DTD
  • 1999 Geochemistry, Geophysics, Geosystems
    marked up with a variation of ISO 12083 SGML DTD
  • 2000 Global Biogeochemical Cycles partial use
    of AGU Article SGML DTD

CRC publishing model is a dead end. Disparate
production modes counterproductive. Solution
unified XML-centric process to cut costs, provide
services, and stay competitive
6
SYSTEM ARCHITECTURE AND WORKFLOW 1
7
SYSTEM ARCHITECTURE AND WORKFLOW 2
  • Custom software
  • AGU-article XML DTD
  • AGU Validator
  • XML conversion tool
  • Metadata Loader (bib and ref metadata extractor
    and inserter)
  • AGU metadata database (AGU MDDB)
  • reporting, linking, and metadata dissemination
    modules
  • full-text database and search engine

8
DESIGN DECISIONS DTD vs. W3C Schema and Relax NG
  • DTD advantages Beck and Lapeyre, 2003
  • Technical
  • parameter entity mechanism modular design,
    inclusion of DTDs (CALS, MathML),
    maintainability, scalability, customization
  • availability of processing tools
  • consistency of validity checking among parsers
  • Business
  • vendor community taggers, compositors,
    conversion shops
  • aggregators, archives
  • Practical
  • XML character entities eacute vs. Unicode
    point x000E9 (readability)
  • DTD disadvantages
  • not in XML syntax
  • lacks strong datatyping
  • DTD publisher-specific vs. industry-standard
  • no suitable DTD at the time AGU developed its
    own to meet the Requirements
  • emerging industry standard NCBI/NLM DTD
    http//dtd.nlm.nih.gov/publishing/

9
DESIGN DECISIONS non-DTD schema languages W3C
Schema
  • W3C Schema
  • Ask yourself
  • schema must be in XML syntax?
  • strong datatyping, full namespace support for
    elements and attributes essential?
  • min/max mechanism for elements essential?
  • contents models similar enough to make use of
    derivation by extension/restriction?
  • developers and vendors okay with inconsistency of
    tools?
  • Cons
  • in scholarly publishing content models are
    diverse ? not derivable
  • mixed content ? min/max, regular expressions not
    usable
  • difficult to modularize, scale, maintain
  • no XML character entities

10
DESIGN DECISIONS non-DTD schema languages
Relax NG
  • Relax NG
  • Pros
  • has both XML and compact (non-XML) syntax
  • combines intelligibility of DTD with datatyping
    capability of W3C Schema
  • provides for context-sensitive content models
  • can validate documents of different types using a
    combined schema
  • DocBook and TEI converted to Relax NG in the past
    year
  • Cons
  • not as well-established as W3C Schema, number of
    tools available is limited
  • does not support XML character entities ? you
    must also have a DTD
  • does not permit Formal Public Identifiers (FPI)
  • Relax NG-specific features may not translate
    neatly to either DTD or W3C Schema
  • Today we would still opt for a DTD but Relax
    NG may become a schema of choice for text (as
    opposed to data) content in the future

11
DESIGN DECISIONS more
  • Converting manuscript to XML
  • vendor vs. in-house
  • Copyediting manuscript
  • in author-submitted format vs. in XML
  • Validator
  • XML instance may be valid but not correct or even
    meaningful Rosenblum and Golfman, 2001
  • in addition to validity, also checks datatypes,
    specific dependencies, and naming conventions
  • overall, performs 100 checks on each XML article
    instance
  • Metadata Loader
  • extracts bibliographic and reference metadata and
    loads them to MDDB
  • Validator and Loader Java/XSLT applications
    portability
  • MDDB and its reporting, linking, and metadata
    dissemination modules
  • information products and services online ToCs
    (updated daily), printed issue ToCs, author and
    subject indices, virtual journals
  • metadata deposits, response pages, and citation
    linking
  • business reports for the managers
  • MDDB relational vs. native XML DB

12
LESSONS LEARNED 1
  • DOI
  • Decide what your DOI identifies
    abstraction/manifestation, format, extent,
    granularity
  • DOI format dumb vs. intelligent, based on
    volume/issue/page vs. tracking number, etc.
  • Version of record
  • XML, if the goal is to separate content from
    presentation
  • rendering preserved content in a variety of
    formats/devices/media
  • archiving textual source and non-textual
    components
  • Published as ready journal model deconstructed
  • each article appears online as soon as its
    production cycle is completed, assembled into
    printed issues and mailed later
  • if article is published (online) in one calendar
    year and printed in the next, what is its year
    and volume? Should your XML schema account for
    the difference?
  • set of articles selected on the basis of user- or
    publisher-defined criteria may cut across
    journals ? journal is but one of many
    collections. Any collection is just a query
    executed against MDDB

13
LESSONS LEARNED 2
  • Page numbers and article IDs
  • at the time of article publishing it is not
    always possible to predict accurately what its
    continuous pagination within the printed issue
    will be
  • waiting until a printed issue is assembled and
    then adding page numbers to article
    representations creates discrepancy in how an
    article is to be cited and runs contrary to the
    principle that an article must not be changed
    after it is published
  • abandoning page numbers altogether is not an
    option because many AI services may need them
    for the purposes of citation tracking and
    metadata resolution (Thomson ISI, CrossRef)
  • as long as a printed issue exists, the reader
    needs a means of finding a particular article
    within it

AGU solution smart Article ID (citation
number) Citation Holzworth, R. H., and R. A.
Goldberg (2004), Electric field measurements in
noctilucent clouds, J. Geophys. Res.,
109, D16203, doi10.1029/2003JD004468. D16203 is
a citation number, where D part D (Atmospheres)
of Journal of Geophysical Research
(JGRD) 16 issue number 2 Aerosols and Clouds
subset of JGRD 03 article sequence within the
subset
14
LESSONS LEARNED 3
  • Page numbers and article IDs (contd)
  • all metadata needed for the citation are in the
    version of record (XML), as well as in HTML and
    PDF ? article can be consistently cited as soon
    as it is published!
  • AI services can use either page numbers (each
    article begins with a page number 1, though), or
    a citation number, or both
  • citation numbers appear in print ? easy for the
    reader to locate an article within a printed
    issue
  • citation numbers in most cases follow the
    physical sequence of articles within an issue or
    its unit, but may occasionally deviate from it ?
    AGU has the flexibility to deal with exceptions
    to a regular publishing flow
  • Special characters
  • XML is the version of record, and eacute is
    more readable than x000E9
  • an XML instance with Unicode points can always be
    produced simply by running a validating parser

15
LESSONS LEARNED 4
  • Math
  • Tagging math
  • MathML
  • LaTeX
  • link to an image
  • Rendering math
  • displaying MathML in a browser
  • providing an image
  • Problems with MathML Gaylord, 2004
  • MathML Presentation vs. MathML Content
  • MathML Presentation verbosity (debugging
    problems)
  • Firefox Netscape display MathML natively, IE
    6.0 needs plug-in. Opera, etc.?
  • Math Player Windows only. Mac, Linux, UNIX?
  • all browsers require additional fonts, yet not
    all characters can be displayed

MathML as a display format is not an option if
multiple browsers/platforms are involved Using
image gives complete control over appearance but
math cant be searched/reused AGU approach tag
math using LaTeX within XML convert to GIF for
presenting in HTML
16
RESULTS 1
  • Improved productivity
  • since introduction of the XML-centric workflow,
    the number of published articles has increased,
    while in the journal production department 2
    full-time positions have been eliminated and 25
    of staff positions have been downgraded
  • production time from acceptance to publication
    has been substantially reduced it is now the
    fastest since 1984 (when records began to be
    kept). GRL 5 weeks, semimonthlies and monthlies
    10 weeks from acceptance to publication
  • management reporting improved significantly
  • Automated production of publishing products
  • printed issue ToCs
  • end-of-year author and subject indices
  • Improved quality and value of published product
  • human error reduced
  • authors responsible for content only, publisher
    responsible for accuracy of articles structure
    and consistency of their appearance
  • multiple outputs (PDF, HTML, print) produced
    automatically from the XML source
  • previously unfeasible checks performed accuracy
    of references metadata

17
RESULTS 2
  • Automatic production of the Web search repository
  • metadata and full text automatically extracted
    from XML into the repository
  • Automatic archiving
  • XML, HTML, PDF, non-textual components, and
    article metadata
  • Direct data feeds to AI services
  • CrossRef, NASAs Astrophysics Data System (ADS),
    AIPs SPIN, etc.
  • Used to be up to half a year delays between
    article publication and metadata appearance in
    AI services. Now delivery is instantaneous and
    fully automated
  • Reference linking implementation
  • CrossRef inbound, outbound, and forward linking
  • Introduction of new information products and
    services
  • virtual journals (cross-journal article
    collections)
  • multimedia content
  • immediate access to underlying datasets
  • RSS

Making production process XML-centric has
allowed AGU to bring its readers the results of
scientific research of the highest quality in the
fastest most cost-efficient manner
Write a Comment
User Comments (0)
About PowerShow.com