Title: An XMLbased Encoding Standard for Language Corpora
1An XML-based Encoding Standard for Language
Corpora
XCES
- Nancy Ide Vassar College
- Patrice Bonhomme LORIA/CNRS
- Laurent Romary LORIA/CNRS
2The Corpus Encoding Standard
- EAGLES standard
- encoding conventions for corpora used in language
engineering research
- an SGML application
- TEI conformant
3The CES defines...
- requirements for increasing levels of encoding
- a suite of DTDs for encoding basic document
structure and linguistic annotation
- a corresponding data architecture for linguistic
corpora
4XCES
- instantiation of the CES DTDs in XML
- same tags, data architecture as the CES
- motivation use in creating the American National
Corpus (ANC)
- Macleod, Ide, and Grishman, LREC 2000
5XML provides more than SGML
- better linkage mechanisms
- XSLT for document access and transformation
- XML schemas
- provision for accessing all or part of multiple
DTDs
6Minimal XML conversion
- adaptation of DTDs
- eliminate inclusion exceptions
- make mixed-content models XML-compliant
- adaptation of the CES mechanism for
inter-document reference
- meet the specifications of XML pointer and
linking mechanisms
7Additional Adaptations
- validate the CES data architecture by ensuring
conformance to other XML specifications
- XSL Transformation Language
- XQL
- exploit XML mechanisms for combining all or part
of documents described by different DTDs
- instantiate the XCES DTDs using XML schemas
8The CES/XCES Data Architecture
- remote markup, or "stand-off" model
- annotations maintained in separate documents that
point back to the original
- yields a hyper-document composed of the
original text and all annotations
- increasingly accepted as the appropriate
architecture for language resources
9Use of links in CES and XCES
- link corresponding segments of two or more
aligned primary texts
- link annotation documents to a base document
containing the primary text
- e.g., morpho-syntactic information linked to the
string of characters in the original text to
which it applies
10XCES Requirements for Linking
- must be able to point to other documents
- must be able to point to tagged elements as well
as locations within tagged elements
- eliminate the need to tag every element that
might be referenced
- eliminate IDs on every element that is
referenced, as in SGML
11XML Path Language (XPath)
- concise notation for element localization in the
document tree
- /div/p2/s3 - third sentence of second
paragraph in each
- /descendantp - all elements
- predicates for accessing characters within
elements
- substring(/p/s2/text(),10,12)
12XPointer
- extends XPath syntax to allow
- addressing points and ranges as well as nodes
- locating information by string matching
- use of addressing expressions in URI-references
as fragment identifiers
13XLink
- uni- or multi-directional links
- can specify how link is to be activated
- by hand or automatically by the browser
- can specify what to do with the target fragment
- replace it or insert it into the source document
14Links to External Documents
- None in SGML
- HyTime/TEI invented "doc" attribute
- CES used "doc" with inheritance to avoid
repetition of the attribute
- not supported by SGML processors
- XML XLink and xmlbase attribute
15Linking Mechanisms (A brief history)
HyTime/TEI
to"CHILD (1) (2) STRLOC (22)"
doc"doc.xml"
CES
XCES using XLink
xptr
(substring(/p/s2/text(),10,12))"
16Use of xmlbase
c.xml" g (/p/s2/text(), 10, 12))"/ xlinkhref"xptr(substring
(/p/s2/text(), 24, 4))"/
17XSLT
- a powerful tree-traversal language
- translate any XML document into another document
in any form
- html
- XML
- plain text
- etc.
- most to offer for handling annotated corpora
18XSLT Capabilities
- selection of elements or portions of element
content using the XPath syntax
- rearrangement, transformation of extracted
information (text content, element names, etc.)
in the target document
- addition of information to the target document
19A Simple Example
xmlbase "http//www.cs.vassar.edu/ME/Oen.xc
esDoc"
string(//p/s1/text(),1,2" It it
Pp3ns PPER3x it
Pp3ns PPER3
...
xcesAna document
20XSLT creates HTML
http//www.w3.org/1999/XSL/Transform"
body /body plate match"//par"/ ct//tok/ th/
XSLT document
21Result
22Possibilities
- create new documents containing selected
annotations
- transduce XML encoded documents to tool-internal
formats
- generate a new document with all phonemes that
appear in a certain context (or all the unique
contexts of a certain phoneme), etc.
23XML Schemas
- constrain and document the meaning, usage and
relationships of the constituent parts of XML
documents
- datatypes
- elements and their content
- attributes and their values
- provide default values for attributes and
elements
24Impact for language resources
- provide means to define an abstract data model
for a class of documents
- e.g., data model for annotations and annotated
objects
- one of the most important tasks for corpus and
tool creators
- provide for much tighter validation of document
form and content
25Capabilities
- different attribute declarations and/or content
models can apply to elements with the same name
in different contexts
- allows for more tightly constrained content
models than possible with DTDs
- e.g., in header and in text likely
have different content constraints
26- define equivalence classes for groups of elements
and/or attributes
- may be used in the same ways as defined for a
particular named element
- in CES used parameter entities to make a class of
phrase-level objects (for example)
- a "kludge"
27- constrain attribute or element values (or
combinations) to be unique, e.g.,
- only one entry in a computational lexicon can be
defined with a given word form
- only one paragraph can have an attribute
indicating that it is the 23rd
- only one disambiguated form is given for each
token
- only one correspondence for a given item in an
alignment document
Useful for error detection and prevention
28- establish dependencies based on element or
attribute values, for example
- prevent nouns from being assigned a tense
- specify that tokens with type attribute value
PUNCT include only elements containing
specific characters
- specify annotation labels elsewhere, constrain
element content to these values only
- e.g., constrain the values of the element
in an XCES annotation document to the EAGLES
morpho-syntactic specifications
Another means for error control and validation
29Why is XML a good thing?
- search, extraction, and transformation
capabilities answer most current and foreseen
needs for corpus-based language engineering
- means to fully implement the CES/XCES data
architecture
- processing tools for XML recommendations are
freely distributed
- no need for costly and time-consuming tool
development
30XCES and its future
- CES and XCES have been developed for and by the
language engineering community
- At present, cover
- various features in written text
- morpho-syntactic annotation
- alignment information
- relatively stable and agreed-upon within the
community
31- coverage will continue to evolve
- currently working with different groups to
implement encoding guidelines for
- additional written text features
- computational lexicons
- discourse and dialogue
- co-reference
- speech and its various levels of annotation and
representation
- Asian character support
32Information
http//www.cs.vassar.edu/CES and http//www.cs.v
assar.edu/XCES
ide_at_cs.vassar.edu or ide_at_loria.fr
bonhomme_at_loria.fr
romary_at_loria.fr