An XMLbased Encoding Standard for Language Corpora

About This Presentation
Title:

An XMLbased Encoding Standard for Language Corpora

Description:

LREC 2000 Athens, Greece. An XML-based Encoding Standard for Language Corpora ... LREC 2000 Athens, Greece. XCES. Use of links in CES and XCES ... –

Number of Views:72
Avg rating:3.0/5.0
Slides: 33
Provided by: csVa
Category:

less

Transcript and Presenter's Notes

Title: An XMLbased Encoding Standard for Language Corpora


1
An XML-based Encoding Standard for Language
Corpora
XCES
  • Nancy Ide Vassar College
  • Patrice Bonhomme LORIA/CNRS
  • Laurent Romary LORIA/CNRS

2
The Corpus Encoding Standard
  • EAGLES standard
  • encoding conventions for corpora used in language
    engineering research
  • an SGML application
  • TEI conformant

3
The CES defines...
  • requirements for increasing levels of encoding
  • a suite of DTDs for encoding basic document
    structure and linguistic annotation
  • a corresponding data architecture for linguistic
    corpora

4
XCES
  • instantiation of the CES DTDs in XML
  • same tags, data architecture as the CES
  • motivation use in creating the American National
    Corpus (ANC)
  • Macleod, Ide, and Grishman, LREC 2000

5
XML provides more than SGML
  • better linkage mechanisms
  • XSLT for document access and transformation
  • XML schemas
  • provision for accessing all or part of multiple
    DTDs

6
Minimal XML conversion
  • adaptation of DTDs
  • eliminate inclusion exceptions
  • make mixed-content models XML-compliant
  • adaptation of the CES mechanism for
    inter-document reference
  • meet the specifications of XML pointer and
    linking mechanisms

7
Additional Adaptations
  • validate the CES data architecture by ensuring
    conformance to other XML specifications
  • XSL Transformation Language
  • XQL
  • exploit XML mechanisms for combining all or part
    of documents described by different DTDs
  • instantiate the XCES DTDs using XML schemas

8
The CES/XCES Data Architecture
  • remote markup, or "stand-off" model
  • annotations maintained in separate documents that
    point back to the original
  • yields a hyper-document composed of the
    original text and all annotations
  • increasingly accepted as the appropriate
    architecture for language resources

9
Use of links in CES and XCES
  • link corresponding segments of two or more
    aligned primary texts
  • link annotation documents to a base document
    containing the primary text
  • e.g., morpho-syntactic information linked to the
    string of characters in the original text to
    which it applies

10
XCES Requirements for Linking
  • must be able to point to other documents
  • must be able to point to tagged elements as well
    as locations within tagged elements
  • eliminate the need to tag every element that
    might be referenced
  • eliminate IDs on every element that is
    referenced, as in SGML

11
XML Path Language (XPath)
  • concise notation for element localization in the
    document tree
  • /div/p2/s3 - third sentence of second
    paragraph in each
  • /descendantp - all elements
  • predicates for accessing characters within
    elements
  • substring(/p/s2/text(),10,12)

12
XPointer
  • extends XPath syntax to allow
  • addressing points and ranges as well as nodes
  • locating information by string matching
  • use of addressing expressions in URI-references
    as fragment identifiers

13
XLink
  • uni- or multi-directional links
  • can specify how link is to be activated
  • by hand or automatically by the browser
  • can specify what to do with the target fragment
  • replace it or insert it into the source document

14
Links to External Documents
  • None in SGML
  • HyTime/TEI invented "doc" attribute
  • CES used "doc" with inheritance to avoid
    repetition of the attribute
  • not supported by SGML processors
  • XML XLink and xmlbase attribute

15
Linking Mechanisms (A brief history)
HyTime/TEI
to"CHILD (1) (2) STRLOC (22)"
doc"doc.xml"
CES

XCES using XLink
xptr
(substring(/p/s2/text(),10,12))"
16
Use of xmlbase
c.xml" g (/p/s2/text(), 10, 12))"/ xlinkhref"xptr(substring
(/p/s2/text(), 24, 4))"/
17
XSLT
  • a powerful tree-traversal language
  • translate any XML document into another document
    in any form
  • html
  • XML
  • plain text
  • etc.
  • most to offer for handling annotated corpora

18
XSLT Capabilities
  • selection of elements or portions of element
    content using the XPath syntax
  • rearrangement, transformation of extracted
    information (text content, element names, etc.)
    in the target document
  • addition of information to the target document

19
A Simple Example
xmlbase "http//www.cs.vassar.edu/ME/Oen.xc
esDoc"
string(//p/s1/text(),1,2" It it
Pp3ns PPER3x it
Pp3ns PPER3
...
xcesAna document
20
XSLT creates HTML
http//www.w3.org/1999/XSL/Transform"
body /body plate match"//par"/ ct//tok/ th/





XSLT document
21
Result
22
Possibilities
  • create new documents containing selected
    annotations
  • transduce XML encoded documents to tool-internal
    formats
  • generate a new document with all phonemes that
    appear in a certain context (or all the unique
    contexts of a certain phoneme), etc.

23
XML Schemas
  • constrain and document the meaning, usage and
    relationships of the constituent parts of XML
    documents
  • datatypes
  • elements and their content
  • attributes and their values
  • provide default values for attributes and
    elements

24
Impact for language resources
  • provide means to define an abstract data model
    for a class of documents
  • e.g., data model for annotations and annotated
    objects
  • one of the most important tasks for corpus and
    tool creators
  • provide for much tighter validation of document
    form and content

25
Capabilities
  • different attribute declarations and/or content
    models can apply to elements with the same name
    in different contexts
  • allows for more tightly constrained content
    models than possible with DTDs
  • e.g., in header and in text likely
    have different content constraints

26
  • define equivalence classes for groups of elements
    and/or attributes
  • may be used in the same ways as defined for a
    particular named element
  • in CES used parameter entities to make a class of
    phrase-level objects (for example)
  • a "kludge"

27
  • constrain attribute or element values (or
    combinations) to be unique, e.g.,
  • only one entry in a computational lexicon can be
    defined with a given word form
  • only one paragraph can have an attribute
    indicating that it is the 23rd
  • only one disambiguated form is given for each
    token
  • only one correspondence for a given item in an
    alignment document

Useful for error detection and prevention
28
  • establish dependencies based on element or
    attribute values, for example
  • prevent nouns from being assigned a tense
  • specify that tokens with type attribute value
    PUNCT include only elements containing
    specific characters
  • specify annotation labels elsewhere, constrain
    element content to these values only
  • e.g., constrain the values of the element
    in an XCES annotation document to the EAGLES
    morpho-syntactic specifications

Another means for error control and validation
29
Why is XML a good thing?
  • search, extraction, and transformation
    capabilities answer most current and foreseen
    needs for corpus-based language engineering
  • means to fully implement the CES/XCES data
    architecture
  • processing tools for XML recommendations are
    freely distributed
  • no need for costly and time-consuming tool
    development

30
XCES and its future
  • CES and XCES have been developed for and by the
    language engineering community
  • At present, cover
  • various features in written text
  • morpho-syntactic annotation
  • alignment information
  • relatively stable and agreed-upon within the
    community

31
  • coverage will continue to evolve
  • currently working with different groups to
    implement encoding guidelines for
  • additional written text features
  • computational lexicons
  • discourse and dialogue
  • co-reference
  • speech and its various levels of annotation and
    representation
  • Asian character support

32
Information
http//www.cs.vassar.edu/CES and http//www.cs.v
assar.edu/XCES
ide_at_cs.vassar.edu or ide_at_loria.fr
bonhomme_at_loria.fr
romary_at_loria.fr
Write a Comment
User Comments (0)
About PowerShow.com