Tone Merete Bruvik Aksis, June 16, 2006 - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Tone Merete Bruvik Aksis, June 16, 2006

Description:

lb n='1'/ div type='hymnus' p supplied sum /supplied ens illud aue gabrelis ore ... Verdi [v '?i:]: ve m rend='?' ana='phon.rd' rd /m i. or: ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 24
Provided by: tonemere
Category:
Tags: aksis | bruvik | june | merete | tone | verdi

less

Transcript and Presenter's Notes

Title: Tone Merete Bruvik Aksis, June 16, 2006


1
Tone Merete BruvikAksis, June 16, 2006
Short introduction to XML-encodingCorpora In
Phonological ResearchAmsterdam, Netherlands , 15
- 17 June 2006
2
Source and editions
(Source The Botulph Breviary fragments, MPF AT
BERGEN UNIVERSITY LIBRARY
3
The encoding
  • ...
  • lttextgt
  • ltbody lang"lat"gt
  • ltpb n"1r"/gt
  • ltcb n"A"/gt
  • ltlb n"1"/gtltdiv type"hymnus"gtltpgtltsuppliedgtsumlt/su
    ppliedgtens illud aue gabrelis ore
  • ltlb n"2"/gtltsuppliedgtfunlt/suppliedgtda nos in pace
    mutans nomen
  • ltlb n"3"/gtltsuppliedgteult/suppliedgte. lthi
    rend"blue"gtSlt/higtolue uincla reis pro
  • ltlb n"4"/gtltsuppliedgtferlt/suppliedgt lumen cecis
    mala nostra pelle
  • ...

4
What should a document format be like?
  • Open and well documented.
  • Application independent.
  • Interchangeable.
  • Readable for both computers and humans
  • Encoding structure, not layout.
  • Make the encoding of the document explicit.

5
Text encoding languages
  • Encoding grammars SGML, MECS, XML
  • Encoding semantics HTML, HNML, MECSWIT, TEI
    Guidelines

SGML (1986)
XML (1998)
6
Extensible Markup Language (XML)
  • Design goals
  • XML shall be straightforwardly usable over the
    Internet.
  • XML shall support a wide variety of applications.
  • XML shall be compatible with SGML.
  • It shall be easy to write programs which process
    XML documents.
  • The number of optional features in XML is to be
    kept to the absolute minimum, ideally zero.
  • XML documents should be human-legible and
    reasonably clear.
  • The XML design should be prepared quickly.
  • The design of XML shall be formal and concise.
  • XML documents shall be easy to create.
  • Terseness in XML markup is of minimal importance.
  • Source Extensible Markup Language (XML) 1.0
    (Third Edition), W3C Recommendation 04 February
    2004.

7
DTD and Schema
  • A template for the structure of a text, two main
    types
  • DTD - Document Type Definition
  • Schema
  • RELAX NG
  • W3C Schema
  • Schematron

8
Well formed and Valid
  • A document has to be well formed to be called a
    XML document
  • Has only one root element.
  • Has no open tags.
  • All tags nest.
  • ...
  • A document which follows the rules of a schema is
    said to be valid.

9
TEI - Text Encoding Initiative
  • Design goals
  • Provide a standard format for data interchange
  • Provide guidance for encoding of texts in this
    format
  • Support the encoding of all kinds of features of
    all kinds of texts studied by researchers
  • Be application independent
  • (Source TEI - Guidelines for Electronic Text
    Encoding and Interchange, 2002)

10
TEI - design decisions
  • The choice of SGML, XML, ISO 646, and Unicode
  • The provision of a large predefined tag set
  • A distinction between required, recommended, and
    optional encoding practices
  • Encodings for different views of text
  • Alternative encodings for the same text features
  • Mechanisms for user-defined extensions to the
    scheme
  • (Source TEI - Guidelines for Electronic Text
    Encoding and Interchange, 2002)

11
TEI versions
  • P1 (1992)
  • P3 (1999)
  • P4 (2003)
  • P5 (2006, and still in development)
  • Do not be afraid of new versions of the TEI, old
    texts will still be valid according to the old
    version.
  • New TEI versions are a new set of spelling
    rules which only applies to texts that are
    referring to them.

12
TEI schemata
  • There is no such thing as the TEI schema or
    the TEI DTD.
  • Each project has to make its own TEI
    customisation, or should use a TEI customisation
    made by someone else.
  • TEI is made to be customised.
  • Correctly customised TEI is still TEI.
  • The tool ROMA helps you pick what you like to
    include in your TEI schema, see
    http//tei.oucs.ox.ac.uk/Roma/

13
TEI on Speech and Corpora
  • TEI P5 Chapter 11 Transcriptions of Speech
  • TEI P5 Chapter 15 Simple Analytic Mechanisms
  • TEI P5 Chapter 18 Transcription of Primary
    Sources
  • TEI P5 Chapter 23 Language Corpora

14
Elements Unique to Spoken Texts in TEI
  • ltugt utterance.
  • ltpausegt a pause.
  • ltvocalgt vocalized semi-lexical.
  • ltkinesicgt any communicative phenomenon, for
    example a gesture, frown, etc.
  • lteventgt any phenomenon or occurrence, for example
    incidental noises or other events affecting
    communication.
  • ltwritinggt a passage of written text revealed to
    participants in the course of a spoken text.
  • ltshiftgt marks the point at which some
    paralinguistic feature of a series of utterances
    by any one speaker changes.
  • Source TEI -P5, chapter 11 Transcriptions of
    Speech )

15
Some general elements relevant to spoken texts
  • ltseggt (arbitrary segment) contains any arbitrary
    phrase-level unit of text.
  • ltwgt (word) represents a grammatical (not
    necessarily orthographic) word.
  • ltsgt (s-unit) contains a sentence-like division of
    a text.
  • ltcgt (character)
  • Elements for transcriptionsltabbrgt, ltaddgt, ltappgt,
    ltcorrgt, ltdelgt, ltdamagegt, ltexpandgt, ltgapgt, lthigt,
    ltrdggt, ltsicgt, ltsuppliedgt, ltuncleargt.

16
Sample Coding of schwa
  • ltdivgt
  • ltu who"A"gtHeter hun Ronja Langangen?lt/ugt
  • ltu who"B"gtNei hun heter Sonja Langangltc
    rend"vowel" ana"phon.schwa"gtelt/cgtnlt/ugt
  • lt/divgt
  • ltdivgt
  • ltu who"A"gtHeter hun Nelly Dalen?lt/ugt
  • ltu who"B"gtNei hun heter Molly Dalltc
    rend"empty" ana"phon.schwa"gtelt/cgtnlt/ugt
  • lt/divgt

17
Sample ...more on schwa
  • ltu who"B"gtNei hun heter Molly Dalltc rend"empty"
    ana"phon.schwa"gtelt/cgtnlt/ugt
  • ltu who"B"gtNei hun heter Molly Dalltdel
    rend"empty" ana"phon.schwa"gtelt/delgtnlt/ugt
  • ltu who"B"gtNei hun heter Molly ltchoicegtltreggtDalenlt
    /reggtltorig ana"phon.schwa"gtDalnlt/origgtlt/choicegtlt/
    ugt
  • ltu who"B"gtNei hun heter Molly Dalltchoicegtltreggtelt/
    reggtltorig ana"phon.schwa"gtlt/origgtlt/choicegtnlt/ugt
  • ltu who"B"gtNei hun heter Molly ltreg
    ana"phon.schwa"gtDalenlt/reggtlt/ugt

18
Sample Pronunciation of orthographic /rd/
sequences in East Norwegian
  • Sverd sværd sveltm rend"rd ana"phon.rd"gtrdlt/m
    gt
  • Bord bur boltm rend"r" ana"phon.rd"gtrdlt/mgt
  • Verdi væ?i veltm rend"?" ana"phon.rd"gtrdlt/mgt
    i
  • or
  • lt!ENTITY dtail "x0256"gt lt!-- LATIN SMALL
    LETTER D WITH TAIL --gt
  • ...
  • veltm rend"dtail" ana"phon.rd"gtrdlt/mgtilt/pgt

19
Sample Prosodic features
  • lt!ENTITY lr "?"gt lt!-- low rise intonation --gt
  • lt!ENTITY rf "!"gt lt!-- rise fall intonation --gt
  • ...
  • ltu who"person2"gthvilket dalrlt/ugt
  • Might be encoded using entities.

20
Problems with text encoding
  • Texts are more complex than one might think.
  • Text encoding can be a both a very philosophic
    and a very prosaic task.
  • Time consuming.

(Source The Botulph Breviary fragments, MPF AT
BERGEN UNIVERSITY LIBRARY
21
... and more problems
  • Overlap
  • Discontinuous elements
  • Alternative element orderings
  • This is a problem to encode in XML.
  • ltugtThis lthi rendblue italicgtis a ltemph
    rendboldgtproblemlt/higt to encodelt/emphgt in
    XMLlt/ugt.

22
Questions to be solved
  • What should be encoded?
  • Separate tiers?
  • Should a standard encoding schema be used?
  • Should the text be interchangeable?
  • Is there a community that has to decide what
    schema to be used?

23
Links and references
  • TEI P5 - Guidelines for Electronic Text Encoding
    and Interchange, edited by C.M. Sperberg-McQueen
    and Lou Burnard, 2005 (http//www.tei-c.org/releas
    e/doc/tei-p5-doc/html/).
  • ROMA, http//tei.oucs.ox.ac.uk/Roma/
  • XML, http//www.w3.org/XML/
  • Markup Language for Complex Documents (MLCD),
    http//teksttek.aksis.uib.no/projects/mlcd
Write a Comment
User Comments (0)
About PowerShow.com