Shoebox Interlinear glossing - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Shoebox Interlinear glossing

Description:

The parse process works from the outside to the inside prefixes and suffixes ... of text (t, m, g, free translation, notes, audio references etc.) are all ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 24
Provided by: simonmu
Category:

less

Transcript and Presenter's Notes

Title: Shoebox Interlinear glossing


1
Shoebox Interlinear glossing
2
Basic principles
  • Shoebox parses by looking for possible analyses
    of a word form based on information stored in a
    lexicon or lexica
  • Large substrings are preferred to short
    substrings
  • The parse process works from the outside to the
    inside prefixes and suffixes are parsed before
    roots
  • Where more than one parse is possible, the user
    is prompted to resolve the ambiguity
  • The output is whatever information has been
    requested for each item from the lexicon, all
    formatted to preserev alignment
  • Where no parse is found, a string of meaningless
    symbols ( by default) is returned

3
Setting up - 1
  • The fields to be filled by interlinear processes
    have to be created in the database type
    definition
  • By default, Shoebox gives you the following
  • \t unparsed text
  • \m morphemic parse
  • \g morphemic gloss
  • \p part of speech
  • You can add other information if you want it

4
Setting up - 2
  • The interlinear processes are defined at the
    Interlinear tab of the database type dialogue
  • There are two types of interlinear process
  • Parse analyses the text word-by-word and parses
    it into morphemes
  • Look-up takes information from the lexicon
    about each parsed morpheme
  • For each process, you must specify which file to
    look in and what fields to look in for the
    information which is needed

5
How many lexica
  • Although using one lexicon might seem simplest,
    often it makes sense to use more than one
  • Where there are many loan words used, especially
    if they are mainly from one other language, it
    may be easier to keep the languages in separate
    lexica
  • You wont have to separate the loans before you
    turn the main lexicon into a dictionary you
    could do this also with filtering
  • Possibly, you will store less information about
    loan words (e.g. loans may be stored without
    morphological analysis)
  • Maybe keep proper names in a separate lexicon
  • Some people advocate using a different lexicon
    for interlinear analysis and for dictionaries I
    have never found this to be necessary

6
Processes look-up
  • Look-up processes are straighforward
  • Shoebox has found the required morpheme in the
    lexicon and has only to retrieve relevant
    information
  • Only one lexicon will be relevant
  • The input is whatever field the base form of
    lexical items is stored in
  • The output is whatever other field you want
    information from e.g. gloss, word class

7
Processes - parse
  • All the complications occur in parse processes!
  • You have to specify the location of any variant
    forms which should all resolve to a single
    morpheme usually the \a field
  • Typically, in the early stages of work on data,
    you will have many \a entries, often several for
    a single morpheme
  • As you know more about a language, these should
    reduce your transcription will be more
    consistent etc.
  • But it is always possible that variants will
    remain
  • You must specify each lexicon field that may
    contain forms of the morpheme

8
Listing affixes
  • Lexical entries for affixes use hyphens to
    specify the attachment of the affix
  • -xxx means that the morpheme is a suffix
  • xxx- means that the morpheme is a prefix
  • -xxx- means that the morpheme is an infix
  • Shoebox does not parse circumfixes or internal
    modification (but well look at some possible
    work-arounds later)

9
Reduplication
  • Shoebox can parse reduplication
  • All morphemes analysed as reduplication are
    entered in the lexicon with the letters dup in
    their \lx field e.g. dupCV- could be the entry
    for a reduplicating CV- prefix
  • Use \a fields to specify the actual forms of the
    morpheme
  • Use variables (cons, vowel) to generalise
  • The prefix above would be listed\lx dupCV-\a c
    onsvowel-

10
Segmenting text
  • The help files recommend that each text is
    treated as a single record in a text database
  • This has advantages for searching and filtering
    your texts you can access all relevant examples
    in one file
  • The text can be internally segmented for ease of
    handling and display
  • For this, it is important to differentiate
    between the \id field and the \ref field
  • \id is the record marker
  • \ref is the identifier for individual segments
    within the record

11
Segmenting and hierarchy
  • To follow this approach, we have to use the
    marker hierarchy
  • \ref is under \id in the hierarchy
  • All the fields associated with a segment of text
    (\t, \m, \g, free translation, notes, audio
    references etc.) are all under \ref in the
    hierarchy
  • Shoebox will segment and number a text
    automatically, dividing at punctuation as
    specified

12
Ambiguity
  • Where more than one parse is possible, Shoebox
    will produce a dialogue asking you to choose
  • The ambiguity can come about because of
    homophonous morphemes
  • Or the ambiguity can come about because Shoebox
    can segment the string in more than one way
  • It is not uncommon for several choice to be
    necessary to get a single word form to parse

13
Reducing ambiguity
  • Many cases where Shoebox asks for resolution of
    an ambiguity are caused by the homophonous (or
    homographic) morphemes e.g.
  • s in English must be listed twice in the
    lexicon, as a plural marker and as a 3rd person
    singular marker
  • But one attaches to nouns and one attaches to
    verbs
  • Shoebox can be given this information by using
    word formulas, and this will prevent both parses
    being offered every time s is encountered

14
Circumfixes
  • As mentioned previously, Shoebox has no specific
    method for handling circumfixes
  • If the two parts of a circumfix occur separately
    with different meaning, then a word formula can
    be used to streamline parsing
  • For example, a formula might readSymbol WordP
    attern (NOM1)adj(NOM2)

15
Variants
  • It is quite possible to list variants even where
    regular morphophonemic processes can be seen
  • But such processes can also be generalised and
    specified in the lexicon entry for an affix
  • This is done using an underlying form field (\u)
    which is always paired with an alternate form
    field (\a)

16
Morphophonemics example
  • The \a field shows the string which will be found
    in the text line ignoring morpheme boundaries
  • The \u field shows the underlying form to be
    returned and specifies the morpheme boundary
  • For example, the English orthographic rule which
    gives the plural of nouns ending in y would be
    specified as
  • \lx -s\a -ies\u ys or y -s
  • But I have never got this to work!

17
Internal modification
  • The work-around for e.g. vowel changes as a
    morphemic process is to use the \u field to force
    a parse
  • For example, the English strong verb past tense
    could be expressed as follows\lx sang\u sing
    ed
  • Note that no generalisation of this is possible,
    each form has to be entered
  • The same approach works for suppletion and
    irregular forms e.g.\lx went\u go ed
  • Again, this is not a feature which I have ever
    used with success

18
When parsing fails
  • When Shoebox cannot parse a string, it returns a
    series of characters (or you can set a
    different character)
  • This means that there is not enough information
    in the lexicon(s) for the string to be parsed
  • You need to enter more information into the
    lexicon
  • Setting the Jump Path is now useful!

19
Jump path
  • If a Jump Path is set, you can select a string in
    the text file, press CtrlJ and jump to the
    lexicon, either
  • To a new entry if the selected string doesnt
    match any existing entry
  • Or to an existing entry
  • Jump Path is set in the database properties
    dialogue
  • If you are using more than one lexicon, they can
    all be entered in the Jump Path, you can specify
    the order in which they will be accessed, and you
    will get to choose which one you want

20
Other failures
  • You may be sure that all relevant information is
    available in the lexicon(s), but the parse still
    fails
  • Playing with the settings for Prefer
    Prefixes/Suffixes may help
  • If all else fails, you can mark the morpheme
    breaks in the text line with hyphens, run the
    parse and then delete the hyphens afterwards

21
Verification
  • Shoebox has a feature for checking interlinear
  • This is useful because Shoebox stores interlinear
    lines as formatted lines of text
  • These lines do not change if you make a change to
    the lexicon the parse has to be run again
  • Verification runs through the whole file and
    stops where changes have affected the result, or
    where parsing failed previously
  • This is a lot faster than going through the whole
    file again yourself!

22
Export possibilities
  • Shoebox will export records or files in various
    formats
  • Preset options are Rich Text Format, Standard
    Format (text file) and xml
  • There is a trap with the RTF export it is
    formatted according to the view you see on screen
    if you have set a wide screen (using Reshape),
    you will get messy wrapping

23
Xml export
  • The xml export wraps each field in tags but does
    not preserve formatting of interlinears
  • People are working on better xml tools to use
    with Shoebox
  • Peter Newman has produced a tool which uses the
    Bow, Hughes, Bird model of interlinear text as
    output other
  • Peter is finalising a new version (and waiting
    for a baby to arrive), and I cant demonstrate
    this tool today
Write a Comment
User Comments (0)
About PowerShow.com