Shoebox Interlinear glossing - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Shoebox Interlinear glossing

Description:

The parse process works from the outside to the inside prefixes and suffixes ... of text (t, m, g, free translation, notes, audio references etc.) are all ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 24

Provided by: simonmu

Category:

more less

Transcript and Presenter's Notes

Title: Shoebox Interlinear glossing

1
Shoebox Interlinear glossing
2
Basic principles

Shoebox parses by looking for possible analyses
of a word form based on information stored in a
lexicon or lexica
Large substrings are preferred to short
substrings
The parse process works from the outside to the
inside prefixes and suffixes are parsed before
roots
Where more than one parse is possible, the user
is prompted to resolve the ambiguity
The output is whatever information has been
requested for each item from the lexicon, all
formatted to preserev alignment
Where no parse is found, a string of meaningless
symbols ( by default) is returned

3
Setting up - 1

The fields to be filled by interlinear processes
have to be created in the database type
definition
By default, Shoebox gives you the following
\t unparsed text
\m morphemic parse
\g morphemic gloss
\p part of speech
You can add other information if you want it

4
Setting up - 2

The interlinear processes are defined at the
Interlinear tab of the database type dialogue
There are two types of interlinear process
Parse analyses the text word-by-word and parses
it into morphemes
Look-up takes information from the lexicon
about each parsed morpheme
For each process, you must specify which file to
look in and what fields to look in for the
information which is needed

5
How many lexica

Although using one lexicon might seem simplest,
often it makes sense to use more than one
Where there are many loan words used, especially
if they are mainly from one other language, it
may be easier to keep the languages in separate
lexica
You wont have to separate the loans before you
turn the main lexicon into a dictionary you
could do this also with filtering
Possibly, you will store less information about
loan words (e.g. loans may be stored without
morphological analysis)
Maybe keep proper names in a separate lexicon
Some people advocate using a different lexicon
for interlinear analysis and for dictionaries I
have never found this to be necessary

6
Processes look-up

Look-up processes are straighforward
Shoebox has found the required morpheme in the
lexicon and has only to retrieve relevant
information
Only one lexicon will be relevant
The input is whatever field the base form of
lexical items is stored in
The output is whatever other field you want
information from e.g. gloss, word class

7
Processes - parse

All the complications occur in parse processes!
You have to specify the location of any variant
forms which should all resolve to a single
morpheme usually the \a field
Typically, in the early stages of work on data,
you will have many \a entries, often several for
a single morpheme
As you know more about a language, these should
reduce your transcription will be more
consistent etc.
But it is always possible that variants will
remain
You must specify each lexicon field that may
contain forms of the morpheme

8
Listing affixes

Lexical entries for affixes use hyphens to
specify the attachment of the affix
-xxx means that the morpheme is a suffix
xxx- means that the morpheme is a prefix
-xxx- means that the morpheme is an infix
Shoebox does not parse circumfixes or internal
modification (but well look at some possible
work-arounds later)

9
Reduplication

Shoebox can parse reduplication
All morphemes analysed as reduplication are
entered in the lexicon with the letters dup in
their \lx field e.g. dupCV- could be the entry
for a reduplicating CV- prefix
Use \a fields to specify the actual forms of the
morpheme
Use variables (cons, vowel) to generalise
The prefix above would be listed\lx dupCV-\a c
onsvowel-

10
Segmenting text

The help files recommend that each text is
treated as a single record in a text database
This has advantages for searching and filtering
your texts you can access all relevant examples
in one file
The text can be internally segmented for ease of
handling and display
For this, it is important to differentiate
between the \id field and the \ref field
\id is the record marker
\ref is the identifier for individual segments
within the record

11
Segmenting and hierarchy

To follow this approach, we have to use the
marker hierarchy
\ref is under \id in the hierarchy
All the fields associated with a segment of text
(\t, \m, \g, free translation, notes, audio
references etc.) are all under \ref in the
hierarchy
Shoebox will segment and number a text
automatically, dividing at punctuation as
specified

12
Ambiguity

Where more than one parse is possible, Shoebox
will produce a dialogue asking you to choose
The ambiguity can come about because of
homophonous morphemes
Or the ambiguity can come about because Shoebox
can segment the string in more than one way
It is not uncommon for several choice to be
necessary to get a single word form to parse

13
Reducing ambiguity

Many cases where Shoebox asks for resolution of
an ambiguity are caused by the homophonous (or
homographic) morphemes e.g.
s in English must be listed twice in the
lexicon, as a plural marker and as a 3rd person
singular marker
But one attaches to nouns and one attaches to
verbs
Shoebox can be given this information by using
word formulas, and this will prevent both parses
being offered every time s is encountered

14
Circumfixes

As mentioned previously, Shoebox has no specific
method for handling circumfixes
If the two parts of a circumfix occur separately
with different meaning, then a word formula can
be used to streamline parsing
For example, a formula might readSymbol WordP
attern (NOM1)adj(NOM2)

15
Variants

It is quite possible to list variants even where
regular morphophonemic processes can be seen
But such processes can also be generalised and
specified in the lexicon entry for an affix
This is done using an underlying form field (\u)
which is always paired with an alternate form
field (\a)

16
Morphophonemics example

The \a field shows the string which will be found
in the text line ignoring morpheme boundaries
The \u field shows the underlying form to be
returned and specifies the morpheme boundary
For example, the English orthographic rule which
gives the plural of nouns ending in y would be
specified as
\lx -s\a -ies\u ys or y -s
But I have never got this to work!

17
Internal modification

The work-around for e.g. vowel changes as a
morphemic process is to use the \u field to force
a parse
For example, the English strong verb past tense
could be expressed as follows\lx sang\u sing
ed
Note that no generalisation of this is possible,
each form has to be entered
The same approach works for suppletion and
irregular forms e.g.\lx went\u go ed
Again, this is not a feature which I have ever
used with success

18
When parsing fails

When Shoebox cannot parse a string, it returns a
series of characters (or you can set a
different character)
This means that there is not enough information
in the lexicon(s) for the string to be parsed
You need to enter more information into the
lexicon
Setting the Jump Path is now useful!

19
Jump path

If a Jump Path is set, you can select a string in
the text file, press CtrlJ and jump to the
lexicon, either
To a new entry if the selected string doesnt
match any existing entry
Or to an existing entry
Jump Path is set in the database properties
dialogue
If you are using more than one lexicon, they can
all be entered in the Jump Path, you can specify
the order in which they will be accessed, and you
will get to choose which one you want

20
Other failures

You may be sure that all relevant information is
available in the lexicon(s), but the parse still
fails
Playing with the settings for Prefer
Prefixes/Suffixes may help
If all else fails, you can mark the morpheme
breaks in the text line with hyphens, run the
parse and then delete the hyphens afterwards

21
Verification

Shoebox has a feature for checking interlinear
This is useful because Shoebox stores interlinear
lines as formatted lines of text
These lines do not change if you make a change to
the lexicon the parse has to be run again
Verification runs through the whole file and
stops where changes have affected the result, or
where parsing failed previously
This is a lot faster than going through the whole
file again yourself!

22
Export possibilities

Shoebox will export records or files in various
formats
Preset options are Rich Text Format, Standard
Format (text file) and xml
There is a trap with the RTF export it is
formatted according to the view you see on screen
if you have set a wide screen (using Reshape),
you will get messy wrapping

23
Xml export

The xml export wraps each field in tags but does
not preserve formatting of interlinears
People are working on better xml tools to use
with Shoebox
Peter Newman has produced a tool which uses the
Bow, Hughes, Bird model of interlinear text as
output other
Peter is finalising a new version (and waiting
for a baby to arrive), and I cant demonstrate
this tool today