Shoebox Starting out and lexical management - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Shoebox Starting out and lexical management

Description:

Various reversal fields used in making finder lists, can be used if you don't ... produce versions of lexica for distribution to people who are not Shoebox users ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 38
Provided by: simonmu
Category:

less

Transcript and Presenter's Notes

Title: Shoebox Starting out and lexical management


1
Shoebox Starting out and lexical management
2
Shoebox / Toolbox
  • What is it?Shoebox is a data management program
    for language data. It is not a text editor, but
    nor is it a database management system in the
    sense usually understood (it is not an
    implementation of a relational database system).
  • Where do I get it?Shoeboxhttp//www.sil.org/com
    puting/shoebox/index.htmlCost
    US19.95Toolboxhttp//www.sil.org/computing/too
    lbox/Freeware

3
Shoebox and Toolbox
  • Which program / version should I use?If you use
    a Windows PC, then you should certainly use
    Toolbox it has features which are not in
    Shoebox such as better Unicode compliance, better
    xml export and (best of all) it supports
    scrolling from a mouse.
  • If you work on Mac, you dont have a choice
    Shoebox works on Mac, but Toolbox doesnt (except
    under VirtualPC). And Shoebox runs under OS9.
  • The application is not officially available for
    Linux (or UNIX). But Shoebox and Toolbox will
    actually run under Linux with an add-on (ask
    Baden).

4
Why Shoebox?
  • Advantages
  • Good functionality
  • Choice of output possibilities
  • Portability simple file formats
  • Drawbacks
  • Data input not always easy (but no other
    application is better!)
  • Manuals are not always easy to use
  • Way interlinear is stored means you need to
    revise often
  • Everything is text, only weak data typing is
    possible

5
Installing
  • When you install Shoe/Toolbox, (versions later
    than 4), the installation creates a folder called
    My Shoebox Settings on your C drive.
  • By default, all Shoebox files will be stored
    here, but you can easily change the setting.
  • But you have to careful moving things!
  • Look at the sample files for help.

6
Basic concepts
  • ProjectA project is the work unit in Shoebox. A
    project file (.prj) is a shell file which holds
    information about what files are included in the
    project and what their properties are.
  • Database typeWhen you set up a project, you have
    to define the type of the database files which
    you want to use in that project that is, you
    must specify what fields are included in the file
    and what the properties of the fields are.
    (Shortly, we will work through setting up a
    lexicon database)
  • Language encodingA crucial property which you
    have to set for each field in a database is the
    language encoding which will be used. A language
    encoding includes information about
  • List of characters
  • Case pairs
  • Sort order
  • Onscreen presentation
  • Variables

7
Language Encoding - 1
  • Exploring the Help files, there is lots of useful
    information on how to set properties of a
    Language Encoding, but nothing about how to
    choose the characters which you want to use!
  • You can do this in two ways
  • Create a new encoding file then use the Language
    Encoding dialogues to work through all the bits
    and pieces which you want to do. (tricky)
  • Create a new encoding file, then open it in a
    text editor and manipulate it there. (easier)

8
Language Encoding - 2
  • Case pairs you have to tell the program which
    pairs of symbols to treat as alphabetically
    equivalent, e.g. A a for sorting and for
    parsing.
  • Sort order you have to tell the program what
    order you want the alphabet to be in for sorting,
    e.g. if you use glottal stop, where should it be
    in the alphabet?
  • Fonts you can specify the on-screen
    characteristics of each language encoding which
    you use. This is useful to make the screen easier
    to read.
  • You can also specify screen characteristics for
    fields when you define a database type,
    overriding or modifying the language settings
  • Neither of these options affects the presentation
    of data when you export to the Multi Dictionary
    Formatter (MDF) MDF uses its own font settings
    regardless.

9
Language Encoding - 3
  • Variables you have to specify which characters
    will be included in which sets of variables. The
    default groupings which are set in the program
    are
  • Everything
  • Lower case
  • Upper case
  • Vowels
  • Consonants
  • Nasals
  • Punctuation
  • Digits
  • These variable definitions are used for wildcards
    in searches, and for specifying some
    morphological processes in parsing.

10
Language encoding and data input
  • Inputting non-ASCII characters is a problem!
  • One solution is to use a keyboard mapping utility
    Tavultesoft Keyman is recommended for Shoebox
  • One option available in a language encoding is to
    associate a keyboard mapping with that language
  • Keyboard definitions are available for Keyman,
    but if what you want doesnt exist, you have to
    make a definition yourself
  • An alternative, assuming you are using Unicode,
    is to use UniPad
  • A Unicode text editor
  • Keyboards can be made by dragging and dropping
  • Keyboards are both hard (you type) and soft
    (click on display on screen)

11
Database types - 1
  • Relational database (e.g. Access)
  • One field (or a combination of fields) must have
    data and function as unique identifier
  • Every field specified in the definition occurs in
    every record
  • Every field specified in the definition occurs
    only once in each record
  • Non-relational database (Shoebox)
  • One field specified in the definition must occur
    in every record as unique identifier the record
    marker
  • Other fields can occur many times in each record

12
Database types Markers 1
  • Shoebox database files are a special sort of text
    file Standard Format Marker files
  • A new record starts with the occurrence of a
    record marker field
  • Each field has the structure
  • Marker \ character identifying string
  • Text content whatever is stored in the field
  • Return indicates end of field
  • NB database definitions and language encodings
    are also SFM files

13
Database types markers 2
  • When you define a database type, you define a set
    of markers
  • First you must define the record marker
  • For a lexicon, the head word is a good choice, as
    this will provide the default sort order for the
    file
  • For each marker and its associated field, you can
    specify various properties.

14
Marker properties - 1
  • Marker from standard list for MDF, or mnemonic
  • Name should be unambiguous, relates to marker
  • Hierarchy more to follow on this
  • Following field useful if one field will always
    occur with another one
  • Language encoding ensures that needed
    characters are available for that field
  • Description important documentation for other
    users (or you in a few years!)
  • Font you can allow the default font settings
    which go with the language encoding, or you can
    override them

15
Marker properties - 2
  • Although Shoebox doesnt permit any strong data
    typing, you can do a little bit to make things
    more secure
  • You can specify that a field cannot be empty
    (other then the record marker which must have
    data anyway)
  • You can specify that a field will not contain
    spaces
  • Range set you can specify that a field will
    only contain one of a set of specified values,
    useful or e.g. part of speech, semantic domains

16
Database types dates
  • Date stamping you can include a date field
    (usually \dt) in your database and enable
    automatic date stamping
  • Date stamping happens on insertion of a record
    and then again whenever a record is edited if
    you want to preserve the information about when
    you first entered a record, this has to be done
    manually
  • You have to create a date field before you can
    enable date stamping

17
Hierarchies
  • Hierarchies are used to create structure within
    records
  • This feature is especially valuable in a lexicon
    file which has sub-entries with multiple part of
    speech and gloss information
  • Hierarchies are defined for each field in the
    Markers window of the Database Type dialogue
  • The predefined MDF_4.0 database type has a
    complex hierarchy included
  • A properly defined hierarchy ensures that all
    relevant information is retrieved in sorts and
    filters i.e. glosses for all sub-entries rather
    than just the first gloss entry

18
Other database properties
  • There are plenty of other features which can be
    set for a database
  • Many of these are not so relevant to lexica
    interlinear, jump path etc.
  • We will return to some of these this afternoon

19
MDF fields
  • The full definition of the MDF_4 database type
    has 103 fields specified
  • It is unlikely that you will want to use all of
    these!
  • There are three possible approaches
  • Use the preset and just dont bother about the
    fields you dont use
  • Eliminate fields from the preset until you have
    what you want
  • Create a new database definition from scratch
  • Well work through option 1 here

20
Entries in the dictionary
  • The record marker for a MDF file is \lx the
    lexeme
  • This can be morpheme smaller than a word
  • Other forms can be included
  • A citation form \lc
  • A phonetic form \ph
  • Alternative forms to be listed in a dictionary,
    these are entered under \va, for interlinear use
    they are typically entered under \a which is not
    a defined field in MDF_4

21
Sub-entries, sense numbers and homonyms
  • Homonyms should be used where forms are identical
    but there is no semantic relationship
  • Homonyms are identified only by a number in the
    field \hm
  • Sub-entries should be used where a word or phrase
    is derived from the root
  • Sub-entries are identified by numbers in the
    field \se
  • Where a form has multiple sense within the same
    part of speech, the senses are identified by a
    number in the field \sn

22
The hierarchy in entries
  • The hierarchical structure of entries set up by
    the various markers isHead item homonym
    1 pos1 sense1 sense2 pos2 sense1 sub
    entry1 pos1 sense1 subentry2 pos1 sens
    e1 homonym2 pos1 sense1 sense2

23
Word classes
  • As we just saw, word classes are very important
    in the hierarchical structure
  • The field used for this information is \ps
  • A field is also available for word class names in
    a second language \pn
  • The MDF format assumes that you will work with
    three (or four) languages
  • A vernacular language (the object language)
  • A national language
  • An international language (probably English)
  • (a regional language can also be used)
  • The MDF_4 file recommends use of range sets for
    these word class fields this is unrealistic at
    early stages, you have to know a lot about a
    language before you are confident about listing
    word classes exhaustively

24
Glosses and definitions
  • Single word glosses for use in interlinears can
    be entered in English (\ge) and the national
    language (\gn) (\gr is also available)
  • More extended definitions can be entered in
    English (\de), national language (\dn) and the
    vernacular (\dv) (\dr is also available)
  • Encyclopedic information can be entered in all
    three (four) languages \ee, \en, \ev, (\er)

25
Semantic information
  • MDF_4 offers a semantic domain field (\sd -
    English) and also a thesaurus field (\th -
    vernacular)
  • For both, use of a range set is recommended, but
    again this is unrealistic in the early stages of
    research
  • It is better to allow categories to be added
    freely until a good picture is obtained of the
    semantic domains needed, then move to restricting
    the possible entries

26
Examples
  • MDF_4 has five fields for including example
    phrases or sentences
  • \rf to provide a reference to the example
  • \xv vernacular text (i.e. the actual example)
  • \xe English text, a free translation
  • \xn national language text, a free translation
  • \xr regional language text
  • As Shoebox is a non-relational database, it is
    possible to use each of these fields several
    times in one record you can include as many
    examples as you like for each entry
  • There is a hierarchy here
  • \rf is under sense number, and allows you to give
    a reference for each example
  • \xv is under \rf and over the other \x.. fields,
    ensuring that the translations for each example
    stay together

27
Notes
  • The MDF_4 definition specifies many fields for
    notes
  • All the fields which are defined will be exported
    in the MDF process
  • So maybe more important than the distinctions
    allowed in the preset is a distinction between
    information which will appear in the dictionary,
    and information which is for your use only
  • I recommend creating a notes field which isnt
    part of the MDF presets!
  • \so a field for the source of the data

28
Miscellaneous
  • \bw borrowed word, for entering the source
    language
  • \cf cross-reference, plus fields for glosses
    for the referenced item
  • \mr morphology, for showing the internal
    structure of morphologically complex items (note
    that this may not be desirable for interlinear
    glossing!)
  • Various reversal fields used in making finder
    lists, can be used if you dont want the given
    gloss to be the reversal of an entry

29
Housekeeping
  • Date stamping is very valuable for example it
    can be useful to be able to sort or filter
    entries by date
  • But as noted before, if you want to keep track of
    both the date of insertion of a record and the
    date of last edit, you will need two fields and
    you will have to manually enter the date in the
    first one
  • MDF_4 also allows a status field (\st) which is
    very useful for tracking whether an entry is
    complete and fully checked, whether it is in the
    last printed version of a dictionary etc.

30
Other stuff
  • Obviously we have only looked at a few of the
    fields which are defined in MDF_4
  • It is worth looking through the entire list to
    see what could be relevant to your needs
  • Reversal fields are certainly worth
    investigating
  • But it is also worth remembering that you can
    achieve a lot with a reasonably small number of
    fields

31
Range sets
  • Range sets, as previously mentioned, are used to
    limit the values which can appear in a field
  • Often, it is not possible to specify a set of
    values when you start work on a language
  • When you have some data, Shoebox can
    automatically create a set of values for you from
    what is already entered
  • You must remember to check the Use a Range Set
    box in the Marker properties section of the
    Database Type dialogue

32
Consistency checks
  • Shoebox can perform some checking of data for you
    automatically
  • If you choose Consistency Check from the Tools
    menu
  • If you specify that data should be checked in an
    export process
  • When you move to a new record if you have Check
    Consistency When Editing enabled on the Tools
    menu
  • In any of these cases, the program will check
  • That data matches any Data Property settings
  • That data matches any Range Set settings
  • That Jump Path destinations are valid links
  • It is valuable to constrain data as much as
    possible and to reduce the possibility for
    entering invalid data

33
Export processes
  • The most important export process when working
    with a lexicon is the Multi Dictionary Formatter
    (MDF)
  • This powerful facility creates fully formatted
    dictionaries and finder lists from your lexicon
    file
  • The results are Rich Text Format files (.rtf)
    which can be opened and manipulated in most word
    processing packages (such as Word)

34
MDF basics
  • You can choose to export your data to a bilingual
    dictionary or a trilingual dictionary
  • If bilingual, you can choose whether the second
    language is English or the relevant national
    language
  • If trilingual, English and the national language
    are used
  • (Regional language apparently vanishes at this
    point)

35
Other options
  • Data can be filtered (i.e. only entries which
    correspond to some criteria are included)
  • Fields can be excluded
  • Some formatting can be controlled
  • Header and footer material
  • Total number of entries is printed
  • Output file can be .rtf or web pages (HTML)

36
Other export possibilities
  • You can export all your data as a document in
    .rtf format, or a text format which Shoebox
    describes as standard
  • In these exports, you can export the records in
    the current window, or all records
  • You can define other export processes for
    yourself if you feel brave!

37
Lexique Pro
  • Lexique Pro is a freeware tool distributed by SIL
    via www.lexiquepro.com
  • It is intended to produce versions of lexica for
    distribution to people who are not Shoebox users
  • The program makes a version which is
    well-formatted for on-screen viewing
  • It also makes an executable file (.exe) to
    distribute the lexicon to other people this
    will install a run-time version of Lexique Pro
    and the database extracted from your lexicon onto
    another persons computer
  • You can also export your lexicon as web pages
Write a Comment
User Comments (0)
About PowerShow.com