Title: Hypermedia Lexica and Lexicon Metadata
1Hypermedia Lexica andLexicon Metadata
The MetaLex model in the ModeLex project Dafydd
GibbonU BielefeldEurope E-MELD Workshop,
Detroit, August 2002
2Overview
Metalex goals Background DATR, Hyprlex, Speech,
Language DocumentationMetalex design theory and
practice Lexical documents metadocuments Lexic
al objects, properties, structuresMetalex
implementation Ivory Coast encyclopaedia
project Ega documentation model project The
Modelex (multimodal lexicon) project Ivory Coast
Nigeria documentation curriculum
projectExtending metalex Modalities
submodalities Data-driven lexicography Data
structures algorithms trees, lattices
induction, inference
3Metalex goals background
- General objectives
- Versatile high quality spoken language
lexicography - Motivated balance of high-tech low tech
- Good resources are data-driven and
theory-informed - Specific project objectives
- DATR/ILEX formal lexicon theory and
implementation - VerbMobil integrated HyprLex dissemination model
- HyprLex encyclopaedia model for Ivory Coast
Languages - Ega endangered language documentation model
- Modelex - theory and design of multimodal lexica
- Ivory Coast and Nigeria curricula for language
documentation
4(No Transcript)
5(No Transcript)
6Metalex design data and theory
- Data-driven data metadata acqusition
- Systematic metatext derived from and supporting
... - Computational fieldwork
- Induction of lexica
- Theory-informed data metadata acquisition
- Integrated Lexicon (ILEX) consisting of ...
- Abstract Lexicon (ALEX) - "theory" in the
mathematical sense - Object Lexicon (OLEX) - "model" in the
mathematical sense
7Metalex design data
- Data-driven acquisition
- Computational fieldwork
- Portable metadatabase with restricted vocabulary
and general metatext, and - Definition of and support for transcription
annotation - Portable support for scenarios, scripts
- Portable support for lexicon processing
- Induction of lexica
- Lexicon tools for
- Extraction of macrostructural elements (lexeme
elements) - Induction of microstructural information (media
concordance, POS, ...) - Induction of mesostructural regularities and
subregularities (grammar, ...)
8Metalex design theory
- Theory-informed formalisation
- Abstract Lexicon (ALEX) - "theory" in the
mathematical sense - Decomposition (componential A-V description)
- Generalisation (inheritance)
- Composition (multilinear operations)
- Object Lexicon (OLEX) - "model" in the
mathematical sense - XML archiving and dissemination formats
- object-relational database acquisition and
processing formats - Integrated Lexicon (ILEX)
9(No Transcript)
10(No Transcript)
11Metalex implementationarchitecture
- Data model Ç Theory shared lexicon
architecture - Macrostructure declarative and procedural
components - Lexicon architecture relational, inheritance,
text, ... - Lexical objects entry types
- Lexical access fact query, semasiological /
onomasiological indexing - Mesostructure
- Generalisations grammar, phonetics, cultural
background, ... - Composition of lexicon object types idioms,
words, morphemes, ... - Lexical access inferential query
- Microstructure
- Lexical entry (article, lemma structure - atom,
string, tree, ...) - Types of lexical information - standardly
"lexicon model"
12Metalex implementationmicrostructure
- Microstructure specification philosophy
- Anybody can specify any kind of unpredictable
detail - Questionnaire / Experiment / Corpus / Archive
dependence - Lexicon architecture relational, inheritance,
text, ... - Intelligent (semi-)automatic classification, not
fixed attributes - Theory-informed coarse grouping is possible
- Media attributes visual, auditory, tactile, ...
- Meaning attributes definition, gloss, lexical
relations, ... - Composition attributes context/category, parts,
operations - Use attributes style, register, concordance,
media illustrations, ... - Micrometadata attributes lexicographer DB
indices, source (e.g. fieldwork metadata) DB
indices, modification, ...
13Metalex implementationfieldwork metadata source
(1)
- Situation dimensions
- participant fieldworker, partners, contacts
- channel modalities, media
- locale indoor/outdoor, spatial configuration
- temporal date, time, calendar event
- functional affiliation, role, occasion
observation (prompt, metadata management) - Language dimension
- affiliation
- discourse level discourse type, genre prosody
- phrase level recursive phrasal
categories/relations prosody - word level clitics, inflexion, word formation
prosody
14Metalex implementationfieldwork metadata source
(2)
- Technical dimension
- physical characteristics of participants age,
sex, health - physical characteristics of locale
indoor/outdoor, spatial configuration, temporal
sequence, date (season), time (of day) - audio mike type, position, room A/D
channels, fsample, resolution formats - video camera microphone type,
analogue/digital filters, lenses audio
formats - other sensors laryngograph, airflow, data glove,
... - Metalinguistic dimension
- empirical method introspection, experiment,
corpus elicitation - materials questionnaire, experiment layout,
corpus scenario - metadata specification index, metatext type,
metacatalogue type
15Metalex implementationfieldwork metadata entry
tool
- LREC 2002, Workshop on Portability Issues
16Metalex implementationfieldwork metadata entry
tool
HanDBase DBMS for PalmOS
17Metalex objectsin conjunction with work in ISLE
CLWG(Computational Lexicon Working Group)
- (see Gibbon in reading list)
- LEXICON
- lt Macrostructure gt , lt Mesostructure gt
- Macrostructure Ordering( ENTRY, ... )
- Mesostructure lt FrontmatterMetadata,
Descriptions gt - ENTRY
- lt Microstructure, HousekeepingMetadata gt
18The LEXICON object
- Front Matter Metadata
- Bibliographical creator, publisher, title, date,
... - Medium / format paper, CD-ROM/DVD, web, ...
- Macrostructure type
- access semasiological/onomasiological,
- n-lingual/langue(s),
- special taxonomy (thesaurus), concordance
- structure, e.g. tabular f(type,attrib)value
19The ENTRY object metadata
- Entry Metadata (see Gibbon al. in reading
list) - Entry type (wrt macrostructure specification)
- encyclopaedic
- multiword unit, word, ...
- Microstructure data model specification
- entry structure flat, tree, graph (net), ...
- dta categories specification (atribute, field,
information type) - DC groups - structural skeleton
- DCs
- DC substructure - homography, homophony, polysemy
...
20The ENTRY object DC groups
- Media ("surface")
- acoustic (phonetic, earcon, sonification,),
visual (orthography, icon, gesture, ...) - Composition (structure)
- part (e.g. morphology for words), context (e.g.
POS, subcat for words) - Meaning (definition, illustration)
- semantic (components, relations, senses,
ontology) - pragmatic (speech act, dialogue, disfluency, ...)
- Use typically media (e.g. audio) concordance,
... - Metadata lexicographer, ...
21The ENTRY object DCs
- Countless Data Category models (see reading
list) - every existing dictionary
- linguistic "types of lexical information"
- several European projects
- (GENELEX, MULTILEX, ACQUILEX, ...)
- ISO terminology norms (cf. MARTIF etc. ...)
22The ENTRY object DC structures
- Computationally relevant properties of fields
- type (atomic, complex tree, string,
xyz-formatted text) - character encoding spec. ASCII, Unicode, xyz
- tree (or other graph/net)
- finite depth
- flat, disjunctive disjunctive tree
- recursive graph (net)
- table, non-tree graph, anchor/link/index
structure - generated text
- print, hypertext (compiled vs. dynamic (generated
on the fly)
23Metalex microstruture application
- Media ("surface")
- phonemic tonemic transcription (SAMPA ASCII -
still waiting for Unicode...) - Composition (structure)
- morphemic substructure, category subcategory
- Meaning (definition, illustration)
- glosses (English, French, German)
- definitions, senses, relations, components
audio-visual illustration - Use genres examples (e.g. concordance link)
free text notes - Metadata first record last field
24Metalex field lexicon microstruture
- Anouman_1
- Media attributes
- Phonemic tier an'Um'a
- Skeletal tier VNVNV
- Tonal tier L H LH
- Signal tier Audio
- Meaning attributes
- F-gloss Oiseau
- E-gloss Bird
- G-gloss Vogel
- Definition avis
- Homophone full Anouman_2 grandchild
- Homophone phonemic Anouman_3 yesterday
- Use
- lt Concordance pointer gt
- Genre narrative
- Metadata
- Lexicographer S. Adouakou
- Source Bielefeld-Anyi-Corpus, Adaou village, CI
- Date March 2002
25Metalex portable lexical database
- Relational database
- Metalex specs flattened
- structure re-constitution via metalex specs
- HanDBase for PalmOS
- Features
- standard full RelDBMS
- XML, CSV, text export
- export/import via GSM
- inexpensive (wrt laptop)
- stylus, keyboard, sync input
- light weight
- low power consumption
- inconspicous in use
- interfaces to Scheme, C
26Metalex extensionThe Modelex project"Theory
and Design of Multimodal Lexica"
- Goals
- Data-driven, theory-informed lexicon models
- Formal properties of abstract data models for
multimodal lexica - Interpretation of abstract data models in XML
- Integration of parallel annotation lattices for
modalities and submodalities - Development of a prototype multimodal lexicon
27The Modelex domainmodalities and submodalities
28Modelex data driven lexicography
29Modelex gesture annotation
- Time Aligned Signal
- Corpus System
- (Java, GPL)
- Jan-Torsten Milde, U Bielefeld
- TASX annotator
- Phonological tier
- ToBI tiers
- Gesture tier
- Speech Act tier
- Anyi, Ega, German
30Model-theoretic compilation in ILEXINTERPRETATIO
N ( ALEX ) OLEX
31Metalex in the Modelex projectMultimodal
concordance as microstructure DC
- Prototype http//www.spectrum.uni-bielefeld.de/la
ngdoc/PAX/
32Metalex in the Modelex projectunderspecified
ALEX microstructure for gesture coordinates
- Hand
- ltpartsgt "Palm" "Digit"
- ltvectorgt "ltnamegt" ltcoord "ltnamegt"gt
- ltcoordgt "ltx1gt" "lty1gt" "ltx2gt" "lty2gt"
- ltgt
- .
- Palm
- ltpartsgt ltvectorgt
- ltnamegt palm
- ltwidthgt pw
- ltheightgt ph
- ltx1 foregt ltx1gt
- ltx1 middlegt ( ltx1gt ( ltx2gt - ltx1gt ) / 3 )
- ltx1 ringgt ( ltx1gt ( ltx2gt - ltx1gt ) 2 / 3 )
- ltx1 pinkygt ltx2gt
- ltx1gt px1
- lty1gt py1
- ltx2gt ( ltx1gt ltwidthgt )
- lty2gt ( lty1gt ltheightgt )
33Metalex in the Modelex projectfully specified
ALEX microstructure for gesture coordinates
- Handltpartsgt
- palm px1 py1 ( px1 pw ) ( py1 ph )
- thumb px1 py1 ( px1 - lt ) py1
- fore px1 py1 px1 ( py1 - lf )
- middle ( px1 ( ( px1 pw ) - px1 ) / 3 ) py1
( px1 ( ( px1 pw ) - px1 ) / 3 ) ( py1 - lm )
- ring ( px1 ( ( px1 pw ) - px1 ) 2 / 3 )
py1 ( px1 ( ( px1 pw ) - px1 ) 2 / 3 ) (
py1 - lr ) - pinky ( px1 pw ) py1 ( px1 pw ) ( py1 - lp )
34Metalex conclusion prospects
- User complexity
- demands an open, data-driven approach
- Domain
- demands a theory-informed approach
- with computational acquisition inference
- Data-driven and theory-informed lexica
- are possible (METALEX)
- need integrated model-theoretic approach (ILEX)
- INTERPRETATION (ALEX) OLEX
- a formal problem remains differing complexity of
- trees (archive) simulation of other graphs via
semantics only - annotation lattices (data), tables (lexica)
- regular relations if non-recursive, indexed
grammars if recursive?