Title: Eliciting Features from Minor Languages
1Eliciting Features from Minor Languages
Overview
Feature Structure Design
Feature Specification
This research is part of the AVENUE Machine
Translation Project. AVENUE is supported by the
US National Science Foundation, NSF grant number
IIS-0121-631 In the field of Machine
Translation fully aligned and tagged translation
corpora are considered to be one of the most
valuable resources for automatically training
translation systems. However, among minority
languages such resources are hard to find. It is
possible to overcome this obstacle by using
techniques inspired by field linguistics. That
is, by drawing on bilingual informants to
translate and align given sentences. Field
linguists have relied on questionnaires that have
remained relatively static over a number of
years. We want the flexibility to change the
questionnaire to reflect different semantic
domains, different
goals for machine translation systems, different
levels of detail, etc. We also want the
questionnaire to be available in multiple
languages. For example, we would want a version
of the questionnaire in Spanish for use by Latin
American minority language speakers. We also
want flexibility in lexical selection in order to
avoid cultural bias and to choose appropriate
lexical items for the major language. This paper
will look at methods for specifying the scope and
depth of an elicitation corpus as well as methods
for quick design and implementation of
elicitation corpora. The resulting can also be
used as a test suite to explore existing machine
translation systems or design far-reaching
corpora for studying low resource languages.
ltfeaturegt ltfeature-namegtnp-my-number lt/feature-nam
egt ltvaluegt ltvalue-namegtnum-sg
lt/value-namegt lt/valuegt ltvaluegt
ltvalue-namegtnum-pl lt/value-namegt lt/valuegt
ltvaluegt ltvalue-namegtnum-dual
lt/value-namegt lt/valuegt ltnotegt Notes for
analysis of data CS, 2.1.2.4.1 page 38, seem to
imply that some combinations of numbers are more
expected than others lt/notegt lt/featuregt
Our Goals
- Tools for semi-automated corpus design
- Test suite for MT
- Structured corpus for input to machine learning
- A user interface for producing high quality,
word-aligned parallel corpora (Elicitation Tool) - Automated learning of morpho-syntax for
low-resource languages
A control language is used to define the size and
scope of the set of feature structures that will
be used by GenKit to generate the corpus
Feature Structures
((subj ((np-my-general-type pronoun-type)
(np-my-person person-third)
(np-my-number num-sg) (np-my-biological-gender
bio-gender-male) (np-my-function
fn-predicatee)(np-my-animacy anim-human)
(np-my-info-function info-neutral)(np-d-my-dista
nce-from-speaker distance-neutral)
(np-pronoun-reflexivity reflexivity-n/a)(np-my-emp
hasis emph-no-emph) (np-my-semantic-cla
ss NEED_VALUES)(np-pronoun-exclusivity
exclusivity-n/a) (np-pronoun-antecedent
-function antecedent-n/a))) (predicate
((np-my-general-type common-noun-type)
(np-my-person person-third)
(np-my-function predicate)(np-my-animacy
anim-human) (np-my-info-functio
n info-neutral)
(np-d-my-distance-from-speaker distance-neutral)
(np-pronoun-reflexivity
reflexivity-n/a)(np-my-emphasis emph-no-emph)
(np-my-number num-sg)(np-my-semanti
c-class NEED_VALUES)
(np-pronoun-exclusivity exclusivity-n/a)
(np-pronoun-antecedent-function
antecedent-! n/a))) (c-my-copula-type role)
(c-my-secondary-type secondary-copula)
(c-my-polarity polarity-positive) (c-my-function
fn-main-clause) (c-my-general-type
declarative)(c-my-speech-act sp-act-state)
(c-v-my-grammatical-aspect gram-aspect-neutral)
(c-v-my-lexical-aspect state)(c-v-my-absolute-tens
e past)(c-v-my-phase-aspect durative)(c-my-imperat
ive-degree imp-degree-n/a)(c-my-ynq-type
ynq-n/a)(c-my-actor's-sem-role actor-sem-role-neut
ral)(c-my-minor-type minor-n/a)(c-my-headedness-rc
rc-head-n/a)(c-my-answer-type ans-n/a)(c-my-restr
ictivess-rc rc-restrictive-n/a)(c-my-focus-rc
focus-n/a)(c-my-actor's-status actor-neutral)(c-my
-gaps-function gap-n/a)(c-my-relative-tense
relative-n/a))
The Elicitation Tool
The elicitation tool provides a simple interface
for bilingual informants with no linguistic
training and limited computer skills to translate
and word-align a corpus in some source language.
The output of the elicitation tool is a text file
containing triplets of eliciting sentence,
elicited sentence, and alignment. The
elicitation tool can produce bilingual glossaries
based on the aligned corpus. It also has a
simple "auto-align" option to add alignments for
unambiguous word pairs in the same file.
They are multi-level sets of feature-value pairs
that are used to reflect the grammatical
structures intended for elicitation. When paired
with an English grammar and lexicon the above
feature structure will generate He was a
teacher.
Feature Detection
watashi wa sensei deshita
((Subj((person first)(num sg)
(animacy human)(head-token-1 I))) (Obj((person
third) (animacy human)(identifiability
-)))) (tense past))
I was a teacher watashi wa sensei deshita
((Subj((person first)(num sg)
(animacy human)(head-token-1 I))) (Obj((person
third) (animacy human)(identifiability
-)))) (tense past))
I was a teacher
((Subj((person first)(num sg)
(animacy human)(head-token-1 I))) (Obj((person
third) (animacy human)(identifiability
-)))) (tense past))
Mapping
Difference Detection
(Subj((person first) (num sg)
(animacy human)))
(Obj(((person third) (animacy human)
(identifiability -))))
(Subj((person first) (num sg)
(animacy human)))
(Obj(((person third) (animacy human)
(identifiability -))))
(tense past)
(tense past)
Sentence Selection
(person first)
(person third) (animacy human)
(identifiability -)
(person first)
(person third) (animacy human)
(identifiability -)
(Subj((person first) (num sg)
(animacy human)))
(Obj(((person third) (animacy human)
(identifiability -))))
(num sg)
(tense past)
(num sg)
(animacy human)
?
(animacy human)
watashi wa sensei desu
(person first)
(person third) (animacy human)
(identifiability -)
((Subj((person first)(num sg)
(animacy human)(head-token-1 I))) (Obj((person
third) (animacy human)(identifiability
-)))) (tense present))
((Subj((person first)(num sg)
(animacy human)(head-token-1 I))) (Obj((person
third) (animacy human)(identifiability
-)))) (tense present))
((Subj((person first)(num sg)
(animacy human)(head-token-1 I))) (Obj((person
third) (animacy human)(identifiability
-)))) (tense past))
(num sg)
(animacy human)
(tense present)
I was a teacher watashi wa sensei deshita
(tense past)
((Subj((person first)(num sg)
(animacy human)(head-token-1 I))) (Obj((person
third) (animacy human)(identifiability
-)))) (tense past))
(Subj((person first) (num sg)
(animacy human)))
(Obj(((person third) (animacy human)
(identifiability -))))
(Subj((person first) (num sg)
(animacy human)))
(Obj(((person third) (animacy human)
(identifiability -))))
(tense present)
(Subj((person first) (num sg)
(animacy human)))
(Obj(((person third) (animacy human)
(identifiability -))))
Minimal Pair Linking
(person first)
(person third) (animacy human)
(identifiability -)
(person third) (animacy human)
(identifiability -)
(person first)
(person third) (animacy human)
(identifiability -)
(person first)
Translation/Alignment
(num sg)
(num sg)
(num sg)
(animacy human)
(animacy human)
(Subj((person first) (num sg)
(animacy human)))
(Obj(((person third) (animacy human)
(identifiability -))))
(animacy human)
(tense past)
Substitution mismatch Difference is found on ME
I was a teacher Watashi wa sensei deshita
I am a teacher Watashi wa sensei desu
(person first)
(person third) (animacy human)
(identifiability -)
(num sg)
(animacy human)