Data Collection and Language Technologies for Mapudungun - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Data Collection and Language Technologies for Mapudungun

Description:

Data Collection and Language Technologies ... Chile's programs in bilingual and multicultural education ... Supra-dialectal. 28 letters covering the 32 phones ... – PowerPoint PPT presentation

Number of Views:259
Avg rating:3.0/5.0
Slides: 29
Provided by: loril8
Category:

less

Transcript and Presenter's Notes

Title: Data Collection and Language Technologies for Mapudungun


1
Data Collection and Language Technologies for
Mapudungun
  • Lori Levin, Rodolfo Vega,
  • Jaime Carbonell, Ralf Brown,
  • Alon Lavie
  • Language Technologies Institute
  • Carnegie Mellon University
  • Eliseo Cañulef
  • Instituto de Estudios Indígenas
  • Universidad de La Frontera
  • Carolina Huenchullán
  • Ministerio de Educación
  • Chile

Presented by Ariadna Font-Llitjos Language
Technologies Institute Carnegie Mellon University
2
Overview
  • Chiles programs in bilingual and multicultural
    education
  • The AVENUE project at Carnegie Mellon University
  • The Mapudungun corpus
  • Plans for Example-Based Machine Translation
  • Plans for Rule-Based Machine Translation

3
Bilingual and Intercultural Education in Chile
  • Eight ethnic groups Mapuche, Aymara, Rapa Nui
    (Pascuense), Likay Antai, Quechua, Colla,
    Kawashkar (Alacalufe), Yamana (Yagan).
  • Make education culturally and linguistically
    relevant.
  • Languages of instruction are native language and
    second language (Spanish).
  • Community involvement in curriculum design.

4
AVENUE Automatic Voice Enabled Natural language
Understanding Environment
  • Affordable machine translation for languages with
    scarce resources.
  • No large corpus in electronic form
  • Few or no native speakers trained in
    computational linguistics

5
AVENUE Omnivorous MT
  • AVENUE can consume whatever resources are
    available
  • EBMT if a parallel corpus is available
  • Human-Engineered MT if a human computational
    linguist is available
  • Seeded Version Space Learning for automatic
    acquisition of transfer rules if no corpus or
    computational linguist is available

6
Mapudungun
  • Language of the Mapuche
  • Over 900,000 Mapuche in Chile and Argentina
  • Words contain several morphemes including
    multiple open class items.
  • Still spoken by a majority of Mapuche
  • Still spoken as a first language
  • Competing orthographies
  • Some vocabulary loss
  • Some written literature, newsletters and textbooks

7
The Mapudungun Corpora
  • First step toward
  • Corpus-based machine translation
  • Authentic corpus for instructional purposes
  • Written corpus
  • Spoken corpus

8
The Written Mapudungun Corpus
  • Existing texts were entered in electronic form
    and translated into Spanish
  • Memorias de Pascual Coña the life story of a
    Mapuche leader written by Ernesto Wilhelm de
    Moessbach.
  • Las Ultimas Familias by Tomás Guevara.
  • Nuestros Pueblos newspaper published by
    Corporación Nacional de Desarrollo Indígena
    (CONADI).
  • Total of around 200,000 words

9
The Spoken Mapudungun Corpus
  • Recorded with Sony DAT recorder and digital
    stereo microphone.
  • Downloaded with CoolEdit
  • Transcribed with TransEdit
  • Alignment of audio and transcript for speech
    recognition

10
The Spoken Mapudungun Corpus
  • All sessions were scheduled and recorded by a
    native speaker interviewer
  • Subject matter primary and preventive health
  • Limited domain for higher quality machine
    translation
  • People were asked to describe their experiences
    with an illness and how it was treated by modern
    or traditional medicine

11
The Spoken Mapudungun Corpus
  • Speakers
  • 21-75 years old most 40-65
  • Fully native speakers
  • Some auxiliary nurses for rural areas in Chilean
    Public health system
  • Some machi
  • Did not reveal specialized knowledge

12
The Mapudungun Spoken Corpus
  • Dialects
  • Lafkenche, Nguluche, Pewenche
  • Williche will be recorded at a later stage of the
    project
  • more morpho-syntactic differences from the other
    dialects

13
The Mapudungun Spoken Corpus
  • Orthography
  • Pan-dialectal
  • 32 phones
  • Some are dialectal variants of each other
  • Supra-dialectal
  • 28 letters covering the 32 phones
  • Typable on Spanish keyboard with some diacritics
    such as apostrophes
  • Use Spanish letters for phonemes that sound like
    Spanish phonemes

14
Plans for Machine Translation
  • Example-Based MT
  • Seeded Version Space Learning for automated
    acquisition of transfer rules

15
Example-Based MT
  • Insert one of Ralfs slides

16
Automated Acquisition of Transfer Rules
  • Elicitation Tool
  • Seeded Version Space Learning
  • Run-time transfer system for MT

17
Chinese-English Transfer Rule for Yes-No Questions
  • SS NP VP MA -gt AUX NP VP
  • ((x1y2) set
    alignments
  • (x2y3)
  • ((x0 subj) x1) create Chinese
    f-structure
  • ((x0 subj case) nom) Chinese has no
    case, so add it
  • ((x0 act) quest) set speech act
    to question
  • (x0 x2) create
    Chinese f-structure
  • ((y1 form) do) set base form
    of AUX to "do"
  • proper form will be selected based on
    subj-verb agreement
  • ((y3 vform) c inf) verb must be
    infinitive
  • ((y1 agr) (y2 agr)) subject and
    "do" must agree
  • )

18
Example of Seed Rule and Generalization
  • Pair 1 the mander mann
  • Pair 2 the womandie frau

19
(No Transcript)
20
Elicitation Tool
21
Elicitation Process
  • Bilingual informant
  • Literate in the elicitation language and the
    elicited language
  • Translate sentences
  • Align words

22
Elicitation Corpus Excerpt
  • He has sold both of his cars. English
    prompt
  • El ha vendido sus dos automóviles Spanish prompt
  • fey weluiñi epu awtu
    Mapudungun provided by informant
  • He can move both of his thumbs.
  • El puede mover sus dos pulgares
  • fey pepi newüleliñi epu fütrarumechangüll
  • He loves both of his sisters.
  • El ama a sus dos hermanas
  • fey poyeyñi epu deya
  • He loves both of his brothers.
  • El ama a sus dos hermanos
  • fey poyeyñi epu peñi

23
Elicitation Corpus
  • Compositional
  • Small phrases are elicited first and then are
    combined into larger phrases
  • For learnability
  • Minimal Pairs
  • Sentences that differ in only one feature (e.g.,
    number of the subject)
  • For automatic feature detection
  • If the minimal pair differs only in the number of
    the subject, and the verbs are different in the
    two sentences, the language may have agreement in
    number between subjects and verbs.

24
Elicitation Corpus Current Coverage
  • 864 Sentences (pilot corpus)
  • Transitive and intransitive sentences
  • Animate and inanimate subjects and objects
  • Definite and indefinite subjects and objects
  • Present/ongoing and past/completed
  • Singular, plural, and dual nouns
  • Simple noun phrases with definiteness, modifiers
  • Possessive noun phrases

25
Elicitation Corpus Future Work
  • Probst and Levin (2002)
  • Pitfalls of automated elicitation
  • Automatic Branching and skipping
  • Automatically skip parts of the corpus depending
    on what features have been detected

26
Status of automated rule learning
  • Preliminary results
  • Learned some compositional rules for German
  • Current work
  • Interaction of compositional rules
  • Seed rule generation
  • Generalization and verification of seed rule
    hypothesis

27
Status of Transfer Rule System
  • Preliminary experiments on Chinese-English MT
  • Integrated into a multi-engine system with
    Example-Based MT

28
Tools for Field Linguists?
  • Can feature detection and automatically learned
    rules be useful to alert a field worker to
    possible interesting data?
  • Can automated elicitation with branching and
    skipping be helpful?
Write a Comment
User Comments (0)
About PowerShow.com