Overview of the Language Technologies Institute and AVENUE Project

About This Presentation

Title:

Overview of the Language Technologies Institute and AVENUE Project

Description:

JANUS/DIPLOMAT/AVENUE (Waibel, Frederking, Levin, Schultz, Vogel, Lafferty, Black, ... Exchange email or chat in Quechua and Spanish. ... – PowerPoint PPT presentation

Number of Views:143

Avg rating:3.0/5.0

Slides: 69

Provided by: loril8

Category:

more less

Transcript and Presenter's Notes

Title: Overview of the Language Technologies Institute and AVENUE Project

1
Overview of the Language Technologies Institute
and AVENUE Project

Jaime Carbonell, Director
March 2, 2002

2
School of Computer Science at Carnegie Mellon
University

Computer Science Department (theory, systems)
Robotics Institute (space, industry, medical)
Language Technologies Institute (MT, speech, IR)
Human-Computer Interaction Inst. (Ergonomics)
Institute for Software Research Int. (SE)
Center for Automated Learning Disc (DM)
Entertainment Technologies (Animation, graphics)

3
Language Technologies Institute

Founded in 1986 as the Center for Machine
Translation (CMT).
Became Language Technologies Institute in 1996,
unifying CMT, Comp Ling program.
Current Size 110 FTEs
18 Faculty
22 Staff
60 Graduate Students (45 PhD, 15 MLT)
10 Visiting Scholars

4
Bill of Rights

Get the right information
To the right people
At the right time
On the right medium
In the right language
With the right level of detail

5
The Right Information

Find the right papers, web-pages,
Language modeling for IR (Lafferty, Callan)
Translingual IR (Yang, Carbonell, Brown)
Distributed IR (Callan)
Seek Novelty (Carbonell, Yang, )
Avoid massive redundancy
Detect new events in streaming data

6
to the Right People

Text Categorization
Multi-class classifiers by topic (Yang)
Boosting for genre learning (Carbonell)
Filtering Routing
Topic tracking in streaming data (Yang)
TREC filtering/routing (Callan, Yang)

7
at the Right Time

I.e. when the information is needed
Anticipatory analysis
Helpful info without being asked
Context-aware learning
Interactivity with user
Utility theory (when to ask, when to give new or
deeper info, when to back off)
(We have not yet taken up this challenge)

8
on the Right Medium

Speech Recognition
SPHINX (Reddy, Rudnicky Rosenfeld, )
JANUS (Waibel, Schultz, )
Speech Synthesis
Festival (Black, Lenzo)
Handwriting Gesture Recognition
ISL (Waibel, J. Yang)
Multimedia Integration (CSD)
Informedia (Wactlar, Hauptmann, )

9
in the Right Language

High-Accuracy Interlingual MT
KANT (Nyberg, Mitamura)
Parallel Corpus-Trainable MT
Statistical MT (Lafferty, Vogel)
Example-Based MT (Brown, Carbonell)
AVENUE Instructible MT (Levin, Lavie, Carbonell)
Speech-to-speech MT
JANUS/DIPLOMAT/AVENUE (Waibel, Frederking, Levin,
Schultz, Vogel, Lafferty, Black, )

10
at the Right Level of Detail

Multidocument Summarization (Carbonell, Waibel,
Yang, )
Question Answering (Carbonell, Callan, Nyberg,
Mitamura, Lavie, )
New thrust (JAVELIN project)
Combines Q-analysis, IR, extraction, planning,
user-feedback, utility analysis, answer
synthesis,

11
We also Engage in

Tutoring Systems (Eskenazi, Callan)
Linguistic Analysis (Levin, Mitamura)
Robust Parsing Algorithms (Lavie, )
Interface communication language design
(Rosenfeld, Waibel, Rudnicky)
Complex System Design (Nyberg, Callan)
Machine Learning (Carbonell, Lafferty, Yang,
Rosenfeld, Lavie, )

12
How we do it at LTI

Data-driven methods
Statistical learning
Corpora-based
Examples
Statistical MT
Example-based MT
Text categorization
Novelty detection
Translingual IR

Knowledge-based
Symbolic learning
Linguistic analysis
Knowledge represent.
Examples
Interlingual MT
Parsing generation
Discourse modeling
Language tutoring

13
Hot Research Topics

Automated Q/A from web/text (JAVELIN)
Endangered Language MT (AVENUE)
Novelty detection and tracking (TDT)
Theoretical foundations of Language modeling, and
knowledge discovery
(All require multi-discipline approach.)

14
Educational Programs at LTI

PhD Program
45 PhD students, all research areas of LTI
Individual and joint advisorships
Marriage process in mid-September to match
faculty/projects with new students
Years 1-2 50 research, 50 courses
Years 3-N 100 research (target N5)
Semi-annual student evaluations

15
Education at LTI (II)

MLT Program (1-2 years)
Courses are more central
50 on Project/research work (if funded)
Many MLTs apply for PhD admission
CALL Masters (1 year)
New program joint with Modern Languages
Certificate program (1 semester)

16
The AVENUE ProjectMachine Translation and
Language Tools for Minority Languages

Jaime Carbonell, Lori Levin, Alon Lavie, Tanja
Schultz, Eric Petersen, Kathrin Probst, Christian
Monson,

17
Machine Translation of Indigenous Languages

Policy makers have access to information about
indigenous people.
Epidemics, crop failures, etc.
Indigenous people can participate in
Health care
Education
Government
Internet
without giving up their languages.

18
History of AVENUE

Arose from a series of joint workshops of NSF and
OAS.
Workshop recommendations
Create multinational projects using information
technology to
provide immediate benefits to governments and
citizens
develop critical infrastructure for communication
and collaborative research
training researchers and engineers
advancing science and technology

19
Resources for MT

People who speak the language.
Linguists who speak the language.
Computational linguists who speak the language.
Text on paper.
Text on line.
Comparable text on paper or on line.
Parallel text on paper or on line.
Annotated text (part of speech, morphology, etc.)
Dictionaries (mono-lingual or bilingual) on paper
or on line.
Recordings of spoken language.
Recordings of spoken language that are
transcribed.
Etc.

20
MT for Indigenous Languages

Minimal amount of parallel text
Possibly competing standards for
orthography/spelling
Maybe not so many trained linguists
Access to native informants possible
Need to minimize development time and cost

21
Two Technical Approaches

Generalized EBMT
Parallel text 50K-2MB (uncontrolled corpus)
Rapid implementation
Proven for major Ls with reduced data

Transfer-rule learning
Elicitation (controlled) corpus to extract
grammatical properties
Seeded version-space learning

22
Types of Machine Translation

Interlingua

Semantic Analysis
Sentence Planning
Transfer Rules
Text Generation
Syntactic Parsing
Source (Arabic)
Target (English)
Direct SMT, EBMT
23
Multi-Engine Machine
Translation

MT Systems have different strengths
Rapidly adaptable Statistical, example-based
Good grammar Rule-Based (linguisitic) MT
High precision in narrow domains KBMT
Minority Language MT Learnable from informant
Combine results of parallel-invoked MT
Select best of multiple translations
Selection based on optimizing combination of
Target language joint-exponential model
Confidence scores of individual MT engines

24
Illustration of Multi-Engine MT
25
EBMT Example
English I would like to meet
her. Mapudungun Ayükefun trawüael fey
engu.
English The tallest man is my
father. Mapudungun Chi doy fütra chi wentru
fey ta inche ñi chaw.
English I would like to meet the
tallest man Mapudungun (new)
Ayükefun trawüael Chi doy fütra chi
wentru Mapudungun (correct) Ayüken ñi
trawüael chi doy fütra wentruengu.
26
Architecture Diagram
SL Input
Run-Time Module
Learning Module
SL Parser
EBMT Engine
Elicitation Process
SVS Learning Process
Transfer Rules
Transfer Engine
TL Generator
User
Unifier Module
TL Output
27
Version Space Learning

Symbolic learning from and examples
Invented by Mitchell, refined by Hirsch
Builds generalization lattice implicitly
Bounded by G and S sets
Worse-case exponential complexity (in size of G
and S)
Slow convergence rate

28
Example of Transfer Rule Lattice
29
Seeded Version Spaces

Generate concept seed from first example
Generalization-level hypothesis (POS feature
agreement for T-rules in NICE)
Generalization/specialization level bounds
Up to k-levels generalization, and up to j-levels
specialization.
Implicit lattice explored seed-outwards

30
Complexity of SVS

O(gk) upward search, where g of
generalization operators
O(sj) downward search, where s of
specialization operators
Since m and k are constants, the SVS runs in
polynomial time of order max(j,k)
Convergence rates bounded by F(j,k)

31
Next Steps in SVS

Implementation of transfer-rule intepreter
(partially complete)
Implementation of SVS to learn transfer rules
(underway)
Elicitation corpus extension for evaluation
(under way)
Evaluation first on Mapudungun MT (next)

32
NICE Partners
33
Agreement Between LTI and Institute of Indigenous
Studies (IEI), Universidad De La Frontera, Chile

Contributions of IEI
Native language knowledge and linguistic
expertise in Mapudungun
Experience in bicultural, bilingual education
Data collection recording, transcribing,
translating
Orthographic normalization of Mapudungun

34
Agreement between LTI and Institute of Indigenous
Studies (IEI), Universidad de la Frontera, Chile

Contributions of LTI
Develop MT technology for indigenous languages
Training for data collection and transcription
Partial support for data collection effort
pending funding from Chilean Ministry of
Education
International coordination, technical and project
management

35
LTI/IEI Agreement

Continue collaboration on data collection and
machine translation technology.
Pursue focused areas of mutual interest, such as
bilingual education.
Seek additional funding sources in Chile and the
US.

36
The IEI Team

Coordinator (leader of a bilingual and
multicultural education project)
Eliseo Canulef
Distinguished native speaker
Rosendo Huisca
Linguists (one native speaker, one near-native)
Juan Hector Painequeo
Hugo Carrasco
Typists/Transcribers
Recording assistants
Translators
Native speaker linguistic informants

37
MINEDUC/IEIAgreement Highlights

Based on the LTI/IEI agreement, the Chilean
Ministry of Education agreed to fund the data
collection and processing team for the year 2001.
This agreement will be renewed each year, as
needed.

38
MINEDUC/IEI AgreementObjectives

To evaluate the NICE/Mapudungun proposal for
orthography and spelling
To collect an oral corpus that represent the four
Mapudungun dialects spoken in Chile. The main
domain is primary health, traditional and western.

39
MINEDUC/IEI AgreementDeliverables

An oral corpus of 800 hours recorded,
proportional to the demography of each current
spoken dialect
120 hours transcribed and translated from
Mapudungun to Spanish
A refined proposal for writing Mapudungun

40
Nice/MapudungunDatabase

Writing conventions (Grafemario)
Glossary Mapudungun/Spanish
Bilingual newspaper, 4 issues
Ultimas Familias memoirs
Memorias de Pascual Coña
Publishable product with new Spanish translation
35 hours transcribed speech
80 hours recorded speech

41
NICE/MapudungunOther Products

Standardization of orthography Linguists at UFRO
have evaluated the competing orthographies for
Mapudungun and written a report detailing their
recommendations for a standardized orthography
for NICE.
Training for spoken language collection In
January 2001 native speakers of Mapudungun were
trained in the recording and transcription of
spoken data.

42
Underfunded Activities

Data collection
Colombia (unfunded)
Chile (partially funded)
Travel
More contact between CMU and Chile (UFRO) and
Colombia.
Training
Train Mapuche linguists in language technologies
at CMU.
Extend training to Colombia
Refine MT system for Mapudungun and Siona
Current funding covers research on the MT engine
and data collection, but not detailed linguistic
analysis

43
Outline

History of MT--See Wired magazine May 2000 issue.
Available on the web.
How well does it work?
Procedure for designing an LT project.
Choose an application What do you want to do?
Identify the properties of your application.
Methods knowledge-based, statistical/corpus
based, or hybrid.
Methods interlingua, transfer, direct
Typical components of an MT system.
Typical resources required for and MT system.

44
How well does it work?Example SpanAm

Possibly the best Spanish-English MT system.
Around 20 years of development.

45
How well does it work?Example Systran

Try it on the Altavista web page.
Many language pairs are available.
Some language pairs might have taken up to a
person-century of development.
Can translate text on any topic.
Results may be amusing.

46
How well does it work?Example KANT

Translates equipment manuals for Caterpillar.
Input is controlled English many ambiguities are
eliminated. The input is checked carefully for
compliance with the rules.
Around 5 output languages.
The output might be post-edited.
The result has to be perfect to prevent accidents
with the equipment.

47
How well does it work?Example JANUS

Translates spoken conversations about booking
hotel rooms or flights.
Six languages English, French, German, Italian,
Japanese, Korean (with partners in the C-STAR
consortium).
Input is spontaneous speech spoken into a
microphone.
Output is around 60 correct.
Task Completion is higher than translation
accuracy users can always get their flights or
rooms if they are willing to repeat 40 of their
sentences.

48
How well does it work?Speech Recognition

Jupiter weather information 1-888-573-8255. You
can say things like what cities do you know
about in Chile? and What will be the weather
tomorrow in Santiago?.
Communicator flight reservations 1-877-CMU-PLAN.
You can say things like Im travelling to
Pittsburgh.
Speechworks demo 1-888-SAY-DEMO. You can say
things like Sell my shares of Microsoft.
These are all in English, and are toll-free only
in the US, but they are speaker-indepent and
should work with reasonable foreign accents.

49
Different kinds of MT

Different applications for example, translation
of spoken language or text.
Different methods for example, translation rules
that are hand crafted by a linguist or rules that
are learned automatically by a machine.
The work of building an MT program will be very
different depending on the application and the
methods.

50
Procedure for planning an MT project

Choose an application.
Identify the properties of your application.
List your resources.
Choose one or more methods.
Make adjustments if your resources are not
adequate for the properties of your application.

51
Choose an application What do you want to do?

Exchange email or chat in Quechua and Spanish.
Translate Spanish web pages about science into
Quechua so that kids can read about science in
their language.
Scan the web Is there any information about
such-and-such new fertilizer and water
pollution? Then if you find something that looks
interesting, take it to a human translator.
Answer government surveys about health and
agriculture (spoken or written).
Ask directions (where is the library?)
(spoken).
Read government publications in Quechua.

52
Identify the properties of your application.

Do you need reliable, high quality translation?
How many languages are involved? Two or more?
Type of input.
One topic (for example, weather reports) or any
topic (for example, calling your friend on the
phone to chat).
Controlled or free input.
How much time and money do you have?
Do you anticipate having to add new topics or new
languages?

53
Do you need high quality?

Assimilation Translate something into your
language so that you can
understand it--may not require high quality.
evaluate whether it is important or interesting
and then send it off for a better
translation--does not require high quality.
use it for educational purposes--probably
requires high quality.

54
Do you need high quality?

Dissemination Translate something into someone
elses language e.g., for publication.
Usually should be high quality.

55
Do you need high quality?

Two-Way e.g., chat room or spoken conversation
May not require high reliability on correctness
if you have a native language paraphrase.
Original input I would like to reserve a double
room.
Paraphrase Could you make a reservation for a
double room.

56
Type of Input

Formal text newspaper, government reports,
on-line encyclopedia.
Difficulty long sentences
Formal speech spoken news broadcast.
Difficulty speech recognition wont be perfect.
Conversational speech
Difficulty speech recognition wont be perfect
Difficulty disfluencies
Difficulty non-grammatical speech
Informal text email, chat
Difficulty non-grammatical speech

57
Methods Knowledge-Based

Knowledge-based MT a linguist writes rules for
translation
noun adjective -- adjective noun
Requires a computational linguist who knows the
source and target languages.
Usually takes many years to get good coverage.
Usually high quality.

58
Methods statistical/corpus-based

Statistical and corpus-based methods involve
computer programs that automatically learn to
translate.
The program must be trained by showing it a lot
of data.
Requires huge amounts of data.
The data may need to be annotated by hand.
Does not require a human computational linguist
who knows the source and target languages.
Could be applied to a new language in a few days.
At the current state-of-the-art, the quality is
not very good.

59
Methods Interlingua

An interlingua is a machine-readable
representation of the meaning of a sentence.
Id like a double room/Quisiera una habitacion
doble.
request-actionreservationhotel(room-typedouble)
Good for multi-lingual situations. Very easy to
add a new language.
Probably better for limited domains -- meaning is
very hard to define.

60
Multilingual Interlingual Machine Translation

Instructions
Delete sample document icon and replace with
working document icons as follows
Create document in Word.
Return to PowerPoint.
From Insert Menu, select Object
Click Create from File
Locate File name in File box
Make sure Display as Icon is checked.
Click OK
Select icon
From Slide Show Menu, Select Action Settings.
Click Object Action and select Edit
Click OK

61
Methods Transfer

A transfer rule tells you how a structure in one
language corresponds to a different structure in
another language
an adjective followed by a noun in English
corresponds to a noun followed by an adjective in
Spanish.
Not good when there are more than two languages
-- you have to write different transfer rules for
each pair.
Better than interlingua for unlimited domain.

62
Methods Direct

Direct translation does not involve analyzing the
structure or meaning of a language.
For example, look up each word in a bilingual
dictionary.
Results can be hilarious the spirit is willing
but the flesh is weak can become the wine is
good, but the meat is lousy.
Can be developed very quickly.
Can be a good back-up when more complicated
methods fail to produce output.

63
Components of a Knowledge-Based Interlingua MT
System

Morphological analyzer identify prefixes,
suffixes, and stem.
Parser (sentence-to-syntactic structure for
source language, hand-written or automatically
learned)
Meaning interpreter (syntax-to-semantics, source
language).
Meaning interpreter (semantics-to-syntax, target
language).
Generator (syntactic structure-to-sentence) for
target language.

64
Resources for a knowledge-based interlingua MT
system

Computational linguists who know the source and
target languages.
As large a corpus as possible so that the
linguists can confirm that they are covering the
necessary constructions, but the size of the
corpus is not crucial to system development.
Lexicons for source and target languages, syntax,
semantics, and morphology.
A list of all the concepts that can be expressed
in the systems domain.

65
Components of Example Based MT a direct
statistical method

A morphological analyzer and part of speech
tagger would be nice, but not crucial.
An alignment algorithm that runs over a parallel
corpus and finds corresponding source and target
sentences.
An algorithm that compares an input sentence to
sentences that have been previously translated,
or whose translation is known.
An algorithm that pulls out the corresponding
translation, possibly slightly modifying a
previous translation.

66
Resources for Example Based MT

Lexicons would improve quality of translation,
but are not crucial.
A large parallel corpus (hundreds of thousands of
words).

67
Omnivorous Multi-Engine MT eats any available
resources
68
Approaches we had in mind

Direct bilingual-dictionary lookup because it is
easy and is a back-up when other methods fail.
Generalized Example-Based MT because it is easy
and fast and can be also be a back-up.
Instructable Transfer-based MT a new, untested
idea involving machine learning of rules from a
human native speaker. Useful when computational
linguists dont know the language, and people who
know the language are not computational
linguists.
Conventional, hand-written transfer rules in
case the new method doesnt work.