Overview of the Language Technologies Institute and AVENUE Project - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Overview of the Language Technologies Institute and AVENUE Project

Description:

JANUS/DIPLOMAT/AVENUE (Waibel, Frederking, Levin, Schultz, Vogel, Lafferty, Black, ... Exchange email or chat in Quechua and Spanish. ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 69
Provided by: loril8
Category:

less

Transcript and Presenter's Notes

Title: Overview of the Language Technologies Institute and AVENUE Project


1
Overview of the Language Technologies Institute
and AVENUE Project
  • Jaime Carbonell, Director
  • March 2, 2002

2
School of Computer Science at Carnegie Mellon
University
  • Computer Science Department (theory, systems)
  • Robotics Institute (space, industry, medical)
  • Language Technologies Institute (MT, speech, IR)
  • Human-Computer Interaction Inst. (Ergonomics)
  • Institute for Software Research Int. (SE)
  • Center for Automated Learning Disc (DM)
  • Entertainment Technologies (Animation, graphics)

3
Language Technologies Institute
  • Founded in 1986 as the Center for Machine
    Translation (CMT).
  • Became Language Technologies Institute in 1996,
    unifying CMT, Comp Ling program.
  • Current Size 110 FTEs
  • 18 Faculty
  • 22 Staff
  • 60 Graduate Students (45 PhD, 15 MLT)
  • 10 Visiting Scholars

4
Bill of Rights
  • Get the right information
  • To the right people
  • At the right time
  • On the right medium
  • In the right language
  • With the right level of detail

5
The Right Information
  • Find the right papers, web-pages,
  • Language modeling for IR (Lafferty, Callan)
  • Translingual IR (Yang, Carbonell, Brown)
  • Distributed IR (Callan)
  • Seek Novelty (Carbonell, Yang, )
  • Avoid massive redundancy
  • Detect new events in streaming data

6
to the Right People
  • Text Categorization
  • Multi-class classifiers by topic (Yang)
  • Boosting for genre learning (Carbonell)
  • Filtering Routing
  • Topic tracking in streaming data (Yang)
  • TREC filtering/routing (Callan, Yang)

7
at the Right Time
  • I.e. when the information is needed
  • Anticipatory analysis
  • Helpful info without being asked
  • Context-aware learning
  • Interactivity with user
  • Utility theory (when to ask, when to give new or
    deeper info, when to back off)
  • (We have not yet taken up this challenge)

8
on the Right Medium
  • Speech Recognition
  • SPHINX (Reddy, Rudnicky Rosenfeld, )
  • JANUS (Waibel, Schultz, )
  • Speech Synthesis
  • Festival (Black, Lenzo)
  • Handwriting Gesture Recognition
  • ISL (Waibel, J. Yang)
  • Multimedia Integration (CSD)
  • Informedia (Wactlar, Hauptmann, )

9
in the Right Language
  • High-Accuracy Interlingual MT
  • KANT (Nyberg, Mitamura)
  • Parallel Corpus-Trainable MT
  • Statistical MT (Lafferty, Vogel)
  • Example-Based MT (Brown, Carbonell)
  • AVENUE Instructible MT (Levin, Lavie, Carbonell)
  • Speech-to-speech MT
  • JANUS/DIPLOMAT/AVENUE (Waibel, Frederking, Levin,
    Schultz, Vogel, Lafferty, Black, )

10
at the Right Level of Detail
  • Multidocument Summarization (Carbonell, Waibel,
    Yang, )
  • Question Answering (Carbonell, Callan, Nyberg,
    Mitamura, Lavie, )
  • New thrust (JAVELIN project)
  • Combines Q-analysis, IR, extraction, planning,
    user-feedback, utility analysis, answer
    synthesis,

11
We also Engage in
  • Tutoring Systems (Eskenazi, Callan)
  • Linguistic Analysis (Levin, Mitamura)
  • Robust Parsing Algorithms (Lavie, )
  • Interface communication language design
    (Rosenfeld, Waibel, Rudnicky)
  • Complex System Design (Nyberg, Callan)
  • Machine Learning (Carbonell, Lafferty, Yang,
    Rosenfeld, Lavie, )

12
How we do it at LTI
  • Data-driven methods
  • Statistical learning
  • Corpora-based
  • Examples
  • Statistical MT
  • Example-based MT
  • Text categorization
  • Novelty detection
  • Translingual IR
  • Knowledge-based
  • Symbolic learning
  • Linguistic analysis
  • Knowledge represent.
  • Examples
  • Interlingual MT
  • Parsing generation
  • Discourse modeling
  • Language tutoring

13
Hot Research Topics
  • Automated Q/A from web/text (JAVELIN)
  • Endangered Language MT (AVENUE)
  • Novelty detection and tracking (TDT)
  • Theoretical foundations of Language modeling, and
    knowledge discovery
  • (All require multi-discipline approach.)

14
Educational Programs at LTI
  • PhD Program
  • 45 PhD students, all research areas of LTI
  • Individual and joint advisorships
  • Marriage process in mid-September to match
    faculty/projects with new students
  • Years 1-2 50 research, 50 courses
  • Years 3-N 100 research (target N5)
  • Semi-annual student evaluations

15
Education at LTI (II)
  • MLT Program (1-2 years)
  • Courses are more central
  • 50 on Project/research work (if funded)
  • Many MLTs apply for PhD admission
  • CALL Masters (1 year)
  • New program joint with Modern Languages
  • Certificate program (1 semester)

16
The AVENUE ProjectMachine Translation and
Language Tools for Minority Languages
  • Jaime Carbonell, Lori Levin, Alon Lavie, Tanja
    Schultz, Eric Petersen, Kathrin Probst, Christian
    Monson,

17
Machine Translation of Indigenous Languages
  • Policy makers have access to information about
    indigenous people.
  • Epidemics, crop failures, etc.
  • Indigenous people can participate in
  • Health care
  • Education
  • Government
  • Internet
  • without giving up their languages.

18
History of AVENUE
  • Arose from a series of joint workshops of NSF and
    OAS.
  • Workshop recommendations
  • Create multinational projects using information
    technology to
  • provide immediate benefits to governments and
    citizens
  • develop critical infrastructure for communication
    and collaborative research
  • training researchers and engineers
  • advancing science and technology

19
Resources for MT
  • People who speak the language.
  • Linguists who speak the language.
  • Computational linguists who speak the language.
  • Text on paper.
  • Text on line.
  • Comparable text on paper or on line.
  • Parallel text on paper or on line.
  • Annotated text (part of speech, morphology, etc.)
  • Dictionaries (mono-lingual or bilingual) on paper
    or on line.
  • Recordings of spoken language.
  • Recordings of spoken language that are
    transcribed.
  • Etc.

20
MT for Indigenous Languages
  • Minimal amount of parallel text
  • Possibly competing standards for
    orthography/spelling
  • Maybe not so many trained linguists
  • Access to native informants possible
  • Need to minimize development time and cost

21
Two Technical Approaches
  • Generalized EBMT
  • Parallel text 50K-2MB (uncontrolled corpus)
  • Rapid implementation
  • Proven for major Ls with reduced data
  • Transfer-rule learning
  • Elicitation (controlled) corpus to extract
    grammatical properties
  • Seeded version-space learning

22
Types of Machine Translation
  • Interlingua

Semantic Analysis
Sentence Planning
Transfer Rules
Text Generation
Syntactic Parsing
Source (Arabic)
Target (English)
Direct SMT, EBMT
23
Multi-Engine Machine
Translation
  • MT Systems have different strengths
  • Rapidly adaptable Statistical, example-based
  • Good grammar Rule-Based (linguisitic) MT
  • High precision in narrow domains KBMT
  • Minority Language MT Learnable from informant
  • Combine results of parallel-invoked MT
  • Select best of multiple translations
  • Selection based on optimizing combination of
  • Target language joint-exponential model
  • Confidence scores of individual MT engines

24
Illustration of Multi-Engine MT
25
EBMT Example
English I would like to meet
her. Mapudungun Ayükefun trawüael fey
engu.
English The tallest man is my
father. Mapudungun Chi doy fütra chi wentru
fey ta inche ñi chaw.
English I would like to meet the
tallest man Mapudungun (new)
Ayükefun trawüael Chi doy fütra chi
wentru Mapudungun (correct) Ayüken ñi
trawüael chi doy fütra wentruengu.
26
Architecture Diagram
SL Input
Run-Time Module
Learning Module
SL Parser
EBMT Engine
Elicitation Process
SVS Learning Process
Transfer Rules
Transfer Engine
TL Generator
User
Unifier Module
TL Output
27
Version Space Learning
  • Symbolic learning from and examples
  • Invented by Mitchell, refined by Hirsch
  • Builds generalization lattice implicitly
  • Bounded by G and S sets
  • Worse-case exponential complexity (in size of G
    and S)
  • Slow convergence rate

28
Example of Transfer Rule Lattice
29
Seeded Version Spaces
  • Generate concept seed from first example
  • Generalization-level hypothesis (POS feature
    agreement for T-rules in NICE)
  • Generalization/specialization level bounds
  • Up to k-levels generalization, and up to j-levels
    specialization.
  • Implicit lattice explored seed-outwards

30
Complexity of SVS
  • O(gk) upward search, where g of
    generalization operators
  • O(sj) downward search, where s of
    specialization operators
  • Since m and k are constants, the SVS runs in
    polynomial time of order max(j,k)
  • Convergence rates bounded by F(j,k)

31
Next Steps in SVS
  • Implementation of transfer-rule intepreter
    (partially complete)
  • Implementation of SVS to learn transfer rules
    (underway)
  • Elicitation corpus extension for evaluation
    (under way)
  • Evaluation first on Mapudungun MT (next)

32
NICE Partners
33
Agreement Between LTI and Institute of Indigenous
Studies (IEI), Universidad De La Frontera, Chile
  • Contributions of IEI
  • Native language knowledge and linguistic
    expertise in Mapudungun
  • Experience in bicultural, bilingual education
  • Data collection recording, transcribing,
    translating
  • Orthographic normalization of Mapudungun

34
Agreement between LTI and Institute of Indigenous
Studies (IEI), Universidad de la Frontera, Chile
  • Contributions of LTI
  • Develop MT technology for indigenous languages
  • Training for data collection and transcription
  • Partial support for data collection effort
    pending funding from Chilean Ministry of
    Education
  • International coordination, technical and project
    management

35
LTI/IEI Agreement
  • Continue collaboration on data collection and
    machine translation technology.
  • Pursue focused areas of mutual interest, such as
    bilingual education.
  • Seek additional funding sources in Chile and the
    US.

36
The IEI Team
  • Coordinator (leader of a bilingual and
    multicultural education project)
  • Eliseo Canulef
  • Distinguished native speaker
  • Rosendo Huisca
  • Linguists (one native speaker, one near-native)
  • Juan Hector Painequeo
  • Hugo Carrasco
  • Typists/Transcribers
  • Recording assistants
  • Translators
  • Native speaker linguistic informants

37
MINEDUC/IEIAgreement Highlights
  • Based on the LTI/IEI agreement, the Chilean
    Ministry of Education agreed to fund the data
    collection and processing team for the year 2001.
    This agreement will be renewed each year, as
    needed.

38
MINEDUC/IEI AgreementObjectives
  • To evaluate the NICE/Mapudungun proposal for
    orthography and spelling
  • To collect an oral corpus that represent the four
    Mapudungun dialects spoken in Chile. The main
    domain is primary health, traditional and western.

39
MINEDUC/IEI AgreementDeliverables
  • An oral corpus of 800 hours recorded,
    proportional to the demography of each current
    spoken dialect
  • 120 hours transcribed and translated from
    Mapudungun to Spanish
  • A refined proposal for writing Mapudungun

40
Nice/MapudungunDatabase
  • Writing conventions (Grafemario)
  • Glossary Mapudungun/Spanish
  • Bilingual newspaper, 4 issues
  • Ultimas Familias memoirs
  • Memorias de Pascual Coña
  • Publishable product with new Spanish translation
  • 35 hours transcribed speech
  • 80 hours recorded speech

41
NICE/MapudungunOther Products
  • Standardization of orthography Linguists at UFRO
    have evaluated the competing orthographies for
    Mapudungun and written a report detailing their
    recommendations for a standardized orthography
    for NICE.
  • Training for spoken language collection In
    January 2001 native speakers of Mapudungun were
    trained in the recording and transcription of
    spoken data.

42
Underfunded Activities
  • Data collection
  • Colombia (unfunded)
  • Chile (partially funded)
  • Travel
  • More contact between CMU and Chile (UFRO) and
    Colombia.
  • Training
  • Train Mapuche linguists in language technologies
    at CMU.
  • Extend training to Colombia
  • Refine MT system for Mapudungun and Siona
  • Current funding covers research on the MT engine
    and data collection, but not detailed linguistic
    analysis

43
Outline
  • History of MT--See Wired magazine May 2000 issue.
    Available on the web.
  • How well does it work?
  • Procedure for designing an LT project.
  • Choose an application What do you want to do?
  • Identify the properties of your application.
  • Methods knowledge-based, statistical/corpus
    based, or hybrid.
  • Methods interlingua, transfer, direct
  • Typical components of an MT system.
  • Typical resources required for and MT system.

44
How well does it work?Example SpanAm
  • Possibly the best Spanish-English MT system.
  • Around 20 years of development.

45
How well does it work?Example Systran
  • Try it on the Altavista web page.
  • Many language pairs are available.
  • Some language pairs might have taken up to a
    person-century of development.
  • Can translate text on any topic.
  • Results may be amusing.

46
How well does it work?Example KANT
  • Translates equipment manuals for Caterpillar.
  • Input is controlled English many ambiguities are
    eliminated. The input is checked carefully for
    compliance with the rules.
  • Around 5 output languages.
  • The output might be post-edited.
  • The result has to be perfect to prevent accidents
    with the equipment.

47
How well does it work?Example JANUS
  • Translates spoken conversations about booking
    hotel rooms or flights.
  • Six languages English, French, German, Italian,
    Japanese, Korean (with partners in the C-STAR
    consortium).
  • Input is spontaneous speech spoken into a
    microphone.
  • Output is around 60 correct.
  • Task Completion is higher than translation
    accuracy users can always get their flights or
    rooms if they are willing to repeat 40 of their
    sentences.

48
How well does it work?Speech Recognition
  • Jupiter weather information 1-888-573-8255. You
    can say things like what cities do you know
    about in Chile? and What will be the weather
    tomorrow in Santiago?.
  • Communicator flight reservations 1-877-CMU-PLAN.
    You can say things like Im travelling to
    Pittsburgh.
  • Speechworks demo 1-888-SAY-DEMO. You can say
    things like Sell my shares of Microsoft.
  • These are all in English, and are toll-free only
    in the US, but they are speaker-indepent and
    should work with reasonable foreign accents.

49
Different kinds of MT
  • Different applications for example, translation
    of spoken language or text.
  • Different methods for example, translation rules
    that are hand crafted by a linguist or rules that
    are learned automatically by a machine.
  • The work of building an MT program will be very
    different depending on the application and the
    methods.

50
Procedure for planning an MT project
  • Choose an application.
  • Identify the properties of your application.
  • List your resources.
  • Choose one or more methods.
  • Make adjustments if your resources are not
    adequate for the properties of your application.

51
Choose an application What do you want to do?
  • Exchange email or chat in Quechua and Spanish.
  • Translate Spanish web pages about science into
    Quechua so that kids can read about science in
    their language.
  • Scan the web Is there any information about
    such-and-such new fertilizer and water
    pollution? Then if you find something that looks
    interesting, take it to a human translator.
  • Answer government surveys about health and
    agriculture (spoken or written).
  • Ask directions (where is the library?)
    (spoken).
  • Read government publications in Quechua.

52
Identify the properties of your application.
  • Do you need reliable, high quality translation?
  • How many languages are involved? Two or more?
  • Type of input.
  • One topic (for example, weather reports) or any
    topic (for example, calling your friend on the
    phone to chat).
  • Controlled or free input.
  • How much time and money do you have?
  • Do you anticipate having to add new topics or new
    languages?

53
Do you need high quality?
  • Assimilation Translate something into your
    language so that you can
  • understand it--may not require high quality.
  • evaluate whether it is important or interesting
    and then send it off for a better
    translation--does not require high quality.
  • use it for educational purposes--probably
    requires high quality.

54
Do you need high quality?
  • Dissemination Translate something into someone
    elses language e.g., for publication.
  • Usually should be high quality.

55
Do you need high quality?
  • Two-Way e.g., chat room or spoken conversation
  • May not require high reliability on correctness
    if you have a native language paraphrase.
  • Original input I would like to reserve a double
    room.
  • Paraphrase Could you make a reservation for a
    double room.

56
Type of Input
  • Formal text newspaper, government reports,
    on-line encyclopedia.
  • Difficulty long sentences
  • Formal speech spoken news broadcast.
  • Difficulty speech recognition wont be perfect.
  • Conversational speech
  • Difficulty speech recognition wont be perfect
  • Difficulty disfluencies
  • Difficulty non-grammatical speech
  • Informal text email, chat
  • Difficulty non-grammatical speech

57
Methods Knowledge-Based
  • Knowledge-based MT a linguist writes rules for
    translation
  • noun adjective -- adjective noun
  • Requires a computational linguist who knows the
    source and target languages.
  • Usually takes many years to get good coverage.
  • Usually high quality.

58
Methods statistical/corpus-based
  • Statistical and corpus-based methods involve
    computer programs that automatically learn to
    translate.
  • The program must be trained by showing it a lot
    of data.
  • Requires huge amounts of data.
  • The data may need to be annotated by hand.
  • Does not require a human computational linguist
    who knows the source and target languages.
  • Could be applied to a new language in a few days.
  • At the current state-of-the-art, the quality is
    not very good.

59
Methods Interlingua
  • An interlingua is a machine-readable
    representation of the meaning of a sentence.
  • Id like a double room/Quisiera una habitacion
    doble.
  • request-actionreservationhotel(room-typedouble)
  • Good for multi-lingual situations. Very easy to
    add a new language.
  • Probably better for limited domains -- meaning is
    very hard to define.

60
Multilingual Interlingual Machine Translation
  • Instructions
  • Delete sample document icon and replace with
    working document icons as follows
  • Create document in Word.
  • Return to PowerPoint.
  • From Insert Menu, select Object
  • Click Create from File
  • Locate File name in File box
  • Make sure Display as Icon is checked.
  • Click OK
  • Select icon
  • From Slide Show Menu, Select Action Settings.
  • Click Object Action and select Edit
  • Click OK

61
Methods Transfer
  • A transfer rule tells you how a structure in one
    language corresponds to a different structure in
    another language
  • an adjective followed by a noun in English
    corresponds to a noun followed by an adjective in
    Spanish.
  • Not good when there are more than two languages
    -- you have to write different transfer rules for
    each pair.
  • Better than interlingua for unlimited domain.

62
Methods Direct
  • Direct translation does not involve analyzing the
    structure or meaning of a language.
  • For example, look up each word in a bilingual
    dictionary.
  • Results can be hilarious the spirit is willing
    but the flesh is weak can become the wine is
    good, but the meat is lousy.
  • Can be developed very quickly.
  • Can be a good back-up when more complicated
    methods fail to produce output.

63
Components of a Knowledge-Based Interlingua MT
System
  • Morphological analyzer identify prefixes,
    suffixes, and stem.
  • Parser (sentence-to-syntactic structure for
    source language, hand-written or automatically
    learned)
  • Meaning interpreter (syntax-to-semantics, source
    language).
  • Meaning interpreter (semantics-to-syntax, target
    language).
  • Generator (syntactic structure-to-sentence) for
    target language.

64
Resources for a knowledge-based interlingua MT
system
  • Computational linguists who know the source and
    target languages.
  • As large a corpus as possible so that the
    linguists can confirm that they are covering the
    necessary constructions, but the size of the
    corpus is not crucial to system development.
  • Lexicons for source and target languages, syntax,
    semantics, and morphology.
  • A list of all the concepts that can be expressed
    in the systems domain.

65
Components of Example Based MT a direct
statistical method
  • A morphological analyzer and part of speech
    tagger would be nice, but not crucial.
  • An alignment algorithm that runs over a parallel
    corpus and finds corresponding source and target
    sentences.
  • An algorithm that compares an input sentence to
    sentences that have been previously translated,
    or whose translation is known.
  • An algorithm that pulls out the corresponding
    translation, possibly slightly modifying a
    previous translation.

66
Resources for Example Based MT
  • Lexicons would improve quality of translation,
    but are not crucial.
  • A large parallel corpus (hundreds of thousands of
    words).

67
Omnivorous Multi-Engine MT eats any available
resources
68
Approaches we had in mind
  • Direct bilingual-dictionary lookup because it is
    easy and is a back-up when other methods fail.
  • Generalized Example-Based MT because it is easy
    and fast and can be also be a back-up.
  • Instructable Transfer-based MT a new, untested
    idea involving machine learning of rules from a
    human native speaker. Useful when computational
    linguists dont know the language, and people who
    know the language are not computational
    linguists.
  • Conventional, hand-written transfer rules in
    case the new method doesnt work.
Write a Comment
User Comments (0)
About PowerShow.com