Title: CSA405: Advanced Topics in NLP
1CSA405 Advanced Topicsin NLP
- Machine Translation I
- Introduction to MT
2Outline
- MT Machine Translation
- Why MT is important
- What MT is and why MT is difficult
- MT and the Human Translator
3Why Machine Translation is Important
4Implications of Multilinguality
5Commerical Interest
- US has invested in MT for intelligence purposes
- MT is popular on the web - the most ued of
Google's special features - EU spends more that 1B per annum on translation
6Academic Interest
- Different NL technologies include
- parsing
- generation
- morphology
- pronoun resolution
- understanding ...
7Misconceptions about MT
- MT is a waste of time because
- you will never make a machine that can translate
Shakespeare. - the quality of translation you can get from an
MT system is very low - MT threatens the jobs of translators.
- MT systems are machines, and buying an MT system
should be very much like buying a car.
8Facts about MT
- There are many situations where the ability to
produce reliable, if less than perfect,
translations at high speed is valuable. - MT systems can take over some of the boring,
repetitive translation jobs and allow human
translation to concentrate on more interesting
specialist tasks. - Building an MT system is an arduous and time
consuming job, involving the construction of
grammars and very large monolingual and bilingual
dictionaries.
9The Place for MT
- Human Translators are good at
- Getting the right turn of phrase
- Preserving translation equivalence
- Human Translators are bad at
- Dictionary look-up
- Consistency of translation
- Translation of terminology
- MT can exploit these weaknesses
10Summary
- MT is important because
- There are too few human translators
- Availability of materials in appropriate language
has significant economic consequences. - Scientifically, it is still one of the best test
areas for language technology
11Why Translation is Difficult
12What Makes MT Hard
- Style and Meaning
- Word Order
- Word Sense
- Pronouns
- Tense
- Idioms
13Style and Meaning
- As recently as a decade ago it was widely
believed that infectious disease was no longer
much of a threat in the developed world. The
remaining challenges to public health there, it
was thought, stemmed from noninfectious
conditions such as cancer, heart disease and
degenerative diseases.
- Il y a une dizaine dannees, on croyait que les
pays industrialises etait debarasses des risques
lies aux maladies infectieuses et que la sante
publique netait menacee que par des maladies
comme le cancer, les troubles cardiaques, et les
anomolies genetiques
14Style and Meaning
- English
- Two sentences
- infectious disease was no longer much of a threat
in the developed world - The remaining challenges to public health there
- noninfectious conditions
- French
- One sentence
- les pays industrialises etait debarasses des
risques lies aux maladies infectieuses - la sante publique netait menacee que
- maladies
15Different word orders
- English word order is subject - verb - object
- Japanese order is subject - object - verb
- English IBM bought Lotus
- Japanese IBM Lotus bought
- English Reporters said IBM bought Lotus
- Japanese Reporters IBM Lotus bought said
16Word Sense Ambiguity
- Bank as in river
- Bank as in financial insitution
- Plant as in tree
- Plant as in factory
- Different senses usually translate into different
words
17Hutchins Somers (1992)
18Problems Contextual Interpretation
OPEN
19Different Cultural Models
English Health Insurance German Krankenversiche
rung French Assurance Maladie
English validate French obliterer
20Differences in Marking of Semantic Information
- Head marking.
- In English possessive relation is marked on the
head The man's house - In Hungarian it is marked on the dependentThe
man house-his - his house / sa maison
- Direction and manner of motion marking
- He ran into the room (English)
- He entered the room running (French)
21Summary
- Translation is about more than equivalence of
meaning. - Translation may involve the resolution of
ambiguity. - Preservation of intention involves cultural
background as well as linguistic knowledge. - Translation is a hard problem for humans let
alone machines.
22Similarities and Differences Between Languages
- Differences
- Morphology
- Word order and syntactic structures
- Marking of semantic distinctions
- Lexical
- Similarities
- Communicative function for survival
- Mechanisms for reference to people, eating,
politeness, time. - Syntactic complexity
- Nouns
- Verbs
23Machine Translation and Human Translators
24In the Beginning ....was the dream of FAMT
- Fully Automatic (High Quality) Machine
Translation (Bar Hillel 1960)
Source Language text
TargetLanguage text
FAHQMT
25FAMT
- Basic Charactistics
- No human intervention
- Arbitrary text
- Evaluation Criteria
- Quality of ouput
- Cost (/page)
- Speed (pages/hour)
26FAMT Success StoryTAUM METEO
- Written by Chevalier et al. 1978.
- Translation of weather reports from English to
French - Highly constrained subset of English
- Small number of senses for each word
- Restricted syntactic constructions
- System determines whether a given sentence is
within its capabilities - Very fast, very accurate, no post-editing
27FAMT MORAL
- FAMT can work well but only if we give up one or
more of the goals e.g. - Unrestricted text input
- High quality translation
- This observation has lead to research on
sub-languages - And to the use of FALQT
28FAMT is not the only way
- FAMT lies at one extreme of a continuum of ways
in which technology can be brought to bear upon
the translation problem - At the other extreme there are word processing
software, fax machines, and even mobile phones - Between these two extremes there are other points
of interest where technology can radically affect
the productivity of the individual translator.
29MAHT and HAMT
- Machine Aided Human Translation (MAHT)
- Human Aided Machine Translation (HAMT).
- The essential difference between these two lies
not only in the way in which the person is
involved but also in the extent of their
involvement
30MAHT - Translation Memories
- Systems consist of a database in which each
source sentence of a translation is stored
together with the target sentence (this is called
a translation memory "unit") - Any new source sentences will be searched for in
the database and a match value is calculated. - When the match value is 100, the translation of
the source sentence from the database is inserted
into the text being translated.
31MAHT - Translation Memories
- If the match value is below 100 and above a
certain user-definable percentage (i.e., "fuzzy
match"), the old translation will be inserted as
a translation proposal for the translator to
review and edit. - Sentences with match values below that margin
have to be translated from scratch. - New and changed translation proposals will then
be stored in the database for future use.
32MAHT - Translation Memories Advantages
- Avoid redoing translation of repeated material
- Use previous texts as a model for new
translations - Ensure consistency throughout a translation
33MAHT - Translation Memories - Drawbacks
- If terminology changes between projects the
content of a TM needs to be updated to reflect
these changes. - Blind faith in exact matches (without validation)
can generate incorrect translation since there is
no verification of the context where the new
segment is used compared to where the original
one was used.
34MAHT - Translation Memories - Remarks
- Translation Process TM tools may not easily fit
into existing translation or localization
processes work best where work can be signed off
in pieces rather than as a whole. - Customisation rarely works straight out of the
box. Menu adaptation, filters to desktop
applications may require significant effort. - Investment costs are high
- Setup and maintenance of TMs has to factored in.
- OpenTag/TMX formats for exchanging TM data
between competing systems
35MAHT Other Technology
- Communication/coordination amongst translators
- Integration of internet technologies and web
services. - Database technology, smart indexing, and
networking - Improvements can be achieved that are well within
the scope of current technology.
36HAMT Human Assisted Machine Translation
- Machine retains the initiative but works in
collaboration with human consultant. - System translates autonomously until it
recognises that a linguistic difficulty of a
certain type has arisen, e.g. - ambiguity
- pronoun reference
- unknown word
- unrecognised construction
- At this point it seeks help from the consultant.
37HAMT Challenges
- Reliable identification/classification of
difficulty. - Reliable communication of difficulty to user.
- Tradeoff between quality and scope of
translation.
38HAMT - Advantages
- Modulo challenges a high quality of translation
can be guaranteed. - Speed if large sections of text can be
translated automatically. - Human consultant need not necessarily have all
the skills of a human translator native
competence in one or both languages may suffice.
39Summary
- Machine Translation is a continuum
- FAMT
- HAMT
- MAHT
- The utility of a given type of system cannot be
assessed with very simple criteria - Utlility function involves at least the human
cost, the machine cost, the quality of the
result, and the nature of the translation
requirements.
40Some References
- Jonathan Slocum, Machine Translation its
History, Current Status, and Future Prospects,
Proc ACL 1984, Stanford University,
http//acl.ldc.upenn.edu/P/P84/P84-1116.pdf - Martin Kay Machine Translation, Computational
Linguistics vol 11 numbers 2-3 1985. - Richard Kittredge Sublanguages, Computational
Linguistics vol 11 numbers 2-3 1985.