Title: Automating machine translation from poorly studied languages
1Automating machine translation from poorly
studied languages
- John Goldsmith
- Departments of Linguistics and Computer Science
2Outline
- The goal of automatic translation
- The history of automatic translation
- From the cybernetics era (1948 1960)
- To the statistical era (1993 date)
- The problem of complex word structure in most
languages - and our solution
- Where we stand today
31. The goal of automatic translation
- There are over 6,000 human languages in use
today. - Researchers need access to documents in all of
these languages over the long run. - Defense analysts may need access to documents in
any of these languages with pressing time needs. - Languages can be intentionally used as encryption
systems.
46,000 natural languages of the world
5Just the major languages of Africa alone
6Computational linguistic research
- When we specify the problem that we tackle, they
often sound super-humanly difficult. - When we begin to explain the methods, they can
sound far too simple. - In fact, the methods are conceptually elegant,
and highly quantitative. - The goals of linguistics, with the tools of
computer science.
72. History of MT
- A reminder of where computers actually came from
- World War II, and their uses
- more accurately aim artillery
- efforts to break German encryption systems
- Post-war period of industrialization
8Warren Weavers memo (July 1949)
- Director, Natural Sciences Division of the
Rockefeller Foundation - It is very tempting to say that a book written in
Chinese is simply a book written in English which
was coded into the "Chinese code." If we have
useful methods for solving almost any
cryptographic problem, may it not be that with
proper interpretation we already have useful
methods for translation?
9Efforts in the 1950s
- ...stymied by
- the lack of sufficient computing power,
- and immature computing technology
10Example
They hadnt reckoned on ambiguity when they set
out to translate human languages.
January, 1954
11- Progress during the 1970s and 1980s was
incremental. - In the 1990s, a major sea-change in computational
linguistics occurred, based on data-driven
statistical techniques. - IBM Research developed an approach to translation
based on systems that learn from examples.
12Statistical Machine Translation (MT)
131999 The Egypt system
- NSF funded a summer project at Johns Hopkins
University Egypt. - Open source and widely used in research.
- Difficult to use in practice.
14What do we translate?
- Do we translate sentences?
- In a sense, yes.
- lentourage de Chirac est plus imperméable que
celui de Nicolas Sarkozy. - Chiracs inner circle is more tightly knit than
that of Nicolas Sarkozy.
15Sentences W C
- A sentence is a collection of words and
constructions. - We translate the words and the constructions.
- We will break the problem down into these two
parts, then.
16Word-level alignments
Given a parallel sentence pair we can link
(align) words or phrases that are translations of
each other
Le chien se est assis sur le tapis
System is given 2 sentences, but without any
information about how the words are
aligned these lines are inferred, not given.
the dog sat down on the rug
17MT first two tasks
- Figure out word-to-word matchings (translations)
- Figure out common alignments across the source
and target languages how their word orders
differ. - French and English quite similar
- Japanese, Korean verb appears at the end of the
sentence.
18Just a taste
19NULL?
We often find that a word in one language
corresponds to nothing in the other language so
we include NULL as an ever-present possibility
of translation.
20NULL the dog le chien j1 (le) total P(le
NULL)P(le the)P(le dog) 2/3 ½ 7/6
1.17
tctotal count tc(ab) total expected count of
this joint occurrence
21Changes in probabilities
Initialized values
Iteration 2
After 5 iterations
223. What is morphology?
- Morphology studies the internal structure of
words - English words word s
- findings find ing s
- - Swahili tunakisema we speak it
23European languages are outliers.
- From the morphological point of view, most
languages of the world are much more complex than
European languages.
24Linguistica
- Computational linguistic project under
development since 1997 - http//linguistica.uchicago.edu
- Core engine automatic morphology analyzer
- Learns the morphological structure of a language
directly from a (written) sample, with no human
intervention.
25English illustration
- Bear in mind the system has no initial knowledge
at all about English. - It takes about 15 seconds to analyze 200,000
words of English. - C code is highly optimized, and operates 2
orders of magnitude faster than other comparable
computational linguistic systems.
26Signatures
Adjectives
Verbs
We find these automatically
Nouns
27Compounds
- English makes heavy use of compounds, which are
best handled if we can break them apart - Eastward
- eggshell
- farmhouse
- headdress
28Compounds
Selected
Rejected
Rejected
294. Where we stand today
- Our project is working on
- improving automatic morphology
- integrating Egypt statistical machine translation
into our package for easy application - improving translation by using morphology
- testing with Swahili-English
304.1 Improving automatic morphology
- Swahili, Somali, Urdu, Finnish
- Compounds English, Finnish
31Swahili
nilimupenda nitakamupenda
32(No Transcript)
33Swahili verb
344.2 Integrating Egypt MT software into our
front-end
- Linguistica has a user-friendly front end
- Linguistica is written in C, compiles under
Windows, MacOS, and Linux - Open source
354.3 Improving translations using morphology
- Developing mathematical models
- A small amount of work has been done by other
researchers, but the goal has largely been to use
morphology to strip off affixes.
364.4 Testing with Swahili
- 8 books from the New Testament available on the
internet.
37the end