Title: Leaving the Last Century
1Leaving the Last Century
- New Solutions for TM and MT
ESTeam AB Dublin 9/1/ 2002
2Rule Based Data Driven Methods
- Translation Memories are the extreme of data
driven methods - Systran contains an old fashioned lexicon
structure - Logos, Globalink, PTrans, etc. are rule based MT
solutions
3Rule based systems in MT
- 30 years of proven failure
- 30 years of commercial failure
- 30 years of still being used since there was no
alternative
4Rule Based MT
- Development time 2 years minimum per language
pair - Development cost 1 mil. per language pair
- Equally low quality in all domains
- Systran has cost the EU 200 MY and 200 Mil.
Euro for 20 language pairs
5Where is the problem ?
- Rules do not work everywhere in fact the times
a parse succedes on real data is about 1 in 100
why? - Words not in lexicon, spelling mistakes, and
sentence (translation unit) not covered by rule. - Link between real texts and grammar missing
- Peter doesnt love Mary we just think he does
6Where is the problem? cont...
- The biggest failure in Translation Solutions of
all times is Eurotra 220 Mil. Euro spent on
writing analysis rules - All analysis is MONOLINGUAL but Translation
isnt, it is the relationship between 2
languages, the source and its translation, and
the world (domain) that they represent
7Data Driven Solutions not new
- In the 70ies corpus based research
- Computers were too slow and too small
- Ideas were there but couldnt be applied
8The First Data Driven Solutions
- Translation Memories date back to ca 1980 and
have been in use since ca. 1985. - In the translation tool market there is only one
commerical success Trados - Moving beyond TM
9Data Driven Methods in MT
- Several tests during the last 10 years by IBM,
Sharp Labs and many more - Counting alone doesnt do it.
- Credible solutions come from the merge with
Translation Memory methods (a type of example
based MT)
10Integrating TM and MT
- Where do we select TM and where MT?
- Issues
- How deep can we go with TM and not lose the
sentence structure - MT is always low quality how can we improve it
by using the TM - Is there a difference between Sub-sentence TM and
MT?
11Requirements for Success
- Translation Memories
- Internet data
- Lexicons
- Any monolingual language material
- Good computers and serious data storage and
access tools
12Building Resources for TMT
- Global structure of domain
- Monolingual issues in each domain
- Sentence Phrase Word distribution
- Word context statistics
- etc.
- Translation issues in each domain
- Sentence Phrase Word alignment
13State of the Art
- MT Rule based
- Disambiguation on word class
- Cannot be tuned to a domain
- Language pair based (source
language - dependent because of analysis)
- TM Sentence based
- Single user
- Language pair based
- Project based
14ESTeam Translator
- Multi-domain
- Multi-User (Client Server)
- Multi-lingual
- TM on Sentence and Sub-Sentence levels
- MT on the remaining parts of the sentence Uses
Rules and Statistical Methods - Improves MT translations in a domain by
- Statistical disambiguation filters
- Structuring lexicons automatically
- Post-Editing MT (Target Language Verification)
15ESTeam Goals
- Maximum reuse of data resources
- Highest possible translation quality
- Optimal control of data
- Full operational control (Workflow)
- Multiple usage with the same tools
- Translation Support
- Information Browsing
16Client Example
- Translation Agency with 11 subsidiary translation
companies and freelance translators in more than
11 countries - Human Translation supported optimally at all
levels - ESTeam Translation Workflow controls operations
17Translation Tools
- ESTeam Translator
- Translation Memory
- Machine Translation
- Term Tool
- Translation Memory Admin Tool
- Concordancer
- Aligner
- Translation Tools Administration
18Translation Memory
- Unique on the market
- Multi-lingual
- Multi-domain
- Multi-level
- sentence
- subsentence
- Client Server
19Link Architecture
- Greek
- Italian Spanish
- German French
- Danish
English - Dutch
Finnish - Portuguese Swedish
- New language
20Multilingual Linking in TM and MT
FR EN IT EL ES FI PT SV DE DA NL
NL DA DE SV PT FI ES EL IT
212Level TM
- Sentence
- Chemical and pharmaceutical products, all
intended for industrial purposes.
Sub-Sentence
,
and
Chemical
pharmaceutical products
.
all intended for industrial purposes
22Creating Sub-Sentence Resources
- Analysing segmentation points
- Aligning sentence TM data
- Automatic statistical processing
- Manual intervention
- Loading TM with sub-sentence data
23Existing Data Resources
- Available TM Sentence Data
- Language pairs Number of Sentences
- English to French 191.680
- English to Portuguese 167.287
- Portuguese to FR 3.740
- French to PT 42.910
24TM Multilingual Linking Effect
- ESTeam Multi-lingual and Multi-level Approach
with the same data - Language Combinations Sentences Sub-sent.
- English to/from French 191.680 61.848
- English to/from Portuguese 167.287 60.695
- French to/from Portuguese 150.048 56.562
25MT and TM Integration
- Theoretical assumptions
- MT is erroneous, thus the less MT the better
- Conclusions
- Cost-effective MT development
- Minimize MT compared to TM
- Tailor TM to work for information
- browsing as well as translation support
26Lexical Machine Translation
- Information in the lexicon
- Domain, Frequency and Category info on Source and
Target - Disambiguation through shared info and
reliability on the link - Lexical/Category based rules that are part of the
database - Multi-lingual
- As in TM all languages CAN be linked to each
other by translating into a previously existing
language
27Target Language Verification
- Assumption MT is erroneous - Data is correct
- Example
- Source t??f?d?t??? µpata????
- MT chargers batteries
- TLV battery chargers
28Languages in MT and TM
- Current All EU Languages into all and Norwegian
(TM is Unicode any language can be catered for) - Goal for 2002 Start work on all EU Minority
Languages and Icelandic - linking them to each
other and all EU languages - Eastern European Languages to be integrated by
request (Approximate development duration 6
months for each new language linked to all
current EU languages)
29Developing New Languages
- Semi-Automatic Lexicon Building
- Manual entry of Special Words and Morphology for
TLV and TM Fuzzy - Re-using translators work for automatic solutions
- Sub-Sentence TM
- Phrases in Lexicon
- Automatic Alignment
- sentence
- phrase/word
30Term Tool
- User friendly interface to all the functionality
of the terminology and translation lexicons - For Information Browsing when translating
- For building new languages and enhancing existing
lexicons
31TM Admin Tool
- User friendly interface to all the functionality
of the Translation Memory including editing and
viewing - For Information Browsing when translating
- For enhancing, correcting and building new
translation memories
32Concordancer
- Bilingual support tool for freelance translators
- Easy to install and use
- Directly linked to interface and stores the
translations carried out by the translator - Database MySQL
33Aligner
- Goal to find as many good matches as possible to
build an application and discard the rest - Uses the lexicon to verify alignment
- Manual intervention minimal
34Translation Tools Admin
- Specify how to run a translation
- Operate TM
- select sentence or sub-sentence or both
- set fuzzy threshold
- etc
- Operate MT
- activate rules
- activate statistics
- set quality threshold
- etc
35Translation Workflow
- Local Workflow
- Automatic Translation
- On-line Manual Translation / Revision
- Subsidiary Workflow
- Local Distribution / Collection
- Off-line Manual Translation
36Local Workflow Features
- Roles
- Workflow Manager
- Translators / Revisers
- Enterprise Integration
- Address book personnel and clients
- Distribution of work, availability
- Administration Data
- Archiving of processed files
37Local Workflow (pt. 1)
38Subsidiary Workflow
39Local Workflow (pt. 2)
40What is different?
- Multi-lingual
- Multi-level TM
- Single Global TM (client server and structure of
database) - Data used to decide on selection in MT
- Data used to change MT output
41Effect
- For information browsing
- cost savings serious both in development and
production - speed of development for new languages
- For translation support
- Client example using TM data from another system
provided 6 translation saving using TM only
42Technical details
- All Tools are
- Client Server
- Unicode compliant
- developed in C and Oracle
- running on Unix and Windows
- All Interfaces are developed in Java
- Workflow is developed in C, Java and Lotus Domino
43ESTeam AB
- Developer of Automatic Translation Solutions
- Legal Residence Gothenburg, Sweden, since 1995
- Development Site in Athens, Greece
- Marketing site in UK
- Products
- ESTeam Translator? (2000)
- ESTeam Translation WorkFlow ? (2001/2)
- ESTeam BTR? (1996)
44Contact Info
- ESTeam AB
- Markou Botsari 15
- 145 61 Athens
- Greece
- Email esteam_at_otenet.gr
- Tel 30 10 8085704