Title: Sira E. Palazuelos Cagigas, Jos
1Design and Evaluation of a Versatile Architecture
for a Multilingual Word Prediction System
- Sira E. Palazuelos Cagigas, José L. Martín
Sánchez, Lisset Hierrezuelo Sabatela - Departamento de Electrónica. Universidad de
Alcalá. - Alcalá de Henares. España
- Javier Macías Guarasa
- Dpto. de Ingeniería Electrónica. ETSI de
Telecomunicación. UPM. - Madrid. España
ICCHP06
2Overview
- Introduction
- Word prediction system
- Description of the prediction systems for each
language - Evaluation
- Conclusions
3Introduction (I)
- Word prediction is the set of techniques that try
to predict the word a person is typing
La casa de los es este esta
estado
La casa de los espíritus
La casa de los e el es
este
La casa de los esp español
española especial
La casa de los espí espíritu
espías espía
La casa de los espír espíritu
espíritus
La casa de los e ensayos
elementos estudios
La casa de los espíritus
La casa de los es estudios
espíritus estados
La casa de los espíritus. Dichos espíritus
La casa de los espíritus. Dichos e
espíritus ensayos
elementos
4Introduction (II)
- Justification
- People with physical disabilities
- Computer access for writing of communication
5WPS General architecture
- Main features
- Modularity separation between information
sources and prediction methods - Flexibility task and language independent
- Power integration of multiple information sources
Management module
Training module
6WPS Lexicons
- Main lexicon
- Word form and all the possible lemmas of each
word. - Probabilistic information.
- Grammatical information POS and features.
- Dynamic lexicons subject and personal lexicons
- User vocabulary (new words, proper names,
specific vocabulary, etc.). - Frequencies dependent on the user and subject.
- Word pairs.
- Training from pre-stored texts (subject lexicons)
or the current text (personal lexicon). - Problem Spelling mistakes.
7WPS Prediction methods
- Word probabilistic grammars
- Unigrams, bigrams, trigrams.
- POS probabilistic grammars
- Bipos, tripos.
- Smoothing.
- Fall back.
- Basic feature management.
- Stochastic context free grammar (SCFG)
- Probabilistic information.
- Possibility to include in the rules optional
symbols (with its probability). - Features agreement, imposition and prohibition.
- Word form and lemma prohibition and imposition.
8WPS Heuristics
- Elimination of the words previously rejected by
the user. - Prediction of the more frequent word suffixes
beginning by the last letter. - Automatic insertion of spaces after punctuation
marks. - Automatic capitalization after a period.
9WPS Management module
- The management module
- Processes the input from the user interface (text
written by the user). - Manages the information flow between the
different prediction methods (coordinating the
data each one needs and provides) and the
transactions with the lexicons. - Obtains the word prediction list that each method
provides and combines them to send the most
adequate proposals to the user interface. - Applies the heuristics.
10WPS User interface
11Description of the prediction systems for each
language
Heuristic/Lexicon/ Word Prediction Algorithm Spanish English Portug. Swedish
Main lexicon ? ? ? ?
Subject lexicon ? ? ? ?
Personal lexicon ? ? ? ?
Unigram ? ? ? ?
2-grams to 6-grams ? ? ?
Static bipos and tripos ? ?
Features management ?
SCFG ?
Suffixes prediction ?
Elimination of rejected words ? ? ? ?
Auto capitalization ? ? ? ?
Spaces after punct. marks ? ? ? ?
12Evaluation (I)
- Results of the Spanish word prediction system
of saved keystrokes
Exp. Configuration Result Relative Impr.
1 Static bipos and tripos and features management 42.7
2 Exp. 1 plus 2-grams to 6-grams 51.9 21.5
3 Exp. 2 plus personal lexicon 53.3 2.7
13Evaluation (II)
- Results of the English word prediction system
of saved keystrokes
Exp. Configuration Result Relative Impr.
1 Static bipos and tripos 28.2
2 Exp. 1 plus 2-grams to 6-grams 37.4 32.6
3 Exp. 2 plus personal lexicon 47.7 27.5
14Evaluation (III)
- Results of the Swedish word prediction system
of saved keystrokes
Exp. Configuration Result Relative Impr.
1 Unigrams 33.8
2 Exp. 1 plus 2-grams to 6-grams 42.7 26.3
3 Exp. 2 plus personal lexicon 47.7 11.7
15Evaluation (IV)
- Results of the Portuguese word prediction system
of saved keystrokes
Exp. Configuration Result Relative Impr.
1 Unigrams 38.2
2 Exp. 1 plus 2-grams to 6-grams 42.8 12.0
3 Exp. 2 plus personal lexicon 45.0 5.1
16Evaluation (V)
- The percentage of saved keystrokes is more than
45 and for words predicted before the user types
all their letters is usually over 90-95 for all
the languages, lexicons and methods. - The differences between the results are due to
- The amount of information sources available for
each language - The grammatical information for Spanish has been
specially designed to optimize the prediction
process, while the grammatical information
available for English was the one included in the
BNC. - Agreement between the test and training texts
- If the subject of the test and training test is
the same, the prediction obtained by the n-grams
based methods could be very good, leaving a
narrow margin to the personal lexicon. - For best trained languages (Spanish and English),
the results for experiment 3 are very similar,
showing the power of the personal lexicon.
17Conclusions (I)
- The architecture, lexicons and word prediction
methods of a prediction system have been
described. - The system architecture is
- Modular, with independent modules and well
defined interfaces between them. - Flexible it allows to easily change prediction
methods or lexicons for the same or a different
language. - The system has been evaluated for Spanish,
English, Swedish and Portuguese with results of
more than 45 of saved keystrokes and over 90 of
predicted words.
18Conclusions (II)
- The results of the evaluation show that
- The architecture is able to efficiently handle
different languages with similar performance - There are important differences when including
additional information sources in the prediction
process, when compared with the basic prediction
methods. - The improvements strongly depend on the previous
information (the availability of the grammatical
information, features, the amount of words in the
main lexicon, etc.). - N-grams also produced results varying with the
agreement between the test and training texts. - The use of flexible methods, like the personal
and subject lexicon, produces the best results
for all the languages, due to their capability to
adapt to the new vocabulary and frequencies. They
compensate the shortages of the fixed lexicons.
19Thank you
- Thank you for your attention
- For further information
- Email to sira_at_depeca.uah.es
- PhD thesis with the explanation of the
architecture - http//www.depeca.uah.es/personal/sira/Documentos/
thesisSiraEnglish.pdf - http//www.depeca.uah.es/personal/sira/Documentos/
TraspasTesisIngles47.zip - Report Report on Word Prediction for Spanish,
English and Swedish - http//www.depeca.uah.es/personal/sira/Documentos/
Report on Word Prediction.pdf - Papers on Word Prediction (in English or Spanish)
- http//www.depeca.uah.es/personal/sira/