Title: Querydriven dictionary enhancement
1Query-driven dictionary enhancement
- Primo Jakopin, Birte Lönneker
- Scientific Research Center ZRC SAZU
- Ljubljana, Slovenia
2Motivation
Dictionary authors question What are the needs
of the users of my dictionary?
- Log files of online dictionaries provide direct
acces to the users requests. - Make use of them!
3Overview
- The dictionary Online SLO-DE-SLO
- The log file
- Use of the log file
- to evaluate current dictionary contents
- to choose the most promising corpus type for
enlarging the dictionary - Conclusions
4Dictionary Online SLO-DE-SLO
- Bidirectional online dictionary
- German-Slovenian
- On the Web since 2001
- Initially a learners dictionary
- for German-speaking learners of Slovenian
5Online SLO-DE-SLO user interface
6Online SLO-DE-SLO contents
- Evaluated version (October 2003)
- Textbook corpus
- 5,172 entries
- Newspaper corpus
- 729 entries
- Total 5,901 entries
- Current version (June 2004)
- Textbook corpus
- 5,544 entries
- Newspaper corpus
- 743 entries
- Technical corpus
- 829 entries
- Total 7,116 entries
7Online SLO-DE-SLO entry concept
- Each entry is bilingual
- Exactly one equivalence per entry
- An entry can describe
- a basic word form
- an inflected word form
- an example sentence or phrase
- a collocation
8Online SLO-DE-SLO query results
9The log file
- When a user submits a query to the dictionary, a
program writes data about the query into the log
file, e.g. - Source language
- Submitted query string
- Selected search options
- (exact string match, match at beginning of word,
match anywhere) - Time stamp
10The log file details
- Evaluation period
- 6 January 2002 to 10 October 2003
- Number of queries stored in log file
- 131,674
- Number queries, exact string match
- 88,879
- Only exact string match queries are evaluated
11The log file preprocessing
- Has to take into account how the matching is
performed when the dictionary finds an entry for
the user - Example 1
- Dictionary matching
- Case insensitive (user enters A for a)
- Preprocessing
- Downcase all letters in log file (and in
dictionary evaluation file)
12The log file preprocessing
- Example 2 a
- Dictionary matching
- Substitution of special characters for easier
access (user enters ae for ä) - Preprocessing version I
- Make a second version of log file
- Replace ae by ä in second version
- Use spell checker word list to find valid
versions - Check ambiguous cases manually
13The log file preprocessing
- Example 2b
- Dictionary matching
- Substitution of special characters for easier
access (user enters c for c) - Preprocessing version II
- Make a second version of log file
- Replace c by c in second version
- Use frequencies of parallel spellings to find
valid versions - Check ambiguous cases manually
14The log file preprocessing
- Users sometimes determine erroneous source
language (SL) - Correct SL could be found using spell checker
lists for both languages - In our case spell checker lists taken from
Online SLO-DE-SLO detect - ...wrongly determined SL Slovenian 378
- ...wrongly determined SL German 593
15Evaluation IQueries against dictionary
- Question To which extent does the dictionary
satisfy users requests? - Method match preprocessed queries against
downcased dictionary entries, language by language
16Evaluation I
- Dictionary entries 5,901
- German distinct (downc.) 5,289
- Slovenian distinct 5,103
- Compare these lists with queries
- Result I (tokens)
- German 40,7 of queries match
- Slovenian 38,3 of queries match
17Evaluation I
- Result II (types)
- types distinct queries
- German 10,4 of types match
- Slovenian 12,7 of types match
- Well-known frequency distribution also in query
log file - a few types occur very often and many types
occur rarely
18Evaluation I Qualitative results
- Online SLO-DE-SLO still lacks some expressions
and words used in social relations and everyday
life, e.g. - Slovenian top unmatched queries
- regard, offer, confirmation, cow, payment, kiss,
oak, to miss, fond of, to teach, sale,... - German top unmatched queries
- kiss, welcome, regard, regards, good morning,
treasure, to fuck, good evening,...
19Corpus-based enlargement
- Log file entries alone are not enough
- The enlargement of the dictionary should stay
corpus-based, because the dictionary author wants
to - find appropriate examples of use
- find also collocations and idioms
- find more words that are likely to be of interest
to typical users
20Evaluation II Outline
- Which corpus should be used next to enlarge
Online-SLO-DE-SLO? - Which corpus best reflects the structure of the
entire vocabulary entered by the users? - Evaluation of Slovenian queries using Slovenian
corpora (subcorpora of Nova Beseda c.)
21Evaluation IIQueries against corpora
- Evaluate corpora of three text types
Newspaper 88 million
Fiction 5,7 million
Technical 6,3 million
22Evaluation II. Method
- Compare lemmas in user queries with relative
frequencies in the three corpora. - Lemmatize Slovenian queries and assign POS
- Retain lemmatized content words and
interjections - 7,246 query lemmas
23Evaluation II. Method
- Lemmatize each corpus (currently with
ambiguities) - Calculate relative frequencies (per 1 million) of
lemmas in three corpora - Assign weight to lemma for each query lemma
and corpus, multiply number of queries with
relative frequency
24Evaluation II. Example
- First seven lines of fiction corpus evaluation
(alphabetical order)
25Evaluation II. Result
- All lemma weights are summed up for each of the
three corpora separately - Fiction 10,262,558
- Newspaper 9,694,125
- Technical 9,369,494
- The fiction corpus reflects the user queries best
26Evaluation II.Top twenty weights
- Lemmas in at least two corpora (transl.)
- to be, to have, to give, to go, good, day, house,
beautiful, table, to come, light (ADJ), to know,
to see, big, year, to work/to do - Top 20 weight in fiction
- to think, to say, to look, fond of
- Top 20 weight in newspaper
- town/place Slovenian
- Top 20 weight in technical
- computer, picture, data item
27Evaluation II.Improvements and Variations
- Improvement unambiguously lemmatized corpora
(work in progress for Slovenian) - Variation evaluate only non-matched queries
- Not overall structure of all queries
- But overall structure of unsuccessful queries
(might change after enhancements)
28Conclusion
- We have shown
- query-driven methods of evaluation for online
dictionaries - query-driven methods for finding adequate corpora
as sources for enhancing dictionaries - Result example dictionary Online-SLO-DE-SLO
should and will be enlarged based on literary
texts first
29Thank you for your attention
http//www.rrz.uni-hamburg.de/slowenisch