Querydriven dictionary enhancement

1 / 29
About This Presentation
Title:

Querydriven dictionary enhancement

Description:

The dictionary: Online SLO-DE-SLO. The log file. Use of the log file ... Result: example dictionary Online-SLO-DE-SLO should and will be enlarged based ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Querydriven dictionary enhancement


1
Query-driven dictionary enhancement
  • Primo Jakopin, Birte Lönneker
  • Scientific Research Center ZRC SAZU
  • Ljubljana, Slovenia

2
Motivation
Dictionary authors question What are the needs
of the users of my dictionary?
  • Log files of online dictionaries provide direct
    acces to the users requests.
  • Make use of them!

3
Overview
  • The dictionary Online SLO-DE-SLO
  • The log file
  • Use of the log file
  • to evaluate current dictionary contents
  • to choose the most promising corpus type for
    enlarging the dictionary
  • Conclusions

4
Dictionary Online SLO-DE-SLO
  • Bidirectional online dictionary
  • German-Slovenian
  • On the Web since 2001
  • Initially a learners dictionary
  • for German-speaking learners of Slovenian

5
Online SLO-DE-SLO user interface
6
Online SLO-DE-SLO contents
  • Evaluated version (October 2003)
  • Textbook corpus
  • 5,172 entries
  • Newspaper corpus
  • 729 entries
  • Total 5,901 entries
  • Current version (June 2004)
  • Textbook corpus
  • 5,544 entries
  • Newspaper corpus
  • 743 entries
  • Technical corpus
  • 829 entries
  • Total 7,116 entries

7
Online SLO-DE-SLO entry concept
  • Each entry is bilingual
  • Exactly one equivalence per entry
  • An entry can describe
  • a basic word form
  • an inflected word form
  • an example sentence or phrase
  • a collocation

8
Online SLO-DE-SLO query results
9
The log file
  • When a user submits a query to the dictionary, a
    program writes data about the query into the log
    file, e.g.
  • Source language
  • Submitted query string
  • Selected search options
  • (exact string match, match at beginning of word,
    match anywhere)
  • Time stamp

10
The log file details
  • Evaluation period
  • 6 January 2002 to 10 October 2003
  • Number of queries stored in log file
  • 131,674
  • Number queries, exact string match
  • 88,879
  • Only exact string match queries are evaluated

11
The log file preprocessing
  • Has to take into account how the matching is
    performed when the dictionary finds an entry for
    the user
  • Example 1
  • Dictionary matching
  • Case insensitive (user enters A for a)
  • Preprocessing
  • Downcase all letters in log file (and in
    dictionary evaluation file)

12
The log file preprocessing
  • Example 2 a
  • Dictionary matching
  • Substitution of special characters for easier
    access (user enters ae for ä)
  • Preprocessing version I
  • Make a second version of log file
  • Replace ae by ä in second version
  • Use spell checker word list to find valid
    versions
  • Check ambiguous cases manually

13
The log file preprocessing
  • Example 2b
  • Dictionary matching
  • Substitution of special characters for easier
    access (user enters c for c)
  • Preprocessing version II
  • Make a second version of log file
  • Replace c by c in second version
  • Use frequencies of parallel spellings to find
    valid versions
  • Check ambiguous cases manually

14
The log file preprocessing
  • Users sometimes determine erroneous source
    language (SL)
  • Correct SL could be found using spell checker
    lists for both languages
  • In our case spell checker lists taken from
    Online SLO-DE-SLO detect
  • ...wrongly determined SL Slovenian 378
  • ...wrongly determined SL German 593

15
Evaluation IQueries against dictionary
  • Question To which extent does the dictionary
    satisfy users requests?
  • Method match preprocessed queries against
    downcased dictionary entries, language by language

16
Evaluation I
  • Dictionary entries 5,901
  • German distinct (downc.) 5,289
  • Slovenian distinct 5,103
  • Compare these lists with queries
  • Result I (tokens)
  • German 40,7 of queries match
  • Slovenian 38,3 of queries match

17
Evaluation I
  • Result II (types)
  • types distinct queries
  • German 10,4 of types match
  • Slovenian 12,7 of types match
  • Well-known frequency distribution also in query
    log file
  • a few types occur very often and many types
    occur rarely

18
Evaluation I Qualitative results
  • Online SLO-DE-SLO still lacks some expressions
    and words used in social relations and everyday
    life, e.g.
  • Slovenian top unmatched queries
  • regard, offer, confirmation, cow, payment, kiss,
    oak, to miss, fond of, to teach, sale,...
  • German top unmatched queries
  • kiss, welcome, regard, regards, good morning,
    treasure, to fuck, good evening,...

19
Corpus-based enlargement
  • Log file entries alone are not enough
  • The enlargement of the dictionary should stay
    corpus-based, because the dictionary author wants
    to
  • find appropriate examples of use
  • find also collocations and idioms
  • find more words that are likely to be of interest
    to typical users

20
Evaluation II Outline
  • Which corpus should be used next to enlarge
    Online-SLO-DE-SLO?
  • Which corpus best reflects the structure of the
    entire vocabulary entered by the users?
  • Evaluation of Slovenian queries using Slovenian
    corpora (subcorpora of Nova Beseda c.)

21
Evaluation IIQueries against corpora
  • Evaluate corpora of three text types

Newspaper 88 million
Fiction 5,7 million
Technical 6,3 million
22
Evaluation II. Method
  • Compare lemmas in user queries with relative
    frequencies in the three corpora.
  • Lemmatize Slovenian queries and assign POS
  • Retain lemmatized content words and
    interjections
  • 7,246 query lemmas

23
Evaluation II. Method
  • Lemmatize each corpus (currently with
    ambiguities)
  • Calculate relative frequencies (per 1 million) of
    lemmas in three corpora
  • Assign weight to lemma for each query lemma
    and corpus, multiply number of queries with
    relative frequency

24
Evaluation II. Example
  • First seven lines of fiction corpus evaluation
    (alphabetical order)

25
Evaluation II. Result
  • All lemma weights are summed up for each of the
    three corpora separately
  • Fiction 10,262,558
  • Newspaper 9,694,125
  • Technical 9,369,494
  • The fiction corpus reflects the user queries best

26
Evaluation II.Top twenty weights
  • Lemmas in at least two corpora (transl.)
  • to be, to have, to give, to go, good, day, house,
    beautiful, table, to come, light (ADJ), to know,
    to see, big, year, to work/to do
  • Top 20 weight in fiction
  • to think, to say, to look, fond of
  • Top 20 weight in newspaper
  • town/place Slovenian
  • Top 20 weight in technical
  • computer, picture, data item

27
Evaluation II.Improvements and Variations
  • Improvement unambiguously lemmatized corpora
    (work in progress for Slovenian)
  • Variation evaluate only non-matched queries
  • Not overall structure of all queries
  • But overall structure of unsuccessful queries
    (might change after enhancements)

28
Conclusion
  • We have shown
  • query-driven methods of evaluation for online
    dictionaries
  • query-driven methods for finding adequate corpora
    as sources for enhancing dictionaries
  • Result example dictionary Online-SLO-DE-SLO
    should and will be enlarged based on literary
    texts first

29
Thank you for your attention
http//www.rrz.uni-hamburg.de/slowenisch
Write a Comment
User Comments (0)