Title: From term extraction to terminology compilation:
1From term extraction to terminology compilation
The challenge for computational terminology in an
era of unlimited corpora availability
- Kyo Kageura
- Library and Information Science Course
- Graduate School of Education
- University of Tokyo
- Oct 9, 2008
- TAMA 2008
2Background (1)?
- Automatic term recognition (ATR) Are they used?
- yes, in its basic form, by terminographers and
lexicographers - yes, in closed situations such as patent
translations, etc. - no, not by most users such as online volunteer
translators. - Which ATR method/system?
- ease of use, well maintained, etc.
- recall, precision, f-measure unimportant, beyond
a certain level.
3Background (2)?
- End users' (or more precisely, my) situation
- type of texts to be translated is open
- potential range of reference is also open
- term lookup is frequently needed
- refer to Web as the universal information
resource. - Result of terminographers' work Are they
available? - yes, for in-house use
- no, in most cases, for general users, including
online/volunteer translators and other
non-technical translators (myself!). - Do these translators (me!) need ATR type of help?
- yes, if it constitutes something equivalent to
terminological dictionaries or supplements
existing terminological dictionaries - not term extraction, but terminology compilation
4Three possible models of reference tools
- Maximal model
- you can stop looking for information there
(libraries) - Quality model
- you can believe in what it says (high-quality
dictionaries) - Singularity model
- you have nothing better/else (Google)
- These models are intimately related to one
another - a good reference tool represents at least two of
these models - completely different from recall, precision
f-measure. - you need to make decision by yourself beyond that
point.
5Requirements of each model
- Maximal model
- entry what you can find by Google should be
available - translation what you can find by Google should
be available - examples what you can find by Google should be
available. - Quality model
- entry entries should be coherent as a set and
match user's expectations (e.g. what you tend to
check by association should be there) - translation give basic translations consistently
from which users can derive possible extensions
of translations - examples give basic range of examples.
- Singularity model
- no general internal requirements
6Technologies for the maximal model
- Entries Universal term crawler
- collect all the technical terms existing on the
web - but what is a term?
- collect all terms and non-terms on the web.
- Translations Maximum range of candidates
- still needs ordering by goodness, converges to
the technology for the quality model - Examples Maximum range of candidates
- still needs ordering by goodness, converges to
the technology for the quality model
7Technologies for the quality model
- Entries
- stratify texts in accordance with their register
and text types - collect maximal set of terms from the relevant
set of texts - make explicit the coherency of entries as a set.
- Translations
- make correspondence between source and target
text class - make explicit the coherency of translations as a
set. - Examples
- something equivalent to entries and translations.
8Technologies for the singularity model
- Tear and/or hide other sources
- Run a huge propaganda
- Blackmail
- Block other peoples research and development
- Bundle the system with hardware.
9The overall picture
Maximal model
Maximal dictionary
Universal crawler of entries, translations and
examples
Preference assigner
Maximal/ coherent set of relevant texts
entry, translation and example extractor
Maximal/ coherent set of relevant texts
Maximal/ coherent set of relevant texts
Quality dictionary
Relevant text collector
Consistency validator
Quality model
10Research agenda theoretical issues
- How many terms are there on the web (in source
and target language)? - What is the consistency of entries?
- What is a good example?
- How are their order defined?
- What is a relevant set of texts for term
extraction? - How are their preferences stratified?
11Research agenda technical issues
- How can all the terms be identified and
extracted? - How can consistencies be measured?
- How can examples be clustered and classified?
- How can these examples be ordered?
- How can features be identified for relevant text
classification? - How can these texts be further classified and
ordered?
12Last words
- Do I really intend to pursue these agendas?
- Yes, if my current funding proposal is accepted
(if not, I'll still carry out basic research
related to these agendas, but will not be able to
make operational real world systems) - What is the prospect of the research?
- In 3 years, we will make maximal crawler
available - In 2 years, we will make relevant text collector
open - I am not sure how coherency checker can be made,
maybe with the interaction of human experts? - a
la ontology research? which I do not like that
much...
13