Title: Computational Lexicography
1Computational Lexicography
- Frank Van Eynde
- Centre for Computational Linguistics
- K.U.Leuven
2OUTLINE
- 1. The token/type distinction
- 2. Lexicographic practice
- 3. Computational lexica
- 4. Lexical knowledge bases
- 5. Lexical knowledge acquisition
- 6. The use of lexica in text-to-speech
31. Tokens vs. types
- (1) The girl gave the flowers to the athlete.
- - 3 tokens the properties are context specific
- - 1 type ltTHEgt properties are generalizations
over the various uses - Heracleitos vs. Plato
- (2) The sooner they come, the better it is.
- ltTHE, articlegt vs. ltA, articlegt NL de,
het - ltTHE, adverbgt vs. ltFAR, adverbgt NL hoe
41. Tokens vs. types
- (3) I do not think that the dog of that man is
really that dangerous. - ltTHAT, complgt vs. ltIF, complgt FR que
- ltTHAT, detgt vs. ltTHIS, detgt FR
ce/cet(te) - ltTHAT, adverbgt vs. ltSO, adverbgt FR si
- (4) Je ne pense pas que le chien de cet homme est
vraiment si dangereux.
51. Tokens vs. types
- The abstraction problem given a word W, how many
types ltW,POSgt do we have to distinguish? - (5) It is not far from here.
- (6) We didn't go far.
- (7) He's living in the Far West.
- (8) Paris is far more expensive than Dublin.
-
- ltFAR, adjgt vs. ltNEAR, adjgt NL ver
- ltFAR, advgt vs. ltLITTLE, advgt NL veel
61. Tokens vs. types
(9) The ball of the finals will be sold at the
ball of the FIFA. (10) De bal van de finale
wordt verkocht op het bal van de FIFA. ltBAL,
noun non-neutergt IT palla ltBAL,
noun neutergt IT ballo (11)
That girl has been very lucky. (12) That girl has
a lot of luck. ltW,POS,VALgt ltHAVE, verb aux,
_VPPSPgt IT avere/essere ltHAVE, verb
main, _NPgt IT avere
71. Tokens vs. types
-
- (13) The pen is in my pocket.
- (14) The pig is in the pen.
- ltW, POS, VAL, SENSEgt
- ltPEN, noun, writing implementgt NL pen
- ltPEN, noun, fenced enclosuregt NL hok
82. Lexicographic practice
The entries of pen and peg in the Oxford Advanced
Learner's Dictionary of Current English. ltORTHn,
PHON, POS, m, (VAL,) SENSEgt Homonymy vs.
polysemy Problem for any given ORTH, how many n
and how many m does one have to distinguish? The
entries of pen and peg in the Collins Cobuild
Dictionary of the English Language. ltORTH, PHON,
m, SENSEgt
93. Computational Lexica
Dictionaries are made for people who already
understand (much of) the language. Computational
lexica are made for machines that do not
understand (anything of) the language Consequence
an NLP system can only make sense of information
which is presented in the notation (or format)
which it employs for processing the language.
103. Computational Lexica
lttwo hundred fifty-six, 256gt lttwo hundred
fifty-six, CCLVIgt POS tagger The entry for ik
in Van Dale The entry for ik in the lexicon of
the Spoken Dutch Corpus
114. Lexical Knowledge Base
Computational lexica are often task-specific and
application-dependent. The need for reusability,
maintainability, extensibility Creation of a
lexical knowledge base which is sufficiently
general and abstract to be reusable, maintainable
and easily extensible Two aspects of
abstractness theory-neutral and
level-independent
124. Lexical Knowledge Base
- Lexical knowledge representation languages
- DATR (Gazdar and Evans)
- Typed feature structures (HPSG)
- The number of lexical entries for any given
natural language is enormous. - The information to be captured in each lexical
entry is detailed and complex.
135. Lexical knowledge acquisition
- from scratch
- from a machine-readable dictionary
- from an agency for the distribution of resources
(ELRA and LDC) - inductive from a partial lexicon and a corpus
146. Lexica in text-to-speech
- written text
- ? text normalisation
- expanded graphemic representation
- ? tagging
syntactic analysis - graphemic representation with prosody
- ? grapheme-to-phoneme
- sequence of phonemes, incl. lexical stress
- ? speech synthesis
- fluent speech