Title: Dictionaries
1Dictionaries
- See
- Patrick Hanks Lexicography chapter 3 of Mitkov,
R. (ed.) The Oxford Handbook of Computational
Linguistics, Oxford OUP, 2004. -
2Dictionaries/Lexicons
- Lexicography and the computer
- Corpus-based lexicography
- MRDs
- Dictionaries for NLP
- Thesauri structured lexicons
3Computational lexicography
- Restructuring and exploiting human dictionaries
for use by computer programs - Using computational techniques to compile (new)
dictionaries - Focus on English (and other well established
languages) - Significant different issues for other languages,
especially - Alphabetization and arrangement
- Compilation from scratch for previously unstudied
languages
4Human dictionaries
- Traditional view of what a dictionary is
- List of words, arranged (usually) alphabetically
- Inclusion in dictionary lends authority, even
proscriptively - Entry typically gives
- spelling ... alternate spellings
- POS, morphology (if irregular)
- core definition (using defining vocab?)
- pronunciation (using own transcription)
- etymology
- examples of usage
- as justification for inclusion
- as illustration of use (esp. learners
dictionaries) - Entry typically doesnt give
- help with spelling
- morphology (if regular), especially derivational
- subcategorization information
- contrastive examples of use
- indications of possible metaphorical extensions
to meaning
5Human dictionaries
- Historically
- bilingual dictionaries for translators
- monolingual dictionary as (pre/proscriptive)
definition of language, often polemical - OED (1884-1928) first dictionary on purely
descriptive principle, relying on citations - Deficiencies and difficulties
- What to include? (neologisms, slang)
- Inclusion of names
- Differentiating senses
6Differentiating word senses
- Dictionaries disagree widely
- Probably no right answer
- General principles (look for excuse to split vs
look for reason to lump) - Keep related words of different POS together?
- Etymology can be misleading (eg crane, pupil)
- Metaphorical extension of original meaning how
far do you go? (eg rose, bar) - Purpose of dictionary may help decide, eg
translation
7Citations
- Senses and uses identified by collecting examples
of use - Sent in on slips by informants
- Lexicographers job is to collate these
- Criteria for a new word (or new meaning)
- Number of citations
- Source of citations
- Veracity of use
8Corpus-based dictionaries
- A collection of texts, usually collected with a
specific purpose in mind - British National Corpus, attempt to capture a
synchronic picture of BrE of the late 1980s (100m
words) - COBUILD Bank of English dynamic monitor
corpus used to help lexicographers
identify/define usage
9Machine-readable dictionaries
- Machine means computer
- Dictionary stored in a format which makes it
manipulable on a computer - Originally, derived from MR version of print
dictionary (from type-setters tapes) - Now the other way round data stored as a
database from which hard copy can be printed
(inter alia)
10MRDs - advantages
- Flexibility of access and presentation
- Not bound to alphabetical listing
- Information presented can be filtered
- Can be searched as a database
- Different versions (for different users, serving
different purposes) can be produced - Increased storage capacity
- More information can be stored, especially
- Implicit information can be made explicit
- More examples, including negative data
11Lexicons for NLP
- Have to state everything we need to know about
the word - Phonology stress pattern, possible weak forms
- Orthography spelling alternatives, hyphenation
- Morphology inflectional paradigms, even if
regular - Information about derivations
- Syntax Explicit information about
subcategorization and - eg syntactic/semantic features of arguments
- Any special interpretation of tenses
- Lexical combinatorics compounds, idioms
- Semantics definition, semantic features,
semantic relations - Pragmatics register, collocation, connotation
12Lexicons for NLP - example
- Information about derivations
- Agentive derivation (-er) is very productive
- Usually means the actor doing the action of a
verb, e.g. swimmer, dancer, killer - Not available for some verbs, e.g. knower,
cycler, sayer though cf soothsayer, hoper - May have a specialised meaning instead of or as
well as the derived meaning, e.g. revolver,
computer, washer, hitter - In some cases can mean the object undergoing the
action (via ergative use of verb), e.g. taster
13Subcategorization
- Words are assigned to categories (ie parts of
speech, POS), eg noun, verb - on basis of form, meaning, use
- Syntactic behaviour is predictable from (or
determined by) category - Within a category there are subcategories with
specific patterns of behaviour, both syntactic
and semantic, e.g. - transitive/intransitive verb ? direct object?
passivize?
14Subcategorization
- Subcat frames indicate complement patterns and
preferences, e.g. - subj, obj, double obj, prep-obj, infinitival
complement, that complement etc - semantic features of complements, eg obj of eat
normally edible - Subcat information can help to disambiguate
- cf He told the man where the body was buried .
- He found the place where the body was
buried . - Much of this info can be captured in general
rules
15- Have to state everything we need to know about
the word, though not necessarily explicitly - There can be rules to capture inheritance of
properties, e.g. - accomplishment prog tense implies incompletion
- cf She was baking a cake when she dropped dead ?
no cake - She was stroking the cat when she dropped dead
16Exploiting human dictionaries in NLP
- In all NLP applications, lexicon is major
bottleneck - Availability of MRD versions of human
dictionaries provided possible solution - Obviously, MRD gives list of words, and some
information - Extract further information about verb frames by
analysing the examples - Identify semantic features from definitions
- eg a plant which..., a person who...
- Identify hidden arguments
- eg to lock to close sthg using a key
- cf He locked the door. The key was heavy.
- He emptied his pockets. The key was
heavy.
17Exploiting human dictionaries in NLP
- Generic information about a word and its usage
can be derived from definitions in which it
occurs
Wine alcoholic drink made from fermented juices,
especially of grapes Vintage a seasons yield of
wine from a vineyard Red wine wine having a red
colour derived from the skins of the grapes used
... Vineyard an orchard where grapes are grown
for the purpose of wine making Pinot noir a dry
red Californian table wine Sake Japanese rice
wine Claret a dry red Bordeaux or Bordeaux-like
wine Sherry a sweet white wine from the Jerez
region of Spain Riesling a dessert wine made
from white grapes grown historically in Germany
...
18Corpus-based lexicography revisited
- Similarly, analysis of real examples can reveal
patterns of usage - Identify primary meaning not always what youd
expect (example of reckon) - Identify possible complementation patterns, and
their relative frequency
19Structured dictionaries
- Special type of dictionary in which words are
grouped together according to their meaning
thesaurus - Classic example Rogets Thesaurus (1852)
- Structured vocabulary much used in field of
terminology - Also now a valuable resource for NLP Millers
(Princeton) WordNet (1985)