Title: Handling of missing values
1- Handling of missing values
- in lexical acquisition
- NĂºria Bel
- Universitat Pompeu Fabra
2By Automatic Lexical Information Acquisition we
..
- try to find how to build repositories of
language dependent lexical information
automatically. Many technologies behind
applications (MT, IE, Automatic Summarization,
Sentiment Analysis, Opinion Mining, Question
Answering, etc.) do need this information to work
("paralelo" AST ALO
"paralel" ATR POST CL (PF-AS
PM-OS SF-A SM-O) FC (NPP) LY
AMENTE MC ("a") PLC (NG)
PRED (ESTAR SER) TA (OBJ-P REL)
AUTHOR "juan" DATE "31-Aug-99" SITE
"FB52")
("fiesta" NST ALO
"fiest" CL (PF-AS SF-A) GD
(F) KN MS PLC (NF) TYN
(ABS) AUTHOR "juan" DATE
"28-Aug-99" SITE "FB52")
Entries borrowed from MT system Incyta (Metal
family)
3Cue Based Lexical Acquisition
- Differences in the distribution of certain
contexts separate words of different classes
(Harris, 1951). - For example some / many mud
- Words (types) can be represented in terms of a
collection of contexts where their occurrence or
not in these contexts is taken as hints or cues
for a word to be classified as being of a
particular class.
4Words occurrences are represented as vectors and
used to train a classifier.
- _at_data
- 15,2,8,4,0,8,1,0,1,0,0,0,0,0
- Number of times the word has been observed in
each of the defined contexts. - Non occurrence in particular contexts is as
informative as occurrence. - We use supervised classifiers (Support Verb
Machines, Decision Trees) to predict the class
(Abstract, Mass, etc.) of new words.
5Cues, classification and state-of-the-art results
- Merlo and Stevenson (2001) selected very specific
cues for classifying verbs into a number of Levin
(1993) based verbal classes animacy of the
subject, passives, ... - Baldwin (2005) used general features, such as the
pos tags of neighboring words for type
classification. - Joanis et al. (2007) used the frequency of filled
syntactic positions or slots, tense and voice of
occurring verbs, etc., to describe the whole
system of English verbal classes. - Difficult to compare the results, but .. an
accuracy of about 70
6The problem missing values
7The Sparse data problem
- Joanis and Stevenson, 2003 Joanis et al. 2007
Korhonen et al. 2008 mention that they have to
face the problem of sparse data, many of the
types/words are low in frequency and show up very
little information. - Most of the words will appear very little (i.e.
Zipff distribution) and therefore will show few
cues. - Yallop et al. (2005) calculated that in the
100M-word British National Corpus, from a total
of 124,120 distinct adjectives, 70,246 occur only
once. - The cues we can use as information are mutually
exclusive, i.e. an adjective can be prenominal
and postnominal, but if it only occurs once, it
will only show one cue, the other ones being a
zero value. - Even when appearing more frequently, the optional
nature and variety of the contexts of occurrence
are the origin of missing values also for those
types that occur more than once.
8 Zero values and learning
- Zero values create not only a problem of enough
information to decide, but a further uncertainty
when learning from the data. - A zero value could be indeed a negative value,
i.e. the cue is that it has not been observed,
but it could be that the cue was just not
observed in the examined corpus because of
various reasons - When there are many zero values, the cue loses
its predictive power because of the mentioned
uncertainty. - Katz (1987) and Baayen and Sproat (1996), among
others, acknowledged the importance of
preprocessing low frequency events and Joanis et
al. (2007) also decided to smooth the data, even
working with more than 1000 occurrences per verb
in the BNC.
9 Our smoothing experiment Harmonization based
on linguistic information
10Intuitively How likely is that a 0 is just an
unobserved feature and not a true 0, given the
values of other observations?
- To classify Abstract/Concrete nouns in English
- Cue 1 is suffix ness, -ism, . For
Abstracts (Light 1996) - Cue 2 is determiners such, little, much ..
For Abstracts - Cue 3 is adjectives like big, small, For
Concrete -
- P(cue_110,1,0)
- P(abstractyes0,1,0) P(cue_11abstractyes)
-
- P(abstractno0,1,0) P(cue_11abstractno)
11- We use the information of observed features to
assess the likelihood of a particular unobserved
cue. - Harmonization is substituting 0 values by the
likelihood of being 1 given the other cues
observed. - BUT
- In order to get P(cue_110,1,0) we need to
have P(cue_nclass) and for all cues in the
vector.
12The challenge how to get P(cue_nclass) with so
many 0s in the data ? By estimating the
P(cue_nclass) with linguistic information
- Abstract Concrete
- Suffixno 0.5 1.0
- Suffixyes 0.5 0.0
- SC_Adjno 1.0 0.5
- SC_Adjyes 0.0 0.5
- The probability of being Concrete and having
suffix ness is 0
13Harmonization effects in Spanish Mass experiment
Harmonized Frequency types
0,1,0,1,0,1,1,0,1,0,0,1,1,0 0,3,0,1,0,1,1,0,1,0,0,1,1,0 agua (water)
1,1,0.5,0.5,0.5,1,1,1,1,0,0,0,0,0 1,2,0,0,0,2,1,1,2,0,0,0,0,0 acero (steel)
0.5,0.5,0.5,0.5,0.5,0.5,1,0.5,0.5,0,0,0,0,0 0,0,0,0,0,0,1,0,0,0,0,0,0,0 desabastecimiento (shortage)
0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.47,0.47,0.47,0.47,0.47 0,0,0,0,0,0,0,0,0,0,0,0,0,0 aceptabilidad (acceptability)
14Results of the experiments
- Spanish Mass English Abstract
- Experiment DT SVM DT SVM
- Mean 74.2 63.8 57.8 61.0
- Trimmed mead 77.5 67.4 55.6 61.0
- Frequency 79.9 79.1 61.4 64.1
- Harmonized 82.8 80.7 76.1 70.1
- Baseline 74.8 61.5
15Error Analysis Future work
- Frequency information to filter noise has been
neutralized - Future work is about how to handle missing values
and noise together.
16- Thanks for your attention !