Title: Designing Statistical Language Learners: Experiments on Noun Compounds
1Designing Statistical Language LearnersExperimen
ts on Noun Compounds
by Mark Lauer 1995 PhD Thesis, Macquarie
University
2Outline
- Background Information
- Part I Syntactic Analysis
- Part II Semantic Analysis
- Recent Efforts
3Background Information
- Material Covered in Chapters 1-4 and from other
sources, essential for the understanding of the
Chapter 5 Experiments.
4What is a Compound Noun?
- A compound noun is any consecutive sequence of
nouns at least two words in length that functions
as a noun - police officer, car park, bread knife,
radio telescope platform coordination computer
software test - Often represent a shortened form of a longer
expression. - officer in the police department, place
intended for the parking of cars, knife
designed specifically to cut bread, a test for
the software, which is run on a computer, which
coordinates the platform that houses the
telescope that makes use of radio waves.
5Three Forms of CNs
- Open Form
- dining room, disaster supplies, fish tank
- Hyphenated Form
- sky-scraper, secretary-general, hanger-on
- Closed Form
- boyfriend, baseball, motorcycle
- Typically, only the open form is studied in
research, since the other two tend to be
lexicalized.
6Frequency
- CNs are used very frequently, and are highly
productive in modern language. - Roughly 1 CN in 5 sentences in modern fictional
prose, with about an 86 chance of encountering a
previously unseen CN type. This is commonly
accepted as the low end of CN frequency and
generation. - News articles typically have a new CN type in
every second sentence. - The abstracts from 293 technical articles
contained 3,514 distinct types, or 12 distinct
types per abstract.
7Frequency (continued)
- Cannot be simply cataloged and analyzed.
- Too many in existence to be practical
- New ones get produced too regularly to keep
maintained. - Must have automated tools for processing.
- However, some are so common or their meaning so
removed from the composite words that they have
been added to the language lexicon, aka
lexicalized. - soap opera, scum bag, post office
8Types of CNs
- Nominalizations The head, or rightmost word, is
a nominalized form of a verb. - stamp collection, marathon running
- a.k.a. Verbal-Nexus Compounds, Sentence
Compounds, Sortal Nouns. - Non-Verbal-Nexus Those not formed with a
nominalized head. - a.k.a. Deleted Predicate Nominals, Relational
Nominals - Copulative A subclass of Non-Verbal-Nexus CNs
where the modifier is a special type of the head
word. - tuna fish, submarine ship
9Meaning Distribution Theory
- Chapter 3 focuses on Lauers defense for using a
predominately semantic view of language, which
was not popular at that time. Basic tenets - Communication is the expression of a combination
of smaller primitive elements. - Analysis of text is governed by the expected
meaning of a given text given all the possible
meanings. - A full understanding is not required to
understand the experiments, just some of the
motivations.
10Experiments, Part ISyntactic Parsing
11Parsing
- Given a compound, determine the correct parse
tree, or equivalently, the bracketing of the
composite nouns. - animal cruelty committee
- woman customs official
- chocolate birthday party cake obsession
- CNs longer than three can be handled in a
recursive manner, so studying only the length of
three case is sufficient for analysis. - For CNs of length three, it is common to refer to
the bracketing as either left-branching (a b
c) or right-branching (a b c).
12Meaning of Bracketing
- Modifier nouns (those not rightmost) modify a
noun to its right in some way. Bracketing
defines which noun it modifies. - In animal cruelty committee, animal is
describing cruelty, not committee. animal
cruelty combined modify committee. - In woman customs official, both woman and
customs modify official, and not each other. - Can sometimes be ambiguous, or depend on context.
13Adjacency vs. Dependency
- Adjacency Method of bracketing was the previous
common method. Given the CN a b c , compare the
suitability of a b and of b c, and choose
which is more acceptable. - Dependency Method is Lauers creation, where he
notes that b always modifies c, so the better
question to ask is whether a b is more
acceptable than a c, since with his meaning
distribution theory, that is whats important.
14Acceptability
- How do we compare the acceptability or
suitability of two or more given noun pairs? - Compare the probabilities of a each noun pair,
scaled by the overall model probability.
15Dependency Modeling
- The mappings of the modifier dependencies can be
views as a tree structure, which describes the
meaning of the CN. - A given tree model can generate several different
strings, all equally probable. - The task then becomes to measure the probability
of having a certain model given an observed
string of nouns.
16Model Probability
17Model Probability (continued)
- For the three word CNs, we get
When comparing, this simplifies to
Or, choose left bracketing unless b c is more
than twice as probable as a b.
18Class-Based Smoothing
- To overcome the problem of data requirements,
Lauer proposes smoothing each word to a semantic
class, and then computing probabilities based on
those classes instead. This smoothing makes the
preceding equations much less elegant, but in
predictable ways. - Each word is assumed to belong to at least one
semantic class, and all classes to which it
belongs are considered equiprobable. - Lauer makes use of Rogets Thesaurus to define
semantic classes of nouns.
19Determining Noun Pair Probabilities
- Scan a suitably large corpus (The New Groliers
Multimedia Encyclopedia) for occurrences of each
noun pair appearing as a lone CN. Assign
probabilities accordingly. - Introduces the lasting Lauer Heuristic for
identifying CNs in text, which is to look for
sequences of words to be known nouns, surrounded
by words not know to always be nouns.
20Experiments
- Experiments were conducted varying several
different features - Dependency vs. Adjacency Models
- Adjacency vs. windowed identification of CNs
- Symmetry vs. Asymmetry of counts
- Effects of model probability scaling
- Class smoothed counts vs. each word separately
- POS tagged data vs. List of known nouns.
21Summary of Results
- Dependency model consistently outperforms the
adjacency model, and the baseline of guess left - Windowed counting only hurts accuracy.
- Asymmetric counts are marginally better than
Symmetric ones. - Model Probability tuning significantly helps the
Adjacency model, but not the Dependency model. - Class based smoothing provides significant
improvements. - POS tagging can provide moderate improvements to
the estimation process.
22Semantics
23Semantics
- Now that we have bracketed a given CN, we can
extract a set of length two CNs, which together
form the entire meaning of the CN, one for each
edge in the dependency graph. - But what does each two word CN mean?
24Defining Semantics
- Given the two words of the CN, what is the
relation between those two words expressed by the
CN? - No consensus about which list of relations to
use. - If such a detailed list was created, it would be
massively long, and likely incomplete. - Instead, limit analysis to a few common classes.
- Lauer chose to use prepositional paraphrasing.
25Prepositional Paraphrasing
- Only applies to Non-Verbal-Nexus, Non-Copulative
CNs. - Interpret a b as b ltprepgt a, where ltprepgt is
one of of, for, in, at, on, from, with, about. - state laws ? laws of the state
- baby chair? chair for babies
- reactor waste ? waste from a reactor
26Prepositional Paraphrasing
- Pros
- Concrete, small, list of classes
- Easily identified in corpus texts
- Commonly used
- Cons
- Does not always apply. 50 jacket, cold virus
- Very shallow representation of semantics
- Certain nouns present various lexical preferences
for various prepositions, which can skew
empirical results - Some relations can be expressed by multiple
prepositions
27Predicting Paraphrases
- When predicting which preposition to use in the
paraphrase, it is a simple case of choosing the
most probable.
After some assumptions regarding independence and
uniformity, and applying Bayes Theorem, this
simplifies to
28Estimating Pobj and Phead
- Use a POS tagged corpus (or use an automatic POS
tagger. Consider NN, NNS, NNP, NNPS, VBG to all
be nouns. - For Phead(np), look for n tagged as a noun,
followed by p. - For Pobj(np), look for p, followed by up to
three of JJ, DT, CD, PRP, POS followed by n.
The words between p and n are assumed to modify
n, and thus p and n are still associated.
29Experiments
- Compared to the Parsing experiments, there were
relatively few experiments performed - Word only classes, vs. Rogets Thesaurus
classes. - MLE vs. ELE estimates of probabilities.
- Restriction of predictions to only a select few
prepositions.
30Results
- Overall, the results are abysmal, only barely
reaching significance above the baseline of
always guessing of (the most common relation). - Word based counts tend to perform marginally
better than class smoothed counts. - ELE slightly improves class based estimation over
MLE, but significantly hurts word only
estimation. - Restricting guesses only the most common results
can significantly increase accuracy, but at the
cost of never guessing the less frequent
relations.
31Recent Efforts
32Recent Efforts
- Keller and Lapata have analyzed the effect of
using Internet search engine result counts for
estimating probabilities. Their results are
comparable. - Lapata has also attempted to resolve nominalized
CNs through corpus statistics, with significant
results. - Moldovan, et al. applied several learning
algorithms to the task of semantic labeling, with
a more detailed list of relations, and achieved
significant results.