Attacking the Data Sparseness Problem

About This Presentation

Title:

Attacking the Data Sparseness Problem

Description:

Unseen words not in the training material but in dictionary ... crane (Mov.Solid/Animal), medicine (Abstract/Liquid), plant (Plant/Solid) ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 121

Provided by: clsp1

more less

Transcript and Presenter's Notes

Title: Attacking the Data Sparseness Problem

1
Attacking the Data Sparseness Problem

Team Louise Guthrie, Roberto Basili, Fabio
Zanzotto, Hamish Cunningham, Kalina Bontcheva,
Jia Cui, Klaus Macherey, David Guthrie, Martin
Holub, Marco Cammisa, Cassia Martin, Jerry Liu,
Kris Haralambiev, Fred Jelinek

2
Motivation for the project

Texts for text extraction contain sentences like
The IRA bombed a family owned shop in
Belfast yesterday.
FMLN set off a series of explosions in
central Bogota today.

3
Motivation for the project

Wed like to automatically recognize that both
are of the form
The IRA bombed a family owned shop in Belfast
yesterday.
FMLN set off a series of explosions in central
Bogota today.

4
Our Hypotheses

A transformation of a corpus to replace words and
phrases with coarse semantic categories will help
overcome the data sparseness problem encountered
in language modeling, and text extraction.
Semantic category information might also help
improve machine translation
A noun-centric approach initially will allow
bootstrapping for other syntactic categories

5
A six week goal Labeling noun phrases

Astronauts aboard the space shuttle Endeavor were
forced to dodge a derelict Air Force satellite
Friday
Humans aboard space_vehicle dodge satellite
timeref.

6
Preparing the data- Pre-Workshop

Identify a tag set
Create a Human annotated corpus
Create a double annotated corpus
Process all data for named entity and noun phrase
recognition using GATE Tools (26 million words)
Parsed about (26 million words)
Develop algorithms for mapping target categories
to Wordnet synsets to support the tag set
assessment

7
The Semantic Classes and the Corpus

A subset of classes available in Longman's
Dictionary of contemporary English (LDOCE)
Electronic version
Rationale
The number of semantic classes was small
The classes are somewhat reliable since they were
used by a team of lexicographers to code Noun
senses, Adjective preferences and Verb
preferences
Many words have subject area information which
might be useful

8
The Semantic Classes
Concrete
Abstract
Animate
Inanimate
Solid
Gas
Liquid
Plant
Animal
Human
Movable
Non-movable
FemaleAnim.
9
The Semantic Classes
Concrete
Abstract
Animate
Inanimate
Solid
Gas
Liquid
Plant
Animal
Human
Non-movable
Movable
FemaleAnim.
10
The Semantic Classes
Abstract
Concrete
Animate
Inanimate
Solid
Gas
Liquid
Plant
Animal
Human
Non-movable
Movable
FemaleAnim.
11
The Semantic Classes
Abstract
Concrete
Animate
Inanimate
Solid
Gas
Liquid
Plant
Animal
Human
Non-movable
Movable
Collective
FemaleAnim.
12
The human annotated statistics

Inter-annotator agreement is 94, so that is the
upper limit of our task.
214,446 total annotated noun phrases (262,683
including None of the Above)
29,071 unique vocabulary items (Unlemmatized)
25 semantic categories (162 associated subject
areas were identified)
127,569 with semantic category - Abstract, 59

13
The experimental setup
BNC (Science, Politics, Business) 26 million
words
14
The main development set (dev)

Training 113,000 instances
Held out 85,000 instances
Blind portion
Machine Learning to improve this
15
A challenging development set for experiments on
useen words (Hard data set)

Training all unambiguous words 125,000 instances
Held out ambiguous words 73,000 instances
Blind portion
Machine Learning to improve this
16
Our Experiments include

Supervised Approaches (Learning from Human
Annotated data)
Unsupervised approaches
Using outside evidence (the dictionary or
wordnet)
Syntactic information from parsing or pattern
matching
Context words, the use of preferences, the use of
topical information

17
Experiments on unseen words - Hard data set

Training corpus has only words with unambiguous
annotations
125,000 training instances
73,000 instances held-out
Perplexity 21
Baseline Accuracy 45
Improvement Accuracy 68.5
Context can contribute greatly in unsupervised
experiments

18
Results on the dev set

Random with some frequent ambiguous words moved
into testing
113,000 training instances
85,000 instances held-out
Perplexity 3.44
Baseline Accuracy 80
Improvement Accuracy 87

19
The scheme for annotating the large corpus

After experimenting with the development sets, we
need a scheme for making use of all of the dev
corpus to tag the blind corpus.
We developed a incremental scheme within the
maximum entropy framework
Several talks have to do with re-estimation
techniques useful to bootstrapping process.

20
Terminology

Seen words words seen in the human annotated
data (new instances of known words)
Unseen words not in the training material but
in dictionary
Novel words not in the training material nor in
the dictionary/Wordnet

21
Bootstrapping

Human Annotated
Blind portion
Unannotated Data
22
The Unannotated Data Four types

Human Annotated
Blind portion
Unambiguous 515,000 instances
23
The Unannotated Data Four types

Human Annotated
Blind portion
Seen in training 550,000 instances
Unambiguous 515,000 instances
24
The Unannotated Data Four types

Human Annotated
Blind portion
Seen in training 550,000 instances
Unambiguous 515,000 instances
25
The Unannotated Data Four types

Human Annotated
Blind portion
Seen in training 550,000 instances
Unambiguous 515,000 instances
Novel 20,000
26
Annotated
201K
Unambiguous
515K
Seen
550K
9K
Novel
20K
Training
Training
Training
Tag TestData
27
Results on the Blind Data

We set aside one tenth of the annotated corpus
Randomly selected within each of the domains
It contained 13,000 annotated instances
The baseline here was very high - 90 with
simple techniques
We were able to achieve 93.5 accuracy

28
Overview

Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject
(David, Jia, Martin)
Structuring the context using syntax and
semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion

29
Semantic Categories and MT

10 test words high, medium, and low frequency
Collected their target translations using
EuroWordNet (e.g. Dutch)
Crane
lifts and moves heavy objects hijskraan,
kraan
large long-necked wading bird - kraanvogel

30
SemCats and MT (2)

Manually mapped synonym sets to semantic
categories
automatic mapping will be presented later
Studied how many synonym sets are ruled out as
translations by the semantic category

31
Some Results

3 words full disambiguation
crane (Mov.Solid/Animal), medicine
(Abstract/Liquid), plant (Plant/Solid)
7 words the categories reduce substantially the
possible translations
club - Abstr/an association of people...,
Mov.Solid/stout stick..., Mov.Solid/ an
implement used by a golfer..., Mov.Solid/a
playing card..., NonMov.Solid/a building
club/NonMov.Solid clubgebouw, clubhuis,
club/Abstr. bevolkingsgroep, broederschap,
club/Mov.Solid knots, kolf, malie, kolf,
malie, club

32
The architecture

The multiple-knowledge sources WSD architecture
(Stevenson 03)
Allow use of multiple taggers and combine their
results through a weighted function
Weights can be learned from a corpus
All taggers implemented as GATE components and
combined in applications

33
(No Transcript)
34
The Bag-of-Words Tagger

The bag-of-words tagger is an Information
Retrieval-inspired tagger with parameters
Window size 50 default value
What POS to put in the content vectors (default
nouns and verbs)
Which similarity measure to use
Used in WSD (Leacock et al 92)
Crane/Animalspecies, captivity, disease
Crane/Mov.Solidworker, disaster, machinery

35
BoW classifier (2)

Seen words classified by calculating the inner
product between their context vector and the
vectors for each possible category
Inner product calculated as
Binary vectors number of matching terms
Weighted vectors
Leacocks measure favour concepts that occur
frequently in exactly one category
Take into account the polysemy of concepts in the
vectors

36
Current performance measures

The baseline frequency tagger on its own 91 on
the test (blind) set
Bag-of-words tagger on its own 92.7
Combined architecture 93.2 (window size 50,
using only nouns, binary vectors)

37
Future work on the architecture

Integrate syntactic information, subject codes,
and document topics
Experiment with cosine similarity
Implement Yarowsky92 WSD algorithm
Implement the weighted function module
Experiment with integrating the ME tools as one
of the taggers supplying preferences for the
weighting module

38
Overview

Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject
(David, Jia, Martin)
Structuring the context using syntax and
semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion

39
Accuracy MeasurementsKris Haralambiev

How to measure the accuracy
How to distinguish correct, almost correct
and wrong

40
Exact Match Measurements

W (w1, w2, , wn) vector of the annotated
words
X (x1, x2, , xn) categories assigned by the
annotators
Y (y1, y2, , yn) categories assigned by a
program
Exact match (default) measurement 1 for match
and 0 for mismatch of each (xi,yi) pair
accuracy(X,Y) i xi yi

41
The Hierarchy
42
Ancestor Relation Measurement

The exact match will assign 0 for the pairs
(H,M), (H,F), (A,Q),
Give a partial score for two categeories in
ancestor relation

weight(Cat) i xi? Tree with root Cat
score(xi, yi) min(weight(xi)/weight(yi),
weight(yi)/weight(xi) )
accuracy(X,Y) ?i score(xi,yi)

43
Edge Distance Measurement

The ancestor relation will assign some score for
pairs like (H,M), (A,Q), but will assign 0 for
pairs like (M,F), (A,H)
Going further, we want to compute the similarity
(distance) between X and Y

distance(xi, yi) the length of the simple path
from xi to yi
each edge can be given individual length or all
edges have length 1 (we prefer the latter)

44
Edge Distance Measurement (cont' d)

distance(X,Y) ?i distance(xi,yi)
Accuracy Distance
100 - 0
? - distance(X,Y)
0 - max_possible_distance
max_possible_distance
?i max(distance(xi,cat))
might be reasonable to use aver. instead of max

45
Some Baselines

Training held-out data

Blind data

46
Overview

Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject
(David, Jia, Martin)
Structuring the context using syntax and
semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion

47
Supervised Methods using Maximum Entropy
Jia Cui, David Guthrie, Martin Holub, Jerry Liu,
Klaus Macherey
48

Overview
Maximum Entropy Approach
Feature Functions
Word Classes
Experimental Results

Maximum Entropy Approach
Principle
Define suitable features (constraints) on
training data
Find maximum entropy distribution that satisfies
constraints (GIS)
Properties
Easy to integrate information from several
knowledge sources
Always converges to the global optimum on
training data

Feature Functions
Prior Features
Use Unigram probabilities P(c) for semantic
categories c as feature
Lexical Features
Use the lexical information directly as a
feature
Reduce number of features by using the
following definition

Feature Functions (contd)
Longman Preference Features
Longman Dictionary provides subject codes for
nouns
Use frequency of preferences as additional
features
Unknown Word Features
- Prefix features
- Suffix features
- Human-IST feature

Word Classes
Lemmatization
- Eliminate inflections and reduce words to
their base form
- Assumption different cases of one word have
the same semantic classes
Mutual Information
- Measures the amount of information one random
variable contains about
another
- Applied for nouns and adjectives

53
(No Transcript)
54
(No Transcript)
55
Overview

Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject
(David, Jia, Martin)
Structuring the context using syntax and
semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion

56
Incorporating Context Features Jia Cui, David
Guthrie, Martin Holub, Klaus Macherey, Jerry Liu
57

Overview
Result Analysis
Rewind Encoding Feature Functions
Incorporating Context Features
Clustering Methods
Experimental Results

58
(No Transcript)
59
(No Transcript)
60

Adjectives
Continuing the example angry kid,
Describe adjectives by the categories of nouns
that they prefer to modify to
avoid Sparseness.
Obtain a set of categories for both kid and
angry
- kid A, S, H, H, H
- angry T, H
We can concentrate them together (merging) A,
S, H, H, H, T, H
Or do some kind of component-wise multiplication
(pruning) H, H, H
Simply merging introduces irrelevant categories
- increases entropy

Clustering Methods
Longman dictionary contains such adjective
preferences, but we can also
generate preferences based on corpus.
Measure the entropy of each adjective, by
getting frequency of each
adjective modifying a noun of a particular
category
- The lower the entropy, the more contextually
useful the adjective
- Measure confidence of adjective by frequency
Example angry
- adj angry, entropy 2.18, freqs 155, 55,
9, 7, 0 ....

62
(No Transcript)
63
Overview

Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject
(David, Jia, Martin)
Structuring the context using syntax and
semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion

64
Hard vs Soft Word Clusters

Words as features are sparse, we need to cluster
up
Hard clusters
A feature is assigned to one and only one
cluster. (The cluster for which there exists the
strongest evidence.)
Soft clusters
A feature is assigned to as many clusters as
there is evidence for.

65
Using clustering and contextual features

Baseline prior most frequent semantic
category
All words within the target noun phrase (with a
threshold of 10 occurrences)
Adjective hard clusters
Clusters are defined by most frequent semantic
category
Noun soft clusters
Clusters are defined by all semantic categories
Combined adjective hard clusters and noun soft
clusters

66
Results with clusters and context
Training on TrainingHeld-out
Testing on Blind Data
ME tool Jia's MaxEnt toolkit
67
Measuring Usefulness of Adjectives in Context
We have a huge number of nouns that are assigned
a semantic tag from A. Training Data B. The BNC
corpus when the noun is unambiguous with regard
to the possible semantic category. Using the
adjectives that modify these nouns we are able to
compute the entropy a is an adjective,
C is the set of semantic categories
68
Clustering Adjectives

We take adjectives with low H(T a) and make
clusters form them depending on which semantic
category they predict
Then use each cluster of adjectives as a context
feature

??1 and ?2 are thresholds
69
Overview

Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject
(David, Jia, Martin)
Structuring the context using syntax and
semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion

70
Structuring the context using syntax

Syntactic Model eXtended Dependency Graph
Syntactic Relations considerd V_Obj, V_Sog,
V_PP, NP_PP
Results
Observations
Features are too scarce
We're overfitting! We need more intelligent
methods.

Held-out
Used Tools Syntactic Parser Chaos (Basili
Zanzotto) Max Entropy Toolkit (implemented by Jia
Cui)
71
Semantic FingerprintGeneralizing nouns using
EuroWordnet
Top level generalizations
Base concepts (tree structure)
Bottom level synonym sets (directed graph)
72
Noun semantic fingerprints an example

Words in the events are replaced by basic
concepts

object
location
person
area
social group
geographic area
district
administrative district
assemblage
urban center
the CEO drove into the city with his own car
Solid
73
Verb semantic fingerprints an example
Generalized Features
Lexicalized Features
drive_V_PP_into travel _V_PP_into move_V_PP_into
drive_V_PP_into
drive
move
travel
the CEO drove into the city with his own
car
V_PP
V_Subj
74
How to exploit the word context?

Semantic Category-Subj-to_think
Positive observations
his wife thought he should eat more
the waitress thought that Italians leave tiny
tips
Our conceptual hierarchy contains FemaleHuman and
MaleHuman...

? Fabio thought he has had a terrific idea before
looking at the results
Fabio is a FemaleHuman !
75
How to exploit the word context?
nnn
ooo
H
yyy
kkk
H
Verifying the hypothesis
W
vvv
H
xxx
zzz
H
76
Syntactic slots and slot fillers
77
How to exploit the word context?

Using...
a revised hierarchy
Female animal and male animal ? Animal
Female human and male human ? Human
Female and male ? Animate
one-semantic-class-per-discourse hypothesis
the semantic fingerprint generalising nouns in
the basic concepts of EuroWordnet and verbs in
the top most in Wordnet

78
Results
Held-out combined
Test bed charateristics
Used Tools Syntactic Parser Chaos (Basili
Zanzotto) Max Entropy Toolkit (implemented by Jia
Cui)
79
Results a closer look
Held-out combined
Used Tools Syntactic Parser Chaos (Basili
Zanzotto) Max Entropy Toolkit (implemented by Jia
Cui)
80
Overview

Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject
(David, Jia, Martin)
Structuring the context using syntax and
semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion

81
Unsupervised Semantic Labeling of Nouns using ME

Frederick Jelinek
Semantic Analysis for Sparse Data

82
Motivation

Base ME features on lexical and grammatical
relationships found in the context of nouns to be
labeled
Hand-labeled data too sparse to allow using
powerful ME compound features
Wish to utilize large unlabeled British National
Corpus (and internet, etc.) for training
Will use dictionary and initialization by
statistics from smaller hand-labeled corpus

83
Format of Labeled Training Data

w is the noun to be labeled
r1, r2, rm are the relationships in the context
of w which correlate with the label appropriate
to w
C is the label denoting semantic class
f1, f2,, fK is the label count, I.e., fC1 and
fi0 for i ? C
Then the event file format is
(f1, f2,, fK , w, r1, r2, rm )

84
Format of BNC Training Data

The label counts fi will be fractional with fi
0 if the dictionary does not allow noun w to
have the ith label.
Always fi 0 and Si fi 1
The problem is the initial selection of values of
fi
Suggestion let fi Q(C i w) where Q denotes
the empirical distribution from hand labeled
data.

85
Annotated
Unambiguous
BNCSeen
Unseen
Novel
Tag Heldout
Training
Training
Training
86
Inner Loop ME Re-estimation

The empirical distribution used in the ME
iterations is obtained from sums of values of fi
found in both the labeled and BNC data sets.
These counts determine
which of the potential features will be selected
as actual features
the values of the l parameters in the ME model

87
Constraints and Equations
88
Outer Loop Re-scaling of Data

Once the ME model P(C c w,r1,,rm) is
estimated, the fi values in event files of the
BNC portion of data are re-scaled.
fi values in the hand labeled portion remain
unchanged
New empirical counts are thus available
to determine the identity of new actual features
the parameters of a new ME probability model
Etc.

89
Preliminary Results by Jia Cui
Sheffield annotated corpus and BNC unambiguous
nouns provide initial statistics. Label instances
of BNC corpus whose headwords are seen in
unambiguous data but are ambiguous according to
the Longman dictionary.
90
Concluding Thoughts

Preliminary results are promising
Method requires theoretical and practical
exploration
Changing of features and feature targets is a new
phenomenon in ME estimation
Careful selection of relationships and basing
them on clusters required will lead to effective
features
See proposal by Jia Cui and David Guthrie

91
Overview

Bag of words (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy (Klaus)
Incorporating context preferences (Jerry)
Experiments with Adjective Classes and Subject
(David, Jia, Martin)
Structuring the context using syntax and
semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia)
Unsupervised Re-estimation (Roberto)
Student Proposals (Jia, Dave, Marco)
Conclusion

92
Unsupervised Semantic Tagging
Roberto Basili, Fabio Zanzotto, Marco Cammisa,
Martin Holub, Kris Haralambiev, Cassia Martin,
Jia Cui, David Guthrie,
JHU Summer Workshop 2003 August, 22nd 2003
Baltimore
93
Summary

Motivations
Lexical Information for Semantic Tagging
Unsupervised Natural Language Learning
Empirical Estimation for ME bootstrapping
Weakly Supervised BNC Tagging through Wordnet
A semantic similarity metric over Wordnet
Experiments and Results
Mapping LDOCE to Wordnet
Bootstrapping over an untagged corpus
Re-estimation thrugh Wordnet

94
Motivations

All experiments tell that lexical information is
crucial for semantic tagging,
data sparseness seems to limit the effect of the
context
The contribution of different resources needs to
be exploited (as in WSD)
In applications hand-tagging should be applied in
a cost effective way
Good results need to scale-up also to
technological scenarios where poorer (or no)
resources are available

95
Motivations (contd)

Wordnet contribution to semantic tagging
A source of evidence for a larger set of lexicals
(unseen words)
A consistent way to generalize single
observations
(hierarchical) constraints over word uses
statistics
Similarity of words uses suggests semantic
similarity
Corpus-driven syntactic similarity is one
possible choice
Domain or topical similarity is also relevant
Semantic similarity in the Wordnet hierarchy
suggests useful levels of generalization
Specific hypernyms, i.e. able to separate
different senses
General hypernyms, i.e. help to reduce the number
of word classes to model

96
Learning Contextual Evidence

Each syntactic relation provides a view on a
word usage, i.e. suggests a set of nouns with
common behaviour(s)
Semantic similarity among nouns is a model of
local semantic preference
to drink beer, water, , cocoa/L, stock/L,
The , president, director, boy, ace/H, brain/H,
succeeds

97
Semantic classes vs. language models

The role of p( C v, d)
e.g.
p( n v, d) ? p( n C) p( C v, d)
Implications
p(n C) gives a lexical semantic model that
it is likely to depend on the corpus and not on
the individual context
p(C v, d) model selectional preferences and
provides disambiguation cues for contexts (v d X)

98
Semantic classes vs. language models

Lexical evidence p(n C) (or also p(Cn) )
Contextual evidence p( C v, d)
The idea
Contextual evidence can be collected from the
corpus by involving the lexical knowledge base
The modeling of lexical evidence can be seen as a
side effect of the context (p(C n) ? p(nC)
)
Implied approach
Learn the second as an estimate for the first and
then combine for bootstrapping to unseen words

99
Conceptual Density

Basic terminology
Target noun set T (e.g. beer, water, stock
nouns in relation rVDirobj with a given verb)
(Branching Factor) Average number m of children
of a node s, i.e. the average number of children
of any node subsumed by s
(Marks) Set of marks M, i.e. the subset of nouns
in T that are subsumed within the WN subhierarchy
rooted in s. N M
(Area) area(s), total number of nodes of the
subhierarchy rooted at s

100
Conceptual Density (contd)
1
3
2
6
4
5
15
9
8
7
10
12
11
13
14
101
Using Conceptual Density

Target Noun set T (e.g. subjs of verb to march)
horse (6 senses in WN1.6),
ant (1 sense in WN1.6)
troop (4 senses in WN1.6)
division (12 senses in WN1.6)
elephant (2 senses in WN1.6)

FIND the smaller set of synsets s that covers
T and maximizes CDSscd(r)(s)

(1) organization organisation
horse, troops, divisions
(2) placental placental_mammal ...
horse, elephant
(3) animal animate_being
horse, elephant, ant
(4) army_unit
troop, division

102
Summary

Motivations
Lexical Information for Semantic Tagging
Unsupervised Natural Language Learning
Empirical Estimation for ME bootstrapping
Weakly Supervised BNC Tagging through Wordnet
A semantic similarity metric over Wordnet
Experiments and Results
Mapping LDOCE to Wordnet
Bootstrapping over an untagged corpus
Re-estimation thrugh Wordnet

103
Results Mapping LDOCE classes

Lexical Entries in LDOCE are defined in terms of
a Semantic Class and a topical tag (Subject
Codes), e.g. stock ('L','FO')
The semantic similarity metrics has been used to
derive WN synset(s) that represent
ltSemClass,SubjCodegt pairs
A Wn explanation of lexical entries in a LM class
(lexical mapping)
The position(s) in the WN noun hierarchy of each
LM class (category mapping)
Semantic preference of synsets given words, LM
classes (and Subject Codes) can be mapped into
probabilities, e.g.
p( WN_syns n
LM_class )
and then
p(LM_class n WN_syns ), p(LM_class n),
p(LM_class WN_syns )

104
Mapping LDOCE classes (contd)

Example Cluster 2---EDZI
'2 'Abstract and solid'
'ED-'ZI 'education - institutions, academic
name of '
Tnursery_school, polytechnic, school,
seminary, senate
Synset school ,
cd0.580, coverage 60
Synset educational_institution,
cd0.527, coverage 80
Synset gathering assemblage
cd0.028, coverage 40

105
Case study the word stock in LDOCE

stock T a supply (of something) for use
stock J goods for sale
stock N the thick part of a tree trunk
stock A a group of animals used for breeding
stock A farm animals usu . cattle LIVESTOCK
stock T a family line , esp . of the stated
character
stock T money lent to a government at a fixed
rate of interest
stock T the money (CAPITAL) owned by a company,
divided into SHAREs
stock P a type of garden flower with a sweet
smell
stock L a liquid made from the juices of meat,
bones , etc . , used in cooking
stock J (in former times) a stiff cloth worn by
men round the neck of a shirt
compare TIE
stock N a piece of wood used as a support or
handle, as for a gun or tool
stock N the piece which goes across the top of
an ANCHOR_1_1 from side to side
stock P a plant from which CUTTINGs are grown
stock P a stem onto which another plant is
GRAFTed

106
Case study stock as Animal (A)
stock

stock A a group of animals used for breeding
stock A farm animals usu . cattle LIVESTOCK

107
Case Study stock (N - P)

stock N a piece of wood used as a support or
handle , as for a gun or tool
stock N the piece which goes across the top of
an ANCHOR_1_1 from side to side
stock N the thick part of a tree trunk
stock P a plant from which CUTTINGs are grown
stock P a stem onto which another plant is
GRAFTed
stock P a type of garden flower with a sweet
smell

108
LM Category Mapping
109
Results A Simple (Unsupervised) Tagger

Estimate over the parsed corpusWordnet and by
mapping into LD categories, the following
quantities
P( C hw r), P( C r), P( C
hw)
(r ranges over SubjV, DirObj, N_P_hw,
hw_P_N)
Apply a simple Bayesian model to any incoming
contexts
lthw r1, , rkgt
and Select argmaxC( p(C hw) p(C r1)
p(C rk))
(OBS p(C rj) is the back-off of p(C hw
rj))

110
Unsupervised Tagger Evaluation
111
Results Re-estimate probs for a ME model

Use sentences in training data for learning
lexical and contextual preferences of nouns and
relations
Use lexical preferences to pre-estimate the
empirical distributions over unseen data (see
constraints Q(c,w,R) in Freds)
Train the ME model over all available data
Tag held-out and blind data

112
Results
Features All syntactic Features ME Tra Training
DataWN Held Out ME Test Held-out Result
79-80 Features Only head words ME Tra Training
DataWN Held Out ME Test Held-out Result
81.80 Features All synt Features ME Tra
Training DataWN Held Out ME Test Blind
Data Result 86,03

Features All syntactic Features
ME Tra Training Data
ME Test Held-out
Result 78-79
Features Only head words
ME Tra Training Data
ME Test Held-out
Result 80.76

113
Conclusions

A robust parameter estimation method for semantic
tagging
Less prone to sparse data
Generalize to meaningful noun classes
Develop lexicalized contextual cues and a
semantic dictionary
A natural and viable way to integrate
corpus-driven evidence with a general-purpose
lexicon
Results consistent wrt fully supervised methods
Open perspectives for effective estimation of
unseen empirical distributions

114
Open Issue

Estimate contextual and lexical probabilities
from the 28M portion of the BNC (already parsed
here)
Alternative formulations of similarity metrics
Experiment a bootstrapping method by imposing the
proposed estimates (i.e. p(C w, SubjV)) as
constraints to Q(C, w, SubjV))
Manually assess and measure the automatically
derived Longman-Wordnet mapping

115
Summary Slide

IR-inspired approaches (Kalina)
Evaluation (Kris)
Supervised methods using maximum entropy
Incorporating context preferences (Jerry)
Adjective Classes and Subject markings (David)
Structuring the context using syntax and
semantics (Cassia, Fabio)
Re-estimation techniques for Maximum Entropy
Experiments (Fred)
Unsupervised Re-estimation (Roberto)

116
Our Accomplishments

Developed a method for bootstrapping using
maximum entropy
More than 300 experiments with features
Integrated dictionary and syntactic information
Integrated dictionary, Wordnet, syntactic
information and topic information experiments
which gave us significant improvement
Developed a system for unsupervised tagging

117
Lessons learned

Semantic Tagging has an intermediate complexity
between the rather successful NE recognition and
Word Sense Disambiguation
Semantic tagging over BNC is viable with high
accuracy
Accuracy reached by most of the proposed methods
?94
This task stimulates cross-fertilization between
statistical and symbolic knowledge grounded on
solid linguistic principles and resources

118
NO! The near future at a glance

Availability of semantic information for head
nouns is critical to a variety of linguistic
tasks
IR and CLIR, Information Extraction and Question
Answering
Machine Translation and Language Modeling
Annotated resources can provide a significant
stimulus to machine learning of linguistic
patterns (e.g. QA answer structures)
Open possibilities for corpus-driven learning of
other semantic phenomena (e.g. verb argument
structures) and incremental learning methods

119
and a quick look further

Unseen phenomena still represent hard cases for
any probabilistic model (rare vs. impossible
labels for unseen/novel words)
Integration of external resources is problematic
Projecting observed empirical distributions may
lead to overfitting data
Lexical information (e.g. Wordnet) has not a
clear probabilistic interpretation
Soft Features (Jia Cui) seem a promising model
Better use of the context
Design and derivation of class-based contextual
features (David Guthrie)
Existing lexical resources provide large scale
and effective information for bootstrapping

120
A Final thought

Thanks to the Johns Hopkins faculty and staff for
their availability and helpfulness during the
workshop.
Special thanks to Fred Jelinek for answering
endless questions about maximum entropy and
helping to model our problem.

Write a Comment

User Comments (0)