Title: Attacking the Data Sparseness Problem
1Attacking the Data Sparseness Problem
- Team Louise Guthrie, Roberto Basili, Fabio
Zanzotto, Hamish Cunningham, Kalina Bontcheva,
Jia Cui, Klaus Macherey, David Guthrie, Martin
Holub, Marco Cammisa, Cassia Martin, Jerry Liu,
Kris Haralambiev, Fred Jelinek
2Motivation for the project
- Texts for text extraction contain sentences like
- The IRA bombed a family owned shop in
Belfast yesterday. - FMLN set off a series of explosions in
central Bogota today.
3Motivation for the project
- Wed like to automatically recognize that both
are of the form - The IRA bombed a family owned shop in Belfast
yesterday. - FMLN set off a series of explosions in central
Bogota today.
4Our Hypotheses
- A transformation of a corpus to replace words and
phrases with coarse semantic categories will help
overcome the data sparseness problem encountered
in language modeling, and text extraction. - Semantic category information might also help
improve machine translation - A noun-centric approach initially will allow
bootstrapping for other syntactic categories
5A six week goal Labeling noun phrases
- Astronauts aboard the space shuttle Endeavor were
forced to dodge a derelict Air Force satellite
Friday - Humans aboard space_vehicle dodge satellite
timeref.
6Preparing the data- Pre-Workshop
- Identify a tag set
- Create a Human annotated corpus
- Create a double annotated corpus
- Process all data for named entity and noun phrase
recognition using GATE Tools (26 million words) - Parsed about (26 million words)
- Develop algorithms for mapping target categories
to Wordnet synsets to support the tag set
assessment
7 The Semantic Classes and the Corpus
- A subset of classes available in Longman's
Dictionary of contemporary English (LDOCE)
Electronic version - Rationale
- The number of semantic classes was small
- The classes are somewhat reliable since they were
used by a team of lexicographers to code Noun
senses, Adjective preferences and Verb
preferences - Many words have subject area information which
might be useful
8The Semantic Classes
Concrete
Abstract
Animate
Inanimate
Solid
Gas
Liquid
Plant
Animal
Human
Movable
Non-movable
FemaleAnim.
9The Semantic Classes
Concrete
Abstract
Animate
Inanimate
Solid
Gas
Liquid
Plant
Animal
Human
Non-movable
Movable
FemaleAnim.
10The Semantic Classes
Abstract
Concrete
Animate
Inanimate
Solid
Gas
Liquid
Plant
Animal
Human
Non-movable
Movable
FemaleAnim.
11The Semantic Classes
Abstract
Concrete
Animate
Inanimate
Solid
Gas
Liquid
Plant
Animal
Human
Non-movable
Movable
Collective
FemaleAnim.
12The human annotated statistics
- Inter-annotator agreement is 94, so that is the
upper limit of our task. - 214,446 total annotated noun phrases (262,683
including None of the Above) - 29,071 unique vocabulary items (Unlemmatized)
- 25 semantic categories (162 associated subject
areas were identified) - 127,569 with semantic category - Abstract, 59
13The experimental setup
BNC (Science, Politics, Business) 26 million
words
14The main development set (dev)
Training 113,000 instances
Held out 85,000 instances
Blind portion
Machine Learning to improve this
15A challenging development set for experiments on
useen words (Hard data set)
Training all unambiguous words 125,000 instances
Held out ambiguous words 73,000 instances
Blind portion
Machine Learning to improve this
16Our Experiments include
- Supervised Approaches (Learning from Human
Annotated data) - Unsupervised approaches
- Using outside evidence (the dictionary or
wordnet) - Syntactic information from parsing or pattern
matching - Context words, the use of preferences, the use of
topical information
17Experiments on unseen words - Hard data set
- Training corpus has only words with unambiguous
annotations - 125,000 training instances
- 73,000 instances held-out
- Perplexity 21
- Baseline Accuracy 45
- Improvement Accuracy 68.5
- Context can contribute greatly in unsupervised
experiments
18Results on the dev set
- Random with some frequent ambiguous words moved
into testing - 113,000 training instances
- 85,000 instances held-out
- Perplexity 3.44
- Baseline Accuracy 80
- Improvement Accuracy 87
19The scheme for annotating the large corpus
- After experimenting with the development sets, we
need a scheme for making use of all of the dev
corpus to tag the blind corpus. - We developed a incremental scheme within the
maximum entropy framework - Several talks have to do with re-estimation
techniques useful to bootstrapping process.
20Terminology
- Seen words words seen in the human annotated
data (new instances of known words) - Unseen words not in the training material but
in dictionary - Novel words not in the training material nor in
the dictionary/Wordnet
21Bootstrapping
Human Annotated
Blind portion
Unannotated Data
22The Unannotated Data Four types
Human Annotated
Blind portion
Unambiguous 515,000 instances
23The Unannotated Data Four types
Human Annotated
Blind portion
Seen in training 550,000 instances
Unambiguous 515,000 instances
24The Unannotated Data Four types
Human Annotated
Blind portion
Seen in training 550,000 instances
Unambiguous 515,000 instances
25The Unannotated Data Four types
Human Annotated
Blind portion
Seen in training 550,000 instances
Unambiguous 515,000 instances
Novel 20,000
26Annotated
201K
Unambiguous
515K
Seen
550K
9K
Novel
20K
Training
Training
Training
Tag TestData
27Results on the Blind Data
- We set aside one tenth of the annotated corpus
- Randomly selected within each of the domains
- It contained 13,000 annotated instances
- The baseline here was very high - 90 with
simple techniques - We were able to achieve 93.5 accuracy
28Overview
- Bag of words (Kalina)
- Evaluation (Kris)
- Supervised methods using maximum entropy (Klaus)
- Incorporating context preferences (Jerry)
- Experiments with Adjective Classes and Subject
(David, Jia, Martin) - Structuring the context using syntax and
semantics (Cassia, Fabio) - Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia) - Unsupervised Re-estimation (Roberto)
- Student Proposals (Jia, Dave, Marco)
- Conclusion
29Semantic Categories and MT
- 10 test words high, medium, and low frequency
- Collected their target translations using
EuroWordNet (e.g. Dutch) - Crane
- lifts and moves heavy objects hijskraan,
kraan - large long-necked wading bird - kraanvogel
30SemCats and MT (2)
- Manually mapped synonym sets to semantic
categories - automatic mapping will be presented later
- Studied how many synonym sets are ruled out as
translations by the semantic category
31Some Results
- 3 words full disambiguation
- crane (Mov.Solid/Animal), medicine
(Abstract/Liquid), plant (Plant/Solid) - 7 words the categories reduce substantially the
possible translations - club - Abstr/an association of people...,
Mov.Solid/stout stick..., Mov.Solid/ an
implement used by a golfer..., Mov.Solid/a
playing card..., NonMov.Solid/a building - club/NonMov.Solid clubgebouw, clubhuis,
- club/Abstr. bevolkingsgroep, broederschap,
- club/Mov.Solid knots, kolf, malie, kolf,
malie, club
32The architecture
- The multiple-knowledge sources WSD architecture
(Stevenson 03) - Allow use of multiple taggers and combine their
results through a weighted function - Weights can be learned from a corpus
- All taggers implemented as GATE components and
combined in applications
33(No Transcript)
34The Bag-of-Words Tagger
- The bag-of-words tagger is an Information
Retrieval-inspired tagger with parameters - Window size 50 default value
- What POS to put in the content vectors (default
nouns and verbs) - Which similarity measure to use
- Used in WSD (Leacock et al 92)
- Crane/Animalspecies, captivity, disease
- Crane/Mov.Solidworker, disaster, machinery
35BoW classifier (2)
- Seen words classified by calculating the inner
product between their context vector and the
vectors for each possible category - Inner product calculated as
- Binary vectors number of matching terms
- Weighted vectors
- Leacocks measure favour concepts that occur
frequently in exactly one category - Take into account the polysemy of concepts in the
vectors
36Current performance measures
- The baseline frequency tagger on its own 91 on
the test (blind) set - Bag-of-words tagger on its own 92.7
- Combined architecture 93.2 (window size 50,
using only nouns, binary vectors)
37Future work on the architecture
- Integrate syntactic information, subject codes,
and document topics - Experiment with cosine similarity
- Implement Yarowsky92 WSD algorithm
- Implement the weighted function module
- Experiment with integrating the ME tools as one
of the taggers supplying preferences for the
weighting module
38Overview
- Bag of words (Kalina)
- Evaluation (Kris)
- Supervised methods using maximum entropy (Klaus)
- Incorporating context preferences (Jerry)
- Experiments with Adjective Classes and Subject
(David, Jia, Martin) - Structuring the context using syntax and
semantics (Cassia, Fabio) - Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia) - Unsupervised Re-estimation (Roberto)
- Student Proposals (Jia, Dave, Marco)
- Conclusion
39Accuracy MeasurementsKris Haralambiev
- How to measure the accuracy
- How to distinguish correct, almost correct
and wrong
40Exact Match Measurements
- W (w1, w2, , wn) vector of the annotated
words - X (x1, x2, , xn) categories assigned by the
annotators - Y (y1, y2, , yn) categories assigned by a
program - Exact match (default) measurement 1 for match
and 0 for mismatch of each (xi,yi) pair - accuracy(X,Y) i xi yi
41The Hierarchy
42Ancestor Relation Measurement
- The exact match will assign 0 for the pairs
(H,M), (H,F), (A,Q), - Give a partial score for two categeories in
ancestor relation
- weight(Cat) i xi? Tree with root Cat
- score(xi, yi) min(weight(xi)/weight(yi),
weight(yi)/weight(xi) ) - accuracy(X,Y) ?i score(xi,yi)
43Edge Distance Measurement
- The ancestor relation will assign some score for
pairs like (H,M), (A,Q), but will assign 0 for
pairs like (M,F), (A,H) - Going further, we want to compute the similarity
(distance) between X and Y
- distance(xi, yi) the length of the simple path
from xi to yi - each edge can be given individual length or all
edges have length 1 (we prefer the latter)
44Edge Distance Measurement (cont' d)
- distance(X,Y) ?i distance(xi,yi)
- Accuracy Distance
- 100 - 0
- ? - distance(X,Y)
- 0 - max_possible_distance
- max_possible_distance
- ?i max(distance(xi,cat))
- might be reasonable to use aver. instead of max
45Some Baselines
46Overview
- Bag of words (Kalina)
- Evaluation (Kris)
- Supervised methods using maximum entropy (Klaus)
- Incorporating context preferences (Jerry)
- Experiments with Adjective Classes and Subject
(David, Jia, Martin) - Structuring the context using syntax and
semantics (Cassia, Fabio) - Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia) - Unsupervised Re-estimation (Roberto)
- Student Proposals (Jia, Dave, Marco)
- Conclusion
47Supervised Methods using Maximum Entropy
Jia Cui, David Guthrie, Martin Holub, Jerry Liu,
Klaus Macherey
48- Overview
- Maximum Entropy Approach
- Feature Functions
- Word Classes
- Experimental Results
49- Maximum Entropy Approach
- Principle
- Define suitable features (constraints) on
training data - Find maximum entropy distribution that satisfies
constraints (GIS) - Properties
- Easy to integrate information from several
knowledge sources - Always converges to the global optimum on
training data
50- Feature Functions
- Prior Features
- Use Unigram probabilities P(c) for semantic
categories c as feature - Lexical Features
- Use the lexical information directly as a
feature - Reduce number of features by using the
following definition
51- Feature Functions (contd)
- Longman Preference Features
- Longman Dictionary provides subject codes for
nouns - Use frequency of preferences as additional
features - Unknown Word Features
- - Prefix features
- - Suffix features
- - Human-IST feature
52- Word Classes
- Lemmatization
- - Eliminate inflections and reduce words to
their base form - - Assumption different cases of one word have
the same semantic classes - Mutual Information
- - Measures the amount of information one random
variable contains about - another
- - Applied for nouns and adjectives
53(No Transcript)
54(No Transcript)
55Overview
- Bag of words (Kalina)
- Evaluation (Kris)
- Supervised methods using maximum entropy (Klaus)
- Incorporating context preferences (Jerry)
- Experiments with Adjective Classes and Subject
(David, Jia, Martin) - Structuring the context using syntax and
semantics (Cassia, Fabio) - Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia) - Unsupervised Re-estimation (Roberto)
- Student Proposals (Jia, Dave, Marco)
- Conclusion
56Incorporating Context Features Jia Cui, David
Guthrie, Martin Holub, Klaus Macherey, Jerry Liu
57- Overview
- Result Analysis
- Rewind Encoding Feature Functions
- Incorporating Context Features
- Clustering Methods
- Experimental Results
58(No Transcript)
59(No Transcript)
60- Adjectives
- Continuing the example angry kid,
- Describe adjectives by the categories of nouns
that they prefer to modify to - avoid Sparseness.
- Obtain a set of categories for both kid and
angry - - kid A, S, H, H, H
- - angry T, H
- We can concentrate them together (merging) A,
S, H, H, H, T, H - Or do some kind of component-wise multiplication
(pruning) H, H, H - Simply merging introduces irrelevant categories
- increases entropy
61- Clustering Methods
- Longman dictionary contains such adjective
preferences, but we can also - generate preferences based on corpus.
- Measure the entropy of each adjective, by
getting frequency of each - adjective modifying a noun of a particular
category - - The lower the entropy, the more contextually
useful the adjective - - Measure confidence of adjective by frequency
- Example angry
-
- - adj angry, entropy 2.18, freqs 155, 55,
9, 7, 0 .... -
62(No Transcript)
63Overview
- Bag of words (Kalina)
- Evaluation (Kris)
- Supervised methods using maximum entropy (Klaus)
- Incorporating context preferences (Jerry)
- Experiments with Adjective Classes and Subject
(David, Jia, Martin) - Structuring the context using syntax and
semantics (Cassia, Fabio) - Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia) - Unsupervised Re-estimation (Roberto)
- Student Proposals (Jia, Dave, Marco)
- Conclusion
64Hard vs Soft Word Clusters
- Words as features are sparse, we need to cluster
up - Hard clusters
- A feature is assigned to one and only one
cluster. (The cluster for which there exists the
strongest evidence.) - Soft clusters
- A feature is assigned to as many clusters as
there is evidence for.
65Using clustering and contextual features
- Baseline prior most frequent semantic
category - All words within the target noun phrase (with a
threshold of 10 occurrences) - Adjective hard clusters
- Clusters are defined by most frequent semantic
category - Noun soft clusters
- Clusters are defined by all semantic categories
- Combined adjective hard clusters and noun soft
clusters
66Results with clusters and context
Training on TrainingHeld-out
Testing on Blind Data
ME tool Jia's MaxEnt toolkit
67Measuring Usefulness of Adjectives in Context
We have a huge number of nouns that are assigned
a semantic tag from A. Training Data B. The BNC
corpus when the noun is unambiguous with regard
to the possible semantic category. Using the
adjectives that modify these nouns we are able to
compute the entropy a is an adjective,
C is the set of semantic categories
68Clustering Adjectives
- We take adjectives with low H(T a) and make
clusters form them depending on which semantic
category they predict - Then use each cluster of adjectives as a context
feature
??1 and ?2 are thresholds
69Overview
- Bag of words (Kalina)
- Evaluation (Kris)
- Supervised methods using maximum entropy (Klaus)
- Incorporating context preferences (Jerry)
- Experiments with Adjective Classes and Subject
(David, Jia, Martin) - Structuring the context using syntax and
semantics (Cassia, Fabio) - Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia) - Unsupervised Re-estimation (Roberto)
- Student Proposals (Jia, Dave, Marco)
- Conclusion
70Structuring the context using syntax
- Syntactic Model eXtended Dependency Graph
- Syntactic Relations considerd V_Obj, V_Sog,
V_PP, NP_PP - Results
- Observations
- Features are too scarce
- We're overfitting! We need more intelligent
methods.
Held-out
Used Tools Syntactic Parser Chaos (Basili
Zanzotto) Max Entropy Toolkit (implemented by Jia
Cui)
71Semantic FingerprintGeneralizing nouns using
EuroWordnet
Top level generalizations
Base concepts (tree structure)
Bottom level synonym sets (directed graph)
72Noun semantic fingerprints an example
- Words in the events are replaced by basic
concepts
object
location
person
area
social group
geographic area
district
administrative district
assemblage
urban center
the CEO drove into the city with his own car
Solid
73Verb semantic fingerprints an example
Generalized Features
Lexicalized Features
drive_V_PP_into travel _V_PP_into move_V_PP_into
drive_V_PP_into
drive
move
travel
the CEO drove into the city with his own
car
V_PP
V_Subj
74How to exploit the word context?
- Semantic Category-Subj-to_think
- Positive observations
- his wife thought he should eat more
- the waitress thought that Italians leave tiny
tips - Our conceptual hierarchy contains FemaleHuman and
MaleHuman...
? Fabio thought he has had a terrific idea before
looking at the results
Fabio is a FemaleHuman !
75How to exploit the word context?
nnn
ooo
H
yyy
kkk
H
Verifying the hypothesis
W
vvv
H
xxx
zzz
H
76Syntactic slots and slot fillers
77How to exploit the word context?
- Using...
- a revised hierarchy
- Female animal and male animal ? Animal
- Female human and male human ? Human
- Female and male ? Animate
- one-semantic-class-per-discourse hypothesis
- the semantic fingerprint generalising nouns in
the basic concepts of EuroWordnet and verbs in
the top most in Wordnet
78Results
Held-out combined
Test bed charateristics
Used Tools Syntactic Parser Chaos (Basili
Zanzotto) Max Entropy Toolkit (implemented by Jia
Cui)
79Results a closer look
Held-out combined
Used Tools Syntactic Parser Chaos (Basili
Zanzotto) Max Entropy Toolkit (implemented by Jia
Cui)
80Overview
- Bag of words (Kalina)
- Evaluation (Kris)
- Supervised methods using maximum entropy (Klaus)
- Incorporating context preferences (Jerry)
- Experiments with Adjective Classes and Subject
(David, Jia, Martin) - Structuring the context using syntax and
semantics (Cassia, Fabio) - Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia) - Unsupervised Re-estimation (Roberto)
- Student Proposals (Jia, Dave, Marco)
- Conclusion
81Unsupervised Semantic Labeling of Nouns using ME
- Frederick Jelinek
- Semantic Analysis for Sparse Data
82Motivation
- Base ME features on lexical and grammatical
relationships found in the context of nouns to be
labeled - Hand-labeled data too sparse to allow using
powerful ME compound features - Wish to utilize large unlabeled British National
Corpus (and internet, etc.) for training - Will use dictionary and initialization by
statistics from smaller hand-labeled corpus
83Format of Labeled Training Data
- w is the noun to be labeled
- r1, r2, rm are the relationships in the context
of w which correlate with the label appropriate
to w - C is the label denoting semantic class
- f1, f2,, fK is the label count, I.e., fC1 and
fi0 for i ? C - Then the event file format is
- (f1, f2,, fK , w, r1, r2, rm )
84Format of BNC Training Data
- The label counts fi will be fractional with fi
0 if the dictionary does not allow noun w to
have the ith label. - Always fi 0 and Si fi 1
- The problem is the initial selection of values of
fi - Suggestion let fi Q(C i w) where Q denotes
the empirical distribution from hand labeled
data.
85Annotated
Unambiguous
BNCSeen
Unseen
Novel
Tag Heldout
Training
Training
Training
86Inner Loop ME Re-estimation
- The empirical distribution used in the ME
iterations is obtained from sums of values of fi
found in both the labeled and BNC data sets. - These counts determine
- which of the potential features will be selected
as actual features - the values of the l parameters in the ME model
87Constraints and Equations
88Outer Loop Re-scaling of Data
- Once the ME model P(C c w,r1,,rm) is
estimated, the fi values in event files of the
BNC portion of data are re-scaled. - fi values in the hand labeled portion remain
unchanged - New empirical counts are thus available
- to determine the identity of new actual features
- the parameters of a new ME probability model
- Etc.
-
89Preliminary Results by Jia Cui
Sheffield annotated corpus and BNC unambiguous
nouns provide initial statistics. Label instances
of BNC corpus whose headwords are seen in
unambiguous data but are ambiguous according to
the Longman dictionary.
90Concluding Thoughts
- Preliminary results are promising
- Method requires theoretical and practical
exploration - Changing of features and feature targets is a new
phenomenon in ME estimation - Careful selection of relationships and basing
them on clusters required will lead to effective
features - See proposal by Jia Cui and David Guthrie
91Overview
- Bag of words (Kalina)
- Evaluation (Kris)
- Supervised methods using maximum entropy (Klaus)
- Incorporating context preferences (Jerry)
- Experiments with Adjective Classes and Subject
(David, Jia, Martin) - Structuring the context using syntax and
semantics (Cassia, Fabio) - Re-estimation techniques for Maximum Entropy
Experiments (Fred, Jia) - Unsupervised Re-estimation (Roberto)
- Student Proposals (Jia, Dave, Marco)
- Conclusion
92Unsupervised Semantic Tagging
Roberto Basili, Fabio Zanzotto, Marco Cammisa,
Martin Holub, Kris Haralambiev, Cassia Martin,
Jia Cui, David Guthrie,
JHU Summer Workshop 2003 August, 22nd 2003
Baltimore
93Summary
- Motivations
- Lexical Information for Semantic Tagging
- Unsupervised Natural Language Learning
- Empirical Estimation for ME bootstrapping
- Weakly Supervised BNC Tagging through Wordnet
- A semantic similarity metric over Wordnet
- Experiments and Results
- Mapping LDOCE to Wordnet
- Bootstrapping over an untagged corpus
- Re-estimation thrugh Wordnet
94Motivations
- All experiments tell that lexical information is
crucial for semantic tagging, - data sparseness seems to limit the effect of the
context - The contribution of different resources needs to
be exploited (as in WSD) - In applications hand-tagging should be applied in
a cost effective way - Good results need to scale-up also to
technological scenarios where poorer (or no)
resources are available
95Motivations (contd)
- Wordnet contribution to semantic tagging
- A source of evidence for a larger set of lexicals
(unseen words) - A consistent way to generalize single
observations - (hierarchical) constraints over word uses
statistics - Similarity of words uses suggests semantic
similarity - Corpus-driven syntactic similarity is one
possible choice - Domain or topical similarity is also relevant
- Semantic similarity in the Wordnet hierarchy
suggests useful levels of generalization - Specific hypernyms, i.e. able to separate
different senses - General hypernyms, i.e. help to reduce the number
of word classes to model
96Learning Contextual Evidence
- Each syntactic relation provides a view on a
word usage, i.e. suggests a set of nouns with
common behaviour(s) - Semantic similarity among nouns is a model of
local semantic preference - to drink beer, water, , cocoa/L, stock/L,
- The , president, director, boy, ace/H, brain/H,
succeeds
97Semantic classes vs. language models
- The role of p( C v, d)
- e.g.
- p( n v, d) ? p( n C) p( C v, d)
- Implications
- p(n C) gives a lexical semantic model that
- it is likely to depend on the corpus and not on
the individual context - p(C v, d) model selectional preferences and
- provides disambiguation cues for contexts (v d X)
98Semantic classes vs. language models
- Lexical evidence p(n C) (or also p(Cn) )
- Contextual evidence p( C v, d)
- The idea
- Contextual evidence can be collected from the
corpus by involving the lexical knowledge base - The modeling of lexical evidence can be seen as a
side effect of the context (p(C n) ? p(nC)
) - Implied approach
- Learn the second as an estimate for the first and
then combine for bootstrapping to unseen words
99Conceptual Density
- Basic terminology
- Target noun set T (e.g. beer, water, stock
nouns in relation rVDirobj with a given verb) - (Branching Factor) Average number m of children
of a node s, i.e. the average number of children
of any node subsumed by s - (Marks) Set of marks M, i.e. the subset of nouns
in T that are subsumed within the WN subhierarchy
rooted in s. N M - (Area) area(s), total number of nodes of the
subhierarchy rooted at s
100Conceptual Density (contd)
1
3
2
6
4
5
15
9
8
7
10
12
11
13
14
101Using Conceptual Density
- Target Noun set T (e.g. subjs of verb to march)
- horse (6 senses in WN1.6),
- ant (1 sense in WN1.6)
- troop (4 senses in WN1.6)
- division (12 senses in WN1.6)
- elephant (2 senses in WN1.6)
FIND the smaller set of synsets s that covers
T and maximizes CDSscd(r)(s)
- (1) organization organisation
- horse, troops, divisions
- (2) placental placental_mammal ...
- horse, elephant
- (3) animal animate_being
- horse, elephant, ant
- (4) army_unit
- troop, division
102Summary
- Motivations
- Lexical Information for Semantic Tagging
- Unsupervised Natural Language Learning
- Empirical Estimation for ME bootstrapping
-
- Weakly Supervised BNC Tagging through Wordnet
- A semantic similarity metric over Wordnet
- Experiments and Results
- Mapping LDOCE to Wordnet
- Bootstrapping over an untagged corpus
- Re-estimation thrugh Wordnet
103Results Mapping LDOCE classes
- Lexical Entries in LDOCE are defined in terms of
a Semantic Class and a topical tag (Subject
Codes), e.g. stock ('L','FO') - The semantic similarity metrics has been used to
derive WN synset(s) that represent
ltSemClass,SubjCodegt pairs - A Wn explanation of lexical entries in a LM class
(lexical mapping) - The position(s) in the WN noun hierarchy of each
LM class (category mapping) - Semantic preference of synsets given words, LM
classes (and Subject Codes) can be mapped into
probabilities, e.g. - p( WN_syns n
LM_class ) - and then
- p(LM_class n WN_syns ), p(LM_class n),
p(LM_class WN_syns )
104Mapping LDOCE classes (contd)
- Example Cluster 2---EDZI
- '2 'Abstract and solid'
- 'ED-'ZI 'education - institutions, academic
name of ' - Tnursery_school, polytechnic, school,
seminary, senate - Synset school ,
- cd0.580, coverage 60
- Synset educational_institution,
- cd0.527, coverage 80
- Synset gathering assemblage
- cd0.028, coverage 40
105Case study the word stock in LDOCE
- stock T a supply (of something) for use
- stock J goods for sale
- stock N the thick part of a tree trunk
- stock A a group of animals used for breeding
- stock A farm animals usu . cattle LIVESTOCK
- stock T a family line , esp . of the stated
character - stock T money lent to a government at a fixed
rate of interest - stock T the money (CAPITAL) owned by a company,
divided into SHAREs - stock P a type of garden flower with a sweet
smell - stock L a liquid made from the juices of meat,
bones , etc . , used in cooking - stock J (in former times) a stiff cloth worn by
men round the neck of a shirt - compare TIE
- stock N a piece of wood used as a support or
handle, as for a gun or tool - stock N the piece which goes across the top of
an ANCHOR_1_1 from side to side - stock P a plant from which CUTTINGs are grown
- stock P a stem onto which another plant is
GRAFTed
106Case study stock as Animal (A)
stock
- stock A a group of animals used for breeding
- stock A farm animals usu . cattle LIVESTOCK
107Case Study stock (N - P)
- stock N a piece of wood used as a support or
handle , as for a gun or tool - stock N the piece which goes across the top of
an ANCHOR_1_1 from side to side - stock N the thick part of a tree trunk
- stock P a plant from which CUTTINGs are grown
- stock P a stem onto which another plant is
GRAFTed - stock P a type of garden flower with a sweet
smell
108LM Category Mapping
109Results A Simple (Unsupervised) Tagger
- Estimate over the parsed corpusWordnet and by
mapping into LD categories, the following
quantities - P( C hw r), P( C r), P( C
hw) - (r ranges over SubjV, DirObj, N_P_hw,
hw_P_N) - Apply a simple Bayesian model to any incoming
contexts - lthw r1, , rkgt
- and Select argmaxC( p(C hw) p(C r1)
p(C rk)) - (OBS p(C rj) is the back-off of p(C hw
rj))
110Unsupervised Tagger Evaluation
111Results Re-estimate probs for a ME model
- Use sentences in training data for learning
lexical and contextual preferences of nouns and
relations - Use lexical preferences to pre-estimate the
empirical distributions over unseen data (see
constraints Q(c,w,R) in Freds) - Train the ME model over all available data
- Tag held-out and blind data
112Results
Features All syntactic Features ME Tra Training
DataWN Held Out ME Test Held-out Result
79-80 Features Only head words ME Tra Training
DataWN Held Out ME Test Held-out Result
81.80 Features All synt Features ME Tra
Training DataWN Held Out ME Test Blind
Data Result 86,03
- Features All syntactic Features
- ME Tra Training Data
- ME Test Held-out
- Result 78-79
- Features Only head words
- ME Tra Training Data
- ME Test Held-out
- Result 80.76
113Conclusions
- A robust parameter estimation method for semantic
tagging - Less prone to sparse data
- Generalize to meaningful noun classes
- Develop lexicalized contextual cues and a
semantic dictionary - A natural and viable way to integrate
corpus-driven evidence with a general-purpose
lexicon - Results consistent wrt fully supervised methods
- Open perspectives for effective estimation of
unseen empirical distributions
114Open Issue
- Estimate contextual and lexical probabilities
from the 28M portion of the BNC (already parsed
here) - Alternative formulations of similarity metrics
- Experiment a bootstrapping method by imposing the
proposed estimates (i.e. p(C w, SubjV)) as
constraints to Q(C, w, SubjV)) - Manually assess and measure the automatically
derived Longman-Wordnet mapping
115Summary Slide
- IR-inspired approaches (Kalina)
- Evaluation (Kris)
- Supervised methods using maximum entropy
- Incorporating context preferences (Jerry)
- Adjective Classes and Subject markings (David)
- Structuring the context using syntax and
semantics (Cassia, Fabio) - Re-estimation techniques for Maximum Entropy
Experiments (Fred) - Unsupervised Re-estimation (Roberto)
116Our Accomplishments
- Developed a method for bootstrapping using
maximum entropy - More than 300 experiments with features
- Integrated dictionary and syntactic information
- Integrated dictionary, Wordnet, syntactic
information and topic information experiments
which gave us significant improvement - Developed a system for unsupervised tagging
117Lessons learned
- Semantic Tagging has an intermediate complexity
between the rather successful NE recognition and
Word Sense Disambiguation - Semantic tagging over BNC is viable with high
accuracy - Accuracy reached by most of the proposed methods
?94 - This task stimulates cross-fertilization between
statistical and symbolic knowledge grounded on
solid linguistic principles and resources
118NO! The near future at a glance
- Availability of semantic information for head
nouns is critical to a variety of linguistic
tasks - IR and CLIR, Information Extraction and Question
Answering - Machine Translation and Language Modeling
- Annotated resources can provide a significant
stimulus to machine learning of linguistic
patterns (e.g. QA answer structures) - Open possibilities for corpus-driven learning of
other semantic phenomena (e.g. verb argument
structures) and incremental learning methods
119 and a quick look further
- Unseen phenomena still represent hard cases for
any probabilistic model (rare vs. impossible
labels for unseen/novel words) - Integration of external resources is problematic
- Projecting observed empirical distributions may
lead to overfitting data - Lexical information (e.g. Wordnet) has not a
clear probabilistic interpretation - Soft Features (Jia Cui) seem a promising model
- Better use of the context
- Design and derivation of class-based contextual
features (David Guthrie) - Existing lexical resources provide large scale
and effective information for bootstrapping
120A Final thought
- Thanks to the Johns Hopkins faculty and staff for
their availability and helpfulness during the
workshop. - Special thanks to Fred Jelinek for answering
endless questions about maximum entropy and
helping to model our problem.