Attacking the Data Sparseness Problem

1 / 120
About This Presentation
Title:

Attacking the Data Sparseness Problem

Description:

Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish ... FMLN set off a series of explosions in central Bogota today. Motivation for the project ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 121
Provided by: clsp1

less

Transcript and Presenter's Notes

Title: Attacking the Data Sparseness Problem


1
Attacking the Data Sparseness Problem
  • Team Louise Guthrie, Roberto Basili, Fabio
    Zanzotto, Hamish Cunningham, Kalina Bontcheva,
    Jia Cui, Klaus Macherey, David Guthrie, Martin
    Holub, Marco Cammisa, Cassia Martin, Jerry Liu,
    Kris Haralambiev, Fred Jelinek

2
Motivation for the project
  • Texts for text extraction contain sentences like
  • The IRA bombed a family owned shop in
    Belfast yesterday.
  • FMLN set off a series of explosions in
    central Bogota today.

3
Motivation for the project
  • Wed like to automatically recognize that both
    are of the form
  • The IRA bombed a family owned shop in Belfast
    yesterday.
  • FMLN set off a series of explosions in central
    Bogota today.

4
Our Hypotheses
  • A transformation of a corpus to replace words and
    phrases with coarse semantic categories will help
    overcome the data sparseness problem encountered
    in language modeling, and text extraction.
  • Semantic category information might also help
    improve machine translation
  • A noun-centric approach initially will allow
    bootstrapping for other syntactic categories

5
A six week goal Labeling noun phrases
  • Astronauts aboard the space shuttle Endeavor were
    forced to dodge a derelict Air Force satellite
    Friday
  • Humans aboard space_vehicle dodge satellite
    timeref.

6
Preparing the data- Pre-Workshop
  • Identify a tag set
  • Create a Human annotated corpus
  • Create a double annotated corpus
  • Process all data for named entity and noun phrase
    recognition using GATE Tools (26 million words)
  • Parsed about (26 million words)
  • Develop algorithms for mapping target categories
    to Wordnet synsets to support the tag set
    assessment

7
The Semantic Classes and the Corpus
  • A subset of classes available in Longman's
    Dictionary of contemporary English (LDOCE)
    Electronic version
  • Rationale
  • The number of semantic classes was small
  • The classes are somewhat reliable since they were
    used by a team of lexicographers to code Noun
    senses, Adjective preferences and Verb
    preferences
  • Many words have subject area information which
    might be useful

8
The Semantic Classes
Concrete
Abstract
Animate
Inanimate
Solid
Gas
Liquid
Plant
Animal
Human
Movable
Non-movable
FemaleAnim.
9
The Semantic Classes
Concrete
Abstract
Animate
Inanimate
Solid
Gas
Liquid
Plant
Animal
Human
Non-movable
Movable
FemaleAnim.
10
The Semantic Classes
Abstract
Concrete
Animate
Inanimate
Solid
Gas
Liquid
Plant
Animal
Human
Non-movable
Movable
FemaleAnim.
11
The Semantic Classes
Abstract
Concrete
Animate
Inanimate
Solid
Gas
Liquid
Plant
Animal
Human
Non-movable
Movable
Collective
FemaleAnim.
12
The human annotated statistics
  • Inter-annotator agreement is 94, so that is the
    upper limit of our task.
  • 214,446 total annotated noun phrases (262,683
    including None of the Above)
  • 29,071 unique vocabulary items (Unlemmatized)
  • 25 semantic categories (162 associated subject
    areas were identified)
  • 127,569 with semantic category - Abstract, 59

13
The experimental setup
BNC (Science, Politics, Business) 26 million
words
14
The main development set (dev)

Training 113,000 instances
Held out 85,000 instances
Blind portion
Machine Learning to improve this
15
A challenging development set for experiments on
useen words (Hard data set)

Training all unambiguous words 125,000 instances
Held out ambiguous words 73,000 instances
Blind portion
Machine Learning to improve this
16
Our Experiments include
  • Supervised Approaches (Learning from Human
    Annotated data)
  • Unsupervised approaches
  • Using outside evidence (the dictionary or
    wordnet)
  • Syntactic information from parsing or pattern
    matching
  • Context words, the use of preferences, the use of
    topical information

17
Experiments on unseen words - Hard data set
  • Training corpus has only words with unambiguous
    annotations
  • 125,000 training instances
  • 73,000 instances held-out
  • Perplexity 21
  • Baseline Accuracy 45
  • Improvement Accuracy 68.5
  • Context can contribute greatly in unsupervised
    experiments

18
Results on the dev set
  • Random with some frequent ambiguous words moved
    into testing
  • 113,000 training instances
  • 85,000 instances held-out
  • Perplexity 3.44
  • Baseline Accuracy 80
  • Improvement Accuracy 87

19
The scheme for annotating the large corpus
  • After experimenting with the development sets, we
    need a scheme for making use of all of the dev
    corpus to tag the blind corpus.
  • We developed a incremental scheme within the
    maximum entropy framework
  • Several talks have to do with re-estimation
    techniques useful to bootstrapping process.

20
Terminology
  • Seen words words seen in the human annotated
    data (new instances of known words)
  • Unseen words not in the training material but
    in dictionary
  • Novel words not in the training material nor in
    the dictionary/Wordnet

21
Bootstrapping

Human Annotated
Blind portion
Unannotated Data
22
The Unannotated Data Four types

Human Annotated
Blind portion
Unambiguous 515,000 instances
23
The Unannotated Data Four types

Human Annotated
Blind portion
Seen in training 550,000 instances
Unambiguous 515,000 instances
24
The Unannotated Data Four types

Human Annotated
Blind portion
Seen in training 550,000 instances
Unambiguous 515,000 instances
25
The Unannotated Data Four types

Human Annotated
Blind portion
Seen in training 550,000 instances
Unambiguous 515,000 instances
Novel 20,000
26
Annotated
201K
Unambiguous
515K
Seen
550K
9K
Novel
20K
Training
Training
Training
Tag TestData
27
Results on the Blind Data
  • We set aside one tenth of the annotated corpus
  • Randomly selected within each of the domains
  • It contained 13,000 annotated instances
  • The baseline here was very high - 90 with
    simple techniques
  • We were able to achieve 93.5 accuracy


28
Overview
  • Bag of words (Kalina)
  • Evaluation (Kris)
  • Supervised methods using maximum entropy (Klaus)
  • Incorporating context preferences (Jerry)
  • Experiments with Adjective Classes and Subject
    (David, Jia, Martin)
  • Structuring the context using syntax and
    semantics (Cassia, Fabio)
  • Re-estimation techniques for Maximum Entropy
    Experiments (Fred, Jia)
  • Unsupervised Re-estimation (Roberto)
  • Student Proposals (Jia, Dave, Marco)
  • Conclusion

29
Semantic Categories and MT
  • 10 test words high, medium, and low frequency
  • Collected their target translations using
    EuroWordNet (e.g. Dutch)
  • Crane
  • lifts and moves heavy objects hijskraan,
    kraan
  • large long-necked wading bird - kraanvogel

30
SemCats and MT (2)
  • Manually mapped synonym sets to semantic
    categories
  • automatic mapping will be presented later
  • Studied how many synonym sets are ruled out as
    translations by the semantic category

31
Some Results
  • 3 words full disambiguation
  • crane (Mov.Solid/Animal), medicine
    (Abstract/Liquid), plant (Plant/Solid)
  • 7 words the categories reduce substantially the
    possible translations
  • club - Abstr/an association of people...,
    Mov.Solid/stout stick..., Mov.Solid/ an
    implement used by a golfer..., Mov.Solid/a
    playing card..., NonMov.Solid/a building
  • club/NonMov.Solid clubgebouw, clubhuis,
  • club/Abstr. bevolkingsgroep, broederschap,
  • club/Mov.Solid knots, kolf, malie, kolf,
    malie, club

32
The architecture
  • The multiple-knowledge sources WSD architecture
    (Stevenson 03)
  • Allow use of multiple taggers and combine their
    results through a weighted function
  • Weights can be learned from a corpus
  • All taggers implemented as GATE components and
    combined in applications

33
(No Transcript)
34
The Bag-of-Words Tagger
  • The bag-of-words tagger is an Information
    Retrieval-inspired tagger with parameters
  • Window size 50 default value
  • What POS to put in the content vectors (default
    nouns and verbs)
  • Which similarity measure to use
  • Used in WSD (Leacock et al 92)
  • Crane/Animalspecies, captivity, disease
  • Crane/Mov.Solidworker, disaster, machinery

35
BoW classifier (2)
  • Seen words classified by calculating the inner
    product between their context vector and the
    vectors for each possible category
  • Inner product calculated as
  • Binary vectors number of matching terms
  • Weighted vectors
  • Leacocks measure favour concepts that occur
    frequently in exactly one category
  • Take into account the polysemy of concepts in the
    vectors

36
Current performance measures
  • The baseline frequency tagger on its own 91 on
    the test (blind) set
  • Bag-of-words tagger on its own 92.7
  • Combined architecture 93.2 (window size 50,
    using only nouns, binary vectors)

37
Future work on the architecture
  • Integrate syntactic information, subject codes,
    and document topics
  • Experiment with cosine similarity
  • Implement Yarowsky92 WSD algorithm
  • Implement the weighted function module
  • Experiment with integrating the ME tools as one
    of the taggers supplying preferences for the
    weighting module

38
Overview
  • Bag of words (Kalina)
  • Evaluation (Kris)
  • Supervised methods using maximum entropy (Klaus)
  • Incorporating context preferences (Jerry)
  • Experiments with Adjective Classes and Subject
    (David, Jia, Martin)
  • Structuring the context using syntax and
    semantics (Cassia, Fabio)
  • Re-estimation techniques for Maximum Entropy
    Experiments (Fred, Jia)
  • Unsupervised Re-estimation (Roberto)
  • Student Proposals (Jia, Dave, Marco)
  • Conclusion

39
Accuracy MeasurementsKris Haralambiev
  • How to measure the accuracy
  • How to distinguish correct, almost correct
    and wrong

40
Exact Match Measurements
  • W (w1, w2, , wn) vector of the annotated
    words
  • X (x1, x2, , xn) categories assigned by the
    annotators
  • Y (y1, y2, , yn) categories assigned by a
    program
  • Exact match (default) measurement 1 for match
    and 0 for mismatch of each (xi,yi) pair
  • accuracy(X,Y) i xi yi

41
The Hierarchy
42
Ancestor Relation Measurement
  • The exact match will assign 0 for the pairs
    (H,M), (H,F), (A,Q),
  • Give a partial score for two categeories in
    ancestor relation
  • weight(Cat) i xi? Tree with root Cat
  • score(xi, yi) min(weight(xi)/weight(yi),
    weight(yi)/weight(xi) )
  • accuracy(X,Y) ?i score(xi,yi)

43
Edge Distance Measurement
  • The ancestor relation will assign some score for
    pairs like (H,M), (A,Q), but will assign 0 for
    pairs like (M,F), (A,H)
  • Going further, we want to compute the similarity
    (distance) between X and Y
  • distance(xi, yi) the length of the simple path
    from xi to yi
  • each edge can be given individual length or all
    edges have length 1 (we prefer the latter)

44
Edge Distance Measurement (cont' d)
  • distance(X,Y) ?i distance(xi,yi)
  • Accuracy Distance
  • 100 - 0
  • ? - distance(X,Y)
  • 0 - max_possible_distance
  • max_possible_distance
  • ?i max(distance(xi,cat))
  • might be reasonable to use aver. instead of max

45
Some Baselines
  • Training held-out data
  • Blind data

46
Overview
  • Bag of words (Kalina)
  • Evaluation (Kris)
  • Supervised methods using maximum entropy (Klaus)
  • Incorporating context preferences (Jerry)
  • Experiments with Adjective Classes and Subject
    (David, Jia, Martin)
  • Structuring the context using syntax and
    semantics (Cassia, Fabio)
  • Re-estimation techniques for Maximum Entropy
    Experiments (Fred, Jia)
  • Unsupervised Re-estimation (Roberto)
  • Student Proposals (Jia, Dave, Marco)
  • Conclusion

47
Supervised Methods using Maximum Entropy
Jia Cui, David Guthrie, Martin Holub, Jerry Liu,
Klaus Macherey
48
  • Overview
  • Maximum Entropy Approach
  • Feature Functions
  • Word Classes
  • Experimental Results

49
  • Maximum Entropy Approach
  • Principle
  • Define suitable features (constraints) on
    training data
  • Find maximum entropy distribution that satisfies
    constraints (GIS)
  • Properties
  • Easy to integrate information from several
    knowledge sources
  • Always converges to the global optimum on
    training data

50
  • Feature Functions
  • Prior Features
  • Use Unigram probabilities P(c) for semantic
    categories c as feature
  • Lexical Features
  • Use the lexical information directly as a
    feature
  • Reduce number of features by using the
    following definition

51
  • Feature Functions (contd)
  • Longman Preference Features
  • Longman Dictionary provides subject codes for
    nouns
  • Use frequency of preferences as additional
    features
  • Unknown Word Features
  • - Prefix features
  • - Suffix features
  • - Human-IST feature

52
  • Word Classes
  • Lemmatization
  • - Eliminate inflections and reduce words to
    their base form
  • - Assumption different cases of one word have
    the same semantic classes
  • Mutual Information
  • - Measures the amount of information one random
    variable contains about
  • another
  • - Applied for nouns and adjectives

53
(No Transcript)
54
(No Transcript)
55
Overview
  • Bag of words (Kalina)
  • Evaluation (Kris)
  • Supervised methods using maximum entropy (Klaus)
  • Incorporating context preferences (Jerry)
  • Experiments with Adjective Classes and Subject
    (David, Jia, Martin)
  • Structuring the context using syntax and
    semantics (Cassia, Fabio)
  • Re-estimation techniques for Maximum Entropy
    Experiments (Fred, Jia)
  • Unsupervised Re-estimation (Roberto)
  • Student Proposals (Jia, Dave, Marco)
  • Conclusion

56
Incorporating Context Features Jia Cui, David
Guthrie, Martin Holub, Klaus Macherey, Jerry Liu
57
  • Overview
  • Result Analysis
  • Rewind Encoding Feature Functions
  • Incorporating Context Features
  • Clustering Methods
  • Experimental Results

58
(No Transcript)
59
(No Transcript)
60
  • Adjectives
  • Continuing the example angry kid,
  • Describe adjectives by the categories of nouns
    that they prefer to modify to
  • avoid Sparseness.
  • Obtain a set of categories for both kid and
    angry
  • - kid A, S, H, H, H
  • - angry T, H
  • We can concentrate them together (merging) A,
    S, H, H, H, T, H
  • Or do some kind of component-wise multiplication
    (pruning) H, H, H
  • Simply merging introduces irrelevant categories
    - increases entropy

61
  • Clustering Methods
  • Longman dictionary contains such adjective
    preferences, but we can also
  • generate preferences based on corpus.
  • Measure the entropy of each adjective, by
    getting frequency of each
  • adjective modifying a noun of a particular
    category
  • - The lower the entropy, the more contextually
    useful the adjective
  • - Measure confidence of adjective by frequency
  • Example angry
  • - adj angry, entropy 2.18, freqs 155, 55,
    9, 7, 0 ....

62
(No Transcript)
63
Overview
  • Bag of words (Kalina)
  • Evaluation (Kris)
  • Supervised methods using maximum entropy (Klaus)
  • Incorporating context preferences (Jerry)
  • Experiments with Adjective Classes and Subject
    (David, Jia, Martin)
  • Structuring the context using syntax and
    semantics (Cassia, Fabio)
  • Re-estimation techniques for Maximum Entropy
    Experiments (Fred, Jia)
  • Unsupervised Re-estimation (Roberto)
  • Student Proposals (Jia, Dave, Marco)
  • Conclusion

64
Hard vs Soft Word Clusters
  • Words as features are sparse, we need to cluster
    up
  • Hard clusters
  • A feature is assigned to one and only one
    cluster. (The cluster for which there exists the
    strongest evidence.)
  • Soft clusters
  • A feature is assigned to as many clusters as
    there is evidence for.

65
Using clustering and contextual features
  • Baseline prior most frequent semantic
    category
  • All words within the target noun phrase (with a
    threshold of 10 occurrences)
  • Adjective hard clusters
  • Clusters are defined by most frequent semantic
    category
  • Noun soft clusters
  • Clusters are defined by all semantic categories
  • Combined adjective hard clusters and noun soft
    clusters

66
Results with clusters and context
Training on TrainingHeld-out
Testing on Blind Data
ME tool Jia's MaxEnt toolkit
67
Measuring Usefulness of Adjectives in Context
We have a huge number of nouns that are assigned
a semantic tag from A. Training Data B. The BNC
corpus when the noun is unambiguous with regard
to the possible semantic category. Using the
adjectives that modify these nouns we are able to
compute the entropy a is an adjective,
C is the set of semantic categories
68
Clustering Adjectives
  • We take adjectives with low H(T a) and make
    clusters form them depending on which semantic
    category they predict
  • Then use each cluster of adjectives as a context
    feature

??1 and ?2 are thresholds
69
Overview
  • Bag of words (Kalina)
  • Evaluation (Kris)
  • Supervised methods using maximum entropy (Klaus)
  • Incorporating context preferences (Jerry)
  • Experiments with Adjective Classes and Subject
    (David, Jia, Martin)
  • Structuring the context using syntax and
    semantics (Cassia, Fabio)
  • Re-estimation techniques for Maximum Entropy
    Experiments (Fred, Jia)
  • Unsupervised Re-estimation (Roberto)
  • Student Proposals (Jia, Dave, Marco)
  • Conclusion

70
Structuring the context using syntax
  • Syntactic Model eXtended Dependency Graph
  • Syntactic Relations considerd V_Obj, V_Sog,
    V_PP, NP_PP
  • Results
  • Observations
  • Features are too scarce
  • We're overfitting! We need more intelligent
    methods.

Held-out
Used Tools Syntactic Parser Chaos (Basili
Zanzotto) Max Entropy Toolkit (implemented by Jia
Cui)
71
Semantic FingerprintGeneralizing nouns using
EuroWordnet
Top level generalizations
Base concepts (tree structure)
Bottom level synonym sets (directed graph)
72
Noun semantic fingerprints an example
  • Words in the events are replaced by basic
    concepts

object
location
person
area
social group
geographic area
district
administrative district
assemblage
urban center
the CEO drove into the city with his own car
Solid
73
Verb semantic fingerprints an example
Generalized Features
Lexicalized Features
drive_V_PP_into travel _V_PP_into move_V_PP_into
drive_V_PP_into
drive
move
travel
the CEO drove into the city with his own
car
V_PP
V_Subj
74
How to exploit the word context?
  • Semantic Category-Subj-to_think
  • Positive observations
  • his wife thought he should eat more
  • the waitress thought that Italians leave tiny
    tips
  • Our conceptual hierarchy contains FemaleHuman and
    MaleHuman...

? Fabio thought he has had a terrific idea before
looking at the results
Fabio is a FemaleHuman !
75
How to exploit the word context?
nnn
ooo
H
yyy
kkk
H
Verifying the hypothesis
W
vvv
H
xxx
zzz
H
76
Syntactic slots and slot fillers
77
How to exploit the word context?
  • Using...
  • a revised hierarchy
  • Female animal and male animal ? Animal
  • Female human and male human ? Human
  • Female and male ? Animate
  • one-semantic-class-per-discourse hypothesis
  • the semantic fingerprint generalising nouns in
    the basic concepts of EuroWordnet and verbs in
    the top most in Wordnet

78
Results
Held-out combined
Test bed charateristics
Used Tools Syntactic Parser Chaos (Basili
Zanzotto) Max Entropy Toolkit (implemented by Jia
Cui)
79
Results a closer look
Held-out combined
Used Tools Syntactic Parser Chaos (Basili
Zanzotto) Max Entropy Toolkit (implemented by Jia
Cui)
80
Overview
  • Bag of words (Kalina)
  • Evaluation (Kris)
  • Supervised methods using maximum entropy (Klaus)
  • Incorporating context preferences (Jerry)
  • Experiments with Adjective Classes and Subject
    (David, Jia, Martin)
  • Structuring the context using syntax and
    semantics (Cassia, Fabio)
  • Re-estimation techniques for Maximum Entropy
    Experiments (Fred, Jia)
  • Unsupervised Re-estimation (Roberto)
  • Student Proposals (Jia, Dave, Marco)
  • Conclusion

81
Unsupervised Semantic Labeling of Nouns using ME
  • Frederick Jelinek
  • Semantic Analysis for Sparse Data

82
Motivation
  • Base ME features on lexical and grammatical
    relationships found in the context of nouns to be
    labeled
  • Hand-labeled data too sparse to allow using
    powerful ME compound features
  • Wish to utilize large unlabeled British National
    Corpus (and internet, etc.) for training
  • Will use dictionary and initialization by
    statistics from smaller hand-labeled corpus

83
Format of Labeled Training Data
  • w is the noun to be labeled
  • r1, r2, rm are the relationships in the context
    of w which correlate with the label appropriate
    to w
  • C is the label denoting semantic class
  • f1, f2,, fK is the label count, I.e., fC1 and
    fi0 for i ? C
  • Then the event file format is
  • (f1, f2,, fK , w, r1, r2, rm )

84
Format of BNC Training Data
  • The label counts fi will be fractional with fi
    0 if the dictionary does not allow noun w to
    have the ith label.
  • Always fi 0 and Si fi 1
  • The problem is the initial selection of values of
    fi
  • Suggestion let fi Q(C i w) where Q denotes
    the empirical distribution from hand labeled
    data.

85
Annotated
Unambiguous
BNCSeen
Unseen
Novel
Tag Heldout
Training
Training
Training
86
Inner Loop ME Re-estimation
  • The empirical distribution used in the ME
    iterations is obtained from sums of values of fi
    found in both the labeled and BNC data sets.
  • These counts determine
  • which of the potential features will be selected
    as actual features
  • the values of the l parameters in the ME model

87
Constraints and Equations
88
Outer Loop Re-scaling of Data
  • Once the ME model P(C c w,r1,,rm) is
    estimated, the fi values in event files of the
    BNC portion of data are re-scaled.
  • fi values in the hand labeled portion remain
    unchanged
  • New empirical counts are thus available
  • to determine the identity of new actual features
  • the parameters of a new ME probability model
  • Etc.

89
Preliminary Results by Jia Cui
Sheffield annotated corpus and BNC unambiguous
nouns provide initial statistics. Label instances
of BNC corpus whose headwords are seen in
unambiguous data but are ambiguous according to
the Longman dictionary.
90
Concluding Thoughts
  • Preliminary results are promising
  • Method requires theoretical and practical
    exploration
  • Changing of features and feature targets is a new
    phenomenon in ME estimation
  • Careful selection of relationships and basing
    them on clusters required will lead to effective
    features
  • See proposal by Jia Cui and David Guthrie

91
Overview
  • Bag of words (Kalina)
  • Evaluation (Kris)
  • Supervised methods using maximum entropy (Klaus)
  • Incorporating context preferences (Jerry)
  • Experiments with Adjective Classes and Subject
    (David, Jia, Martin)
  • Structuring the context using syntax and
    semantics (Cassia, Fabio)
  • Re-estimation techniques for Maximum Entropy
    Experiments (Fred, Jia)
  • Unsupervised Re-estimation (Roberto)
  • Student Proposals (Jia, Dave, Marco)
  • Conclusion

92
Unsupervised Semantic Tagging
Roberto Basili, Fabio Zanzotto, Marco Cammisa,
Martin Holub, Kris Haralambiev, Cassia Martin,
Jia Cui, David Guthrie,
JHU Summer Workshop 2003 August, 22nd 2003
Baltimore
93
Summary
  • Motivations
  • Lexical Information for Semantic Tagging
  • Unsupervised Natural Language Learning
  • Empirical Estimation for ME bootstrapping
  • Weakly Supervised BNC Tagging through Wordnet
  • A semantic similarity metric over Wordnet
  • Experiments and Results
  • Mapping LDOCE to Wordnet
  • Bootstrapping over an untagged corpus
  • Re-estimation thrugh Wordnet

94
Motivations
  • All experiments tell that lexical information is
    crucial for semantic tagging,
  • data sparseness seems to limit the effect of the
    context
  • The contribution of different resources needs to
    be exploited (as in WSD)
  • In applications hand-tagging should be applied in
    a cost effective way
  • Good results need to scale-up also to
    technological scenarios where poorer (or no)
    resources are available

95
Motivations (contd)
  • Wordnet contribution to semantic tagging
  • A source of evidence for a larger set of lexicals
    (unseen words)
  • A consistent way to generalize single
    observations
  • (hierarchical) constraints over word uses
    statistics
  • Similarity of words uses suggests semantic
    similarity
  • Corpus-driven syntactic similarity is one
    possible choice
  • Domain or topical similarity is also relevant
  • Semantic similarity in the Wordnet hierarchy
    suggests useful levels of generalization
  • Specific hypernyms, i.e. able to separate
    different senses
  • General hypernyms, i.e. help to reduce the number
    of word classes to model

96
Learning Contextual Evidence
  • Each syntactic relation provides a view on a
    word usage, i.e. suggests a set of nouns with
    common behaviour(s)
  • Semantic similarity among nouns is a model of
    local semantic preference
  • to drink beer, water, , cocoa/L, stock/L,
  • The , president, director, boy, ace/H, brain/H,
    succeeds

97
Semantic classes vs. language models
  • The role of p( C v, d)
  • e.g.
  • p( n v, d) ? p( n C) p( C v, d)
  • Implications
  • p(n C) gives a lexical semantic model that
  • it is likely to depend on the corpus and not on
    the individual context
  • p(C v, d) model selectional preferences and
  • provides disambiguation cues for contexts (v d X)

98
Semantic classes vs. language models
  • Lexical evidence p(n C) (or also p(Cn) )
  • Contextual evidence p( C v, d)
  • The idea
  • Contextual evidence can be collected from the
    corpus by involving the lexical knowledge base
  • The modeling of lexical evidence can be seen as a
    side effect of the context (p(C n) ? p(nC)
    )
  • Implied approach
  • Learn the second as an estimate for the first and
    then combine for bootstrapping to unseen words

99
Conceptual Density
  • Basic terminology
  • Target noun set T (e.g. beer, water, stock
    nouns in relation rVDirobj with a given verb)
  • (Branching Factor) Average number m of children
    of a node s, i.e. the average number of children
    of any node subsumed by s
  • (Marks) Set of marks M, i.e. the subset of nouns
    in T that are subsumed within the WN subhierarchy
    rooted in s. N M
  • (Area) area(s), total number of nodes of the
    subhierarchy rooted at s

100
Conceptual Density (contd)
1
3
2
6
4
5
15
9
8
7
10
12
11
13
14
101
Using Conceptual Density
  • Target Noun set T (e.g. subjs of verb to march)
  • horse (6 senses in WN1.6),
  • ant (1 sense in WN1.6)
  • troop (4 senses in WN1.6)
  • division (12 senses in WN1.6)
  • elephant (2 senses in WN1.6)

FIND the smaller set of synsets s that covers
T and maximizes CDSscd(r)(s)
  • (1) organization organisation
  • horse, troops, divisions
  • (2) placental placental_mammal ...
  • horse, elephant
  • (3) animal animate_being
  • horse, elephant, ant
  • (4) army_unit
  • troop, division

102
Summary
  • Motivations
  • Lexical Information for Semantic Tagging
  • Unsupervised Natural Language Learning
  • Empirical Estimation for ME bootstrapping
  • Weakly Supervised BNC Tagging through Wordnet
  • A semantic similarity metric over Wordnet
  • Experiments and Results
  • Mapping LDOCE to Wordnet
  • Bootstrapping over an untagged corpus
  • Re-estimation thrugh Wordnet

103
Results Mapping LDOCE classes
  • Lexical Entries in LDOCE are defined in terms of
    a Semantic Class and a topical tag (Subject
    Codes), e.g. stock ('L','FO')
  • The semantic similarity metrics has been used to
    derive WN synset(s) that represent
    ltSemClass,SubjCodegt pairs
  • A Wn explanation of lexical entries in a LM class
    (lexical mapping)
  • The position(s) in the WN noun hierarchy of each
    LM class (category mapping)
  • Semantic preference of synsets given words, LM
    classes (and Subject Codes) can be mapped into
    probabilities, e.g.
  • p( WN_syns n
    LM_class )
  • and then
  • p(LM_class n WN_syns ), p(LM_class n),
    p(LM_class WN_syns )

104
Mapping LDOCE classes (contd)
  • Example Cluster 2---EDZI
  • '2 'Abstract and solid'
  • 'ED-'ZI 'education - institutions, academic
    name of '
  • Tnursery_school, polytechnic, school,
    seminary, senate
  • Synset school ,
  • cd0.580, coverage 60
  • Synset educational_institution,
  • cd0.527, coverage 80
  • Synset gathering assemblage
  • cd0.028, coverage 40

105
Case study the word stock in LDOCE
  • stock T a supply (of something) for use
  • stock J goods for sale
  • stock N the thick part of a tree trunk
  • stock A a group of animals used for breeding
  • stock A farm animals usu . cattle LIVESTOCK
  • stock T a family line , esp . of the stated
    character
  • stock T money lent to a government at a fixed
    rate of interest
  • stock T the money (CAPITAL) owned by a company,
    divided into SHAREs
  • stock P a type of garden flower with a sweet
    smell
  • stock L a liquid made from the juices of meat,
    bones , etc . , used in cooking
  • stock J (in former times) a stiff cloth worn by
    men round the neck of a shirt
  • compare TIE
  • stock N a piece of wood used as a support or
    handle, as for a gun or tool
  • stock N the piece which goes across the top of
    an ANCHOR_1_1 from side to side
  • stock P a plant from which CUTTINGs are grown
  • stock P a stem onto which another plant is
    GRAFTed

106
Case study stock as Animal (A)
stock
  • stock A a group of animals used for breeding
  • stock A farm animals usu . cattle LIVESTOCK

107
Case Study stock (N - P)
  • stock N a piece of wood used as a support or
    handle , as for a gun or tool
  • stock N the piece which goes across the top of
    an ANCHOR_1_1 from side to side
  • stock N the thick part of a tree trunk
  • stock P a plant from which CUTTINGs are grown
  • stock P a stem onto which another plant is
    GRAFTed
  • stock P a type of garden flower with a sweet
    smell

108
LM Category Mapping
109
Results A Simple (Unsupervised) Tagger
  • Estimate over the parsed corpusWordnet and by
    mapping into LD categories, the following
    quantities
  • P( C hw r), P( C r), P( C
    hw)
  • (r ranges over SubjV, DirObj, N_P_hw,
    hw_P_N)
  • Apply a simple Bayesian model to any incoming
    contexts
  • lthw r1, , rkgt
  • and Select argmaxC( p(C hw) p(C r1)
    p(C rk))
  • (OBS p(C rj) is the back-off of p(C hw
    rj))

110
Unsupervised Tagger Evaluation
111
Results Re-estimate probs for a ME model
  • Use sentences in training data for learning
    lexical and contextual preferences of nouns and
    relations
  • Use lexical preferences to pre-estimate the
    empirical distributions over unseen data (see
    constraints Q(c,w,R) in Freds)
  • Train the ME model over all available data
  • Tag held-out and blind data

112
Results
Features All syntactic Features ME Tra Training
DataWN Held Out ME Test Held-out Result
79-80 Features Only head words ME Tra Training
DataWN Held Out ME Test Held-out Result
81.80 Features All synt Features ME Tra
Training DataWN Held Out ME Test Blind
Data Result 86,03
  • Features All syntactic Features
  • ME Tra Training Data
  • ME Test Held-out
  • Result 78-79
  • Features Only head words
  • ME Tra Training Data
  • ME Test Held-out
  • Result 80.76

113
Conclusions
  • A robust parameter estimation method for semantic
    tagging
  • Less prone to sparse data
  • Generalize to meaningful noun classes
  • Develop lexicalized contextual cues and a
    semantic dictionary
  • A natural and viable way to integrate
    corpus-driven evidence with a general-purpose
    lexicon
  • Results consistent wrt fully supervised methods
  • Open perspectives for effective estimation of
    unseen empirical distributions

114
Open Issue
  • Estimate contextual and lexical probabilities
    from the 28M portion of the BNC (already parsed
    here)
  • Alternative formulations of similarity metrics
  • Experiment a bootstrapping method by imposing the
    proposed estimates (i.e. p(C w, SubjV)) as
    constraints to Q(C, w, SubjV))
  • Manually assess and measure the automatically
    derived Longman-Wordnet mapping

115
Summary Slide
  • IR-inspired approaches (Kalina)
  • Evaluation (Kris)
  • Supervised methods using maximum entropy
  • Incorporating context preferences (Jerry)
  • Adjective Classes and Subject markings (David)
  • Structuring the context using syntax and
    semantics (Cassia, Fabio)
  • Re-estimation techniques for Maximum Entropy
    Experiments (Fred)
  • Unsupervised Re-estimation (Roberto)

116
Our Accomplishments
  • Developed a method for bootstrapping using
    maximum entropy
  • More than 300 experiments with features
  • Integrated dictionary and syntactic information
  • Integrated dictionary, Wordnet, syntactic
    information and topic information experiments
    which gave us significant improvement
  • Developed a system for unsupervised tagging

117
Lessons learned
  • Semantic Tagging has an intermediate complexity
    between the rather successful NE recognition and
    Word Sense Disambiguation
  • Semantic tagging over BNC is viable with high
    accuracy
  • Accuracy reached by most of the proposed methods
    ?94
  • This task stimulates cross-fertilization between
    statistical and symbolic knowledge grounded on
    solid linguistic principles and resources

118
NO! The near future at a glance
  • Availability of semantic information for head
    nouns is critical to a variety of linguistic
    tasks
  • IR and CLIR, Information Extraction and Question
    Answering
  • Machine Translation and Language Modeling
  • Annotated resources can provide a significant
    stimulus to machine learning of linguistic
    patterns (e.g. QA answer structures)
  • Open possibilities for corpus-driven learning of
    other semantic phenomena (e.g. verb argument
    structures) and incremental learning methods

119
and a quick look further
  • Unseen phenomena still represent hard cases for
    any probabilistic model (rare vs. impossible
    labels for unseen/novel words)
  • Integration of external resources is problematic
  • Projecting observed empirical distributions may
    lead to overfitting data
  • Lexical information (e.g. Wordnet) has not a
    clear probabilistic interpretation
  • Soft Features (Jia Cui) seem a promising model
  • Better use of the context
  • Design and derivation of class-based contextual
    features (David Guthrie)
  • Existing lexical resources provide large scale
    and effective information for bootstrapping

120
A Final thought
  • Thanks to the Johns Hopkins faculty and staff for
    their availability and helpfulness during the
    workshop.
  • Special thanks to Fred Jelinek for answering
    endless questions about maximum entropy and
    helping to model our problem.
Write a Comment
User Comments (0)