Toward Large-Scale Shallow Semantics for Higher-Quality NLP

About This Presentation

Title:

Toward Large-Scale Shallow Semantics for Higher-Quality NLP

Description:

What is this Semantics in the Semantic Web, and How can You Get It? Toward Large-Scale Shallow Semantics for Higher-Quality NLP Eduard Hovy – PowerPoint PPT presentation

Number of Views:260

Avg rating:3.0/5.0

Slides: 72

Provided by: Edua132

Category:

more less

Transcript and Presenter's Notes

Title: Toward Large-Scale Shallow Semantics for Higher-Quality NLP

1
Toward Large-Scale Shallow Semantics for
Higher-Quality NLP
What is this Semantics in the Semantic Web,and
How can You Get It?

Eduard Hovy
Information Sciences Institute
University of Southern California
www.isi.edu/natural-language

2
The Knowledge Base of the World

We live in the infosphere
but its unstructured, inconsistent, often
outdated,
in other wordsa mess!

?
3
Franks two Semantic Webs

1. The Semantic Web as data definer
2. The Semantic Web as text enhancer

Applies to circumscribed, structured data types
Numbers, lists, tables, inventories, picture
annotations
Suitable for constrained, context-free
semantics
Amenable to OWL, etc. closed vocabularies and
controllable relations
Applies to open-ended, unstructured information
Requires open-ended, context-sensitive
semantics
Requires what exactly? Where to find it?

4
Wheres the semantics?

Its in the words insert standardized symbols
for each (? content) word
Need symbols, vocabularies, ontologies

Its in the links create standardized set of
links and use (? only) them
Need links, operational semantics, link
interpreters

It will somehow emerge, by magic, if we just do
enough stuff with OWL and RDF
Need formalisms, definitions, operational
semantics, notation interpreters

5
NO to controlled vocabulary, says IR!

1960s Cleverdon and the Cranfield aeronautics
evaluations of text retrieval engines (Cleverdon
67)
Tested algorithms and lists of controlled
vocabularies, also all words
SURPRISE all words better than controlled
vocabs!
which led to Saltons vector space approach to
IR
which led to todays web search engines
The IR position forget ontologies and controlled
liststhe semantics lies in multi-word
combinations!
Theres no benefit in artificial or controlled
languages
Multi-word combinations (kitchen knife) are
good enough
Build language models frequency distributions
of words in corpus/doc (Callan et al. 99 Ponte
and Croft 98)

Nonethelessfor Semantic Web uses, we need
semantics. But WHAT is it? And how do we obtain
it?
6
Toward semantics Layers of interpretation 1
7
Layers of interpretation 2
syntax
POS
surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN
PRO V V PN DT AJ N N PUN
Sheikh Mohammed, who is also the Defense Minister
of the United Arab Emirates, announced at the
inauguration ceremony we want to make Dubai a
new trading center
8
Layers of interpretation 3
shallow semantics
P0 act announce1 agent P1(Sheikh Mohammed)
theme P9 time present
P1(Sheikh Mohammed) P2(who) P3(Defense
Minister) P4(United Arab Emirates) P5(inaug.
ceremony) P6(we) P7(Dubai) P8(trading center)
coref
P9 act want3 agent P6(we) theme P10
P10 act make8 theme P7(Dubai) result
P8(center)
syntax
P1(Sheikh Mohammed) P2(who) P2(who)
P3(Defense Minister) P4(United Arab Emirates)
P6(we)
POS
surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN
PRO V V PN DT AJ N N PUN
Sheikh Mohammed, who is also the Defense Minister
of the United Arab Emirates, announced at the
inauguration ceremony we want to make Dubai a
new trading center
9
Layers of interpretation 4
P0 act say-act3 agent P1(Sheikh) theme
P9 authortime T1 eventtime T2 lt T1
P9 state desire1 experiencer P1(Sheikh) theme
P10 statetime T2
P10 act change-state theme P7(Dubai) old-state
? new-state P11 eventtime T3 gt T2
P11 state essence1 experiencer P7(Dubai) theme
P8(center) statetime T4 gt T3
deep(er) semantics
info struc
Sheikh Mohammed, who is also the Defense
Minister of the United Arab Emirates,
announced at the inauguration ceremony we want
to make Dubai a new trading center
shallow semantics
P0 act announce1 agent P1(Sheikh Mohammed)
theme P9 time present
topic (theme) rheme focus
P1(Sheikh Mohammed) P2(who) P3(Defense
Minister) P4(United Arab Emirates) P5(inaug.
ceremony) P6(we) P7(Dubai) P8(trading center)
coref
P9 act want3 agent P6(we) theme P10
P10 act make8 theme P7(Dubai) result
P8(center)
syntax
P1(Sheikh Mohammed) P2(who) P2(who)
P3(Defense Minister) P4(United Arab Emirates)
P6(we)
POS
surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN
PRO V V PN DT AJ N N PUN
Sheikh Mohammed, who is also the Defense Minister
of the United Arab Emirates, announced at the
inauguration ceremony we want to make Dubai a
new trading center
10
Shallow and deep semantics

She sold him the book / He bought the book from
her
He has a headache / He gets a headache
Though its not perfect, democracy is the best
system

(X1 act Sell agent She patient (X1a type
Book) recip He)
(X2a act Transfer agent She patient (X2c
type Book) recip He) (X2b act Transfer
agent He patient (X2d type Money) recip She)
(X3a prop Headache patient He) (?)
(X4a type State object (X4c type Head owner
He) state -3) (X4b type StateChange object
X4c fromstate 0 tostate -3)
(X4 type Contrast arg1 (X4a ?) arg2 (X4b
?))
11
Some semantic phenomena

Somewhat easier
Bracketing (scope) of predications
Word sense selection (incl. copula)
NP structure genitives, modifiers
Concepts ontology definition
Concept structure (incl. frames and thematic
roles)
Coreference (entities and events)
Pronoun classification (ref, bound, event,
generic, other)
Identification of events
Temporal relations (incl. discourse and aspect)
Manner relations
Spatial relations
Direct quotation and reported speech
Opinions and subjectivity

More difficult Quantifier phrases and numerical
expressions Comparatives Coordination Information
structure (theme/rheme) Focus Discourse
structure Other adverbials (epistemic modals,
evidentials) Identification of propositions
(modality) Pragmatics/speech acts Polarity/negatio
n Presuppositions Metaphors
12
Improving NL applications with semantics

How to improve accuracy of IR / web search?
TREC 9801 around 40
Understand user query expand query terms by
meaning
How to achieve conceptual summarization?
Never been done yet, at non-toy level
Interpret topic, fuse concepts according to
meaning re-generate
How to improve QA?
TREC 9902 around 65
Understand Q and A match their meanings know
common info
How to improve MT quality?
MTEval 94 70, depending on what you measure
Disambiguate word senses to find correct meaning

13
Talk overview

Introduction Semantics and the Semantic Web
Approach General methodology for building the
resources
Ontology framework Terminology ontology as start
Creating Omega recent work on connecting
ontologies
Concept level terms and relations
Learning concepts by clustering
Learning and using concept associations
Instance level instances and more
Harvesting instances from text
Harvesting relations
Corpus manual shallow semantic annotation
OntoNotes project
Conclusion

14
2. Approach General methodology for building
the resources
15
Whats needed?

Set of semantic symbols democracy, eat
For each symbol, some kind of definition, or at
least, rules for its combination and treatment
during notation transformations
Notational conventions for each phenomenon of
meaning comparatives, time/tense, negation,
number, etc.
A collection of examples, as training data for
learning systems to learn to do the work
A body of world knowledge for use in processing

16
Credo and methodology

Ontologies (and even concepts) are too complex to
build all in one step
so build them bit by bit, testing each new (kind
of) addition empirically
and develop appropriate learning techniques for
each bit, so you can automate the process
so next time (since theres no ultimate truth)
you can build a new one more quickly

17
Large standardized metadata collections
What is an ontology? My def a collection of
terms denoting entities, events, and
relationships in the domain, taxonomized and
interrelated so as to express the sharing of
properties. Its a formalized model of the
domain, focusing on the aspects of interest for
computation.

The need is there everybodys making lists
SIC and NAICS and other codes
Yahoo!s topic classification
Semantic Web termbanks / ontologies

But how do you
Guarantee the freshness and accuracy of the list?
Guarantee its completeness?
Ensure commensurate detail in levels of the list?
Cross-reference elements of the list?

Need automated procedure for creating lists /
metadata / ontologies
18
Plan Stepwise accretion of knowledge

Initial framework
Start with existing (terminological) ontologies
as pre-metadata
Weave them together
Build metadata/concepts
Define/extract concept cores
Extract/learn inter-concept relationships
Extract/learn definitional and other info
Build (large) data/instance base
Extract instance cores
Link into ontology store in databases
Extract more information, guided by parent
concept

19
Omega ontology Content and framework

Concepts 120,604 Concept/term entries 76 MB
Upper own Penman Upper Model (ISI Bateman et
al.)
Upper SUMO (Pease et al.) DOLCE (Guarino et
al.)
Middle WordNet (Princeton Miller Fellbaum)
Upper Middle Mikrokosmos (NMSU Nirenburg et
al.)
Middle 25,000 Noun-noun compounds (ISI Pantel)
Lexicon / sense space
156,142 English words 33,822 Spanish words
271,243 word senses
13,000 frames of verb arg structure with case
roles
LCS case roles (Dorr) 6.3MB
PropBank roleframes (Palmer et al.) 5.3MB
Framenet roleframes (Fillmore et al.) 2.8MB
WordNet verb frames (Fellbaum) 1.8MB
Associated information (not all complete)
WordNet subj domains (Magnini Cavaglia) 1.2
MB
Various relations learned from text (ISI
Pantel)
TAP domain groupings (Stanford Guha)
SemCor term frequencies 7.5MB

Instances 10.1 GB
1.1 million persons harvested from text
900,000 facts harvested from text
5.7 million locations from USGS and NGA
Framework (over 28 million statements of
concepts, relations, instances)
Available in PowerLoom
Instances in RDF
With database/MYSQL
Online browser
Clustering software
Term and ontology alignment software

http//omega.isi.edu
20
Talk overview

Introduction Semantics and the Semantic Web
Approach General methodology for building the
resources
Ontology framework Terminology ontology as start
Creating Omega recent work on connecting
ontologies
Concept level terms and relations
Learning concepts by clustering
Learning and using concept associations
Instance level instances and more
Harvesting instances from text
Harvesting relations
Corpus manual shallow semantic annotation
OntoNotes project
Conclusion

21
3. Framework Terminology ontology as starting
point semi-automated alignment and merging
(This work with Andrew Philpot, Michael
Fleischman, and Jerry Hobbs)
22
Example application EDC (Hovy et al. 02)
23
Omega (Hovy et al. 03)
WordNet 1.7 (Princeton) 110,00 nodes
Our own new work (ISI) 400 nodes
Mikrokosmos (New Mexico State U) 6,000 nodes
Penman Upper Model (ISI) 300 nodes
24
General alignment and merging problem

Goal find attachment point(s) in ontology for
node/term from somewhere else (ontology, website,
metadata schema, etc.)
Its hard to do manually very hard to do
automaticallysystem needs to understand
semantics of entities to be aligned

Several sets of algorithms
Interesting problems
Various algorithms
25
Ontology alignment and merging

Goal find attachment point in ontology for
node/term from somewhere else (ontology, website,
metadata schema, etc.)
Procedure
1. For a new term/concept, extract and format
name, definition, associated text, local taxonomy
cluster, etc.
2. apply alignment suggestion heuristics (NAME,
DEFINITION, HIERARCHY, DISPERSAL match) against
big ontology, to get proposed attachment points
with strengths (Hovy 98) test with numerous
parameter combinations, see http//edc.isi.edu/ali
gnment/ (Hovy et al. 01)
3. automatically combine proposals (Fleischman et
al 03)
4. apply verification checks
5. bless or reject proposals manually
Process developed in early 1990s (Agirre et al.
94 Knight Luk 94 Okumura Hovy 96 Hovy 98
Hovy et al. 01)
Not stunningly accurate, but can speed up manual
alignment markedly

26
Alignment for Omega

Created Upper Region (400 nodes) manually
Manually snipped tops off Mikro and WordNet, then
attached them to fringe of Upper Region
Automatically aligned bottom fringe of Mikro into
WordNet
Automatically aligned sides of bubbles

27
(No Transcript)
28
A puzzle

Is Amber Decomposable or Nondecomposable?
The stone sense of it (Mikro) is the resin
sense (WordNet) is not
What to do??

29
Shishkebobs (Hovy et al. 03)

Library ISA Building (and hence cant buy things)
Library ISA Institution (and hence can buy
things)
SO Building ? Institution ? Location a
Library is all these
Also Field-of-Study ? Activity ?
Result-of-Process
(Science, Medicine, Architecture, Art)
Allowing shishkebobs makes merging ontologies
easier (possible?) you respect each ontologys
perspective
Continuum from on-the-fly shadings to metonymy
(see Guarinos identity conditions
Pustejovskys qualia)
We found about 400 shishkebobs

30
http//omega.isi.edu
31
Talk overview

Introduction Semantics and the Semantic Web
Approach General methodology for building the
resources
Ontology framework Terminology ontology as start
Creating Omega recent work on connecting
ontologies
Concept level terms and relations
Learning concepts by clustering
Learning and using concept associations
Instance level instances and more
Harvesting instances from text
Harvesting relations
Corpus manual shallow semantic annotation
OntoNotes project
Conclusion

32
4a. Concept level Learning terms/concepts by
clustering web information
(This work by Patrick Pantel, Marco
Pennacchiotti, and Dekang Lin)
33
Where/how to find new concepts/terms?

Potential sources
Existing ontologies (AI efforts, Yahoo!, etc.)
and lists (SIC codes, etc.)
Manual entry, esp with reference to
foreign-language text (EuroWordNet, IL-Annot,
etc.)
Dictionaries and thesauri (Websters, Rogets,
etc.)
Automated discovery by text clustering (Pantel
and Lin, etc.)
Issues
How large do you want it? tradeoff size vs.
consistency and ease of use
How detailed? tradeoff granularity/domain-specif
icity vs. portability and wide acceptance
(Semantic Web)
How language-independent? tradeoff independence
vs. utility for non/shallow-semantic NLP
applications

34
Clustering By Committee (Pantel and Lin 02)

CBC clustering procedure
Parse entire corpus using MINIPAR (D. Lin)
Define syntactic/POS patterns as features
N-N N-subj-V Adj-N etc.
Cluster words, using Pointwise Mutual Information
on features
(eword, fpattern)
Disambiguate
find cluster centroids word committee
for non-centroid words, match their pattern
features to committee words features if match,
include word in cluster, remove features
if no match, then word has remaining features so
try to include in other clusters as well split
ambiguous words senses
Complexity O(n2k) for n words in corpus, k
features
Results no clustering is perfect, but CBC is
quite good

35
www.isi.edu/pantel/
36
(No Transcript)
37
(No Transcript)
38
From words to concepts

How to find a name for a cluster?
Given term instances, search for frequently
co-occurring terms, using apposition patterns
the President, Thomas Jefferson,
Kobe Bryant, famous basketball star
Extract terms, check if present in ontology
Examples for Lincoln
PRESIDENT(N891) - 0.187331
BORROWER / THRIFT(N724) - 0.166958
CAR / DIVISION(N257) - 0.137333
Works ok for nouns, less so for others

39
Problems with clustering

No text-based clustering is ever perfect
How many concepts are there?
How are they arranged? (there is no reason to
expect that a clustering taxonomy should
correspond with an ISA hierarchy!)
What interrelationships exist between them?
Clustering is only the start

40
Talk overview

Introduction Semantics and the Semantic Web
Approach General methodology for building the
resources
Ontology framework Terminology ontology as start
Creating Omega recent work on connecting
ontologies
Concept level terms and relations
Learning concepts by clustering
Learning and using concept associations
Instance level instances and more
Harvesting instances from text
Harvesting relations
Corpus manual shallow semantic annotation
OntoNotes project
Conclusion

41
4b. Concept level Learning and using concept
associations
(This work with Chin-Yew Lin, Mike Junk, Michael
Fleischman, and Tom Murray)
42
Topic signature
Related words in texts show Poisson
distribution In large set of texts, topic
keywords concentrate around topics so families
of related words appear in bursts. To find
family, compare topical word frequency
distributions against global background counts.

Word family built around inter-word relations.
Def Head word (or concept), plus set of related
words (or concepts), each with strength
Tk, (tk1,wk1), (tk2,wk2), , (tkn,wkn)
Problem Scriptal co-occurrence, etc. how to
find it?
Approximate by simple textual term
co-occurrence...

43
Learning signatures
Need texts, sorted

Procedure

How to count co-occurrence?
1. Collect texts, sorted by topic
2. Identify families of co-occurring words
How to evaluate?
3. Evaluate their purity
4. Find the words concepts in the Ontology
5. Link together the concept signatures
Need disambiguator
44
Calculating weights

tf.idf wjk tfjk idfj
?2 wjk (tfjk - mjk)2/ mjk if
tfjk gt mjk
0
otherwise (Hovy Lin, 1997)
tfjk count of term j in text k (waiter
often only in some texts)
idfj log(N/nj) within-collection frequency
(the often in all texts),
nj number of docs with term j , N total
number of documents
tf.idf is the best for IR, among 287 methods
(Salton Buckley, 1988)
mjk ( ?j tfjk ?k tfjk ) / ?jk tfjk
mean count for term j in text k
likelihood ratio ? 2log ? 2N . I (R T)
(Lin Hovy, 2000)
(more approp. for sparse data -2log?
asymptotic to ?2 )
N total number terms in corpus
I mutual information between text relevance R
and given term T ,
H(R ) - H(R T ) for H(R ) entropy of
terms over relevant texts R
and H(R T ) entropy of term
T over rel and nonrel texts

45
Early signature study (Hovy Lin 97)

Corpus
Training set WSJ 1987
16,137 texts (32 topics)
Test set WSJ 1988
12,906 texts (31 topics)
Texts indexed into categories by humans
Signature data
300 terms each, using tf.idf
Word forms single words, demorphed words,
multi-word phrases
Topic distinctness...
Topic hierarchy

46
Evaluating signatures

Solution Perform text categorization task
create N sets of texts, one per topic,
create N topic signatures TSk ,
for each new document, create document signature
DSi ,
compare DSi against all TSk assign document to
best match
Match function vector space similarity measure
Cosine similarity, cos ?? ??TSk DSi / TSk
??DSi

Test 1 (Hovy Lin, 1997, 1999)
Training set 10 topics 3,000 texts (TREC)
Contrast set (background) 3,000 texts
Conclusion tf.idf and ?2 signatures work ok but
depend on signature length
Test 2 (Lin Hovy, 2000)
4 topics 6,194 texts uni/bi/trigram signats
Evaluated using SUMMARIST ? gt tf.idf

47
Text pollution on the web
Goal Create word families (signatures) for each
concept in the Ontology. Get texts from Web.
Main problem text pollution. Whats the search
term?
Purifying In later work, used Latent Semantic
Analysis
48
Purifying with Latent Semantic Analysis

Technique used in Psychologists to determine
basic cognitive conceptual primitives
(Deerwester et al., 1990 Landauer et al., 1998).
Singular Value Decomposition (SVD) used for text
categorization, lexical priming, language
learning
LSA automatically creates collections of items
that are correlated or anti-correlated, with
strengths
ice cream, drowning, sandals ? summer
Each such collection is a semantic primitive in
terms of which objects in the world are
understood.
We tried LSA to find most reliable signatures in
a collection reduce number of signatures in
contrast set.

49
LSA for signatures

Create matrix A, one signature per column (words
? topics).
Apply SVDPAC to compute U so that A U ? UT
Use only the first k of the new concepts ??
?1, ?2?k.
Create matrix A? out of these k vectors A? U
?? UT A.
A? is a new (words ? topics) matrix, with
different weights and new topics. Each column
is a purified signature.

U m ? n orthonormal matrix of left singular
vectors that span space
UT n ? n orthonormal matrix of right singular
vectors
? diagonal matrix with exactly rank(A) nonzero
singular values ?1 gt ?2 gt gt ?n

U
A
UT
?
50
Some results with LSA (Hovy and Junk 99)

Contrast set (for idf and ?2) set of
documents on very different topic, for good idf.
Partitions collect documents within each topic
set into partitions, for faster processing. /n
is a collecting parameter.
U function function for creation of LSA matrix.
Results
Demorphing helps.
? 2 better than tf and tf.idf .
LSA improves results, but not dramatically.

TREC texts
51
Weak semantics Signature for every concept

Procedure
1. Create query from Ontology concept (word
defn. words)
2. Retrieve 5,000 documents (8 web search
engines)
3. Purify results (remove duplicates, html,
etc.)
4. Extract word family (using tf.idf, ?2, LSA,
etc.)
5. Purify
6. Compare to siblings and parents in the
Ontology
Problem raw signatures overlap
average parent-child node overlap 50
BakeryEdifice 35 too far missing
generalization
AirplaneAircraft 80 too close?
Remaining problem web signatures still not
pure...
WordNet In 200204, Agirre and students (U of
the Basque Country) built signatures for all
WordNet nouns

52
Recent work using signatures

Multi-document summarization (Lin and Hovy, 2002)
Create ? signature for each set of texts
Create IR query from signature terms use IR to
extract sentences
(Then filter and reorder sentences into single
summary)
Performance DUC-01 tied first DUC-02 tied
second place

Wordsense disambiguation (Agirre, Ansa,
Martinez, Hovy, 2001)
Try to use WordNet concepts to collect text sets
for signature creation (wordsynonym gt
def-words gt word .AND. synonym .NEAR. def-word gt
etc)
Built competing signatures for various noun
senses (a) WordNet synonyms (b) SemCor
tagged corpus (?2) (c) web texts (?2) (d)
WSJ texts (?2)
Performance Web signatures gt random, WordNet
baseline

Email clustering (Murray and Hovy, 2004)
Social Network Analysis Cluster emails and
create signatures
Infer personal expertise, project structure,
experts omitted, etc.
Corpora ENRON (240K emails), ISI corpus, NSF
eRulemaking corpus

53
Semantics from signatures

Assuming we can create signatures and use them in
some applications
How to integrate signatures into an ontology?
How to employ signatures in inheritance,
classification, inference, and other operations?
How to compose signatures into new concepts?
How to match signatures across languages?
How do signatures change?

54
Talk overview

Introduction Semantics and the Semantic Web
Approach General methodology for building the
resources
Ontology framework Terminology ontology as start
Creating Omega recent work on connecting
ontologies
Concept level terms and relations
Learning concepts by clustering
Learning and using concept associations
Instance level instances and more
Harvesting instances from text
Harvesting relations
Corpus manual shallow semantic annotation
OntoNotes project
Conclusion

55
5a. Instance level Harvesting instances from
text
(This work with Michael Fleischman)
56
What kinds of knowledge?

Goal 1 Add instantial knowledge
Sofia is a city
Sofia is a womans name
Cleopatra was a queen
Everest is a mountain
Varig is an airline company

Goal 2 Add definitional / descriptive knowledge
Mozart was born in 1756
Bell invented the telephone
Pisa is in Italy
The Leaning Tower is in Pisa
Columbus discovered America

Create links between concepts

Uses
QA (answer suggestion and validation)
Wordsense disambiguation for MT
Sources
Existing lists (CIA factbook, atlases, phone
books)
Dictionaries and encyclopedias
The Web

Classify instances under types
57
Learning about locations (Fleischman 01)

Challenge ex.region, state/territory, or city?
The company, which is based in Dpiyj Dsm
Gtsmdodvp, Vsaog., said an antibody prevented
development of paralysis.
The situation has been strained ever since Yplup
began waging war in Dpiyj Rsdy Sdos.
The pulp and paper operations moved to Dpiyj
Vstpaomos in 1981.

Try to learn instances of 8 types (country,
region, territory, city, street, artifact,
mountain, water)
(we have lists of these already, so finding
sentences for training data is easy).
Uses
QA corroborating evidence for answer.
IR query expansion and signature enrichment.
South San Franciscoregion Calif.state
Tokyocity South East Asiaregion South
Carolinastate

58
Learning procedure

Approach
Training For each location, identify features in
context try to learn features that indicate each
type
Usage For new material, use learned features to
classify type of location place results with
high confidence into ontology
Training
Applied BBNs IdentiFinder to bracket locations
Chose 88 features (unigrams,bigrams,trigrams in
fixed positions before and after location
instance later added signatures, etc.)
3 approaches Bayesian classifier, neural net,
decision tree (C4.5)
MemRun procedure store examples if good
(gtTHRESH1) and prefer stored info later if unsure
(ltTHRESH2)

59
Memrun

Initial results
Bayesian classifier not very accurate neural net
ok.
D-tree better, but still multiple classes for
each instance.
Memrun record best example of each instance.
Algorithm with Memrun
Pass 1 for each text,
preprocess with POS tagger and IdentiFinder
apply D-tree to classify instance
if score gt THRESH1, save (instance,tag,score) in
Memrun
Pass 2 for each text,
again apply D-tree
if score lt THRESH2, replace tag by Memrun value

60
Examples
Water Abuna River Adriatic Adriatic Sea Adriatic
sea Aegean Sea Aguapey river Aguaray
River Akhtuba River Akpa Yafe River Akrotiri salt
lake Aksu River Alma-Atinka River Almendares
River Alto Maranon River Amazon River Amur
river Andaman Sea Angara River Angrapa river Anna
River Arabian Gulf Arabian Sea ...
City Aachen Abadan Abassi Madani Abbassi
Madani Abbreviations AZ Abdullojonov Aberdeen Abid
jan Abidjan Radio Cote d'Ivoire
Chaine Abiko Abrahamite Abramenkov Abu
Dhabi Abuja Abyssinia Acari Accom Accordance ...
Territory General Robles Ghanaians Gilan
Province Gilan Province Sha'ban Gitega
Province Glencore Goias State Goias State of
Brazil Gongola State Granma Province Great
Brotherly Russia Greytown Guanacaste
Province Guandong province Guangdong
Province Guangxi Province Guangzhou
Shipyards Guantanamo Province Guayas
Province Guerrero State Guiliano Amato Guizhou
Province Gwent ...
Mountain Wicklow Mountains Wudang
Mountain Wudangshan Mountain Wuling
mountains Wuyi Mountains Xiao Hinggan
Mountains Yimeng Mountains Zamborak
mountain al-Marakishah mountain al-Maraqishah
mountains al-Nubah mountains al-Qantal mountain
61
Results for locations
NB test samples are small
THRESH1 77 THRESH2 98
62
People (Fleischman and Hovy 02)

Goal Collecting training data about 8 types of
people politicians, entertainers (movie stars,
etc.), athletes, businesspeople...
Procedure as before, with added features using
signature of each category and WordNet hypernyms.

athlete 458.029 398.626 perez 392.904 rogers 368
.441 carlos 351.686 points 333.083 roy 311.042 and
res 284.927 scored 273.197 chris 252.172 hardee's
239.747 george 223.879 games 222.202 mark 217.711
mike ...
cleric 1133.793 rabbi 1074.785 cardinal 1011.190 p
aul 809.128 archbishop 798.372 john 748.170 bishop
714.173 catholic 688.291 church 613.625 roman 610
.287 tutu 584.720 desmond 460.057 pope 309.923 kah
ane 300.236 meir ...
entertainer 1902.178 " 1573.695 actor 1083.622 act
ress 721.929 movie 618.947 george 607.466 film 553
.659 singer 541.235 president 536.962 her 536.856
keating 528.226 star 448.524 ( 433.065 ) 404.008 s
aid ...
businessperson 4428.267 greenspan 3999.135 alan 27
74.682 reserve 2429.129 chairman 1786.783 federal
1709.120 icahn 1665.358 fed 1252.701 carl 827.0291
board 682.420 rates 662.510 investor 651.529 twa
531.907 kerkorian 522.072 interest ...
63
Some results for people
Total count 1030 Total Correct 839
0.815 Total Incorrect 191
0.185 miscCorrect 0/20 0.0 lawyerCorrect
13/44 0.295 policeCorrect 11/17
0.647 doctorCorrect 48/50 0.96 entertainerCorre
ct 150/173 0.867 athleteCorrect 11/13
0.846 businessCorrect 120/166
0.722 militaryCorrect 14/21 0.666 clergyCorrect
11/11 1.0 politicianCorrect 461/515 0.895

Best results using signatures and WordNet
hyperlinks (but no synset expansion).
Problems
Training and test data skewed.
Genuine ambiguity
often, politician military leader.

64
Instance extraction (Fleischman Hovy 03)

Goal extract all instances from the web
Method
Download text from web (15GB)
Identify named entities (BBNs IdentiFinder
(Bikel et al. 93))
Extract ones with descriptive phrases (ltAPOSgt,
ltCN/PNgt)
(the vacuum manufacturer Horeck / Saddams
physician Abdul)
Cluster them, and categorize in ontology
Result over 900,000 instances
Average 2 mentions per instance, 40 for George
W. Bush
Evaluation
Tested with 200 who is X? questions
Better than TextMap 25 more
Faster 10 sec ? 9 hr !

65
Talk overview

Introduction Semantics and the Semantic Web
Approach General methodology for building the
resources
Ontology framework Terminology ontology as start
Creating Omega recent work on connecting
ontologies
Concept level terms and relations
Learning concepts by clustering
Learning and using concept associations
Instance level instances and more
Harvesting instances from text
Harvesting relations
Corpus manual shallow semantic annotation
OntoNotes project
Conclusion

66
5b. Instance level Harvesting relations
(This work with Deepak Ravichandran, Donghui
Feng, and Patrick Pantel)
67
Shallow patterns for information

Goal learn relationship data from the web
(when was someone born? Where does he live?)
Procedure automatically learn word-level
patterns
When was Mozart born?
Mozart (17561792)
ltNAMEgt ( ltBIRTHYEARgt ltDEATHYEARgt )
Apply patterns to Omega concepts/instances
Evaluation test in TREC QA competition
Main problem learning patterns
(In TREC QA 2001, Soubbotin and Soubbotin got
very high score with over 10,000 patterns built
by hand)

68
Learning extraction patterns from the web
(Ravichandran and Hovy 02)

Prepare
Select example for target relation Q term
(Mozart) and A term (1756)
Collect data
Submit Q and A terms as queries to a search
engine (Altavista)
Download top 1000 web documents
Preprocess
Apply a sentence breaker to the documents
Retain only sentences with both Q and A terms
Pass retained sentences through suffix tree
constructor
Select and create patterns
Filter each phrase in the suffix tree to retain
only those phrases that contain both Q and A
terms
Replace the Q term by the tag ltNAMEgt and the A
term by the term by ltANSWERgt

69
Some results

BIRTHYEAR
1.0 ltNAMEgt (ltANSgt
0.85 ltNAMEgt was born on ltANSgt
0.6 ltNAMEgt was born in ltANSgt
DEFINITION
1.0 ltNAMEgt and related ltANSgts
1.0 ltANSgt (ltNAMEgt,
0.9 as ltNAMEgt , ltANSgt and
LOCATION
1.0 ltANSgts ltNAMEgt .
1.0 regional ltANSgt ltNAMEgt
0.9 the ltNAMEgt in ltANSgt ,

Testing (TREC-10 questions)
Question Num TREC Web
type Qs MRR MRR
BIRTHYEAR 8 0.479 0.688
INVENTOR 6 0.167 0.583
DISCOVERER 4 0.125 0.875
DEFINITION 102 0.345 0.386
WHY-FAMOUS 3 0.667 0.0
LOCATION 16 0.75 0.864

70
Regular expressions (Ravichandran et al. 2004)

New process learn regular expression patterns

Results over 2 million instances from 15GB
corpus
Complexity O(y2), for max string length y
Later work downloaded and cleaned 1 TB text fro
web created 119MB corpus used for additional
learning of N-N compounds

71
Comparing clustering and surface patterns

Precision took random 50 words, each with
systems learned superconcepts (top 3 of system)
added top 3 from WordNet, 1 human superconcept.
Used 2 judges (Kappa 0.780.85)
Recall Relative Recall RecallPatt /
RecallCo-occ CP / CC
TREC-03 defns Patt up to 52 Co-Occ up to 44
MRR

Precision (correctpartial)
Relative Recall
Pattern System Pattern System Pattern System Co-occurrence System Co-occurrence System Co-occurrence System
Training Prec Top-3 MRR Prec Top-3 MRR
1.5MB 56.6 60.0 60.0 12.4 20.0 15.2
15MB 57.3 63.0 61.0 23.2 50.0 37.3
150MB 50.7 56.0 55.0 60.6 78.0 73.2
1.5GB 52.6 51.0 51.0 69.7 93.0 85.8
15GB 61.8 69.0 67.5 78.7 92.0 86.2
150GB 67.8 67.0 65.0 Too large to process Too large to process Too large to process
(Ravichandran and Pantel 2004)
72
Relation extraction from a small corpus

The challenge apply RegExp pattern induction to
a small corpus (Chemistry textbook) (Pantel and
Pennacchiotti 06)

73
Espresso procedure (Pantel and
Pennachiotti 06)

Phase 1 Pattern Extraction, like Ravichandran,
using MI
Measure reliability based on an approximation of
pattern recall
Phase 2 Instance Extraction
Instantiate all patterns to extract all possible
instances
Identify generic patterns using Google redundancy
check with previously accepted patterns
Measure reliability of each instance
Select top-K instances
Phase 3 Instance Expansion (if too few instances
extracted in phase 2)
Syntactic Drop nominal mods
proton is-a small particle ? proton is-a
particle
WordNet Expand using hypernyms
hydrogen is-a element ? nitrogen is-a element
Web Apply patterns to the Web to extract
additional instances
Phase 4 Axiomatization (transform relations into
axioms in HNF form)
e.g., R is-a S becomes R(x) ? S(x)
e.g., R part-of S becomes (?x)R(x) ?
(?y)S(y) part-of(x,y)

74
IE by pattern
(Feng, Ravichandran, Hovy 2005)
Why not Gorbachev? gender Why not Mrs.
Roosevelt? period Why not Maggie
Thatcher? home? Which semantics to check?
75
Talk overview

Introduction Semantics and the Semantic Web
Approach General methodology for building the
resources
Ontology framework Terminology ontology as start
Creating Omega recent work on connecting
ontologies
Concept level terms and relations
Learning concepts by clustering
Learning and using concept associations
Instance level instances and more
Harvesting instances from text
Harvesting relations
Corpus manual shallow semantic annotation
OntoNotes project
Conclusion

76
6. OntoNotes Creating a Semantic Corpus by
Manual Annotation
(This work with Ralph Weischedel (BBN), Martha
Palmer (U Colorado), Mitch Marcus (UPenn), and
various colleagues)
77
Corpus creation by annotation

Goal create corpus of (sentence semantic rep)
pairs
Use enable machine learning algorithms to do
this
Process humans add information into sentences
(and their parses)
Recent projects

Interlingua Annotation (Dorr et al. 04)
coref links
OntoNotes (Weischedel et al. 05)
ontology
I-CAB, Greek banks
PropBank (Palmer et al. 03)
TIGER/SALSA Bank (Pinkal et al. 04)
verb frames
Framenet (Fillmore et al. 04)
noun frames
Prague Dependency Treebank (Hajic et al. 02)
word senses
Penn Treebank (Marcus et al. 99)
NomBank (Myers et al. 03)
syntax
78
Antecedents
VerbNet
Treebank
PropBank2
WordNet
Chinese
Chinese
PropBank frames
NomBank frames
FrameNet
Chinese
Chinese
Arabic
Sense tags, Coreference and Ontology links
Salsa-German
OntoNotes
Prague-Czech
Chinese
79
OntoNotes large-scale annotation

Partners BBN (Weischedel), U of Colorado
(Palmer), U of Penn (Marcus), ISI (Hovy)
Goal In 4 years, annotate nouns and verbs and
corefs in 1 mill words of English, Chinese, and
Arabic text
Manually provide semantic symbols for nouns,
verbs, adjs, advs
Manually connect sentence structure in verb and
noun frames
Manually link anaphoric references
Validation inter-annotator agreement of 90

Outcomes (2004)
PropBank verb annotation procedure developed
Pilot corpus built, with coref annotation
New project started October 2005 (English,
Chinese Arabic in 2006)
Potential for the near future semantics bank
May energize lots of research on semantic
analysis, reps, etc.
May enable semantics-based IR, QA, MT, etc.

80
OntoNotes representation of literal meaning
The founder of Pakistans nuclear
department Abdul Qadeer Khan has admitted he
transferred nuclear technology to Iran, Libya, an
d North Korea
P1 type Person3 name Abdul Qadeer Khan P2
type Person3 gender male P3 type
Know-How4 P4 type Nation2 name Iran P5
type Nation2 name Libya P6 type Nation2
name N. Korea X0 act Admit1 speaker P1
saying X2 X1 act Transfer2 agent P2
patient P3 dest (P4 P5 P6) coref P1 P2
(slide credit to M. Marcus and R. Weischedel,
2004)
81
Even so Many words untouched!

WSJ1428
OPEC's ability to produce more petroleum than it
can sell is beginning to cast a shadow over world
oil markets. Output from the Organization of
Petroleum Exporting Countries is already at a
high for the year and most member nations are
running flat out. But industry and OPEC
officials agree that a handful of members still
have enough unused capacity to glut the market
and cause an oil-price collapse a few months from
now if OPEC doesn't soon adopt a new quota system
to corral its chronic cheaters. As a result, the
effort by some oil ministers to get OPEC to
approve a new permanent production-sharing
agreement next month is taking on increasing
urgency. The organization is scheduled to meet
in Vienna beginning Nov. 25. So far this year,
rising demand for OPEC oil and production
restraint by some members have kept prices firm
despite rampant cheating by others. But that
could change if demand for OPEC's oil softens
seasonally early next year as some think may
happen. OPEC is currently producing more than 22
million barrels a day, sharply above its nominal,
self-imposed fourth-quarter ceiling of 20.5
million, according to OPEC and industry officials
at an oil conference here sponsored by the Oil
Daily and the International Herald Tribune. At
that rate, a majority of OPEC's 13 members have
reached their output limits, they said.

82
OntoNotes annotation The 90 Solution

1. Sense creation
Expert creates meaning options (shallow semantic
senses) for verbs, nouns, adjs, advs follows
PropBank (Palmer et al.)
At same time, creates concepts and
organizes/refines Omega ontology content and
structure
2. Sense annotation process goes by word, across
docs. Process developed in PropBank. Annotators
manually
See each sentence in corpus containing the
current word (noun, verb, adjective, adverb) to
annotate
Select appropriate senses ( ontology concepts)
for each one
Connect frame structure (for each verb and
relational noun)
3. Coref annotation process goes by doc.
Annotators
Connect co-references within each doc
Constant validation require 90 inter-annotator
agreement

83
Sense annotation procedure

Sense creator first creates senses for a word
Loop 1
Manager selects next nouns from sensed list and
assigns annotators
Programmer randomly selects 50 sentences and
creates initial Task File
Annotators (at least 2) do the first 50
Manager checks their performance
90 agreement few or no NoneOfAbove send on
to Loop 2
Else Adjudicator and Manager identify reasons,
send back to Sense creator to fix senses and defs

Loop 2
Annotators (at least 2) annotate all the
remaining sentences
Manager checks their performance
90 agreement few or no NoneOfAbove send to
Adjudicator to fix the rest
Else Adjudicator annotates differences
If Adj agrees with one Annotator 90, then
ignore other Annotators work (assume a bad day
for the other) else Adj agrees with both about
equally often, then assume bad senses and send
the problematic ones back to Sense creator

84
Pre-OntoNotes test can it be done?

Annotation process and tools developed and tested
in PropBank (Palmer et al. U Colorado)
Typical results (10 words of each type, 100
sentences each)

Round1 ? Round2 ? Round 3 Round1 ? Round2 ? Round 3 Round1 ? Round2 ? Round 3
tagger agreement senses time (min/100 tokens)
verbs .76 ? .86 ? .91 4.5 ? 5.2 ? 3.8 30 ? 25 ? 25
nouns .71 ? .85 ? .95 7.3 ? 5.1 ? 3.3 28 ? 20 ? 15
adjs .87 ? ? .90 2.8 ? ? 5.5 24 ? ? 18
(by comparison agreement using WordNet senses is
70)
85
Creating the senses

Use 90 rule to limit degree of delicacy
See if annotators can agree
Perform manual insertion
After manual creation, get annotator feedback

Should you create the sense? How many must there
be?
Is the term definition adequate?
Where should the term go relative to the other
terms? species
What is unique/different about this term?
differentium/ae

How to do this systematically? Developed method
of graduated refinement using creation of sense
treelets with differentiae
86
Noun and verb sense creation

Performed by Ann Houston in Boston (who also does
verb sense creation)
Sense groupings created
4 nouns per day sense-created
Max head, with 15 senses
Verb procedure creates senses by grouping WordNet
senses (PropBank)
Noun procedure taxonomizes senses into treelets,
with differentiae at each level, for insertion
into ontology

ltinventory lemma"price-n"gt ltsense n"1" type""
name"cost or monetary value of goods or
services" group"1"gt ltdiffgt quantity
monetary_value lt/diffgt ltcommentgt PRICE of
NP -gt NP'sgood/service PRICEexchange_value
lt/commentgt ltexamplesgt The price
of gasoline has soared lately. I don't
know the prices of these two fur coats.
The museum would not sell its Dutch Masters
collection for any price. The cattle
thief has a price on his head in Maine.
They say that every politician has a price.
lt/examplesgt ltmappingsgt ltwn
version"2.1"gt1,2,4,5,6lt/wngt ltomegagt lt/omegagt
lt/mappingsgt lt/sensegt ltsense n"2" type""
name"sacrifice required to achieve something"
group"1"gt ltdiffgt activity complex
effort lt/diffgt ltcommentgt PRICEeffort
PREP(of/for)/SCOMP NPgoal/result lt/commentgt
ltexamplesgt John has paid a high price
for his risky life style.
PRICE abstract quantity monetary_value
(group 1) physical activity complex
(not a single event or action) effort (group
2)
87
Word senses from lexemes to concepts

Lexical space
hang
call

Sense space
hang-hanged
hang-hung
summon they called them home
name he is called Joe
phone she called her mother
name2 he called her a liar
describe she called him ugly

Concept space
Cause-to-die
Suspend-body
Summon
Name-Describe
Phone

How many concepts?
How relate senses to concepts?

88
Omega after OntoNotes

Current Omega
120,000 concepts Middle Model mostly WordNet
Essentially no formally defined features
Post-OntoNotes Omega
60,000 concepts? the 90 rule
Each concept a sense cluster, defined with
features
Each concept linked to many example sentences
What problems do we face?
Sense-to-concept compression
Cross-sense identification
Multiple languages senses
etc.

89
7. Conclusion
90
Summary Obtaining semantics

Ingredients
small ontologies and metadata sets
concept families (signatures)
information from dictionaries, etc.
additional info from text and the web

Method 1. Into a large database, pour all
ingredients 2. Stir together in the right way
3. Bake
EvaluateIR, QA, MT, and so on!
91
My recipe for SW research