Title: Toward Large-Scale Shallow Semantics for Higher-Quality NLP
1Toward Large-Scale Shallow Semantics for
Higher-Quality NLP
What is this Semantics in the Semantic Web,and
How can You Get It?
- Eduard Hovy
- Information Sciences Institute
- University of Southern California
- www.isi.edu/natural-language
2The Knowledge Base of the World
- We live in the infosphere
- but its unstructured, inconsistent, often
outdated, - in other wordsa mess!
?
3Franks two Semantic Webs
- 1. The Semantic Web as data definer
- 2. The Semantic Web as text enhancer
- Applies to circumscribed, structured data types
- Numbers, lists, tables, inventories, picture
annotations - Suitable for constrained, context-free
semantics - Amenable to OWL, etc. closed vocabularies and
controllable relations - Applies to open-ended, unstructured information
- Requires open-ended, context-sensitive
semantics - Requires what exactly? Where to find it?
4Wheres the semantics?
- Its in the words insert standardized symbols
for each (? content) word - Need symbols, vocabularies, ontologies
- Its in the links create standardized set of
links and use (? only) them - Need links, operational semantics, link
interpreters
- It will somehow emerge, by magic, if we just do
enough stuff with OWL and RDF - Need formalisms, definitions, operational
semantics, notation interpreters
5NO to controlled vocabulary, says IR!
- 1960s Cleverdon and the Cranfield aeronautics
evaluations of text retrieval engines (Cleverdon
67) - Tested algorithms and lists of controlled
vocabularies, also all words - SURPRISE all words better than controlled
vocabs! - which led to Saltons vector space approach to
IR - which led to todays web search engines
- The IR position forget ontologies and controlled
liststhe semantics lies in multi-word
combinations! - Theres no benefit in artificial or controlled
languages - Multi-word combinations (kitchen knife) are
good enough - Build language models frequency distributions
of words in corpus/doc (Callan et al. 99 Ponte
and Croft 98)
Nonethelessfor Semantic Web uses, we need
semantics. But WHAT is it? And how do we obtain
it?
6Toward semantics Layers of interpretation 1
7Layers of interpretation 2
syntax
POS
surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN
PRO V V PN DT AJ N N PUN
Sheikh Mohammed, who is also the Defense Minister
of the United Arab Emirates, announced at the
inauguration ceremony we want to make Dubai a
new trading center
8Layers of interpretation 3
shallow semantics
P0 act announce1 agent P1(Sheikh Mohammed)
theme P9 time present
P1(Sheikh Mohammed) P2(who) P3(Defense
Minister) P4(United Arab Emirates) P5(inaug.
ceremony) P6(we) P7(Dubai) P8(trading center)
coref
P9 act want3 agent P6(we) theme P10
P10 act make8 theme P7(Dubai) result
P8(center)
syntax
P1(Sheikh Mohammed) P2(who) P2(who)
P3(Defense Minister) P4(United Arab Emirates)
P6(we)
POS
surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN
PRO V V PN DT AJ N N PUN
Sheikh Mohammed, who is also the Defense Minister
of the United Arab Emirates, announced at the
inauguration ceremony we want to make Dubai a
new trading center
9Layers of interpretation 4
P0 act say-act3 agent P1(Sheikh) theme
P9 authortime T1 eventtime T2 lt T1
P9 state desire1 experiencer P1(Sheikh) theme
P10 statetime T2
P10 act change-state theme P7(Dubai) old-state
? new-state P11 eventtime T3 gt T2
P11 state essence1 experiencer P7(Dubai) theme
P8(center) statetime T4 gt T3
deep(er) semantics
info struc
Sheikh Mohammed, who is also the Defense
Minister of the United Arab Emirates,
announced at the inauguration ceremony we want
to make Dubai a new trading center
shallow semantics
P0 act announce1 agent P1(Sheikh Mohammed)
theme P9 time present
topic (theme) rheme focus
P1(Sheikh Mohammed) P2(who) P3(Defense
Minister) P4(United Arab Emirates) P5(inaug.
ceremony) P6(we) P7(Dubai) P8(trading center)
coref
P9 act want3 agent P6(we) theme P10
P10 act make8 theme P7(Dubai) result
P8(center)
syntax
P1(Sheikh Mohammed) P2(who) P2(who)
P3(Defense Minister) P4(United Arab Emirates)
P6(we)
POS
surface
PN PN PRO AUX ADV DT PN P DT PN V P DT N N PUN
PRO V V PN DT AJ N N PUN
Sheikh Mohammed, who is also the Defense Minister
of the United Arab Emirates, announced at the
inauguration ceremony we want to make Dubai a
new trading center
10Shallow and deep semantics
- She sold him the book / He bought the book from
her - He has a headache / He gets a headache
- Though its not perfect, democracy is the best
system
(X1 act Sell agent She patient (X1a type
Book) recip He)
(X2a act Transfer agent She patient (X2c
type Book) recip He) (X2b act Transfer
agent He patient (X2d type Money) recip She)
(X3a prop Headache patient He) (?)
(X4a type State object (X4c type Head owner
He) state -3) (X4b type StateChange object
X4c fromstate 0 tostate -3)
(X4 type Contrast arg1 (X4a ?) arg2 (X4b
?))
11Some semantic phenomena
- Somewhat easier
- Bracketing (scope) of predications
- Word sense selection (incl. copula)
- NP structure genitives, modifiers
- Concepts ontology definition
- Concept structure (incl. frames and thematic
roles) - Coreference (entities and events)
- Pronoun classification (ref, bound, event,
generic, other) - Identification of events
- Temporal relations (incl. discourse and aspect)
- Manner relations
- Spatial relations
- Direct quotation and reported speech
- Opinions and subjectivity
More difficult Quantifier phrases and numerical
expressions Comparatives Coordination Information
structure (theme/rheme) Focus Discourse
structure Other adverbials (epistemic modals,
evidentials) Identification of propositions
(modality) Pragmatics/speech acts Polarity/negatio
n Presuppositions Metaphors
12Improving NL applications with semantics
- How to improve accuracy of IR / web search?
- TREC 9801 around 40
- Understand user query expand query terms by
meaning - How to achieve conceptual summarization?
- Never been done yet, at non-toy level
- Interpret topic, fuse concepts according to
meaning re-generate - How to improve QA?
- TREC 9902 around 65
- Understand Q and A match their meanings know
common info - How to improve MT quality?
- MTEval 94 70, depending on what you measure
- Disambiguate word senses to find correct meaning
13Talk overview
- Introduction Semantics and the Semantic Web
- Approach General methodology for building the
resources - Ontology framework Terminology ontology as start
- Creating Omega recent work on connecting
ontologies - Concept level terms and relations
- Learning concepts by clustering
- Learning and using concept associations
- Instance level instances and more
- Harvesting instances from text
- Harvesting relations
- Corpus manual shallow semantic annotation
- OntoNotes project
- Conclusion
142. Approach General methodology for building
the resources
15Whats needed?
- Set of semantic symbols democracy, eat
- For each symbol, some kind of definition, or at
least, rules for its combination and treatment
during notation transformations - Notational conventions for each phenomenon of
meaning comparatives, time/tense, negation,
number, etc. - A collection of examples, as training data for
learning systems to learn to do the work - A body of world knowledge for use in processing
16Credo and methodology
- Ontologies (and even concepts) are too complex to
build all in one step - so build them bit by bit, testing each new (kind
of) addition empirically - and develop appropriate learning techniques for
each bit, so you can automate the process - so next time (since theres no ultimate truth)
you can build a new one more quickly
17Large standardized metadata collections
What is an ontology? My def a collection of
terms denoting entities, events, and
relationships in the domain, taxonomized and
interrelated so as to express the sharing of
properties. Its a formalized model of the
domain, focusing on the aspects of interest for
computation.
- The need is there everybodys making lists
- SIC and NAICS and other codes
- Yahoo!s topic classification
- Semantic Web termbanks / ontologies
- But how do you
- Guarantee the freshness and accuracy of the list?
- Guarantee its completeness?
- Ensure commensurate detail in levels of the list?
- Cross-reference elements of the list?
Need automated procedure for creating lists /
metadata / ontologies
18Plan Stepwise accretion of knowledge
- Initial framework
- Start with existing (terminological) ontologies
as pre-metadata - Weave them together
- Build metadata/concepts
- Define/extract concept cores
- Extract/learn inter-concept relationships
- Extract/learn definitional and other info
- Build (large) data/instance base
- Extract instance cores
- Link into ontology store in databases
- Extract more information, guided by parent
concept
19Omega ontology Content and framework
- Concepts 120,604 Concept/term entries 76 MB
- Upper own Penman Upper Model (ISI Bateman et
al.) - Upper SUMO (Pease et al.) DOLCE (Guarino et
al.) - Middle WordNet (Princeton Miller Fellbaum)
- Upper Middle Mikrokosmos (NMSU Nirenburg et
al.) - Middle 25,000 Noun-noun compounds (ISI Pantel)
- Lexicon / sense space
- 156,142 English words 33,822 Spanish words
- 271,243 word senses
- 13,000 frames of verb arg structure with case
roles - LCS case roles (Dorr) 6.3MB
- PropBank roleframes (Palmer et al.) 5.3MB
- Framenet roleframes (Fillmore et al.) 2.8MB
- WordNet verb frames (Fellbaum) 1.8MB
- Associated information (not all complete)
- WordNet subj domains (Magnini Cavaglia) 1.2
MB - Various relations learned from text (ISI
Pantel) - TAP domain groupings (Stanford Guha)
- SemCor term frequencies 7.5MB
- Instances 10.1 GB
- 1.1 million persons harvested from text
- 900,000 facts harvested from text
- 5.7 million locations from USGS and NGA
- Framework (over 28 million statements of
concepts, relations, instances) - Available in PowerLoom
- Instances in RDF
- With database/MYSQL
- Online browser
- Clustering software
- Term and ontology alignment software
http//omega.isi.edu
20Talk overview
- Introduction Semantics and the Semantic Web
- Approach General methodology for building the
resources - Ontology framework Terminology ontology as start
- Creating Omega recent work on connecting
ontologies - Concept level terms and relations
- Learning concepts by clustering
- Learning and using concept associations
- Instance level instances and more
- Harvesting instances from text
- Harvesting relations
- Corpus manual shallow semantic annotation
- OntoNotes project
- Conclusion
213. Framework Terminology ontology as starting
point semi-automated alignment and merging
(This work with Andrew Philpot, Michael
Fleischman, and Jerry Hobbs)
22Example application EDC (Hovy et al. 02)
23Omega (Hovy et al. 03)
WordNet 1.7 (Princeton) 110,00 nodes
Our own new work (ISI) 400 nodes
Mikrokosmos (New Mexico State U) 6,000 nodes
Penman Upper Model (ISI) 300 nodes
24General alignment and merging problem
- Goal find attachment point(s) in ontology for
node/term from somewhere else (ontology, website,
metadata schema, etc.) - Its hard to do manually very hard to do
automaticallysystem needs to understand
semantics of entities to be aligned
Several sets of algorithms
Interesting problems
Various algorithms
25Ontology alignment and merging
- Goal find attachment point in ontology for
node/term from somewhere else (ontology, website,
metadata schema, etc.) - Procedure
- 1. For a new term/concept, extract and format
name, definition, associated text, local taxonomy
cluster, etc. - 2. apply alignment suggestion heuristics (NAME,
DEFINITION, HIERARCHY, DISPERSAL match) against
big ontology, to get proposed attachment points
with strengths (Hovy 98) test with numerous
parameter combinations, see http//edc.isi.edu/ali
gnment/ (Hovy et al. 01) - 3. automatically combine proposals (Fleischman et
al 03) - 4. apply verification checks
- 5. bless or reject proposals manually
- Process developed in early 1990s (Agirre et al.
94 Knight Luk 94 Okumura Hovy 96 Hovy 98
Hovy et al. 01) - Not stunningly accurate, but can speed up manual
alignment markedly
26Alignment for Omega
- Created Upper Region (400 nodes) manually
- Manually snipped tops off Mikro and WordNet, then
attached them to fringe of Upper Region - Automatically aligned bottom fringe of Mikro into
WordNet - Automatically aligned sides of bubbles
27(No Transcript)
28A puzzle
- Is Amber Decomposable or Nondecomposable?
- The stone sense of it (Mikro) is the resin
sense (WordNet) is not - What to do??
29Shishkebobs (Hovy et al. 03)
- Library ISA Building (and hence cant buy things)
- Library ISA Institution (and hence can buy
things) - SO Building ? Institution ? Location a
Library is all these - Also Field-of-Study ? Activity ?
Result-of-Process - (Science, Medicine, Architecture, Art)
- Allowing shishkebobs makes merging ontologies
easier (possible?) you respect each ontologys
perspective - Continuum from on-the-fly shadings to metonymy
- (see Guarinos identity conditions
Pustejovskys qualia) - We found about 400 shishkebobs
30http//omega.isi.edu
31Talk overview
- Introduction Semantics and the Semantic Web
- Approach General methodology for building the
resources - Ontology framework Terminology ontology as start
- Creating Omega recent work on connecting
ontologies - Concept level terms and relations
- Learning concepts by clustering
- Learning and using concept associations
- Instance level instances and more
- Harvesting instances from text
- Harvesting relations
- Corpus manual shallow semantic annotation
- OntoNotes project
- Conclusion
324a. Concept level Learning terms/concepts by
clustering web information
(This work by Patrick Pantel, Marco
Pennacchiotti, and Dekang Lin)
33Where/how to find new concepts/terms?
- Potential sources
- Existing ontologies (AI efforts, Yahoo!, etc.)
and lists (SIC codes, etc.) - Manual entry, esp with reference to
foreign-language text (EuroWordNet, IL-Annot,
etc.) - Dictionaries and thesauri (Websters, Rogets,
etc.) - Automated discovery by text clustering (Pantel
and Lin, etc.) - Issues
- How large do you want it? tradeoff size vs.
consistency and ease of use - How detailed? tradeoff granularity/domain-specif
icity vs. portability and wide acceptance
(Semantic Web) - How language-independent? tradeoff independence
vs. utility for non/shallow-semantic NLP
applications
34Clustering By Committee (Pantel and Lin 02)
- CBC clustering procedure
- Parse entire corpus using MINIPAR (D. Lin)
- Define syntactic/POS patterns as features
- N-N N-subj-V Adj-N etc.
- Cluster words, using Pointwise Mutual Information
on features - (eword, fpattern)
- Disambiguate
- find cluster centroids word committee
- for non-centroid words, match their pattern
features to committee words features if match,
include word in cluster, remove features - if no match, then word has remaining features so
try to include in other clusters as well split
ambiguous words senses - Complexity O(n2k) for n words in corpus, k
features - Results no clustering is perfect, but CBC is
quite good
35www.isi.edu/pantel/
36(No Transcript)
37(No Transcript)
38From words to concepts
- How to find a name for a cluster?
- Given term instances, search for frequently
co-occurring terms, using apposition patterns - the President, Thomas Jefferson,
- Kobe Bryant, famous basketball star
- Extract terms, check if present in ontology
- Examples for Lincoln
- PRESIDENT(N891) - 0.187331
- BORROWER / THRIFT(N724) - 0.166958
- CAR / DIVISION(N257) - 0.137333
- Works ok for nouns, less so for others
39Problems with clustering
- No text-based clustering is ever perfect
- How many concepts are there?
- How are they arranged? (there is no reason to
expect that a clustering taxonomy should
correspond with an ISA hierarchy!) - What interrelationships exist between them?
- Clustering is only the start
40Talk overview
- Introduction Semantics and the Semantic Web
- Approach General methodology for building the
resources - Ontology framework Terminology ontology as start
- Creating Omega recent work on connecting
ontologies - Concept level terms and relations
- Learning concepts by clustering
- Learning and using concept associations
- Instance level instances and more
- Harvesting instances from text
- Harvesting relations
- Corpus manual shallow semantic annotation
- OntoNotes project
- Conclusion
414b. Concept level Learning and using concept
associations
(This work with Chin-Yew Lin, Mike Junk, Michael
Fleischman, and Tom Murray)
42Topic signature
Related words in texts show Poisson
distribution In large set of texts, topic
keywords concentrate around topics so families
of related words appear in bursts. To find
family, compare topical word frequency
distributions against global background counts.
- Word family built around inter-word relations.
- Def Head word (or concept), plus set of related
words (or concepts), each with strength - Tk, (tk1,wk1), (tk2,wk2), , (tkn,wkn)
- Problem Scriptal co-occurrence, etc. how to
find it? - Approximate by simple textual term
co-occurrence...
43Learning signatures
Need texts, sorted
How to count co-occurrence?
1. Collect texts, sorted by topic
2. Identify families of co-occurring words
How to evaluate?
3. Evaluate their purity
4. Find the words concepts in the Ontology
5. Link together the concept signatures
Need disambiguator
44Calculating weights
- tf.idf wjk tfjk idfj
- ?2 wjk (tfjk - mjk)2/ mjk if
tfjk gt mjk - 0
otherwise (Hovy Lin, 1997) - tfjk count of term j in text k (waiter
often only in some texts) - idfj log(N/nj) within-collection frequency
(the often in all texts), - nj number of docs with term j , N total
number of documents - tf.idf is the best for IR, among 287 methods
(Salton Buckley, 1988) - mjk ( ?j tfjk ?k tfjk ) / ?jk tfjk
mean count for term j in text k - likelihood ratio ? 2log ? 2N . I (R T)
(Lin Hovy, 2000) - (more approp. for sparse data -2log?
asymptotic to ?2 ) - N total number terms in corpus
- I mutual information between text relevance R
and given term T , - H(R ) - H(R T ) for H(R ) entropy of
terms over relevant texts R - and H(R T ) entropy of term
T over rel and nonrel texts
45Early signature study (Hovy Lin 97)
- Corpus
- Training set WSJ 1987
- 16,137 texts (32 topics)
- Test set WSJ 1988
- 12,906 texts (31 topics)
- Texts indexed into categories by humans
- Signature data
- 300 terms each, using tf.idf
- Word forms single words, demorphed words,
multi-word phrases - Topic distinctness...
- Topic hierarchy
46Evaluating signatures
- Solution Perform text categorization task
- create N sets of texts, one per topic,
- create N topic signatures TSk ,
- for each new document, create document signature
DSi , - compare DSi against all TSk assign document to
best match - Match function vector space similarity measure
- Cosine similarity, cos ?? ??TSk DSi / TSk
??DSi
- Test 1 (Hovy Lin, 1997, 1999)
- Training set 10 topics 3,000 texts (TREC)
- Contrast set (background) 3,000 texts
- Conclusion tf.idf and ?2 signatures work ok but
depend on signature length - Test 2 (Lin Hovy, 2000)
- 4 topics 6,194 texts uni/bi/trigram signats
- Evaluated using SUMMARIST ? gt tf.idf
47Text pollution on the web
Goal Create word families (signatures) for each
concept in the Ontology. Get texts from Web.
Main problem text pollution. Whats the search
term?
Purifying In later work, used Latent Semantic
Analysis
48Purifying with Latent Semantic Analysis
- Technique used in Psychologists to determine
basic cognitive conceptual primitives
(Deerwester et al., 1990 Landauer et al., 1998). - Singular Value Decomposition (SVD) used for text
categorization, lexical priming, language
learning - LSA automatically creates collections of items
that are correlated or anti-correlated, with
strengths - ice cream, drowning, sandals ? summer
- Each such collection is a semantic primitive in
terms of which objects in the world are
understood. - We tried LSA to find most reliable signatures in
a collection reduce number of signatures in
contrast set.
49LSA for signatures
- Create matrix A, one signature per column (words
? topics). - Apply SVDPAC to compute U so that A U ? UT
- Use only the first k of the new concepts ??
?1, ?2?k. - Create matrix A? out of these k vectors A? U
?? UT A. - A? is a new (words ? topics) matrix, with
different weights and new topics. Each column
is a purified signature.
- U m ? n orthonormal matrix of left singular
vectors that span space - UT n ? n orthonormal matrix of right singular
vectors - ? diagonal matrix with exactly rank(A) nonzero
singular values ?1 gt ?2 gt gt ?n
U
A
UT
?
50Some results with LSA (Hovy and Junk 99)
- Contrast set (for idf and ?2) set of
documents on very different topic, for good idf. - Partitions collect documents within each topic
set into partitions, for faster processing. /n
is a collecting parameter. - U function function for creation of LSA matrix.
- Results
- Demorphing helps.
- ? 2 better than tf and tf.idf .
- LSA improves results, but not dramatically.
TREC texts
51Weak semantics Signature for every concept
- Procedure
- 1. Create query from Ontology concept (word
defn. words) - 2. Retrieve 5,000 documents (8 web search
engines) - 3. Purify results (remove duplicates, html,
etc.) - 4. Extract word family (using tf.idf, ?2, LSA,
etc.) - 5. Purify
- 6. Compare to siblings and parents in the
Ontology - Problem raw signatures overlap
- average parent-child node overlap 50
- BakeryEdifice 35 too far missing
generalization - AirplaneAircraft 80 too close?
- Remaining problem web signatures still not
pure... - WordNet In 200204, Agirre and students (U of
the Basque Country) built signatures for all
WordNet nouns
52Recent work using signatures
- Multi-document summarization (Lin and Hovy, 2002)
- Create ? signature for each set of texts
- Create IR query from signature terms use IR to
extract sentences - (Then filter and reorder sentences into single
summary) - Performance DUC-01 tied first DUC-02 tied
second place
- Wordsense disambiguation (Agirre, Ansa,
Martinez, Hovy, 2001) - Try to use WordNet concepts to collect text sets
for signature creation (wordsynonym gt
def-words gt word .AND. synonym .NEAR. def-word gt
etc) - Built competing signatures for various noun
senses (a) WordNet synonyms (b) SemCor
tagged corpus (?2) (c) web texts (?2) (d)
WSJ texts (?2) - Performance Web signatures gt random, WordNet
baseline
- Email clustering (Murray and Hovy, 2004)
- Social Network Analysis Cluster emails and
create signatures - Infer personal expertise, project structure,
experts omitted, etc. - Corpora ENRON (240K emails), ISI corpus, NSF
eRulemaking corpus
53Semantics from signatures
- Assuming we can create signatures and use them in
some applications - How to integrate signatures into an ontology?
- How to employ signatures in inheritance,
classification, inference, and other operations? - How to compose signatures into new concepts?
- How to match signatures across languages?
- How do signatures change?
54Talk overview
- Introduction Semantics and the Semantic Web
- Approach General methodology for building the
resources - Ontology framework Terminology ontology as start
- Creating Omega recent work on connecting
ontologies - Concept level terms and relations
- Learning concepts by clustering
- Learning and using concept associations
- Instance level instances and more
- Harvesting instances from text
- Harvesting relations
- Corpus manual shallow semantic annotation
- OntoNotes project
- Conclusion
555a. Instance level Harvesting instances from
text
(This work with Michael Fleischman)
56What kinds of knowledge?
- Goal 1 Add instantial knowledge
- Sofia is a city
- Sofia is a womans name
- Cleopatra was a queen
- Everest is a mountain
- Varig is an airline company
- Goal 2 Add definitional / descriptive knowledge
- Mozart was born in 1756
- Bell invented the telephone
- Pisa is in Italy
- The Leaning Tower is in Pisa
- Columbus discovered America
Create links between concepts
- Uses
- QA (answer suggestion and validation)
- Wordsense disambiguation for MT
- Sources
- Existing lists (CIA factbook, atlases, phone
books) - Dictionaries and encyclopedias
- The Web
Classify instances under types
57Learning about locations (Fleischman 01)
- Challenge ex.region, state/territory, or city?
- The company, which is based in Dpiyj Dsm
Gtsmdodvp, Vsaog., said an antibody prevented
development of paralysis. - The situation has been strained ever since Yplup
began waging war in Dpiyj Rsdy Sdos. - The pulp and paper operations moved to Dpiyj
Vstpaomos in 1981.
- Try to learn instances of 8 types (country,
region, territory, city, street, artifact,
mountain, water) - (we have lists of these already, so finding
sentences for training data is easy). - Uses
- QA corroborating evidence for answer.
- IR query expansion and signature enrichment.
- South San Franciscoregion Calif.state
Tokyocity South East Asiaregion South
Carolinastate
58Learning procedure
- Approach
- Training For each location, identify features in
context try to learn features that indicate each
type - Usage For new material, use learned features to
classify type of location place results with
high confidence into ontology - Training
- Applied BBNs IdentiFinder to bracket locations
- Chose 88 features (unigrams,bigrams,trigrams in
fixed positions before and after location
instance later added signatures, etc.) - 3 approaches Bayesian classifier, neural net,
decision tree (C4.5) - MemRun procedure store examples if good
(gtTHRESH1) and prefer stored info later if unsure
(ltTHRESH2)
59Memrun
- Initial results
- Bayesian classifier not very accurate neural net
ok. - D-tree better, but still multiple classes for
each instance. - Memrun record best example of each instance.
- Algorithm with Memrun
- Pass 1 for each text,
- preprocess with POS tagger and IdentiFinder
- apply D-tree to classify instance
- if score gt THRESH1, save (instance,tag,score) in
Memrun - Pass 2 for each text,
- again apply D-tree
- if score lt THRESH2, replace tag by Memrun value
60Examples
Water Abuna River Adriatic Adriatic Sea Adriatic
sea Aegean Sea Aguapey river Aguaray
River Akhtuba River Akpa Yafe River Akrotiri salt
lake Aksu River Alma-Atinka River Almendares
River Alto Maranon River Amazon River Amur
river Andaman Sea Angara River Angrapa river Anna
River Arabian Gulf Arabian Sea ...
City Aachen Abadan Abassi Madani Abbassi
Madani Abbreviations AZ Abdullojonov Aberdeen Abid
jan Abidjan Radio Cote d'Ivoire
Chaine Abiko Abrahamite Abramenkov Abu
Dhabi Abuja Abyssinia Acari Accom Accordance ...
Territory General Robles Ghanaians Gilan
Province Gilan Province Sha'ban Gitega
Province Glencore Goias State Goias State of
Brazil Gongola State Granma Province Great
Brotherly Russia Greytown Guanacaste
Province Guandong province Guangdong
Province Guangxi Province Guangzhou
Shipyards Guantanamo Province Guayas
Province Guerrero State Guiliano Amato Guizhou
Province Gwent ...
Mountain Wicklow Mountains Wudang
Mountain Wudangshan Mountain Wuling
mountains Wuyi Mountains Xiao Hinggan
Mountains Yimeng Mountains Zamborak
mountain al-Marakishah mountain al-Maraqishah
mountains al-Nubah mountains al-Qantal mountain
61Results for locations
NB test samples are small
THRESH1 77 THRESH2 98
62People (Fleischman and Hovy 02)
- Goal Collecting training data about 8 types of
people politicians, entertainers (movie stars,
etc.), athletes, businesspeople... - Procedure as before, with added features using
signature of each category and WordNet hypernyms.
athlete 458.029 398.626 perez 392.904 rogers 368
.441 carlos 351.686 points 333.083 roy 311.042 and
res 284.927 scored 273.197 chris 252.172 hardee's
239.747 george 223.879 games 222.202 mark 217.711
mike ...
cleric 1133.793 rabbi 1074.785 cardinal 1011.190 p
aul 809.128 archbishop 798.372 john 748.170 bishop
714.173 catholic 688.291 church 613.625 roman 610
.287 tutu 584.720 desmond 460.057 pope 309.923 kah
ane 300.236 meir ...
entertainer 1902.178 " 1573.695 actor 1083.622 act
ress 721.929 movie 618.947 george 607.466 film 553
.659 singer 541.235 president 536.962 her 536.856
keating 528.226 star 448.524 ( 433.065 ) 404.008 s
aid ...
businessperson 4428.267 greenspan 3999.135 alan 27
74.682 reserve 2429.129 chairman 1786.783 federal
1709.120 icahn 1665.358 fed 1252.701 carl 827.0291
board 682.420 rates 662.510 investor 651.529 twa
531.907 kerkorian 522.072 interest ...
63Some results for people
Total count 1030 Total Correct 839
0.815 Total Incorrect 191
0.185 miscCorrect 0/20 0.0 lawyerCorrect
13/44 0.295 policeCorrect 11/17
0.647 doctorCorrect 48/50 0.96 entertainerCorre
ct 150/173 0.867 athleteCorrect 11/13
0.846 businessCorrect 120/166
0.722 militaryCorrect 14/21 0.666 clergyCorrect
11/11 1.0 politicianCorrect 461/515 0.895
- Best results using signatures and WordNet
hyperlinks (but no synset expansion). - Problems
- Training and test data skewed.
- Genuine ambiguity
- often, politician military leader.
64Instance extraction (Fleischman Hovy 03)
- Goal extract all instances from the web
- Method
- Download text from web (15GB)
- Identify named entities (BBNs IdentiFinder
(Bikel et al. 93)) - Extract ones with descriptive phrases (ltAPOSgt,
ltCN/PNgt) - (the vacuum manufacturer Horeck / Saddams
physician Abdul) - Cluster them, and categorize in ontology
- Result over 900,000 instances
- Average 2 mentions per instance, 40 for George
W. Bush - Evaluation
- Tested with 200 who is X? questions
- Better than TextMap 25 more
- Faster 10 sec ? 9 hr !
65Talk overview
- Introduction Semantics and the Semantic Web
- Approach General methodology for building the
resources - Ontology framework Terminology ontology as start
- Creating Omega recent work on connecting
ontologies - Concept level terms and relations
- Learning concepts by clustering
- Learning and using concept associations
- Instance level instances and more
- Harvesting instances from text
- Harvesting relations
- Corpus manual shallow semantic annotation
- OntoNotes project
- Conclusion
665b. Instance level Harvesting relations
(This work with Deepak Ravichandran, Donghui
Feng, and Patrick Pantel)
67Shallow patterns for information
- Goal learn relationship data from the web
- (when was someone born? Where does he live?)
- Procedure automatically learn word-level
patterns - When was Mozart born?
- Mozart (17561792)
- ltNAMEgt ( ltBIRTHYEARgt ltDEATHYEARgt )
- Apply patterns to Omega concepts/instances
- Evaluation test in TREC QA competition
- Main problem learning patterns
- (In TREC QA 2001, Soubbotin and Soubbotin got
very high score with over 10,000 patterns built
by hand)
68Learning extraction patterns from the web
(Ravichandran and Hovy 02)
- Prepare
- Select example for target relation Q term
(Mozart) and A term (1756) - Collect data
- Submit Q and A terms as queries to a search
engine (Altavista) - Download top 1000 web documents
- Preprocess
- Apply a sentence breaker to the documents
- Retain only sentences with both Q and A terms
- Pass retained sentences through suffix tree
constructor - Select and create patterns
- Filter each phrase in the suffix tree to retain
only those phrases that contain both Q and A
terms - Replace the Q term by the tag ltNAMEgt and the A
term by the term by ltANSWERgt
69Some results
- BIRTHYEAR
- 1.0 ltNAMEgt (ltANSgt
- 0.85 ltNAMEgt was born on ltANSgt
- 0.6 ltNAMEgt was born in ltANSgt
-
- DEFINITION
- 1.0 ltNAMEgt and related ltANSgts
- 1.0 ltANSgt (ltNAMEgt,
- 0.9 as ltNAMEgt , ltANSgt and
-
- LOCATION
- 1.0 ltANSgts ltNAMEgt .
- 1.0 regional ltANSgt ltNAMEgt
- 0.9 the ltNAMEgt in ltANSgt ,
- Testing (TREC-10 questions)
- Question Num TREC Web
- type Qs MRR MRR
- BIRTHYEAR 8 0.479 0.688
- INVENTOR 6 0.167 0.583
- DISCOVERER 4 0.125 0.875
- DEFINITION 102 0.345 0.386
- WHY-FAMOUS 3 0.667 0.0
- LOCATION 16 0.75 0.864
70Regular expressions (Ravichandran et al. 2004)
- New process learn regular expression patterns
- Results over 2 million instances from 15GB
corpus - Complexity O(y2), for max string length y
- Later work downloaded and cleaned 1 TB text fro
web created 119MB corpus used for additional
learning of N-N compounds
71Comparing clustering and surface patterns
- Precision took random 50 words, each with
systems learned superconcepts (top 3 of system)
added top 3 from WordNet, 1 human superconcept.
Used 2 judges (Kappa 0.780.85) - Recall Relative Recall RecallPatt /
RecallCo-occ CP / CC - TREC-03 defns Patt up to 52 Co-Occ up to 44
MRR
Precision (correctpartial)
Relative Recall
Pattern System Pattern System Pattern System Co-occurrence System Co-occurrence System Co-occurrence System
Training Prec Top-3 MRR Prec Top-3 MRR
1.5MB 56.6 60.0 60.0 12.4 20.0 15.2
15MB 57.3 63.0 61.0 23.2 50.0 37.3
150MB 50.7 56.0 55.0 60.6 78.0 73.2
1.5GB 52.6 51.0 51.0 69.7 93.0 85.8
15GB 61.8 69.0 67.5 78.7 92.0 86.2
150GB 67.8 67.0 65.0 Too large to process Too large to process Too large to process
(Ravichandran and Pantel 2004)
72Relation extraction from a small corpus
- The challenge apply RegExp pattern induction to
a small corpus (Chemistry textbook) (Pantel and
Pennacchiotti 06)
73Espresso procedure (Pantel and
Pennachiotti 06)
- Phase 1 Pattern Extraction, like Ravichandran,
using MI - Measure reliability based on an approximation of
pattern recall - Phase 2 Instance Extraction
- Instantiate all patterns to extract all possible
instances - Identify generic patterns using Google redundancy
check with previously accepted patterns - Measure reliability of each instance
- Select top-K instances
- Phase 3 Instance Expansion (if too few instances
extracted in phase 2) - Syntactic Drop nominal mods
- proton is-a small particle ? proton is-a
particle - WordNet Expand using hypernyms
- hydrogen is-a element ? nitrogen is-a element
- Web Apply patterns to the Web to extract
additional instances - Phase 4 Axiomatization (transform relations into
axioms in HNF form) - e.g., R is-a S becomes R(x) ? S(x)
- e.g., R part-of S becomes (?x)R(x) ?
(?y)S(y) part-of(x,y)
74IE by pattern
(Feng, Ravichandran, Hovy 2005)
Why not Gorbachev? gender Why not Mrs.
Roosevelt? period Why not Maggie
Thatcher? home? Which semantics to check?
75Talk overview
- Introduction Semantics and the Semantic Web
- Approach General methodology for building the
resources - Ontology framework Terminology ontology as start
- Creating Omega recent work on connecting
ontologies - Concept level terms and relations
- Learning concepts by clustering
- Learning and using concept associations
- Instance level instances and more
- Harvesting instances from text
- Harvesting relations
- Corpus manual shallow semantic annotation
- OntoNotes project
- Conclusion
766. OntoNotes Creating a Semantic Corpus by
Manual Annotation
(This work with Ralph Weischedel (BBN), Martha
Palmer (U Colorado), Mitch Marcus (UPenn), and
various colleagues)
77Corpus creation by annotation
- Goal create corpus of (sentence semantic rep)
pairs - Use enable machine learning algorithms to do
this - Process humans add information into sentences
(and their parses) - Recent projects
Interlingua Annotation (Dorr et al. 04)
coref links
OntoNotes (Weischedel et al. 05)
ontology
I-CAB, Greek banks
PropBank (Palmer et al. 03)
TIGER/SALSA Bank (Pinkal et al. 04)
verb frames
Framenet (Fillmore et al. 04)
noun frames
Prague Dependency Treebank (Hajic et al. 02)
word senses
Penn Treebank (Marcus et al. 99)
NomBank (Myers et al. 03)
syntax
78Antecedents
VerbNet
Treebank
PropBank2
WordNet
Chinese
Chinese
PropBank frames
NomBank frames
FrameNet
Chinese
Chinese
Arabic
Sense tags, Coreference and Ontology links
Salsa-German
OntoNotes
Prague-Czech
Chinese
79OntoNotes large-scale annotation
- Partners BBN (Weischedel), U of Colorado
(Palmer), U of Penn (Marcus), ISI (Hovy) - Goal In 4 years, annotate nouns and verbs and
corefs in 1 mill words of English, Chinese, and
Arabic text - Manually provide semantic symbols for nouns,
verbs, adjs, advs - Manually connect sentence structure in verb and
noun frames - Manually link anaphoric references
- Validation inter-annotator agreement of 90
- Outcomes (2004)
- PropBank verb annotation procedure developed
- Pilot corpus built, with coref annotation
- New project started October 2005 (English,
Chinese Arabic in 2006) - Potential for the near future semantics bank
- May energize lots of research on semantic
analysis, reps, etc. - May enable semantics-based IR, QA, MT, etc.
80OntoNotes representation of literal meaning
The founder of Pakistans nuclear
department Abdul Qadeer Khan has admitted he
transferred nuclear technology to Iran, Libya, an
d North Korea
P1 type Person3 name Abdul Qadeer Khan P2
type Person3 gender male P3 type
Know-How4 P4 type Nation2 name Iran P5
type Nation2 name Libya P6 type Nation2
name N. Korea X0 act Admit1 speaker P1
saying X2 X1 act Transfer2 agent P2
patient P3 dest (P4 P5 P6) coref P1 P2
(slide credit to M. Marcus and R. Weischedel,
2004)
81Even so Many words untouched!
- WSJ1428
- OPEC's ability to produce more petroleum than it
can sell is beginning to cast a shadow over world
oil markets. Output from the Organization of
Petroleum Exporting Countries is already at a
high for the year and most member nations are
running flat out. But industry and OPEC
officials agree that a handful of members still
have enough unused capacity to glut the market
and cause an oil-price collapse a few months from
now if OPEC doesn't soon adopt a new quota system
to corral its chronic cheaters. As a result, the
effort by some oil ministers to get OPEC to
approve a new permanent production-sharing
agreement next month is taking on increasing
urgency. The organization is scheduled to meet
in Vienna beginning Nov. 25. So far this year,
rising demand for OPEC oil and production
restraint by some members have kept prices firm
despite rampant cheating by others. But that
could change if demand for OPEC's oil softens
seasonally early next year as some think may
happen. OPEC is currently producing more than 22
million barrels a day, sharply above its nominal,
self-imposed fourth-quarter ceiling of 20.5
million, according to OPEC and industry officials
at an oil conference here sponsored by the Oil
Daily and the International Herald Tribune. At
that rate, a majority of OPEC's 13 members have
reached their output limits, they said.
82OntoNotes annotation The 90 Solution
- 1. Sense creation
- Expert creates meaning options (shallow semantic
senses) for verbs, nouns, adjs, advs follows
PropBank (Palmer et al.) - At same time, creates concepts and
organizes/refines Omega ontology content and
structure - 2. Sense annotation process goes by word, across
docs. Process developed in PropBank. Annotators
manually - See each sentence in corpus containing the
current word (noun, verb, adjective, adverb) to
annotate - Select appropriate senses ( ontology concepts)
for each one - Connect frame structure (for each verb and
relational noun) - 3. Coref annotation process goes by doc.
Annotators - Connect co-references within each doc
- Constant validation require 90 inter-annotator
agreement
83Sense annotation procedure
- Sense creator first creates senses for a word
- Loop 1
- Manager selects next nouns from sensed list and
assigns annotators - Programmer randomly selects 50 sentences and
creates initial Task File - Annotators (at least 2) do the first 50
- Manager checks their performance
- 90 agreement few or no NoneOfAbove send on
to Loop 2 - Else Adjudicator and Manager identify reasons,
send back to Sense creator to fix senses and defs
- Loop 2
- Annotators (at least 2) annotate all the
remaining sentences - Manager checks their performance
- 90 agreement few or no NoneOfAbove send to
Adjudicator to fix the rest - Else Adjudicator annotates differences
- If Adj agrees with one Annotator 90, then
ignore other Annotators work (assume a bad day
for the other) else Adj agrees with both about
equally often, then assume bad senses and send
the problematic ones back to Sense creator
84Pre-OntoNotes test can it be done?
- Annotation process and tools developed and tested
in PropBank (Palmer et al. U Colorado) - Typical results (10 words of each type, 100
sentences each)
Round1 ? Round2 ? Round 3 Round1 ? Round2 ? Round 3 Round1 ? Round2 ? Round 3
tagger agreement senses time (min/100 tokens)
verbs .76 ? .86 ? .91 4.5 ? 5.2 ? 3.8 30 ? 25 ? 25
nouns .71 ? .85 ? .95 7.3 ? 5.1 ? 3.3 28 ? 20 ? 15
adjs .87 ? ? .90 2.8 ? ? 5.5 24 ? ? 18
(by comparison agreement using WordNet senses is
70)
85Creating the senses
- Use 90 rule to limit degree of delicacy
- See if annotators can agree
- Perform manual insertion
- After manual creation, get annotator feedback
- Should you create the sense? How many must there
be? - Is the term definition adequate?
- Where should the term go relative to the other
terms? species - What is unique/different about this term?
differentium/ae
How to do this systematically? Developed method
of graduated refinement using creation of sense
treelets with differentiae
86Noun and verb sense creation
- Performed by Ann Houston in Boston (who also does
verb sense creation) - Sense groupings created
- 4 nouns per day sense-created
- Max head, with 15 senses
- Verb procedure creates senses by grouping WordNet
senses (PropBank) - Noun procedure taxonomizes senses into treelets,
with differentiae at each level, for insertion
into ontology
ltinventory lemma"price-n"gt ltsense n"1" type""
name"cost or monetary value of goods or
services" group"1"gt ltdiffgt quantity
monetary_value lt/diffgt ltcommentgt PRICE of
NP -gt NP'sgood/service PRICEexchange_value
lt/commentgt ltexamplesgt The price
of gasoline has soared lately. I don't
know the prices of these two fur coats.
The museum would not sell its Dutch Masters
collection for any price. The cattle
thief has a price on his head in Maine.
They say that every politician has a price.
lt/examplesgt ltmappingsgt ltwn
version"2.1"gt1,2,4,5,6lt/wngt ltomegagt lt/omegagt
lt/mappingsgt lt/sensegt ltsense n"2" type""
name"sacrifice required to achieve something"
group"1"gt ltdiffgt activity complex
effort lt/diffgt ltcommentgt PRICEeffort
PREP(of/for)/SCOMP NPgoal/result lt/commentgt
ltexamplesgt John has paid a high price
for his risky life style.
PRICE abstract quantity monetary_value
(group 1) physical activity complex
(not a single event or action) effort (group
2)
87Word senses from lexemes to concepts
- Sense space
- hang-hanged
- hang-hung
- summon they called them home
- name he is called Joe
- phone she called her mother
- name2 he called her a liar
- describe she called him ugly
- Concept space
- Cause-to-die
- Suspend-body
- Summon
- Name-Describe
- Phone
- How many concepts?
- How relate senses to concepts?
88Omega after OntoNotes
- Current Omega
- 120,000 concepts Middle Model mostly WordNet
- Essentially no formally defined features
- Post-OntoNotes Omega
- 60,000 concepts? the 90 rule
- Each concept a sense cluster, defined with
features - Each concept linked to many example sentences
- What problems do we face?
- Sense-to-concept compression
- Cross-sense identification
- Multiple languages senses
- etc.
897. Conclusion
90Summary Obtaining semantics
- Ingredients
- small ontologies and metadata sets
- concept families (signatures)
- information from dictionaries, etc.
- additional info from text and the web
Method 1. Into a large database, pour all
ingredients 2. Stir together in the right way
3. Bake
EvaluateIR, QA, MT, and so on!
91My recipe for SW research
- Take two large portions of KR
- one of ontology work,
- one of reasoning
- Add a big slice of databases
- for all the non-text collections,
- and 1 1/2 slices of NL
- for the text collections, to insert the
semantics. - Mix with a medium pinch of Correctness /
Authority / Recency validation, - and add a large helping of Interfaces
- to make the results presentable.
- Combine, using creativity and good methodology,
- (taste frequently to evaluate!)
- and deliver to everyone.
92Extending your ontology
No ontology is ever static need to develop
methods to handle change
- congratulations to the people of Montenegro!
93Thank you!