Title: Feature Vector Quality and Distributional Similarity
1Textual EntailmentA Framework for Applied
Semantics
2Outline a Vision
- Why do we need Text Understanding?
- Capture understanding by Textual Entailment
- Does one text entail another?
- Major challenge knowledge acquisition
- Initial applications
- Looking 5 years ahead
3Text Understanding
- Vision for improving information access
- Common search engines
- Still text processing mostly matches query
keywords - Deeper understanding
- Consider the meanings of words and the
relationships between them - Relevant for applications
- Question answering, information extraction,
semantic search, summarization
4(No Transcript)
5(No Transcript)
6Towards text understanding Question Answering
7(No Transcript)
8Information Extraction (IE)
- Identify information of pre-determined structure
- Automatic filling of forms
- Example - extract product information
9Search may benefit understanding
- Query AIDS treatment
- Irrelevant document
- Hemophiliacs lack a protein, called factor VIII,
that is essential for making blood clots. As a
result, they frequently suffer internal bleeding
and must receive infusions of clotting protein
derived from human blood.During the early 1980s,
these treatments were often tainted withthe AIDS
virus. In 1984, after that was discovered,
manufacturersbegan heating factor VIII to kill
the virus. The strategy greatlyreduced the
problem but was not foolproof. However, many
expertsbelieve that adding detergents and other
refinements to thepurification process has made
natural factor VIII virtually free ofAIDS. - (AP890118-0146, TIPSTER Vol. 1)
- Many irrelevant documents mention AIDS and
treatments for other diseases
10Relevant Document
- Query AIDS treatment
- Federal health officials are recommending
aggressive use of a newly approved drug that
protects people infected with the AIDS virus
against a form of pneumonia that is the No.1
killer of AIDS victims.The Food and Drug
Administration approved the drug, aerosol
pentamidine, on Thursday. The announcement came
as the Centers for Disease Control issued greatly
expanded treatment guidelines recommending wider
use of the drug in people infected with the AIDS
virus but who may show no symptoms. - (AP890616-0048, TIPSTER VOL. 1)
- Relevant documents may mention specific types of
treatments for AIDS
11(No Transcript)
12(No Transcript)
13(No Transcript)
14Why is it difficult?
Meaning
Language
15Variability of Semantic Expression
The Dow Jones Industrial Average closed up 255
Dow ends up
Dow gains 255 points
Stock market hits a record high
Dow climbs 255
16Its all about entailment
Question Expected answer templateWho
acquired Overture? gtgt X acquired Overture
Yahoo acquired Overture
Yahoos buyout of Overture
entails
hypothesized answer
text
- Application inferences can be reduced to
entailment! - IE X acquire Y
- Summarization (multi-document) identify
redundant sentences - MT paraphrasing, evaluation
- Educational applications student answer vs.
reference
17Applied Textual Entailment
- Directional relation between two text fragments
Text (t) and Hypothesis (h)
- Operational (applied) definition
- Human gold standard
- Entailment judgment matches applications
judgments - Assuming common background knowledge
- Language world knowledge
18Textual Entailment Human Reading Comprehension
- From a childrens English learning book(Sela and
Greenberg) - Reference Text The Bermuda Triangle lies in
the Atlantic Ocean, off the coast of Florida. - Hypothesis (True/False?) The Bermuda Triangle is
near the United States
???
19PASCAL Recognizing Textual Entailment (RTE)
ChallengesFP-6 Funded PASCAL NOE 2004-7
Bar-Ilan University ITC-irst and CELCT,
Trento MITRE Microsoft Research
20Some Examples
21Participation and Impact
- Very successful challenges, world wide
- RTE-1 17 groups
- RTE-2 23 groups
- 150 downloads!
- RTE-3 25 groups
- RTE-4 (2008) 25 groups, moved to NIST (TREC
organizers) - High interest in the research community
- Papers, conference keywords, sessions and areas,
PhDs, influence on funded Projects - special issue at Journal of Natural Language
Engineering
22Results RTE-2
Average 60 Median 59
23Classical Approach Semantics as and
Interpretation Task
Stipulated Meaning Representation(by scholar)
Variability
Language(by nature)
- Logical forms, word senses, semantic roles,
named entity types, - scattered tasks - Feasible/suitable framework for applied
semantics?
24Textual Entailment Text Mapping
Assumed Meaning (by humans)
Variability
Language(by nature)
25General Case Inference
MeaningRepresentation
Inference
Interpretation
Language
Textual Entailment
- Entailment mapping is the actual applied goal
- lets agree on it as unified test for
(all) semantic tasks - Interpretation becomes a possible mean -
direct inference at language level may be
attempted
26What is the main obstacle?
- System reports point at
- Lack of knowledge
- rules, paraphrases, lexical relations, etc.
- It seems that systems that coped better with
these issues performed best
27Research Directions at Bar-IlanKnowledge
AcquisitionInferenceApplications
Oren Glickman, Idan Szpektor, Roy Bar Haim,
Maayan Geffet, Moshe Koppel Bar Ilan
UniversityShachar Mirkin Hebrew University,
Israel Hristo Tanev, Bernardo Magnini, Alberto
Lavelli, Lorenza Romano ITC-irst,
Italy Bonaventura Coppola, Milen Kouylekov
University of Trento and ITC-irst, Italy
28Distributional Word Similarity
Similar words appear in similar contexts
Harris, 1968
Similar Word Meanings ? Similar Contexts
Distributional Similarity Model
Similar Word Meanings ? Similar Context Features
29Measuring Context Similarity
- Country State
- Industry (genitive) Neighboring (modifier)
- Neighboring (modifier)
- Governor (modifier)
- Visit (obj) Parliament (genitive)
- Industry (genitive)
- Population (genitive)
- Governor (modifier) Visit (obj)
- Parliament (genitive) President (genitive)
30Incorporate Indicative Patterns
31Acquisition Example
- Top-ranked entailments for company
- firm, bank, group, subsidiary, unit, business,
- supplier, carrier, agency, airline, division,
giant, - entity, financial institution, manufacturer,
corporation, - commercial bank, joint venture, maker, producer,
factory
- Current work extraction from Wikipedia
32Extracting Lexical Rules from Wikipedia
- Be-complement
- Nominal complements of be
- Redirect
- various terms to canonical title
- Parenthesis
- used for disambiguation
- Link
- Maps to a title of another article
33Entailment Rules for Predicates
Q What reduces the risk of Heart Attacks?
TextAspirin prevents Heart Attacks
Hypothesis Aspirin reduces the risk ofHeart
Attacks
Entailment RuleX prevent Y ? X reduce risk of
Y
template
template
? Need a large knowledge base of entailment rules
34TEASE Algorithm Flow
Lexicon
Input template X?subj-accuse-obj?Y
WEB
TEASE
Sample corpus for input template Paula Jones
accused Clinton Sanhedrin accused St.Paul
Anchor Set Extraction(ASE)
Anchor sets Paula Jones?subj
Clinton?obj Sanhedrin?subj St.Paul?obj
Template Extraction (TE)
Sample corpus for anchor sets Paula Jones called
Clinton indictable St.Paul defended before the
Sanhedrin
Templates X call Y indictableY defend before X
iterate
35Sample of ExtractedAnchor-Sets for X prevent Y
36Sample of Extracted Templates for X prevent Y
37Experiment and Evaluation
- 48 randomly chosen input verbs
- 1392 templates extracted human judgments
- Encouraging Results
- Future work improve precision
38Syntactic Variability Phenomena
Template X activate Y
39Inference putting it all together
A Proof system over parse trees A compact,
unified formalism for knowledge representation
and inference at the lexical-syntactic level
- Providing
- Uniform representation for all knowledge types
- A single knowledge-based inference mechanism
40Proof System Components
Research goal Develop formalism components to
support the needed inferences
41Inference Rules Tree Transformations
- Pair of subtrees with shared variables
(templates) - Example
Passive-to-Active
R
L
Vverb
Vverb
subj
obj
by
obj
be
N1noun
N2noun
beverb
byprep
N1noun
pcomp-n
N2noun
The book was read by John
John read the book
42Alignments
- The book was read by John yesterday
L
R
Vverb
Vverb
subj
obj
by
obj
be
N1noun
N2noun
beverb
byprep
N1noun
pcomp-n
N2noun
- We want to infer John read the book yesterday
- We introduced alignments to indicate copying of
modifiers to the generated tree
43Proof Example
- Text It rained when John and Mary left
- ?
- Hypothesis Mary left
44Proof Example
It rained when John and Mary left
It rained when Mary left
Mary left
?
?
ROOT
i
rainverb
expletive
expletive
wha
itother
whenadj
i
leaveverb
subj
Johnnoun
conj
Marynoun
45Making sense of (implicit) senses
- What is the RIGHT set of senses?
- Any concrete set is problematic/subjective
- but WSD forces you to choose one
- A lexical entailment perspective
- Instead of identifying an explicitly stipulated
sense of a word occurrence - identify whether a word occurrence (i.e. its
implicit sense) entails another word occurrence,
in context - ACL-2006
46Lexical Matching for Applications
Q announcement of new models of chairs
T1 IKEA announced a new comfort chair
T2 MIT announced a new CS chair position
- Sense entailment in substitution
Q announcement of new models of furniture
T1 IKEA announced a new comfort chair
T2 MIT announced a new CS chair position
47Unsupervised Direct kNN-ranking
- Test example score Average Cosine similarity
with k most similar training examples - Rational
- positive examples will be similar to some source
occurrence (of corresponding sense) - negative examples wont be similar to source
- Rank test examples by score
- A classification slant on language modeling
48Results (for synonyms) Ranking
- kNN improves 8-18 precision up to 25 recall
49Initial ApplicationsRelation
ExtractionSemantic Search
50Relation Extraction
Input Template X prevent Y
Entailment Rule Acquisition
TEASE
Templates X prevention for Y, X treat Y, X reduce
Y
TransformationRules
Syntactic Matcher
Relation Instances ltsunscreen, sunburnsgt
51Dataset
- Recognizing interactions between annotated
proteins pairs (Bunescu 2005) - 200 Medline abstracts
- Gold standard dataset of protein pairs
- Input template X interact with Y
52Manual Analysis - Results
- 93 of interacting protein pairs can be
identified with lexical syntactic templates
Number of templates vs. recall (within 93)
53TEASE Output for X interact with Y
A sample of correct templates learned
54TEASE algorithm - Potential Recall on Training
Set
- Iterative - taking the top 5 ranked templates as
input - Morph - recognizing morphological derivations
55Results Vs Supervised Approaches
56(No Transcript)
57(No Transcript)
58(No Transcript)
59Integrating IE and Search (w. IBM Research Haifa)
60TE for Summarization(Harabagiu et al., IPM 2007)
- Textual entailment roles
- Selecting information
- Scoring summaries via pyramid-based measures
61Entailment Engine Specification (API)
- Recognize/generate entailments
- Recognize given t/h pair
- RTE mode validation
- Given h and corpus, find all entailing texts
- IR, QA, FAQ
- Given text, generate all entailed statements
- paraphrase generation for MT
- Identify partial entailments
- summarization, partial match
- Accommodate template hypotheses
- Addresses variable value extraction (QA, IE)
- Accommodate contextual preferences in input
- Variable types (IE, QA, IR)
- Disambiguating context
62Optimistic Conclusions
- Good prospects for better levels of text
understanding - Enabling more sophisticated information access
- Textual entailment is an appealing framework
- Boosts research on text understanding
- Potential for vast knowledge acquisition
Thank you!