Title: JAVELIN Project Briefing
1JAVELIN Project Briefing
- February 16, 2007
- Language Technologies InstituteCarnegie Mellon
University
2MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
3 MLQA Architecture
How much did the Japan Bank for International
Cooperation decide to loan to the Taiwan
High-Speed Corporation?
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
4MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
5MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Answer Type MONEY Keyword _____________
Chinese IX
Japanese IX
6DocID JY-20010705J1TYMCC1300010, Confidence
44.01 DocID JY-20011116J1TYMCB1300010,
Confidence 42.95
MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
7MLQA Architecture
Answer Candidate Confidence 0.0718 Passage
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
8MLQA Architecture
Cluster and Re-rank answer candidates.
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
9MLQA Architecture
QA
RS
IX
AG
Keyword Translator
Chinese Corpus
Japanese Corpus
Chinese IX
Japanese IX
10Question Analyzer
- Primary Subtasks
- Question Classification
- Key Term Identification
- Semantic Analysis
11Question Classification
- Hybrid Approach
- machine learning rule-based (same features)
- Features
- Lexical
- unigrams, bigrams
- Syntactic
- focus adjective, main verb, wh-word, determiner
status of wh-word - Semantic
- focus word type
12Question Classification Focus words
- Examples
- Which town hosted the 2002 Winter Olympics?
- How long is the Golden Gate Bridge?
- Determining the semantic type of focus nouns
- Look up in WordNet
- town gt town-8135936
- Use a manually-created mapping
- town-8135936 gt CITY
- city-metropolis-urban_center-8005407 gt CITY
13Question Classification Algorithms
- Machine Learning
- Hierarchical classifier
- E-C MAX_ENT, MAX_ENT
- E-J MAX_ENT, ADABOOST_OVA
- Rule-based
- Example
- MONEY lt WH_WORDhow_much,
FOCUS_ADJexpensive,
FOCUS_TYPEmoney - Hybrid Approach
- Try both ML Rule-based
- If Rule-based classification succeeded, use it
- Else use ML-based classification
14Key Term Identification
- Sources of evidence
- Syntactic category (POS) NN, JJ, VB, CD
- Common phrases in dictionary
- Named entity tags
- Quoted text
- Unification procedure based on priority of
evidence source
15Semantic Analysis
- Semantic Role Labeling
- ASSERT v0.1
- Back-up KANTOO (for be, have, etc.)
- more on SRL later
- Semantic Predicate Structures
- Produced from SRL annotations and key terms
- Focus argument is identified
- Semantic Predicate Expansion
- Using small, manually-created ontology
- Relations is-a, inverse, implies, reflexive
16Plans for Future Development
- Question Classification
- Replace manually-created knowledge sources
heuristics with learners - Re-architecture to place learned components as
supporting agents to rule-based control - Semantic Role Labeling
- Nominalizations?
- Predicate Expansion
- Learn the expansion ontology automatically from
labeled corpora
17Retrieval Strategist for NTCIR 6
- Sentence and block retrieval
- Blocks are overlapping windows, each containing
three sentences - Annotated Corpora
- Chinese
- Sentence and block boundaries
- Named Entity Types www, phone, cardinal, time,
percent, person, quoted, money, booktitle, date,
ordinal, email, location, duration, organization,
measure - Japanese (CLQA and QAC)
- Sentence segmentation, blocks
- Named Entity Types time, date, optional,
location, demo, organization, artifact, made,
misc, money, any, person, numex - Named Entity Subtypes misc, people, cardinal,
age, weight, length, speed, information, area - Question focus terms person_bio, reason, method,
definition - Japanese case markers mo, totomoni, ka, no, tte,
nado, wo, ni, ga, toshite, e, wa, yori, nitsuite,
dake, kara, to, ya - Propbank-style semantic roles target, arg0-4,
argx
18Query Formulation for NTCIR 6
- Retrieve, rank and score blocks
- One weighted synonym inner clause for each
keyterm, containing alternate forms, weighted by
confidence
weightblock( weight1 wsyn( 1.0 term1 0.85
alt1a 0.60 alt1b ) weight2 wsyn( 1.0 term2
0.75 alt2a ) )
19Translation Module Overview
20Outline
- What TM does
- Then (TM at NTCIR-5)
- Now (TM at NTCIR-6)
- Going Forward
21Translation Module
- Responsible for all translation-related tasks
within Javelin - Currently, TMs main task is to translate
keyterms (given by the Question Analyzer) from
the source language (language of the user input
question) to the target languages (languages of
the data collections where answers may be found)
so that the answer can be located and extracted
based on the translated keyterms
22Then (NTCIR-5)
- Goal Produce a high-quality translation for each
keyterm based on question context - View A translation problem
- Evaluation Based on gold-standard translation
- How
- Use multiple translation resources
- Dictionaries
- MT systems
- Web-mining techniques
- Use web co-occurrence statistics to select the
best combination of translated keyterms for a
given question
23Then (NTCIR-5)
TM
Translation Gathering
Translation Selection
Source Language Keyterms
Target Language Keyterms
World Wide Web
Dictionaries
MT Systems
Web Mining
24Then (NTCIR-5)
- Problems
- A correct translation may not be useful in
document retrieval and answer extraction - Bill Clinton, William J. Clinton, President
Clinton (Alternate Forms) - Took over, invaded, attacked, occupied
(Near-synonyms) - Which one is correct? Which one(s) are good for
retrieval and extraction in a QA system? - Needs better translation of named entities
- Accessing the web for gathering statistics could
be slow
25Now (NTCIR-6)
- Goal For each keyterm produce a SET of
translations useful for retrieval and extraction - View A Cross-Lingual Information Retrieval
(CLIR) problem - Evaluation No direct evaluation, but based on
retrieval results - How
- Use multiple translation resources
- Dictionaries
- MT systems
- Better web-mining techniques
- Wikipedia
- Named entity lists
- More of everything
- Use multiple translation candidates
- Better for retrieval and extraction recall
- But need to minimized noise but retain recall
- Rank translation candidates
- Use simpler web statistics
26Now (NTCIR-6)
TM
Target Language Keyterms
Target Language Keyterms
Translation Gathering
Translation Alternatives Scoring
Source Language Keyterms
Target Language Keyterms
Named Entity Lists
Dictionaries
World Wide Web
Web Mining
MT Systems
Wikipedia
27Now (NTCIR-6)
Retrieval results show that good translation
coverage is important Gold Manually created
gold-standard translation, one per keyterm TM
Automatically generated translation using TM,
multiple translations per keyterm Note The unit
of retrieval is a 3-sentence block, not document,
and relevance judgment is based on the answer
pattern
28Now (NTCIR-6)
Retrieval results show that good translation
coverage is important Gold Manually created
gold-standard translation, one per keyterm TM
Automatically generated translation using TM,
multiple translations per keyterm Note The unit
of retrieval is a 3-sentence block, not document,
and relevance judgment is based on the answer
pattern
29Now (NTCIR-6)
Retrieval results show that good translation
coverage is important Gold Manually created
gold-standard translation, one per keyterm TM
Automatically generated translation using TM,
multiple translations per keyterm Note The unit
of retrieval is a 3-sentence block, not document,
and relevance judgment is based on the answer
pattern
30Going Forward
- Improving Translation Coverage
- Improve web-mining translators
- Improve keyterm extraction
- May need alternate forms of the source language
keyterm - May need to segment/transform extracted keyterms
- Japanese translation is poor
- Named entities not translated properly
- Keyterm segmentation problems
- Improving CLIR
- Use corpus statistics for ranking translation
candidates - Use established data sets for CLIR experiments
(TREC, NTCIR)
31Chinese Answer Extraction Module
- Outline
- Review answer extraction in NTCIR5 through an
example - Explain the new techniques we developed for
NTCIR6
32NTCIR5 Chinese Answer Extractor Module (we will
use an English example as our running example,
the techniques in this module is
language-independent)
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
33NTCIR5 Chinese Answer Extractor Module
1. Identify Named-entities
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Location
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Location
Percent
34NTCIR5 Chinese Answer Extractor Module
2. Identify expected answer type
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Location
Answer Type PERCENT
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Location
Percent
35NTCIR5 Chinese Answer Extractor Module
3. Extract answer candidate that has a
named-entity type matches the expected answer type
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Location
Answer Type PERCENT
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Percent
Location
36NTCIR5 Chinese Answer Extractor Module
Score the answer candidate based on surface
distance to key terms. Select the answer
candidate closest to all key terms.
5 word tokens apart
Wisconsin
28
percent
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Location
Answer Type PERCENT
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Percent
Location
37NTCIR6 Chinese Answer Extractor Module
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
A
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
38NTCIR6 Chinese Answer Extractor Module
Find the best alignment of key terms using
max-flow dynamic programming algorithm
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
A
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
39NTCIR6 Chinese Answer Extractor Module
Using max-flow algorithm, we take into account
partial matching of terms and synonym expansion,
by assigning different scores to these types of
matching
Q
percent
make
0.8
0.9
A
28
produce
Partial Matching
Synonym Expansion
40NTCIR6 Chinese Answer Extractor Module
Identify answer type and select answer candidates
that have matching NE type
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
Answer Type PERCENT
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
Percent
41NTCIR6 Chinese Answer Extractor Module
Produce dependency parse trees
whn
head
pcomp-n
Q
What
percent
of
the
nations
cheese
does
Wisconsin
produce
?
subj
gen
prep
det
root
mod
i
gen
mod
T
Wisconsin
In
,
where
farmers
produce
roughly
28
percent
of
the
nations
cheese
root
pcomp-n
obj
pcomp-n
42NTCIR6 Chinese Answer Extractor Module
Extract relation triples among the matching terms
whn
head
pcomp-n
Q
percent
of
the
nations
cheese
Wisconsin
produce
subj
gen
prep
root
mod
i
gen
mod
T
Wisconsin
produce
28
percent
of
the
nations
cheese
root
pcomp-n
obj
pcomp-n
43NTCIR6 Chinese Answer Extractor Module
Extract relation triples cont.
Q
prep
subj
percent
of
Wisconsin
produce
pcomp-n
whn, head
cheese
of
percent
produce
mod
mod,i
T
28 percent
of
Wisconsin
produce
pcomp-n
obj
cheese
of
28 percent
produce
44NTCIR6 Chinese Answer Extractor Module
Combine multiple sources of information using a
maximum-entropy model.
Q
What
percent
of
28
percent
of
1. atype-NE matching
T
Percent
Answer Type PERCENT
2. Dependency path matching
mod
prep
T
28 percent
of
Q
percent
of
pcomp-n
pcomp-n
cheese
of
cheese
of
3. Term alignment score
4. Sentence term occurrence
5. Passage term occurrence
6. etc
45Future work for Chinese IX
- Currently building a more powerful and expressive
model to learn the syntax and semantic
transformation from question to answer. - Plug in more external resources such as
paraphrase database and semantic resources
(gazetteers, WordNet, thesaurus) into the new
model.
46 Japanese Answer Extraction
47NTCIR6 CLQA EJ/JJ task Answer Extraction
- Given retrieved documents, we want to pick Named
Entities that belong to the expected answer type
or other relevant type. - Named Entity tagging
- Used used CaboCha for 9 NEs.
- Patterns based NE tagger is also used for NUMEX,
DATE, TIME, PERCENT classes and for more
fine-grained NEs (ORGANIZATION.UNIVERSITY) - NE family assumption
- Families
- LOCATION, PERSON, ORGANIZATION, ARTIFACT
- NUMEX, PERCENT
- DATE, TIME
- If the answer type is LOCATION, pick other
members in the family into answer candidate pool
too, because NE tagger may have mistakenly tagged
LOCATION as PERSON - MaxEnt Learner learns different weights for
LOCATION-LOCATION and LOCATION-PERSON - Then, we want to estimate the probability of each
Named Entity being an answer. - Used Maximum Etropy model where we can
incorpolate easily customizable features - We can model both proximity (used in JAVELIN IIs
LIGHT IX ) and patterns (used in JAVELIN IIs FST
IX)
48NTCIR6 CLQA EJ/JJ task Answer Extraction
- Numeric features
- Q denotes question sentence, and A denotes answer
candidate sentence. - KEYTERM of key terms from Q found in A
- ALIAS of aliases (obtained from Wikipedia and
Eijiro) from Q found in A - RELATED_TERM of related terms (obtained from
web mining) from Q found in A - KEYTERM_DIST Closest sentence level distance of
a key term from Q. - ALIAS_DIST Closest sentence level distance of a
key term from Q. - RELATED_TERM_DIST Closest sentence level
distance of a key term from Q. - PREDICATE_ARGUMENT in what degree, predicate
argument structure from Q and A are similar - Binary features
- ATYPE pairs of answer types in Q and A
- KEYTERM_ATTACHMENT -NO, -NI, -WA, -GA, -MO, -WO,
-KANJI, -PAREN - If a certain word occurs directly after the key
term in A.
49NTCIR6 QAC Overview
- Japanese-to-Japanese complex (non-factoid) QA
task. - Answer unit is larger than factoid QA task
- Phrases, sentences, multiple sentences,
summarized text - In reality, it was a pilot task
- Kind of questions were not predefined
- Small training data
- Evaluation currently, human judgment only
- Unknown N in number of top N answer candidate to
evaluate - Data
- Corpus Mainichi news paper 1998-2001
- Training 30 questions
- Formal run 100 questions
50NTCIR6 QAC Complex Questions
- Example questions by expected answer (translated)
- Relationship, difference
- What is the difference between skeleton and luge?
- Reason, cause
- Why was it easy to predict the eruption of Mt.
Usu? - What is the background of the rise of Islamic
fundamentalism? - Definition, description, (person bio)
- What is the NPO law?
- What are the problems of aged Mir space station?
- Effect, result
- How does dioxin affect to human body?
- Method, process
- Degree
- Opinion
- What did Ryoko Tamura comment after winning the
gold medal?
51NTCIR6 QAC Our Approach
- Question Analysis
- Keyterm extraction, Dictionary based keyterm
expansion, answer type analysis - Document Retrieval
- Same as factoid QA. Block (3-sentence level)
retrieval with Indri - Answer Extraction
- Machine learning (Maximum Entropy model) using
keyterm, answer types, patterns, as features - Answer Selection
- Duplicate answer merging
52NTCIR6 QAC Answer types
- A-type categories we defined
- How to make use of answer type?
- As a feature in answer extraction phase
METHOD PROCESS REASON RESULT CONDITION
DEFINITION PERSON_BIO DEGREE
How do you choose. In what process, how
???????/???? Why..?, What is the reason of
????????????,???????????? In what
condition? What is ?, What is the advantage of
? Who is ? How much damage?
53NTCIR6 QAC Keyterm expansion
- Based observations, we found some vocabulary
mismatches between questions and answers - Created an synonym/aliase dictionary from
Wikipedia and Eijiro (English-to-Japanese
dictionary) - From Wikipedia
- Use redirection information where aliases can be
extracted - E.g. Carnegie Mellon, CMU
- From Eijiro
- Assume target words are synonyms each other
- Risk of treating financial bank and river
bank as synonym
54NTCIR6 QAC Answer Extraction
- Used Maximum Entropy model where we can
incorporate easily customizable features - One-sentence assumption
- Finding answer boundaries is difficult, because
in non-factoid QA, it requires more
text-understanding - So, we assumed one sentence is an appropriate
span to start with - Then, answer extraction problem became more like
an answer selection problem - Top N answer candidates to return
- As long as the score given to the answer
candidate is over the threshold
55NTCIR6 QAC Answer Extraction Features
- Numeric features (Q denotes question sentence,
and A denotes answer candidate sentence. ) - KEYTERM of key terms from Q found in A
- ALIAS of aliases (obtained from Wikipedia and
Eijiro) from Q found in A - RELATED_TERM of related terms (obtained from
web mining) from Q found in A - KEYTERM_DIST Closest sentence level distance of
a key term from Q. - ALIAS_DIST Closest sentence level distance of a
key term from Q. - RELATED_TERM_DIST Closest sentence level
distance of a key term from Q. - SENTENCE_LENGTH length of the A sentence in
normal distribution
56NTCIR6 QAC Answer Extraction Features
- Binary features
- Q denotes question sentence, and A denotes answer
candidate sentence. - PATTERN_CUE If there is a hand-crafted cue in A
- HAS_SUBJ If there is a subject in A
- HAS_PRON If there is a pronoun in A
- ATYPE Answer type analyzed from Q
- LIST_QUE If list ques  "(1)","???","?","?"Â
found in A - PAREN If  "?","?" or "(",")" found in A
- PARAGRAPH_HEAD If A is the beginning of a
paragraph. - KEYTERM_ATTACHMENT -NO, -NI, -WA, -GA, -MO, -WO,
-KANJI, -PAREN - If a certain word occurs directly after the key
term in A. - Observation -NO and -KANJI are strong feature
57NTCIR6 QAC Human judgment result
- Manual judgment was done by a person who is
outside of NTCIR for all 100 questions. - Top 4 answer candidates were evaluated
- Answer candidates are labeled as
- A the candidate contains the answer
- B the candidate contains the answer but the main
topic of the candidate is not about the answer - C the candidate contains a part of the answer
- D the candidate does not contain the answer
- Judged result
- A24, B30, C13, D310 out of 377 answer
candidates - Our Interpretation of the result
- Precision is 18 ((ABC)/(ABCD)) for loose
evaluation. - For 42 of questions, we were able to return at
least one candidates with A,B,C label
58Future plans
- In answer type analysis, classify multiple binary
features of question, instead of picking out only
one A-type category. - Instead of introducing one-sentence assumption,
see the answer extraction as answer segmentation
problem - Automatic evaluation metrics from
text-segmentation task, such as COAP
(Co-Occurrence Agreement Probability), will be
available even if factoid and non-factoid
questions are mixed together. (cf. Basic Element
approach)
59NTCIR6 CLQA EJ/JJ task Future Work
- Use or develop more accurate NE tagger, trading
off with speed - True annotation
- ltPERSONgt??lt/PERSONgt????????ltPERSONgt???lt/PERSONgt???
????? - Output from CaboCha
- ltLOCATIONgt?lt/LOCATIONgt?????????ltPERSONgt??lt/PERSONgt
ltORGANIZATIONgt?lt/ORGANIZATIONgt???????? - Output from Bar
- ltPERSONgt??lt/PERSONgt????????ltPERSONgt??lt/PERSONgtltORG
ANIZATIONgt?lt/ORGANIZATIONgt???????? - Try other classifier learner algorithms
- SVM, Ada boost, Decision Tree, Voted Perceptron,
e.t.c. - Feature engineering and beyond
- We put this and this features and got the best
accuracy. - So what?
- Want to interpret the result by answering a
question - How much does the feature A contributed to
extract the answer?
60Answer Generator
- NTCIR5
- Cluster similar or redundant answers
- For a cluster containing K answers whose
extraction confidence scores are S1, S2, ..., SK,
the cluster confidence is computed as - NTCIR6
- Apply an answer ranking model to estimate a
probability of an answer given multiple answer
relevance and similarity features
61Answer Ranking Model
- Two subtasks for answer ranking
- Identify relevant answer candidates estimate
P(correct(Ai)Ai,Q) - Exploit answer redundancy estimate
P(correct(Ai)Ai,Aj) - Goal Estimate P(correct(Ai)Q,A1, An)
- Use logistic regression to estimate answer
probability given the degree of answer relevance
and the amount of supporting evidence provided in
the set of answer candidates
62Answer Ranking Model (2)
where, simk(Ai, Aj) a scoring function used
to calculate an answer similarity between Ai and
Aj relk(Ai) a feature function used to produce
an answer relevance score for an answer Ai K1,
K2 the number of feature functions for answer
validity and answer similarity scores,
respectively N the number of answer
candidates a0,ßk?k weights learned from training
data
63Feature Representation
- Answer Relevance Features
- Knowledge-based Features
- Data-driven Features
- Answer Similarity Features
- String distance metrics
- Synonyms
- Each feature produces an answer relevance or
answer similarity score
64Knowledge-based Feature Gazetteers
- Electronic gazetteers provide geographic
information - English
- Use Tipster Gazetteer, CIA World Factbook,
Information about the US states
(www.50states.com) - Japanese
- Extract Japanese location information from Yahoo
- Use Gengo GoiTaikei location names
- Chinese
- Extract location names from the Web and HowNet
- Translated names
- Translate country names provided by the CIA World
Factbook and the Tipster gazetteers into Chinese
and Japanese names - Top 3 translations were used
65Relevance Score (Gazetteers)
66Knowledge-based Feature Ontologies
- Ontologies such as WordNet contain information
about relationships between words and general
meaning types (synsets, semantic categories,
etc.) - English
- WordNet WordNet 2.1 contains 155,327 words,
117,597 synsets and 207,016 word-sense pairs - Japanese
- Gengo GoiTaikei contains 300,000 Japanese words
with their associated 3,000 semantic classes - Chinese
- HowNet contains 65,000 Chinese concepts and
75,000 corresponding English equivalents
67Relevance Score (WordNet)
68Data-driven Feature Google
- Use Google for English, Japanese and Chinese
- For each answer candidate Ai
- 1. Initialize the Google score gs(Ai) 0
- 2. Create a query
- 3. Retrieve the top 10 snippets from
- Google
- 4. For each snippet s
- 4.1. Initialize the co-occurrence score
- cs(s) 1
- 4.2. For each keyterm translation k in s
- 4.2.1. Compute distance d, the minimum
- number of words between k and the
- answer candidate
- 4.2.2. Update the snippet co-occurrence
- score
-
- 4.3 gs(Ai) gs(Ai) cs(s)
- Question What is the prefectural capital
- city whose name is written in hiragana?
- Keyterms and their translation
- - prefectural ???? (0.75)
- ?????? (0.25)
- - capital city ?? (0.78)
- ????? (0.11)
- ?? (0.11)
- - written ??(0.6)
- - hiragana ??? (0.5)
- ?? (0.3)
- ???? (0.11)
- ???? (0.1)
- Answer candidate ?????
- Query
- ????? (???? OR
- ??????) (?? OR ?????
- OR ??) (??) (? ?? OR
- ?? OR ???? OR ????)
69Data-driven Feature Wikipedia
- Use Wikipedia for English, Japanese and Chinese
- Algorithm
70Similarity Features
- String Distance
- Levenshtein, Cosine, Jaro and Jaro-Winkler
- Synonyms
- Binary similarity score for synonyms
- English WordNet synonyms, Wikipedia redirection,
CIA World Factbook - Japanese Wikipedia redirection, EIJIRO
dictionary - Chinese Wikipedia redirection
1, if Ai is a synonym of Aj 0, otherwise
sim(Ai,Aj)
71Answer Similarity using Canonical forms
- Type specific conversion rules
72AG Results (E-J)
73Breakdown by Answer Type
E-J
E-C
74Effects of Keyterm Translation on Answer Ranking
75Future Work
- Continue to analyze the effects of keyterm
translation on answer ranking - Improve Web validation
- Query relaxation when there is no matched Wed
documents - e.g. Which city in Japan is the "Ramen Museum"
located? - Ramen Museum is translated into "ramen ??? and
there is no matched Web documents - Change the query to (ramen AND ???) or
incorporate English keyterm (Ramen Museum) - Extend our joint prediction model to CLQA
- Apply a probabilistic graphical model to estimate
the joint probability of all answer candidates,
from which the probability of an answer can be
inferred
76NTCIR Evaluation Results
77Evaluation Metrics
- Datasets
- Monthly Evaluation
- Current Perfomance
- Performance History
- Periodic Analysis
- Error Analysis (on Development Set)
- Plans for Future Development
78Evaluation Metrics Report
- End-to-end and modular evaluation
- Evaluation of speed and accuracy
- Summary of internal HTML evaluation reports
- http//durazno.lti.cs.cmu.edu/wiki/moin.cgi/Javeli
n_Project/Multilingual/Evaluation
79Plans for Future Development
- Question Classification
- Replace manually-created knowledge sources
heuristics with learners - Re-architecture to place learned components as
supporting agents to rule-based control - Semantic Role Labeling
- Nominalizations
- Semantic Predicate Expansion
- Automatic ontology acquisition
80Structured Retrieval for Question Answering
81Standard Approach to Retrieval for QA
Output Answers
Input Question
- Question Analysis
- Determines what linguistic and semantic
constraints must hold for a document to contain a
valid answer - Formulates a bag-of-words query using question
keywords and named entity placeholder
representing expected answer type - Document Retrieval
- Corpus indexed for keywords and named entity
annotations 13 - Provides best-match documents containing keywords
and NE - Answer Extraction and Post Processing
- Checks constraints and extracts NE answer
82Issues with Standard Approach
- Why is the standard approach sometimes
sub-optimal for QA? - May not scale to large collections
- When question keywords are frequent and co-occur
frequently, many documents that do not answer the
question may be matched. eg. What country is
Berlin in? - Named entities can help narrow the search space
but still match sentences such as Berlin is near
Poland. - May be slow
- If answer extraction or constraint checking is
not a cheap operation, current approach may
retrieve large numbers of irrelevant documents
that need to be checked. - May be ineffective for non-factoid (e.g.
relationship) questions - Can we reduce the number of documents we need to
process in order to find an answer more quickly? - more relevant documents more highly ranked
83Alternative Structured Retrieval
- Using higher-order information can distinguish
relevant vs. non-relevant results which look the
same to a bag-of-words retrieval model - Linguistic and semantic analyses are stored as
annotations and indexed as fields. - Constraint checking at retrieval time can improve
document ranking based on matching constraints,
thereby reducing post-processing burden.
84The Role of Retrieval in QA
Output Answers
Input Question
- Coarse, first-pass filter to narrow search space
for answers - Finding actual answers requires checking
linguistic and semantic constraints - Bag-of-words retrieval does not support such
constraint checking at retrieval time - May need to process large number of irrelevant
documents to find best answers. - Want to improve document ranking based on
constraints
85Retrieval Approaches for QA
Output Answers
Input Question
- System A
- Query composed of question keywords and a named
entity placeholder - Bag-of-words retrieval
- Constraint checking using ASSERT, answer
extraction
- System B
- Likely answer-bearing structures posited one
query per structure - Structured retrieval with Constraint checking
- Answer extraction
86Research Questions
- How can we compare Systems A and B?
- Experiment Answer-Bearing Sentence Retrieval
- How does the effectiveness of the structured
approach compare to bag-of-words? - Does structured retrieval effectiveness vary with
question complexity? - Experiment The Effect of Annotation Quality
- To what degree is the effectiveness of structured
retrieval dependent on the quality of the
annotations?
87ExperimentAnswer-Bearing Sentence Retrieval
- Hypothesis Structured Retrieval retrieves more
relevant documents more highly ranked compared to
bag-of-words retrieval - AQUAINT Corpus (LDC2002T31)
- Sentence Segmentation by MXTerminator 14
- Named Entity Recognition by BBN Identifinder 1
- Semantic Role Labels by ASSERT 12
- 109 TREC 2002 Factoid Questions
- Exhaustive document-level judgments over AQUAINT
2, 8 - Training (55) and test (54) sets, with similar
answer type distribution - Answer-bearing sentences manually identified
- Must completely contain the answer without
requiring inference or aggregation of information
across multiple sentences - Gold-standard question analysis/query formulation
88Example Answer Bearing Sentence
Q1402 What year did Wilt Chamberlain score 100
points? A At the time of his 100-point game with
the Philadelphia Warriors in 1962, Chamberlain
was renting an apartment in New York.
TARGET
renting
89Question-Structure Mapping
Q1402 What year did Wilt Chamberlain score 100
points?
TARGET
combinesentence( max( combinetarget(
max( combine./argm-tmp( 100 point
anydate ) ) max( combine./arg0(
max( combineperson( chamberlain
) ) ) ) ) ) )
ARGM-TMP
100 points
An answer-bearing structure
A structured query that retrieves instances of
this structure
90Answer-Bearing Sentence Retrieval
- Two experimental conditions
- single structure one structured query
- Only answer-bearing sentences matching a single
structure are considered relevant - every structure many queries, round robin
- Any answer-bearing sentence considered relevant
- Most QA systems somewhere in between, querying
for several, but not all, structures - Keyword Named Entity Baseline
91Results Training Topics
12.8
96.9
Optimal smoothing parameters Jelinek-Mercer
19, with collection language model weighted 0.2
and document language model weighted 0.2
92Results Test Topics
11.4
46.6
93Results
Training Topics
Test Topics
96.9
46.6
Optimal smoothing parameters Jelinek-Mercer
19, with collection language model weighted 0.2
and document language model weighted 0.2
94Structure Complexity
- Results show that, on average, structured
retrieval has superior recall of answer-bearing
sentences. - For what types of queries is structured retrieval
most helpful? - Analyze recall at 200 for queries of different
levels of complexity. - Complexity of structure estimated by counting the
number of combine operators, not including the
outermost.
95The more complex the structure sought, the more
useful knowledge of that structure is in ranking
answer-bearing sentences.
96In the test set, there are fewer queries, total,
and fewer highly complex queries. This widens
confidence intervals, but there is still a range
where 95 confidence intervals do not overlap
much or at all.
97The Effect of Annotation Quality
- Penn Treebank WSJ 9 corpus
- WSJ_GOLD Gold standard Propbank 6 annotations
- WSJ_DEGRADED Semantic role labeling by ASSERT
(88.8 accurate) - All questions answerable over the corpus
- Exhaustively generated sentence-level relevance
judgments - 10,690 questions having more than one answer
98Question and Judgment Generation
- Each sentence that contains a Propbank annotation
can answer at least one question - Dow Jones publishes The Wall Street Journal,
Barrons magazine, other periodicals and
community newspapers. - What does Dow Jones publish?
- Who publishes The Wall Street Journal, Barrons
...? - Does Dow Jones publish The Wall Street Journal,
Barrons ... ?
TARGET
publishes
99What does Dow Jones publish?
- The group of sentences that answer this question
- WSJ_0427 Dow Jones publishes The Wall Street
Journal, Barrons magazine, other periodicals and
community newspapers and operates electronic
business information services. - WSJ_0152 Dow Jones publishes The Wall Street
Journal, Barrons magazine, and community
newspapers and operates financial news services
and computer data bases. - WSJ_1551 Dow Jones also publishes Barrons
magazine, other periodicals and community
newspapers and operates electronic business
informaiton services.
100Judgments for WSJ_DEGRADED
- Sentences relevant for WSJ_GOLD are not relevant
for WSJ_DEGRADED if ASSERT omits or mislabels an
argument - This models the reality of a QA system that can
not determine relevance if annotations are
missing or incorrect or if sentence can not be
analyzed - Constraint checking and answer extraction both
depend on the analysis
101Annotation Quality Results
Structured retrieval is robust to degraded
annotation quality.
102Structured Retrieval Recall
- Structured retrieval ranks sentences that satisfy
the constraints highly. - Structured retrieval outperforms the bag-of-words
approach in terms of recall of relevant
sentences. - Structured retrieval performs best when query
structures anticipate answer-bearing structures,
and when these structures are complex.
103Structured Retrieval Precision
- For questions with keywords that frequently
co-locate in the corpus, structured retrieval
should offer a sizable precision advantage, eg.
What country is Berlin in? - Querying on Berlin alone matches over 6,000
documents in the AQUAINT collection, most of
which do not answer the question. - Questions such as this were intentionally
excluded during construction of the test
collection to ease the human assessment burden.
104Structured Retrieval Efficiency
- Structured queries are slower to evaluate, but
retrieve more relevant results more highly
ranked, compared to bag-of-words queries. - A QA system seeking to achieve a certain recall
threshold will have to process fewer documents - Processing fewer results can improve end-to-end
system runtime, even for systems in which answer
extraction cost is low. - The structured retrieval approach requires that
the corpus be pre-processed off-line. - Using the bag-of-words approach, a QA system is
free to run analysis tools on-the-fly, but this
could negatively impact the latency of an
interactive system.
105Structured Retrieval Robustness
- Although accuracy degrades when the annotation
quality degrades, the relative performance edge
that structured retrieval enjoys over
bag-of-words is maintained. (details in the
paper)
106Exploring the Problem Space
Corpus-based view
Query-based view
Domain
Keyword distribution over Corpus
Newswire, WMD, Medical
Language
EN, JP, CH
Annotations
Annotation or Structure distribution over Corpus
NE, SRL, NomSRL, Syntax, Special purpose event
frames
There is a hypothesized sub-space in which
structured retrieval consistently outperforms.
We may be able to determine the boundaries
experimentally and then generalize.
107Conclusions
- Structured retrieval retrieves more relevant
documents, more highly ranked, compared to
bag-of-words retrieval. - The better ranking requires the QA system to
process fewer documents to achieve a certain
level of recall of answer-bearing sentences. - Although accuracy degrades when the annotation
quality degrades, the relative performance edge
that structured retrieval enjoys over
bag-of-words is maintained. - Details are in the paper (submitted to SIGIR)
108Future Work
- Question Analysis for Structured Retrieval
- Map question structures into likely
answer-bearing structures - Mitigate computational burden of corpus
annotation - How to merge results from different structured
queries in the event that more than one structure
is considered relevant?
109Experimental Plan
110References
- 1 Bikel, Schwartz and Weischedel. An algorithm
that learns whats in a name. ML,
34(1-3)211-231, 1999. - 2 Bilotti, Katz and Lin. What works better for
question answering stemming or morphological
query expansion? In Proc. of the IR4QA Workshop
at SIGIR04. 2004. - 6 Kingsbury, Palmer and Marcus. Adding
semantic annotation to the penn treebank. In
Proc. of HLT02. 2002. - 8 Lin and Katz. Building a reusable test
collection for question answering. JASIST
57(7)851-861. 2006. - 9 Marcus, Marcinkiewicz and Santorini.
Building a large annotated corpus of english the
penn treebank. CL, 19(2)313-330. 1993. - 12 Pradhan, Ward, Hacioglu, Martin and
Jurafsky. Shallow semantic parsing using support
vector machines. In Proc. of HLT/NAACL04. 2004. - 13 Prager, Brown, Coden and Radev.
Question-answering by predictive annotation. In
Proc. of SIGIR00. 2000. - 14 Reynar and Ratnaparkhi. A maximum entropy
approach to identifying sentence boundaries. In
Proc. of ANLP97. 1997. - 19 Zhai and Lafferty. A study of smoothing
methods for language models applied to ad hoc
information retrieval. In Proc. of SIGIR01.
2001.
111Semantic Role Labeling for JAVELIN
- Introduction to Semantic Role Labeling
- Extracting information such as WHO did WHAT to
WHOM, WHEN and HOW, from the sentence - Predicate describes the action/event, its
arguments give information about who, what, whom,
when etc - Useful for Information Extraction, Question
Answering and Summarization. e.g.
bromocriptine-induced activation of p38 MAP
kinase contains information used to answer
questions such as What activates p38 MAP
kinase? or What induces the activation of p38
MAP kinase?
112Semantic Role Labeling for JAVELIN
- English SRL ASSERT from U. Colorado
- ASSERT performance (F-measure)
- Hand corrected parses 89.4
- Automatic parsing 79.4
- Upgrade to ASSERT 0.14b
- Current ASSERT is slow model is loaded once per
document - Explore using the remote/client service options
from new ASSERT
113Semantic Role Labeling for JAVELIN
- Example from ASSERT
- I mean, the line was out the door, when I first
got there. - ARG0 I TARGET mean ARG1 the line was out
the door when I first got there - I mean the line was out ARGM-TMP the door
ARGM-TMP when ARG0 I ARGM-TMP first TARGET
got ARGM-LOC there - ASSERT misses the be and have verbs. KANTOOs
rule based system is used to handle these cases.
be and have occur frequently in questions. - PROPBANK doesnt have any examples of predicate
be in the training corpus - Plan for own future work on SRL for questions
114Semantic Role Labeling for JAVELIN
- SRL for Chinese
- C-ASSERT Chinese ASSERT
- 82.02 F-score
- Chinese extension of English ASSERT
- Example
- ????????????????????
- ARG0 ??? ?? ARGM-ADV ? ??? ARG0 ? ?? ???
TARGET ?? ARG1 ?? ?
115Semantic Role Labeling for JAVELIN
- SRL for Japanese Not much work done. Develop
SRL system in house starting as a class project - Recently released - NAIST Text Corpus v1.2 beta-
includes verbal and nominative predicates with
labeled arguments - The Kyoto Text Corpus v4.0 includes POS,
non-projective dependency parse for 40,000
sentences and case role, anaphora, ellipsis and
co-reference for 5,000 sentences - Use CRF and Tree-CRF for the learning task
116Questions ?