Title: Ranking Automatically Generated Questions as a Shared Task
1Ranking Automatically GeneratedQuestions as a
Shared Task
- Michael Heilman and Noah A. Smith
2 In the summer of 55 BC, Julius Caesar invaded
Britain with an army of two legions
3 4 5- In the summer of 55 BC, who invaded Britain with
an army of two legions?
6An Oversimplified Possible User Interface
Click on the checkboxes to select questions to
include on your reading quiz.
- In the summer of 55 BC, who invaded Britain with
an army of two legions? -
- Who invaded Britain?
- When invaded Britain?
7Outline
- Domain-General Factors
- Shared Task Description
- Annotating Questions
- Semi-automated Evaluation
8Domain-General Factors
- Questions in almost all applications must
- Be grammatical,
- Use the correct WH-word,
- Not have obvious answers,
- Avoid vagueness,
-
- Even domain specific applications would benefit
from techniques to ensure basic question quality.
9Outline
- Domain-General Factors
- Shared Task Description
- Annotating Questions
- Semi-automated Evaluation
10Shared Task Description
11Our Overgenerator
- We implemented a question overgenerator employing
various lexical and syntactic transformations.
12Outline
- Domain-General Factors
- Shared Task Description
- Annotating Questions
- Semi-automated Evaluation
13Annotating Questions
- Undergraduates annotated overgenerated questions
produced by our system.
14(No Transcript)
15Annotation Scheme
- Moderate inter-annotator agreement for
Acceptable vs. Not Acceptable ( )
16Annotation Scheme
- Possible Improvements to the Categories
- Better descriptions
- More examples
- Revising/merging ones with low agreement
- Possible Improvements to the Process
- More training and spot-checking of annotators
- More redundancy (e.g., Mechanical Turk)
17Outline
- Domain-General Factors
- Shared Task Description
- Annotating Questions
- Semi-automated Evaluation
18Semi-automated Evaluation
- Once the questions are annotated, new systems can
be evaluated automatically.
?
1. 2. 3. 4. 5. 6. 7. 8.
?
?
?
?
?
?
?
19Semi-automated Evaluation
- Metric Precision _at_ N
- questions in the top N that are acceptable.
20Spectrum of feasible approaches to QG
- Tailored to Domain/Task
- Deep questions
- Require expensive ontologies, etc.
- Domain-general
- Shallow questions
- Easily generalizable
Our proposed task is over here.
21Conclusion
- The domain-generality of the proposed task and
its semi-automatic evaluation method are the
primary benefits that make it likely to succeed
in furthering QG research.
22(No Transcript)
23Annotation Scheme
Deficiency Description
Ungrammatical The question does not appear to be a valid English sentence.
Does not make sense The question is grammatical but indecipherable.
Vague The question is too vague to know exactly what it is asking about, even after reading the article.
Obvious Answer The correct answer would be obvious even to someone who has not read the article.
Missing Answer The answer to the question is not in the article.
Wrong WH word The question would be acceptable if the wh-phrase were different.
Formatting There are minor formatting errors.
Other There are other errors in the question that are not covered by any of the categories.
Acceptable None of the above deficiencies were present.