Title: Annotating Students
1Annotating Students Understanding of Science
Concepts
- Rodney D. Nielsen, Wayne Ward, James Martin, and
Martha Palmer - Center for Computational Language and Education
Research - University of Colorado, Boulder
2Annotating Fine-Grained Entailments
- Question Kate said An object has to move to
produce sound. Do you agree with her? Why or why
not? - Reference answer Agree. Vibrations are movements
and vibrations produce sound. - Learner answer I do not agree because a radio
does not move to make sound. - The student agrees Contradicted
- Vibrations are movement Unaddressed
- Vibrations produce something Different Argument
- Something produces sound Expressed
3Recognizing Textual Entailment
-
- Hypothesis Agree. Vibrations are movements and
vibrations produce sound. - Text I do not agree because a radio does not
move to make sound. - The student agrees False
- Vibrations are movement Unknown
- Vibrations produce something Unknown
- Something produces sound True
4Prior Work
- Automated Tutors
- Aleven et al. 2001 Graesser et al., 2001 Jordan
et al., 2004 Koedinger et al. 1997 Makatchev et
al., 2004 Peters et al., 2004 Pon Berry et al.,
2004 Roll et al., 2005 Rose et al., 2003
VanLehn et al., 2005 - Constructed Response Scoring
- Callear et al., 2001 Leacock and Chodorow, 2003
Mitchell et al., 2002 2003 Pullman, 2005
Sukkarieh, 2003 2005 - PASCAL RTE (Dagan, Glickman and Magnini, 2005)
- Differences / Weakness
- Course grained entailment yes/no or grade 0-2
points - Question-specific systems
- Hand-crafted dialog control, parsers,
knowledge-based ontologies, logic
representations, and or rules - Require 100-500 responses per question
5Necessity of Finer-Grained Analysis
- Imagine a tutor only knowing that there is some
unspecified part of the reference answer that we
are not sure the student understands - Reference Answer A long string produces a low
pitch. - Break the reference answer down into low-level
facets derived from a dependency parse and
thematic roles - NMod(string, long) The string is long.
- Agent(produces, string) A string is producing
something. - Product(produces, pitch) A pitch is being
produced. - NMod(pitch, low) The pitch is low.
- Assess whether an understanding of each facet is
implicated by the students response
6Representing Fine-Grained Semantics
- Assess the relationship between the students
answer and the reference answer facets at a finer
grain - Reference Ans A long string produces a low
pitch. - NMod(string, long)
- Agent(produces, string)
- Product(produces, pitch)
- NMod(pitch, low)
Expressed Expressed Expressed Unaddressed
A long string produces a pitch.
Yes Yes Yes No
Assumed Expressed Expressed Different
Argument It produces a loud pitch.
Assumed Expressed Expressed Contradiction
Expressed It produces a high pitch.
7The Focus of This Effort
- Low level facets of reference answer
- Finer-grained relationship to the facets
8The Corpus
Grd Life Science Physical Science and Technology Earth and Space Science Scientific Reasoning and Technology
3-4 Human Body Structure of Life Magnetism Electricity Physics of Sound Water Earth Materials Ideas Inventions Measurement
5-6 Food Nutrition Environments Levers Pulleys Mixtures Solutions Solar Energy Landforms Models Designs Variables
- Assessing Science Knowledge (ASK) Full Option
Science System - Berkeley, Lawrence Hall of Science national
assessment project (NSF) - 16 science teaching and learning modules, Grades
3-6 - 287 constructed response questions
- 15,400 total student responses
- 146,000 facet entailment annotations
9Annotation Process
- Step 1 FOSS/ASK reference answers were manually
decomposed into constituent facets - Ref Answer The string is tighter, so the pitch
is higher. - Be(string, tighter) The string is tighter.
- Be(pitch, higher) The pitch is higher.
- Cause(X, Y) X is caused by Y
- Step 2 Learner answers are annotated to indicate
whether and how each facet was addressed - Learner Answer The string is tighter, so there
is less tension so the pitch gets higher. - Be(string, tighter) The string is
tighter. Self-Contra - Be(pitch, higher) The pitch is higher. Expressed
- Cause(X, Y) X is caused by Y Expressed
10Reference Answer Decomposition
- Begin with a manual dependency parse of the
reference answer
vc
vmod
sbar
prd
vmod
nmod
sub
vmod
pmod
sub
vmod
The brass ring would not stick to the nail
because the ring is not iron.
- Then raise main verbs, remove unimportant
dependencies, incorporate copulas, prepositions
and negation into dependency labels, and utilize
thematic role labels
theme_not
cause_because
nmod
destination_to_not
be_not
The brass ring would not stick to the nail
because the ring is not iron.
11Reference Answer Markup
- Final facets for Ref Answer The brass ring would
not stick to the nail because the ring is not
iron. - NMod(ring, brass) The ring is brass.
- Theme_not(stick, ring) The ring does not stick.
- Destination_to_not(stick, nail) Something does
not stick to the nail. - Be_not(ring, iron) The ring is not iron.
- Cause_because(stick, is) X is caused by Y
theme_not
cause_because
nmod
destination_to_not
be_not
The brass ring would not stick to the nail
because the ring is not iron.
12Answer Annotation Labels
- Assumed Facets that are assumed to be understood
a priori based on the question - Expressed Any facet directly expressed or
inferred by simple reasoning - Inferred Facets inferred by pragmatics or
nontrivial logical reasoning - Contra-Expr Facets directly contradicted by
negation, antonymous expressions and their
paraphrases - Contra-Infr Facets contradicted by pragmatics or
complex reasoning - Self-Contra Facets that are both contradicted
and implied (self contradictions) - Diff-Arg The core relation is expressed, but it
has a different modifier or argument - Unaddressed Facets that are not addressed at all
by the students answer
13Annotation Expressed Inferred
- Question Kate said An object has to move to
produce sound. Do you agree with her? Why or why
not? - Reference Answer Agree. Vibrations are movements
and vibrations produce sound. - Root(root, agree) student agrees Expressed
- Be(vibration, movement) vibration is
movement Inferred - Agent(produce, vibrations) vibrations produce
something Expressed - Patient(produce, sound) something produces
sound Expressed - Student Answer Yes because it has to vibrate to
make sounds.
14Annotation Contradictions
- Question Darla tied one end of a string around a
doorknob and held the other end in her hand. When
she plucked the string (pulled and let go
quickly) she heard a sound. How would the pitch
change if Darla pulled the string tighter? - Reference Answer When the string is tighter, the
pitch will be higher. - Be(string, tighter) The string is
tighter. Assumed - Be(pitch, higher) The pitch is
higher. Contra-Expr - Cause(X, Y) X is caused by Y Assumed
- Student Answer it will be low the pitch change
15Annotation Unaddressed
- Question Write a note to David to tell him why
the pitch gets higher rather than lower - Ref Ans The string is tighter, so the pitch is
higher. The string between the cup and table is
not longer. -
- Be_not(string, longer) The string is not
longer Unaddressed - Student Answer David pitch is not happening
tension is happening okay so calm down.
16Labels
- Assumed Facets that are assumed to be understood
a priori based on the question - Expressed Any facet directly expressed or
inferred by simple reasoning - Inferred Facets inferred by pragmatics or
nontrivial logical reasoning - Contra-Expr Facets directly contradicted by
negation, antonymous expressions and their
paraphrases - Contra-Infr Facets contradicted by pragmatics or
complex reasoning - Self-Contra Facets that are both contradicted
and implied (self contradictions) - Diff-Arg The core relation is expressed, but it
has a different modifier or argument - Unaddressed Facets that are not addressed at all
by the students answer
17Inter-annotator Agreement
Fine-Grn Tutor Y/N
ITA 78.4 86.2 88.0
Kappa 0.704 0.728 0.752
- In most disagreements (57) one annotator chose
Unaddressed - 49 were between Unaddressed and Understood
- 35 of disagreements were between the labels
implying understanding - Only 2.3 of disagreements are between Understood
and Contradicted
Fine-Grn all labels kept separate Tutor combine
Expressed, Inferred Assumed and Contra-Expr
Contra-Infr, others separate Y/N combine
Expressed, Inferred Assumed v. everything
else
18Assessment Technology Overview
- Start with hand-generated reference answer facets
- Automatically parse reference learner answer
and automatically extract representation - Generate machine learning feature vectors
indicative of the students understanding of each
facet - From answers, their parses, the relations between
these, and corpus co-occurrence statistics - Train a machine learning classifier on the
training set feature vectors - Use classifier to assess the test set answers,
assigning one of five Tutor-Labels for each RA
facet
19Results (C4.5 decision tree)
nonAsmdFacets MajorityClass LexicalBaseline All Features ReducedTraining
Training Set 10xCV 54,967 54.6 59.7 77.1
Unseen Answers 30,514 51.1 56.1 75.5
Unseen Questions 6,699 58.4 63.4 61.7 66.5
Unseen Modules 3,159 53.4 62.9 61.4 68.8
- Results on Tutor-Labels are
- 24.4, 8.1 and 15.4 over most frequent class
baseline - 19.4, 3.1 and 5.9 over lexical baseline
- (All Unseen Modules facets adjudicated, about
half of other modules adjudicated)
20Conclusions
- New assessment paradigm to enable more effective
tutoring dialog management - Facet break down enables the tutor to provide
feedback relevant specifically to the appropriate
part of the reference answer - Additional labels facilitate understanding the
type of mismatch between the reference
answer/hypothesis and the students answer/text
21Conclusions
- Corpus of annotated answers
- Substantial agreement 86.2 on Tutor-Labels,
0.728 Kappa - About 146K facet annotations
- Only corpus of fine-grained inference information
- Freely available
- Will support alternative approaches to the
Recognizing Textual Entailment task
22Conclusions
- Answer Assessment System
- Evaluation according to new paradigm
- Within domain performance
- 24 over majority class baseline
- Out-of-domain performance
- 15 over majority class baseline
- First system to address out-of-domain assessment
- First successful assessment of Grade 3-6
constructed responses
23Thanks!
- This work was partially funded by Award Numbers
- NSF 0551723,
- IES R305B070434, and
- NSF DRL-0733323.