Title: Automatically Grading Programming Assignments with Web-CAT
1Automatically Grading Programming Assignments
with Web-CAT
- Stephen H. Edwards
- Virginia Tech
- Dept. of Computer Science
- edwards_at_cs.vt.edu
- http//web-cat.sourceforge.net/
2My goals today are to
- Explain how requiring students to formulate and
test hypotheses about their own code can improve
their understanding and performance - Describe our experiences with an alternate
grading approach supported by a new tool Web-CAT - Describe some of the flexibility in Web-CAT for
supporting other approaches - Convince you software testing can be an
importantand practicaladdition to classroom
practices
3Students hold onto ineffective techniques
- Too often, intro students believe that if their
code - compiles, the errors are mostly gone
- runs correctly when I try it once, it is correct
- runs on the instructor-provided sample input, it
is correct - has a problem, it can be fixed by trial and error
4What is reflection-in-action?
- For an expert, when the current technique is
failing - Step back and reflect I must be missing
something - Re-examine the situation, your solution, and your
implicit assumptions about the problem - Leads to guesses (hypotheses) about why the
solution isnt working or why something else will
be better - Carry out an experiment which serves to
generate both a new understanding of the
phenomenon and a change in the situation
5Practicing software testing will help students
frame and carry out experiments
- The problem too much focus on synthesis and
analysis too early in teaching CS - Need to be able to read and comprehend source
code - Envision how a change in the code will result in
a change in the behavior - Need explicit, continually reinforced practice in
hypothesizing about program behavior and then
experimentally verifying their hypotheses
6Student comments suggest their current testing
practices are often weak
- I run them through some simple tests to ensure
that it is operating as expected. But for the
most part I have always relied on supplied test
data - I dont think about test cases until I am
confident my program is 100 working. Of course,
it almost never is - I usually write the whole thing up and then start
doing rapid-fire tests of everything I can think
of.
7A comprehensive strategy is necessary for a
culture shift in what students do
- Students cannot test their own code
- Want a culture shift in student behavior
- A single upper-division course would have little
impact on practices in other classes - So Systematically incorporate testing practices
across many courses
CS1
CS2
Testing Practices
OO Design
Data Struct
8Expect students to apply their testing skills all
the time in programming assignments
- Expect students to test their own work
- Empower students by engaging them in the
process of assessing their own programs - Require students to demonstrate the correctness
of their own work through testing - Do this consistently across many courses
9What tools and techniques should I teach?
- We want to start with skills that are directly
applicable to authentic student-oriented tasks - Dont want to add bureaucratic busywork to
assignments - Without tool support, this is a lost cause!
- It is imperative to give students skills they
value - But most textbooks only give a conceptual
intro to idealized industrial practices, not
techniques students can use in their own
assignments
10Test-driven development is very accessible for
students
- Also called test-first coding
- Focuses on thorough unit testing at the level of
individual methods/functions - Write a little test, write a little code
- Tests come first, and describe what is expected,
then followed by code, which must be revised
until all tests pass - Encourages lots of small (even tiny) iterations
- See http//web-cat.sf.net/ for on-line references
11Students can apply TDD in assignments and get
immediate, useful benefits
- Conceptually, easy for students to understand and
relate to - Increases confidence in code
- Increases understanding of requirements
- Preempts big bang integration
12The problem is devising an effective assessment
strategy
- Need to assess student performance at testing
- Need to give productive feedback
- Need to provide rapid turnaround
- Cannot afford huge increase in resources required
13Conventional automated assessment does not
encourage good testing habits
- Student uploads program
- Program is compiled
- Executed against test data
- Scored based on output
14The conventional approach provides useful
benefits that do lead to a cultural change
- Fast, precise feedback to students
- Chance(s) to improve based on feedback
- Good assessment of behavior
- Systematic use resulted in culture change
15But the conventional approach may discourage
desired behavior and skills
- Focus is on output correctness, first and
foremost - Get it working first, work on commenting,
structure, etc. later - Students not encouraged or rewarded for testing
on their own - Students often do less testing
16Proper grading and feedback can provide positive
incentive for desirable behavior
- Decide what behavior to foster
- Choose a corresponding scoring/reward
system - Design feedback approach
- Use students adaptive nature to drive cultural
change
17Proper grading and feedback is critical to
reinforcing desired behavior
- Assess test validity correctness of students
tests - Assess test completeness the thoroughness of
students tests - Assess program correctness behavior of students
solution - Multiply scores as percentages
18Students improve their code quality when using
Web-CAT
Newly written untested code
Commerical-quality code
19Students start earlier and finish earlier when
they use Web-CAT
20An evaluation of submitted code indicates
students program more effectively
Bold ? p .05 significance Without With TDD
Recorded grades 90.2 96.1
TA assessment 98.1 98.2
Automated grader assessment 76.8 94.0
Faults on master test suite 36.7 24.9
Projected Defects/KSLOC 70 38 (45 less!)
How early was first submission? 2.2 days 4.2 days
21After using TDD and Web-CAT, students clearly
perceive practical benefits
Agree Disagree
More helpful at detecting errors than Curator 4.3
Provides excellent support for TDD 4.1
Increases my confidence in correctness 3.9
Increases my confidence when making changes 3.8
Makes me test my solution more thoroughly 3.8
Makes me more systematic in devising tests 3.8
Would like to use, even if not required 3.8
22Student reactions are very positive toward TDD
- I am very excited about using TDD.
- I agree that TDD can be beneficial and Im glad
we are being required to experiment with it in
this course. - If it increases the effectiveness of my
programming and decreases the time I spend
debugging, then I am all for it. - Previously, I had to quit my detailed testing
and stick to making the program appear to work
with the sample data given every time a deadline
drew near. With TDD, the tests are such an
integral part of the project that no
time-conserving measure will save me.
23We use Web-CAT to automatically process student
submissions and check their work
- Web application written in 100 pure Java
- Deployed as a servlet
- Built on Apples WebObjects
- Uses a large-grained plug-in architecture
internally, providing for easily extensible data
model, UI, and processing features
24Web-CATs strengths are targeted at broader use
- Security mini-plug-ins for different
authentication schemes, global user permissions,
and per-course role-based permissions - Portability 100 pure Java servlet for Web-CAT
engine - Extensibility Completely language-neutral,
process-agnostic approach to grading, via
site-wide or instructor-specific grading plug-ins - Manual grading HTML web printouts of student
submissions can be directly marked up by course
staff to provide feedback
25Grading plug-ins are the key to process
flexibility and extensibility in Web-CAT
- Processing for an assignment consists of a tool
chain or pipeline of one or more grading
plug-ins - The instructor has complete control over which
plug-ins appear in the pipeline, in what order,
and with what parameters - A simple and flexible, yet powerful way for
plug-ins to communicate with Web-CAT, with each
other - We have a number of existing plug-ins for Java,
C, Scheme, Prolog, Pascal, Standard ML, - Instructors can write and upload their own
plug-ins - Plug-ins can be written in any language
executable on the server (we usually use Perl)
26The most well-known plug-in is for grading Java
assignments that include student tests
- ANT-based build of arbitrary Java projects
- PMD and Checkstyle static analysis
- ANT-based execution of student-written JUnit
tests - Carefully designed Java security policy
- Clover test coverage instrumentation
- ANT-based execution of optional instructor
reference tests - Unified HTML web printout
- Highly configurable (PMD rules, Checkstyle rules,
supplemental jar files, supplemental data files,
java security policy, point deductions, and lots
more)
27Web-CAT supports a variety of languages, and its
Java plug-in is aimed at software testing
- ANT-based build of arbitrary Java projects
- PMD and Checkstyle static analysis
- ANT-based execution of student-written JUnit
tests - Carefully designed Java security policy
- Clover test coverage instrumentation
- ANT-based execution of optional instructor
reference tests - Unified HTML web printout
- Highly configurable (PMD rules, Checkstyle rules,
supplemental jar files, supplemental data files,
java security policy, point deductions, and lots
more)
28Web-CAT provides timely, constructive feedback on
how to improve performance
- Indicates where code can be improved
- Indicates which parts were not tested well enough
- Provides as many revise/ resubmit cycles as
possible
29The most important step in writing testable
assignments is
- Learning to write tests yourself
- Writing an instructors solution with tests that
thoroughly cover all the expected behavior - Practice what you are teaching/preaching
30Students get frustrated without feedback, so
reference tests must provide some
- If students only get a score, but no other
feedback for how to improve, they get easily
frustrated - We augment our reference tests to provide hints
for failed tests, cross-referenced to the program
assignment
Requirements in assignment spec mul this command takes two arguments from the evaluation stack and multiplies them11.
Feedback to student on failed test Your testing does not fully cover (11)
More detailed alternate feedback (11) mul command failed, expected 4 but received 8
31Students will try to get Web-CAT to do their work
for them
- Students appreciate the feedback, but will avoid
thinking at (nearly) all costs - Too much feedback encourages students to use
Web-CAT for testing instead of writing their own
teststhey use it as a development tool instead
of simply to check their work - This limits the learning benefits, which come in
large part from students writing their own tests - Lesson balance providing suggestive feedback
without giving away the answers lead the
student to think about the problem
32We have also tried to influence student work
habits to improve their success
- Encourage early submission by providing extra
incentives or using late penalties - Score bonuses and/or penalties are easy
- Another useful approach
- Generous limit on the total number of submissions
(60) - Hints disappear one day before the due date
- Project closes for one day to encourage students
to step away and reflect on the last bug - Project opens again for one day with hints
re-enabled, but with a cap on how much the score
can improve
33Lessons for writing program assignments intended
for automatic grading
- Requires greater clarity and specificity
- Requires you to explicitly decide what you wish
to test, and what you wish to leave open to
student interpretation - Requires you to unambiguously specify the
behaviors you intend to test - Requires preparing a reference solution before
the project is due, more upfront work for
professors or TAs - Grading is much easier as many things are taken
care by Web-CAT course staff can focus on
assessing design
34Areas to look out for in writing testable
assignments
- How do you write tests for the following
- Main programs
- Code that reads/write to/from stdin/stdout or
files - Code with graphical output
- Code with a graphical user interface
35Testing main programs
- The key think in object-oriented terms
- There should be a principal class that does all
the work, and a really short main program - The problem is then simply how to test the
principal class (i.e., test all of its methods) - Make sure you specify your assignments so that
such principal classes provide enough accessors
to inspect or extract what you need to test
36Testing input and output behavior
- The key specify assignments so that input and
output use streams given as parameters, and are
not hard-coded to specific sources destinations - Then use string-based streams to write test
cases show students how - In Java, we use BufferedReaders and PrintWriters
for all I/O - In C, we use istreams and ostreams for all I/O
37Testing programs with graphical output
- The key if graphics are only for output, you can
ignore them in testing - Ensure there are enough methods to extract the
key data in test cases - We use this approach for testing Karel the Robot
programs, which use graphic animation so students
can observe behavior
38Testing programs with graphical UIs
- This is a harder problemmaybe too distracting
for many students, depending on their level - The key question what is the goal in writing the
tests? Is it the GUI you want to test, some
internal behavior, or both? - Three basic approaches
- Specify a well-defined boundary between the GUI
and the core, and only test the core code - Switch in an alternative implementation of the UI
classes during testing - Test by simulating GUI events
39Conclusion including software testing helps
promote learning and performance
- If you require students to write their own tests
- Our experience indicates students are more likely
to complete assignments on time, produce one
third less bugs, and achieve higher grades on
assignments - It is definitely more work for the instructor
- But it definitely improves the quality of
programming assignment writeups and student
submissions
40Visit our SourceForge project!
- http//web-cat.sourceforge.net/
- Info about using our automated grader, getting
trial accounts, etc. - Movies of making submissions, setting up
assignments, and more - Custom Eclipse plug-ins for C-style TDD
- Links to our own Eclipse feature site