Title: Evaluating and Restructuring Science Assessments: An Example Measuring Student
1Evaluating and Restructuring Science Assessments
An Example Measuring Students Conceptual
Understanding of Heat
All authors contributed equally to this
manuscript. Please address all inquiries to Kelly
D. Bradley 131 Taylor Education Building
Lexington, KY 40506
Kelly D. Bradley, Jessica D. Cunningham,
Shannon O. Sampson
Newtons Universe is supported by the National
Science Foundation under Grant No. 0437768. For
more information see, http//www.as.uky.edu/Newto
nsUniverse/.
UNIVERSITY OF KENTUCKY Department of Educational
Policy Studies Evaluation
- Conclusion
- Following the reconstruction process, the
committee was asked to develop a new theoretical
hierarchy of item difficulty based on the pilot
results and any revisions made. Using the
baseline assessment given in September 2006, the
theoretical and empirical hierarchy of items will
be compared again. - A strength of this study is the partnership of
science educators with researchers in educational
measurement to construct a quality assessment. - This study provides a model for assessing
knowledge transferred to students through teacher
training. - Findings will support other researchers attempts
to link student performance outcomes to teacher
training, classroom teachers construction of
their own assessments and the continued growth of
collaborative efforts between the measurement and
science education communities.
- Method
- Response Frame
- The target population was middle school science
students in the rural Appalachian regions of
Kentucky and Virginia. - Instrumentation
- A student assessment was constructed by the
Newtons Universe research team to measure
students conceptual understanding of heat. - The pilot assessment contained forty-one,
multiple-choice items. - Data Collection
- Student assessment piloted with a group of middle
school students participating in a science camp
during the summer 2006. - Data Analysis
- The dichotomous Rasch model was applied to the
data. - ZSTD fit statistics acceptable between -2 and 2,
which indicates the fit statistics are within two
standard deviations from the mean of zero (Wright
Masters, 1982). - Items with negative point measure correlations
flagged for review. - Spread of items and students along the continuum
examined for gaps.
- Background
- Although many measurement and testing textbooks
present classical test theory as the only way to
determine the quality of an assessment (Embretson
Hershberger, 1999), Item Response Theory offers
a sound alternative to the classical test theory
approach. - Reliability and various aspects of validity can
be examined when applying the Rasch model (Smith,
2004). - To examine reliability, Rasch measurement places
person ability and item difficulty along a linear
scale. Rasch measurement produces a standard
error (SE) for each person and item, specifying
the range within which each persons true
ability and each items true difficulty fall. - Rasch fit statistics, which are derived from a
comparison of expected response patterns and the
observed patterns (Smith, 2004, p. 103), can be
examined to assess the content validity of the
assessment. - Bradley and Sampson (2006) applied a
one-parameter Item Response Theory model,
commonly known as the Rasch model, to investigate
the quality of a middle school science teacher
assessment and advised appropriate improvements
to the instrument in an effort to ensure
consistent and meaningful measures.
- Discussion
- The first item on the pilot student assessment
was relocated to the fourth item in an effort to
place an easier item first on the student
assessment. - The item flagged for a high outfit ZSTD statistic
was reworded because test developers felt
students were overanalyzing the question. - The item with the negative point measure
correlation (item 13) was deleted because the
committee thought the item in general was
confusing. - Item 19 was revised to replace item 18 from the
student assessment since it tested the same
concept. - Item 23 was removed from the student assessment
because the course does not adequately cover the
concept tested. - A more difficult foundations item was added to
increase the span of foundation items along the
ability continuum. - To fill one potential gap in the item spread,
item 24 was changed to make the question clearer
and in turn, less difficult. - The answer choices of temperature points were
changed to increase the difficulty of the items
12 and 36. - For items 3 and 5, the answer options were
revised because empirically they were not
functioning as expected as distracters. - Items 4 and 40 were determined to be confusing
for many higher ability students so adjustments
were made.
References Bond, T., Fox, C. (2001). Applying
the Rasch model Fundamental measurement in the
human sciences. Mahwah, NJ Lawrence Erlbaum
Associates. Bradley, K. D., Sampson, S. O.
(2006). Utilizing the Rasch model in the
construction of science assessments The process
of measurement, feedback, reflection and change.
In X. Liu W. Boone (Eds.), Applications of
Rasch measurement in science education (pp.
23-44). Maple Grove, MN JAM Press. Embretson,
S., Hershberger, S. (1999). The new rules of
measurement. Mahwah, NJ Lawrence Erlbaum
Associates, Inc. Hopkins, K. D. (1998).
Educational and psychological measurement and
evaluation (8th ed.) Needham Heights, MA Allyn
Bacon. Linacre, J. (1999). A users guide to
Facets Rasch measurement computer program.
Chicago, IL MESA Press. Linacre, J. M. (2005).
WINSTEPS Rasch measurement computer program.
Chicago Winsteps.com. Smith, E. (2004).
Evidence for the reliability of measures and
validity of measure interpretation A Rasch
measurement perspective. In E. Smith R. Smith
(Eds.), Introduction to Rasch measurement (pp.
93-122). Maple Grove JAM Press. Wright, B. D.,
Masters, G. N. (1982). Rating scale analysis
Rasch measurement. Chicago, IL MESA
Press. Wright, B.D., Stone, M.H. (2004).
Making measures. Chicago, IL The Phaneron.
- Objectives
- Apply dichotomous Rasch model to evaluate quality
of assessment to measure student conceptual
understanding of heat - Determine fit of data to the Rasch model
- Restructure the assessment based on results
coupled with theory
- Results
- Person separation and reliability were 2.31 and
0.84 respectively. Item separation and
reliability were 1.56 and 0.71. - Item 13 resulted in a negative point measure
correlation. - The first item was empirically estimated as more
difficult than the theoretical item hierarchy.
- Potential gaps existed between item 28 and items
9 and 12 as well as between item 11 and items 17,
18, 21, 23, 24, 39, and 8. - Four energy transfer items (18, 21, 23, 24) were
located at the same difficulty level. - Unexpected functioning of distracters occurred
for items 4, 13, 14, 30, 32, 38, and 40. - Items containing distracters not being used
included 2, 3, 6, 12, 29, 31, 35, 36, 37, and 39.
A special thanks to Newtons Universe committee
members integral in assessment development
Kimberly Lott, Rebecca McNall, Jeffrey L.
Osborn, Sally Shafer, and Joseph Straley.
Submit requests to kdbrad2_at_uky.edu