Ensuring quality and validity of measurement with SAM tests - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Ensuring quality and validity of measurement with SAM tests

Description:

... of test items under classical test theory Item difficulty: proportion of students in the sample who has completed the item Discrimination: ... – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 35
Provided by: Kathe71
Category:

less

Transcript and Presenter's Notes

Title: Ensuring quality and validity of measurement with SAM tests


1
Ensuring quality and validity of measurement with
SAM tests
  • Kardanova Elena
  • National Research University
  • Higher School of Economics

2
Outline of presentation
  • SAM test development process
  • Analysis of psychometric quality of test items
    and tests
  • SAM validity study
  • International expertise of SAM
  • Localization and adaptation of SAM tests for use
    in other countries and cultures

3
SAM test development process
4
Steps of test development process
  • Test planning
  • Content analysis
  • Test specification development
  • Item writing
  • Piloting testing
  • Test construction
  • Test results scaling
  • Test results reporting and interpretation
  • Test documentation

5
SAM theoretical model realization
  • Tests in mathematics and Russian language have
    been developed under SAM model
  • Tests have similar structure
  • Tests are designed for graduates from primary
    school
  • Each block includes three test items assigned to
    levels 1, 2 and 3 that are correspond to the same
    content area
  • Dichotomous approach students get 1 point for a
    correct answer and 0 for incorrect (or absent)
    answer.

6
Stages of SAM tests piloting
  • Pre-piloting
  • Purpose face validity
  • Time recording for each item
  • Sample - 10-20 students
  • Clinical approbation
  • Purpose testing the items functioning,
    detecting mistakes, defining of item difficulty
  • Sample - 50 students per item
  • Data analysis under classical test theory
    (CTT)
  • Full-scale approbation
  • Purpose to check quality of test items and
    detect problems of item and test functioning
  • Sample not less than 400-500 students per test
    form
  • Data analysis under CTT and IRT

7
Analysis of psychometric quality of test items
and tests
8
Characteristics of test items under classical
test theory
  • Item difficulty proportion of students in the
    sample who has completed the item
  • Discrimination ability of item to differentiate
    students with different ability


9
Reliability and validity
Reliability characteristic of precision and
stability of test results Validity
characteristic of test information suitability
for decision making
10
Psychometric quality (CTT)(2012 pilot testing,
Mathematics, PP form, over 5000 forth-graders)
Test form 1 Test form 2
Number of examinees 3018 2941
Raw score average 26 27
Standard deviation 8.37 8.55
Average difficulty level 0.59 0.61
Avegare dicrimination index 0.44 0.46
Average point-biserial coefficient 0.39 0.39
Reliability index (KR20) 0.90 0.91
Standard error of measurement 2.61 2.61
  • All items have good psychometric quality
  • Items p-values are in the range of (0.16 , 0.98)

11
Joint distribution of items difficulties and
discrimination indexes (math, test form 1)
12
IRT analysis Variable map (math, test form1)
  • PERSON - MAP - TASKS
  • ltmoregtltraregt
  • 5 .
  • .
  • .
  • .
  • 4 .
  • .
  • .
  • .
  • . TT
  • .
  • 3 .
  • .
  • . M-C-01-1-3 M-D-08-1-3
    M-R-05-1-3
  • . M-G-01-1-3 
  • . M-M-11-1-3
  • . S

The 3rd level Items
The 2nd level Items
The 1st level Items
13
Analysis under IRT conclusions
  • Tests can be considered as essentially
    unidimensional (Principal component analysis of
    the standardized residuals (Linacre, 1998 Smith,
    2002) was used to confirm the unidimensionality
    of data)
  • Tests have optimal difficulty level and well
    centered relating to a sample of examinees
  • All items demonstrate satisfactory psychometric
    characteristics and fit the model
  • SAM tests can be acknowledged as qualitative
    measurement tool

14
SAM validity study
15
Description of the SAM validity study
  • Validity is the extent to which a test fulfils
    its purpose
  • The Dutch rating system was chosen as a basis for
    conducting the SAM validity study (Evers, A.,
    2001)
  • SAM validity study was conducted during 2011-2013
    SAM pilot testing in different regions of the
    Russian Federation

16
Structure of the SAM validity study
  • Content validity - external expertize
  • Construct validity -
  • What does the test measure? and Does the test
    measure the intended concept or does it partly or
    mainly measure something else?

17
Construct validity
  • ?onstruct validity is a matter of the
    accumulation of research evidence.
  • Construct validation research is never completed.

18
Evidence for fair test use (DIF analysis on
gender)
  • Test results for both genders (mathematics, test
    form 1)
  • Females Males
  • Sample size 1471 1545
  • Observed raw score average (SD) 26.7 (8.4)
    26.2 (8.3)
  • Ability estimate average (SD) 0.76 (1.15)
    0.69 (1.11)

The method t-test and Mantel-Haenzel
statistics
19
Testing of hypotheses that follow from the
theoretical foundation of the test construct
  • The first hypothesis
  • The items of three levels related to the same
    block and meeting the theoretically-grounded
    criteria of three levels should be built into a
    difficulty-based hierarchy
  • The second hypothesis
  • Towards the end of the primary school the
    syllabus is expected to be acquired on the 2nd,
    reflexive, level. Acquiring this syllabus on the
    3rd, functional, level is expected to happen
    towards the end of the middle school.

20
The first hypothesis The items of three levels
related to the same block and meeting the
theoretically-grounded criteria of three levels
should be built into a difficulty-based
hierarchy.
Distribution of difficulty levels (Math, test
form 1)
21
Verification of the second hypothesis
  • A special study conducted in years 2011-2012.
  • Research design in 2011 the tests were
    administered on four age groups students of the
    4th, 6th, 8th and 10th grades. One year later the
    same tests were administered on the same students
    who were studying at the moment in the 5th, 7th,
    9th and 11th grades.
  • Testing was done in spring, at the end of
    academic year.
  • Sample included about 100 examinees in each
    grade.

22
The second hypothesis Towards the end of the
primary school the syllabus is expected to be
acquired on the 2nd, reflexive, level. Acquiring
this syllabus on the 3rd, functional, level is
expected to happen towards the end of the middle
school.
Students distribution of different grades
depending on proficiency level in mathematics
23
Criterion validity
Concurrent validity
Predictive validity
Predictive validity shows how well a test can
predict future criterion scores. Concurrent
criterion validity answers the question how test
results are related to a criterion at present.
24
SAM predictive validity study research design
  • The study was based on SAM pilot testing in
    Krasnoyarsk region in spring 2011.
  • The total sample was 941 primary schoolers from
    12 schools.
  • The same students marks were gathered one year
    later (they were studying in the 5th grade at the
    moment).

  • Student
  • distribution
  • into
  • proficiency
  • levels
  • (mathematics)

25
SAM predictive validity study Distribution of
student marks depending on student proficiency
level (mathematics)
  • The correlation between the students
    ability score and their school marks is 0.6 and
    the correlation between their proficiency level
    and the school mark is 0.56.

26
Convergent validity
  • Convergent validity refers to the degree to which
    two measures of construct that theoretically
    should be related, are in fact related.
  • To establish convergent validity we used AT test
    - an instrument of monitoring of educational
    achievements in mathematics of primary school
    students (developed by Russian Academy of
    Education).
  • Among students who completed AT test, students
    with high test scores were selected.
  • The hypothesis tested the results of these
    students on SAM tests should be high, most of
    them should be put into 2nd and 3rd proficiency
    levels.

27
International expertise of SAM 
  • Autumn of 2013
  • The reviewers
  • Howard T. Everson (Center for Advanced Study in
    Education, Graduate School, City University of
    New York, Professor of Psychology and Senior
    Research Fellow)
  • Clancy Blair (New York University, Steinhardt
    School of Culture, Education, and Human
    Development, Professor of Applied Psychology)
  • Bas Hemker (Netherlands, Cito National Institute
    for Test Development, Senior Research Scientist)

28
The SAM test documents provided for expertise
  • Basic
  • Technical manual
  • Users guide
  • Test specification
  • Math tests
  • Validity study
  • Additional
  • SAM Framework
  • Technical report

29
International expertise of SAM conclusions
  • SAM toolkit is generally appreciated by experts
  • All experts point out the scope of the research
    aimed at establishing the quality of SAM toolkit
    and its validation
  • Experts call for further research related to SAM
    application
  • Experts stress the need of further research
    related to SAM validation, particularly
    longitudinal studies. For instance, the
    correlation between the findings of SAM research
    and other cognitive and non-cognitive
    measurements of students, the analysis of factors
    which impact the findings, etc.

30
Localization and adaptation of SAM tests for use
in other countries and cultures
31
Procedure of localization
  • Dual translation
  • Verification of translation by national experts
  • Verification of translation by SAM developers
  • Ensuring the quality of translated tests
    piloting, analysis of psychometric
    characteristics of tets items, comparison of
    items characteristics in different languages and
    cultures, reliability and validity study

32
AERA/APA/NCME standards
  • Standard 6.2. When test developers introduce
    significant amendments into the test format, the
    time frame, the language or the contents it is
    essential to validate the test and confirm the
    validity of the localized test, or establish
    that the validation procedure is impossible or
    irrelevant
  • Standard 13.4. When translating a test from one
    language or dialect into another, it is required
    to establish the validity and reliability of the
    test, meant for a certain linguistic community
  • Standard 13.6. If two versions of the test in
    different languages are expected to feature
    equivalent, compatible forms, it is required to
    present confirmation of compatibility and
    equivalence of the forms.

33
Five main sources of potential non-comparability
of cross-cultural results
  • differences in construct
  • tool differences
  • procedure differences
  • sample differences
  • answering differences

34
Thank you!
  • Kardanova Elena
  • ekardanova_at_hse.ru
  • Center for Monitoring of the Quality in Education
  • Institute of Education
  • National Research University Higher School of
    Economics
  • http//ioe.hse.ru/monitoring/
Write a Comment
User Comments (0)
About PowerShow.com