Abstract

Laura Hamilton
Stanford University

Construct validity of constructed-response assessments: Male and female high school science performance

FINAL REPORT:

This study addressed the validity of two kinds of science achievement test item: multiple-choice items, which have traditionally appeared on large-scale measures, and brief, constructed-response items that are increasingly being included in such measures. I was particularly interested in addressing two claims put forth by many critics of standardized achievement tests: (I) open-ended items measure reasoning in a more valid way than do multiple-choice items; and (2) the gender differences that we often observe on science achievement tests could be reduced by switching to an open-ended format. This study combined statistical analyses of the NELS:88 data with interviews of students completing the test items.

This study revealed that science achievement, whether it is measured by multiple choice or open-ended items, is complex and multidimensional. Differences in the constructs measured by test items within a format may in some cases be more important than differences between formats. The question of which format is better is therefore simplistic. Analyses of both types of items on NELS:88 revealed that the tests could be broken down into components that varied in their relationships with student background variables. In particular, subsets of items on both tests were characterized by spatial reasoning demands, and conclusions about the relative performance of males and females depend upon whether these items are considered relevant to science achievement or not.

This study was not intended to provide conclusive information concerning influences on achievement, nor should it be interpreted as an endorsement of a particular format. Instead, it illuminates differences among items and between formats, and calls attention to the value of a careful validity investigation that combines evidence from multiple sources. It is especially important for test developers and users to pay attention to task characteristics, such as the specificity of instructions, that may influence performance. Most of the research that relates school inputs or processes to achievement has taken the outcome measure at face value and assumed that a multiple-choice test is an adequate indicator of the value added by schooling. As constructed-response items become more prevalent in large databases, we can expect these tests to be used increasingly as outcome measures in such studies. Efforts to interpret results of studies conducted using large databases should involve a careful study of the outcome measure and an evaluation of whether total score is appropriate.

Back to Funded Dissertation Grants Page

Laura HamiltonStanford University

Construct validity of constructed-response assessments: Male and female high school science performance

Laura Hamilton
Stanford University