Dr Matteo Madotto, Lecturer in Economics, University of Sussex Business School.[1]
Introduction
When designing multiple-choice quizzes (MCQs), an important decision to make is whether or not to apply negative marking to incorrect answers. The main rationale for penalizing wrong answers is to discourage “guessing”, i.e. situations where students are very uncertain about the correct alternative and decide to answer more or less at random in the hope of getting it right by chance. Indeed, without negative marking rational students would have an incentive to attempt all questions, even those where they have absolutely no clue about the correct answer, since they would always get a positive score in expectation (e.g. Budescu and Bar-Hillel, 1993; Prieto and Delgado, 1999; Bereby-Meyer et al., 2002; Betts et al., 2009; Lesage et al., 2013; Akyol et al., 2016). On the other hand, one of the main concerns with negative marking is that it might end up being discriminatory against female students. Indeed, evidence suggests that females tend to be more risk averse than males, which, for an equivalent level of knowledge, may lead them to answer less questions and be unfairly disadvantaged when negative marking is applied (e.g. Burton, 2005; Espinosa and Gardeazabal, 2010; Lesage et al., 2013; Akyol et al., 2016).
In this short article, I present the results of five MCQs with negative marking taken by 900 undergraduate students at the University of Sussex Business School between 2021 and 2023, and analyze how these tests performed along the two main dimensions highlighted above, i.e. guessing and gender bias.
All quizzes contained 20 open-book questions, each of which had 4 possible alternatives with only one correct answer. The order of both questions and answers was randomized to reduce collusion among students. Each correct answer was worth 5 marks, each unanswered question was worth 0 marks, while each incorrect answer was worth -2 marks. The overall score was computed as the sum of the marks, with a minimum floor of 0. Students were made aware beforehand of this marking scheme. They, however, were given no strategic advice on when they should attempt a question, so as not to bias their choices in either direction. The negative marking of -2 ensured that a student with absolutely no clue about the correct answer to a question, i.e. a student who assigned an equal probability to each of the alternatives, would get an expected mark of approximately 0 (specifically -0.25) by answering the question, as is typically considered appropriate in the case of MCQs with negative marking (e.g. Budescu and Bar-Hillel, 1993; Prieto and Delgado, 1999; Bereby-Meyer et al., 2002; Lesage et al., 2013).
Guessing and gender bias
To determine whether or not random guessing remains an issue even when negative marking is in place, we look at the average ratio between the percentages of students who selected the most popular and the second most popular incorrect alternative per question, and that between the most and the least popular incorrect alternative per question. If students assigned an equal probability to all alternatives and answered completely at random, then both these ratios would be approximately equal to 1. As can be seen in Table 1, however, this does not seem to be the case at all for both males and females, regardless of the level of difficulty of the test.[2] On the contrary, in most questions there are both a popular incorrect alternative, which appears plausible to a relatively large number of students, and a very unpopular one, which is chosen by few of them. Specifically, from Table 1 we see that in four of the five tests the most popular incorrect alternative per question is chosen by a percentage of students which on average is about 4 to 10 times larger than that of the second most popular one, and 6 to 16 times larger than that of the least popular incorrect alternative.[3] Only in one quiz the two ratios are substantially lower (see more on this below). Of course, here it is not possible to determine how much of this is due to the presence of the negative marking itself; however, it appears that one of the main apprehensions surrounding MCQs, i.e. guessing by students, is rather limited when such marking scheme is implemented. Those students who decide to answer and choose the wrong alternative seem to do it out of incorrect knowledge rather than no knowledge at all.
Test number | Number of males | Number of females | Average score | Score standard deviation | Average ratio between % of most popular and second most popular incorrect answers – Males | Average ratio between % of most popular and second most popular incorrect answers – Females | Average ratio between % of most and least popular incorrect answers – Males | Average ratio between % of most and least popular incorrect answers – Females |
---|---|---|---|---|---|---|---|---|
1 | 189 | 93 | 63 | 22.5 | 6.3 | 7.3 | 16.4 | 14.5 |
2 | 179 | 77 | 39 | 23.8 | 4.4 | 4.4 | 7.5 | 10.2 |
3 | 75 | 27 | 56 | 18.2 | 10.4 | 4.8 | 10.4 | 8.6 |
4 | 98 | 35 | 53 | 23.2 | 5.5 | 3.7 | 8.2 | 5.7 |
5 | 93 | 34 | 56 | 21.3 | 2.4 | 1.9 | 6.0 | 3.2 |
Turning to the second main question of the article, we analyze whether MCQs with negative marking are discriminatory against females. Data on the gender of individual students were not available, therefore we used students’ names as a proxy for their gender. Summary statistics and two-tailed t-tests for total scores are shown in Table 2, while those for the number of unanswered questions are in Table 3. In four of the five quizzes, both scores and unanswered questions of females were not significantly different from those of males at any conventional significance level. In one quiz, however, females performed worse than males at a 1% significance level and left a larger number of questions unanswered at a 10% significance level.
Test number | Male average score | Female average score | Male score standard deviation | Female score standard deviation | t | p-value |
---|---|---|---|---|---|---|
1 | 62.7 | 62.7 | 21.0 | 25.5 | 0.004 | 0.997 |
2 | 38.4 | 41.0 | 24.4 | 21.8 | -0.825 | 0.411 |
3 | 54.9 | 58.5 | 17.2 | 21.2 | -0.782 | 0.439 |
4 | 53.6 | 52.7 | 22.2 | 26.4 | 0.180 | 0.858 |
5 | 59.5 | 47.5 | 20.0 | 22.5 | 2.734 | 0.009 |
Test number | Male average of non-responses | Female average of non-responses | Male standard deviation of non-responses | Female standard deviation of non-responses | t | p-value |
---|---|---|---|---|---|---|
1 | 0.8 | 1.1 | 1.6 | 2.3 | -1.023 | 0.308 |
2 | 1.9 | 2.2 | 2.9 | 2.3 | -0.702 | 0.483 |
3 | 0.4 | 0.6 | 1.0 | 1.1 | -0.854 | 0.398 |
4 | 1.1 | 1.5 | 1.7 | 1.8 | -1.058 | 0.295 |
5 | 1.5 | 3.0 | 2.8 | 4.2 | -1.895 | 0.065 |
A possible trade-off
As Tables 2 and 3 show, one of the most common concerns about MCQs with negative marking, i.e. that they may discriminate against female students, does not appear substantiated in most of our cases. However, comparing the results in these two tables with those in Table 1, we see that the only quiz in which females performed significantly worse than males and left a larger number of unanswered questions (namely test 5) is exactly the one in which students seemed more uncertain about the correct answers, as measured by the relatively low values of the two ratios in Table 1. It may be therefore the case that gender bias occurs precisely in those situations where random guessing is more likely and hence negative marking would be more useful. This may be because differences in the risk attitude of students start playing a role exactly when the latter are sufficiently uncertain about the correct answer, i.e. when they assign similar probabilities to all alternatives.
To avoid this trade-off, it may be sensible to design questions such that at least one of the alternatives would appear highly unlikely to those students who possess a minimum level of knowledge, allowing them to assign higher probabilities to the remaining options. In this way, negative marking would discourage random guessing by those students with a very low knowledge level, without excessively reducing the incentives to answer of the more knowledgeable students, regardless of their risk attitude.
References
Akyol, S. P., Key, J. and Krishna, K. (2016) “Hit or miss? Test taking behavior in multiple choice exams” NBER Working Paper 22401.
Bereby-Meyer, Y., Meyer, J. and Flascher, O. M. (2002) “Prospect theory analysis of guessing in multiple choice tests” Journal of Behavioral Decision Making, 15(4), 313-327.
Betts, L. R., Elder, T. J., Hartley, J. and Trueman, M. (2009) “Does correction for guessing reduce students’ performance on multiple-choice examinations? Yes? No? Sometimes?” Assessment & Evaluation in Higher Education, 34(1), 1-15.
Budescu, D. and Bar-Hillel, M. (1993) “To guess or not to guess: a decision-theoretic view of formula scoring” Journal of Educational Measurement, 30(4), 277-291.
Burton, R. F. (2005) “Multiple-choice and true/false tests: myths and misapprehensions” Assessment & Evaluation in Higher Education, 30(1), 65-72.
Espinosa, M. P. and Gardeazabal, J. (2010) “Optimal correction for guessing in multiple-choice tests” Journal of Mathematical Psychology, 54(5), 415-425.
Lesage, E., Valcke, M. and Sabbe, E. (2013), “Scoring methods for multiple choice assessment in higher education – Is it still a matter of number right scoring or negative marking?” Studies in Educational Evaluation, 39(3), 188-193.
Prieto, G. and Delgado, A. R. (1999) “The effect of instructions on multiple-choice test scores” European Journal of Psychological Assessment, 15(2), 143-150.
[1] I would like to thank Ana Carolina Tereza Ramos de Oliveira dos Santos for her excellent work as research assistant.
[2] It is often hard to properly calibrate the level of difficulty of an MCQ, especially when this is administered for the first time, and indeed one of the test turned out very difficult for students. Of course, similar issues can occur with or without negative marking. The presence of the latter, however, tends to amplify the impact of miscalibration to a certain extent.
[3] The average ratios in Table 1 can actually be thought as lower bounds, since these are computed excluding those questions for which the denominator of the ratio would have involved an answer not chosen by any student.
Leave a Reply