Report on Descriptive Statistics and Item Analysis of Objective Test Items
Report on Descriptive Statistics and Item Analysis of Objective Test Items on Data Extracted From the Grade 12 Final English Second Language Exam 2008.
by Stephan Freysen
Prof. T Kuhn CIA 722 7 April 2008
ii
Acknowledgements
I would like to express extreme gratitude to the Gauteng Department of Education for the professional and cooperative manner in which they dealt. The datagathering for this report would have been far much more gruelling had it not been for the selfless assistance that Mr. Y Zafir and Ms. L Bongani provided me with.
I would also like to thank Prof. Knoetze for tabulating the test data. Thanks to Prof. Kuhn for setting up a template with formulas. It has been a great help.
iii
Descriptive Abstract
This report is written so that judgement can be passed on the reliability of the multiple –choice test in the grade 12 English second language final exam.
iv
Table of contents
Acknowledgements
iii
Descriptive abstract
iv
List of Tables
vi
List of Figures
vii
Terminology list
viii
1. Introduction and purpose
1
2. Test analysis
2
3. Item analysis
8
4. Conclusion
10
Bibliography
11
Appendix A: Test Data
12
List of tables v
Table 2.1: Tabulated Test Scores Table 2.2: Measure of Central Tendency Table 2.3: Frequency Distribution Table 2.4: Test scores with pq values Table 3.1: Item Difficulty Indices Table 3.2: Item Discrimination Indices
List of Figures
vi
Figure 2.1: Histogram of Frequency Figure 2.2: Polygon of Frequency Figure 2.3: Ogive of Frequency Figure 4.1: Percentage of acceptability
vii
Terminology List Descriptive Statistics
The term used to refer to the mode, median and mean.
Difficulty Index
“Proportion of students who answered the item correctly.” Borich & Kubiszyn (2007: 205)
Discrimination Index
“Measure of the extent to which a test item discriminates or differentiates between students who do well on the overall test an those who do not do well on the overall test.” Borich & Kubiszyn (2007: 205)
Mean
The average of a set of numbers
Median
The score that splits the distribution in half. Borich & Kubiszyn (2007: 259)
Mode
The score that appears most frequently in a set of scores. Borich & Kubiszyn (2007: 264)
Quantitative Item Analysis
“A numerical method for analyzing test items employing student response alternatives or options.” Borich & Kubiszyn (2007: 205)
Reliability
Refers to the internal consistency of a test. Borich & Kubiszyn (2007: 318)
Standard Deviation
“The estimate of variability that accompanies the mean in describing a distribution.” Borich & Kubiszyn (2007: 272) viii
1. Introduction As we have all experienced, objective test items are a very popular tool for testing knowledge. One of the most popular objective test item types is the multiplechoice format. According to Borich & Kubiszyn (2007: 116), the uniqueness of multiplechoice items is that these items allow you to measure knowledge at higher levels in Bloom’s taxonomy than other objective test items. This provides a problem, as assessors often do not consider any academic guidelines to set these questions. The result being that the items differ vastly from one another in difficulty indices and that they often present unrealistic discrimination indices. Borich & Kubiszyn (2007: 205) The purpose of this report is to analyse the multiplechoice test item data that was extracted from the final English second language grammar exam of 2008. This will be achieved through analysis of the measure of central tendency and variability of the data. The first part of the analysis will consist of the analysis of the question (test) as a whole. The second part of the analysis will consist of individual itemanalysis. The data includes the answers of twenty questions that were given by twenty five learners. This is a small sample group, but it should provide enough critique on the multiplechoice section of the exam to offer a detailed overview of the test’s reliability. The findings in the report will be used to determine whether the multiplechoice test items present in the exam was of adequate and fair difficulty.
1
2. Test Analysis. Descriptive Test Analysis. In quantitative analysis, the first step is to tabulate the raw test scores. According to Borich & Kubiszyn (2007: 204), this type of analysis is the ideal for multiplechoice tests. Consider table 2.1 for the ascending numerical sorting of the test scores. Learner L19 L1 L17 L24 L7 L15 L22 L6 L21 L10 L18 L4 L8 L9 L23 L5 L12 L13 L14 L20 L25 L2 L3 L11 L16
Percentage of items correct 20 35 40 40 45 45 45 50 50 55 60 65 65 65 65 70 70 80 80 80 80 85 85 95 95
Table 2.1: Tabulated Test Scores
2
As depicted in table 2.1, we can determine the lower scores, higher scores and the middle scores. We can see that considering the 40% cutoff rate, only two students failed this test, while eight students obtained a distinction.
The measure of central tendency for these test scores in table 2.1 can be seen in table 2.2. Mean Median 62.2 65 Table 2.2: Measure of Central Tendency
Mode 65, 80
Standard Deviation 19.35
The mode is bimodal as we find an equal distribution of 65% and 80% among these scores. Most scores are above the mean. The next step is to group the scores in table 2.1 into intervals. This is done in order to determine a simple frequency distribution. In table 2.3, one can see the intervals, the lower and upper limits of the intervals, the frequency and the cumulative frequency. L25 80 Learner Scores L2 85 Lower limit L19 20 20 L3 85 L1 35 27 L11 95 L16 95 L17 40 34 L24 40 41 L7 45 48 L15 45 55 L22 45 62 L6 50 69 L21 50 76 L10 55 83 L18 60 90 L4 65 L8 65 L9 65 L23 65 L5 70 L12 70 L13 80 L14 80 L20 80
Upper Limit 26 33 40 47 54 61 68 75 82 89 96
Mid Value Interval 23 30 37 44 51 58 65 72 79 86 93
3
2026 2733 3440 4147 4854 5561 6268 6975 7682 8389 9096
Frequency 1 0 3 3 2 2 4 2 4 2 2
Cumulative Frequency 2 2 5 8 10 12 16 18 22 24 26
Table 2.3: Frequency Distribution
Graphic Representation
The frequency can be graphically represented as follow:
Figure 2.1: Histogram of Frequency
In Figure 2.1, we can see that one learner scored between 20% and 26%. Three learners scored between 34% and 40%. As the cutoff for passing is 40%, this graph unfortunately does not show how many of those three passed. Three more learners scored between 41% and 47%. Two learners scored between 48% and 54%. Another two learners scored between 55% and 61%. Four learners scored between 62% and 68%. Two learners scored between 69% and 75%. Four learners scored between 76% and 82%. Four learners scored more than that. More than four of the learners got distinctions. If we consider table 2.3 once again, we can see that although the graph is accurate, the detail of the distribution is still unclear, due to the large gap in scores that the intervals imply.
4
Figure 2.2: Polygon of Frequency
In figure 2.2, the average of the interval is taken into account on the horizontal axis. We can see that the graph corresponds with figure 2.1 and can thus trust that the data analysis that has been done on figure 2.1 is reliable.
Figure 2.3: Ogive of Frequency
5
Figure 2.3 concentrates on the upper values of the intervals. This curve also correlates with figures 2.2 and 2.1. All three graphs are mesokurtic and negatively skewed. This implies that that the sample group did truly well in the multiplechoice test. According to Borich & Kubiszyn (2007: 257), there can be multiple reasons for this, for example, that the sample group might have been of high intelligence, that the test may have been too easy or that the timeconstraints for the test was too lenient. Reliability Coefficient. “Another way of estimating the internal consistency of a test is through one of the KuderRichardson methods.” Borich & Kubiszyn (2007: 321) For the purpose of this analysis, we will use the KR20 method, as it is the more accurate way of determining the reliability of a test. Borich & Kubiszyn (2007: 322) The formula for this test is:
From the data found in table 2.4, we can determine the reliability coefficient.
6
1 Mark for each correct answer Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Q1 3
Q14
Q15
Q16
Q17
Q18
Q19
Q20
Total
%
1
1
0
0
0
0
0
0
0
0
1
0
0
0
1
1
1
0
0
1
7
35
1
1
1
1
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
17
85
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
17
85
1
1
1
0
1
1
0
1
1
1
1
1
0
1
0
1
1
0
0
0
13
65
1
1
1
0
1
1
0
1
1
0
1
1
1
1
1
1
0
0
1
0
14
70
1
0
1
1
0
1
0
0
1
0
1
1
0
1
1
1
0
0
0
0
10
50
0
1
0
0
1
1
0
0
0
0
1
1
1
1
0
1
0
1
0
0
9
45
1
1
1
0
1
1
0
0
0
0
1
1
1
1
1
1
1
0
1
0
13
65
1
1
1
0
1
1
1
0
0
0
1
1
1
1
1
1
1
0
0
0
13
65
1
1
0
0
1
1
1
0
0
0
1
0
0
1
0
1
1
1
1
0
11
55
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
19
95
1
1
1
1
1
1
1
0
0
0
1
1
0
1
1
1
1
0
1
0
14
70
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
0
16
80
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
0
16
80
1
1
1
1
1
0
0
1
0
0
1
1
0
0
1
0
0
0
0
0
9
45
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
19
95
0
1
0
0
1
0
1
0
1
0
1
0
1
1
1
0
0
0
0
0
8
40
1
1
0
1
1
0
1
0
0
0
1
1
0
1
1
1
1
0
1
0
12
60
0
0
0
1
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
4
20
1
1
1
1
1
1
1
1
1
0
0
0
1
1
1
1
1
1
1
0
16
80
1
0
1
1
0
1
0
0
1
0
1
1
0
1
1
1
0
0
0
0
10
50
0
1
0
0
1
1
0
0
0
0
1
1
1
1
0
1
0
1
0
0
9
45
1
1
1
0
1
1
0
0
0
0
1
1
1
1
1
1
1
0
1
0
13
65
1
1
0
0
0
0
0
0
0
0
1
0
0
0
1
1
1
0
1
1
8
40
1
1
1
1
1
0
0
1
1
1
1
1
80
0.48
0.84
0.68
0.44
0.52
0.92
0.76
0.6
0.84
0.8
0.52
0.64
q
0.16
0.12
0.32
0.52
0.16
0.32
0.56
0.48
0.08
0.24
0.4
0.16
0.2
1 0.333 333 0.666 667
16
0.68
0.48
0.36
p q
0.13 44
0.10 56
0.21 76
0.24 96
0.13 44
0.21 76
0.24 64
0.249 527
0.24 96
0.222 222
0.07 36
0.18 24
0.2 4
0.13 44
0.16
0.076 389
1 0.62 5 0.37 5 0.23 437 5
0
0.88
1 0.916 667 0.083 333
0
0.84
1 0.333 333 0.666 667
1
p
1 0.521 739 0.478 261
0.222 222
0.24 96
0.23 04
3.830 336 Var Part 1 Part 2 r
Table 2.4: Test scores with pq values
7
389.8 333 1.052 632 0.990 174 1.042 289
Level L U U U U U L U U U U U U U L U L U L U U L U L U
The answer is a negative value and this can be interpreted that the test is not reliable. Since the KR20 is equal to a small negative amount, it is safe to assume that the reliability is not far out, but the test is still too easy.
3.
Item Analysis
Difficulty Index When considering table 3.1, we find that the difficulty indices demonstrate that seven of the twenty questions were unacceptable because they were too easy. These include Questions 1, 2, 5, 11, 14, 15 and16. Questions 6 and 12 were a bit easy and the rest of the questions were of acceptable difficulty. Question
Difficulty
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Table 3.1: Item Difficulty Indices
Rating .84 .88 .68 .48 .84 .68 .44 .52 .52 .33 .92 .76 .60 .84 .80 .91 .62 .33 .52 .64
8
Unacceptable (too easy) Unacceptable (too easy) Acceptable Acceptable Unacceptable (too easy) Easy Acceptable Acceptable Acceptable Acceptable Unacceptable (too easy) Easy Acceptable Unacceptable (too easy) Unacceptable (too easy) Unacceptable (too easy) Acceptable Acceptable Acceptable Acceptable
Discrimination Index In table 3.2, we can see that there are six items with a low discrimination index. These items will have to be revised. It is also rather interesting to note the correlation between the unacceptable difficulty indices and the unacceptable discrimination indices as well as the correlation between the acceptable difficulty indices and the acceptable discrimination indices.
Question
Discrimination
Rating
Q1 Q2
0.16 Negative 0.12 Negative
Q3
0.32 Positive
Q4 Q5 Q6 Q7 Q8
0.52 0.16 0.32 0.56 0.48
Q9
0.48
Q10 Q11 Q12 Q13 Q14
0.67 0.08 0.24 0.40 0.16
Q15
0.20
Q16 Q17 Q18 Q19 Q20 Table 3.2: Item Discrimination Indices
0.08 0.38 0.67 0.48 0.36
9
Positive Negative Positive Positive Positive Positive Positive Negative Positive Positive Negative Negative Positive Positive Positive Positive Positive
4. Conclusion
Figure 4.1: Percentage of acceptability
In this report on the 2008 Grade 12 English second language exam, the assumption can be made that the multiplechoice test was rather easy. The thorough analysis made on the frequency, standard deviation, discrimination indices, difficulty indices and the reliability coefficient clearly proves this assumption. Items 1,2,5,11,14 and 15 will need revision so that this test may be graded as reliable. Consider that 76% of the test as seen in figure 4.1 is reliable and the other 24% of the test is too easy. The questions mentioned were all rather easy and therefore not really applicable for a final exam.
10
Bibliography Borich, T. &. (2007). Educational Testing and Measurement: Classroom Application and Practice. NJ: John Wiley & Sons. Inc. Knoetze, J. (2007). Test Data. Retrieved April 1, 2008, from http://www.jknoetze.co.za/CIA_722/testdata.xls
11
Appendix A: Test Data Key C St No Q1
B Q2
D Q3
D Q4
B Q5
C Q6
D Q7
C C C C C C B C C C C C C
B B B B B A B B B B B B B
B D D D D D A D D B D D D
A D D B C D B B A A D D A
C B B B B C B B B B B B B
D D C C C C C C C C C C C
14 C 15 C 16 C
B B B
D D D
A D D
B B B
17 18 19 20 21 22 23
B C D C C B C
B B C B A B B
C B A D D A D
C D D D D B B
24 C 25 C
B B
B D
A D
1 2 3 4 5 6 7 8 9 10 11 12 13
A Q8
C Q9
B Q10
A Q11
C Q12
B Q13
D Q14
A Q15
A Q16
C Q17
D Q18
B Q19
C Q20
A A D B B A B B D D D D D
A A A A D B D D C A D A
D C C C C C D B B D C D C
D B B B D D D C D C B A B
A A A A A A A A A A A A A
D C C C C C C C C B C C C
A B B A B A B B B A B A B
A D D D D D D D D D D D D
A A A C A A C A A D A A A
A A A A A A A A A A A A A
C C C C A A A C C C C C A
B D B B B B D A B D D B B
D B D C B D D B D B B B B
B C C C C C C A A C C D C
C B C
D A D
A A A
C B C
B D B
A A A
C C C
B D B
D A D
A A A
A C A
A B C
B D
B D B
C D C
B B B B C B B
A A A C C C C
D D B D A B B
D D A A D B D
C D D C C D B
D C A D D C
A A C C A A A
D C D D C C C
B A A B A B B
D D A D D D D
A A D A A C A
A B A A A A
C B C A A C
C B B D B D A
A B A B D D B
D C B C C C A
C B
D D
A A
A
D C
D B
A A
D C
A B
A D
A A
A A
C C
B D
D B
B C