Stop Student Complaining
By Improving Test Question Quality
Karen R. Young
Assistant Dean and Director of Undergraduate Programs
College of Humanities and Social Sciences
Emily Wicker Ligon
Lead Instructional Designer
Distance Education
& Learning Technology Applications (DELTA)
Diane Chapman
Teaching Associate Professor
Department of Leadership,
Policy and Adult and Higher Education
Director, Office of Faculty
Development
Henry Schaffer
Professor Emeritus of Genetics and Biomathematics
Coordinator of Special OIT Projects & Faculty Collaboration
OFD Workshop
February 12, 2014
Introductions
Why is this workshop important?
- Learning Outcome is the real Teaching Goal.
- Formative assessment will be of increasing importance!
- Assessment/testing is used
to evaluate student progress ⇒ test item quality is crucial.
- We want to have high quality test items to facilitate formative and
summative assessment of our students.
Learning Objectives for this Workshop
- Understanding test item quality measures.
- Ability to use currently available and forthcoming Item Analyses
- Ability to assess high/low quality items via Item Analysis
Purpose of Test Items (Formative and Summative)
- Assess Student Learning Outcomes
- less needed in small classes - but even there . . .
- Separate out students across learning outcome spectrum
- identify where help is needed
- support assignment of grades
- support the practice of critical pedagogy
- Need good quality test items to further these purposes
- Without good test items, none of the above will follow - hence,
today's focus
Examples of bad questions
- 97% of students answered correctly - too easy - perhaps sometimes appropriate
- e.g. lead-in, or verifying that everyone has mastered a topic
- 3% of students answered correctly - too hard - ever appropriate? or
maybe material not covered?
- 70% of students answered correctly - reasonable
- But is it the right 30% who miss it? ⇒ Point BiSerial
correlation (also Discrimination Index - which we won't cover
as it gives the same type of info)
Item Analysis
- Example - fictional data input - 20 students, 10 questions
BDEACCADBE 10 - correct answers
AR BBEABCADBE 8 - student name, responses, number correct
AW AECAACADBE 6
BA BCDACCADBE 8
CK BDEACAAABD 7
CP EAEADBEDDE 4
CQ DBADECADBE 5
DA BDAADCADBE 8
DV ADEAACAECD 5
DX CCAAECADBB 5
FO ADEAAEBBEA 3
FR DBEADAEBCB 2
IF CDEABEBDAC 4
LC BDEAADCCBC 5
MB CECACCEBCE 4
NB BBDAACADBE 7
ND BDEABCACAA 6
WB BDBABCADDB 6
XB BDEADCBEBE 7
ZD DBEEDEADBE 5
ZF EDEAEEEEEA 3
- Overall View
54.0 average grade 17.1 std dev grades 20 students
80.0 top grade 20.0 bottom grade
70.0 top 27% break point 40.0 bottom 27% break point
- Bar Graph of Student Grades
freq
5| XX
4| XX XX
3| XX XX XX XX XX
2| XX XX XX XX XX XX
1| XX XX XX XX XX XX
----------------------------------------------------------------------
0-<10 10-<20 20-<30 30-<40 40-<50 50-<60 60-<70 70-<80 80-<90 90-100
Grades in %
- Item Analysis output
responses unrd corr DifLev
Item A B C D E * blank ans# %corrct pbs
1 3 9 3 3 2 0 0 B 45.0 0.79
2 1 5 2 10 2 0 0 D 50.0 0.00
3 3 1 2 2 12 0 0 E 60.0 -0.35
4 18 0 0 1 1 0 0 A 90.0 0.08
5 5 4 3 5 3 0 0 C 15.0 0.23
6 2 1 12 1 4 0 0 C 60.0 0.61
7 12 3 1 0 4 0 0 A 60.0 0.67
8 1 3 2 11 3 0 0 D 55.0 0.39
9 2 11 3 2 2 0 0 B 55.0 0.68
10 3 3 2 2 10 0 0 E 50.0 0.47
Kuder-Richardson 20 0.29
- Definitions
- Item answers (1=A, 2=B, ...) counts, including unreadable and blank
- Correct answer - which choice it is
- Difficulty Level = % of students answering this item correctly
- pbs = Point Biserial Correlation coefficient
- Correlation between whether each student answered this
test item correctly and the students overall test grade.
This is a variation on the usual correlation coefficient
and is for situations where one variable varies (more or
less) continuously (here the % grade of students on this exam),
and the other variable is dichotomous. (Dichotomous = two values
only - here correct or incorrect - which includes multiple choice
questions - MCQs, as long as there is only one correct
answer, true/false and matching. Polytomous would mean having
N>1 answers which aren't wrong.)
- Kuder-Richardson 20 (K-R 20) Range:0-1. Below 0.64 indicates the
exam doesn't distinguish well between students with varying
mastery of the material. A very high value indicates homogeneity
in the test items. (Roughly equivalent to Moodle's Coefficient
of internal consistency.)
Same as above, with breakout (Grid) giving response patterns of top/bottom 27%/quartiles
responses unrd corr DifLev
Item A B C D E * blank ans# %corrct pbs
1 3 9 3 3 2 0 0 2 45.0 0.79
0 6 0 0 0 - top 27%
1 0 2 1 2 - bottom 27%
2 1 5 2 10 2 0 0 4 50.0 0.00
0 2 1 3 0
1 1 0 3 1
3 3 1 2 2 12 0 0 5 60.0 -0.35
1 0 0 2 3
0 0 1 0 5
4 18 0 0 1 1 0 0 1 90.0 0.08
6 0 0 0 0
6 0 0 0 0
5 5 4 3 5 3 0 0 3 15.0 0.23
1 1 2 2 0
1 1 1 2 1
6 2 1 12 1 4 0 0 3 60.0 0.61
1 0 5 0 0
1 1 1 0 3
7 12 3 1 0 4 0 0 1 60.0 0.67
5 1 0 0 0
0 2 0 0 4
8 1 3 2 11 3 0 0 4 55.0 0.39
1 0 0 4 1
0 3 0 2 1
9 2 11 3 2 2 0 0 2 55.0 0.68
0 6 0 0 0
1 0 2 1 2
10 3 3 2 2 10 0 0 5 50.0 0.47
0 0 0 1 5
2 1 1 0 2
- Compare breakout with pbs
We can easily see frequencies of correct answer placement:
For the above:
usage of each answer choice
A B C D E
2 2 2 2 2
Here's an example from an actual class exam with 26 questions:
usage of each answer choice
A B C D E
5 5 9 7 0
(There were 5 choices.)
What do you think?
Discussion of Strategy in Selecting Questions - difficulty levels, pbs, ...
- should you have one?
- do you have one?
- if so, what is it?
- do you have a database of questions?
- tagging questions (adding metadata) to assist in selection
Item Analysis Services on Campus
Improving test items
In addition to working on improvement after considering Item Analysis,
here are some basic MCQ suggestions taken from TESTING AND EVALUATION IN THE
BIOLOGICAL SCIENCES COMMISSION ON UNDERGRADUATE EDUCATION IN THE
BIOLOGICAL SCIENCES REPORT OF THE PANEL ON EVALUATION AND TESTING
NOVEMBER 1967/CUEBS Publication 20. Available for Download at no charge from:
http://ofd.ncsu.edu/wordpress/wp-content/uploads/2013/09/testing-and-evaluation-in-the-biological-sciences.pdf
Characteristics that a satisfactory multiple-choice item should possess:
a. the stem sets forth a single precise unambiguous task for the student to do;
b. the stem is followed by a homogeneous set of responses, parallel in construction;
c. no response can be eliminated because of grammatical inconsistency with the stem;
d. the responses contain no verbal associations that provide irrelevant clues to the answer;
e. the correct response is not more elaborate in phraseology than the incorrect ones;
f. to the student who does not perceive the problem or know the answer, each response may appear to be a plausible answer.
Terminology - item/question, stem, correct answer, foils, distractors.
Writing multiple choice questions
Preparing better MCQs
More on this - detailed
Overview & Looking Forward
Overview - Item Analysis can point out questions which need work.
Immediate goal- better questions covering your course learning objectives
Longer range goals- Annotation of test items ⇒ allowing computerized
analysis of student learning (especially important in large classes)
Q & A
Resources
A nice non-technical overview and justification for the use of
Item Analysis. This is part of more general coverage of
Item Analysis.
A related topic is How to Write
Tests. (These are temporary locations - probably to be moved soon.)
A description of
Item Analysis and its use. Discussion on the
Index of Discrimination (which gives the same type of info as does Point
Biserial correlation)
Fall 2013 OFD workshop on using Bloom's Taxonomy of Cognitive Objectives
on constructing test items at the various Cognitive Levels.
http://www.ncsu.edu/it/open_source/Item.html
Copyright 2013, 2014, 2017 by Henry E. Schaffer, Karen R. Young, Emily Wicker Ligon
& Diane Chapman
Comments and suggestions are welcome, and should
go to hes@ncsu.edu
Last modified 3/3/2017
Disclaimer - Information is provided for your use. No endorsement
is implied.