Multiple-Choice Reborn

This blog explores the advantages and disadvantages of classical right mark scoring (RMS), Knowledge and Judgment Scoring (KJS) and Confidence Based Learning (CBL) when used to set grades (cut points) and to promote student and employee development.

About Me

Wednesday, May 13, 2015

How does IRT information replace CTT reliability? Can this
be found on the audit tool (Table 45)?

This post relates my audit tool, Table 45, Comparison of
Conditional Error of Measurement between Normal [CTT] Classroom Calculation and
the IRT Model to a quote from Wikipedia
(Information). I am confident that the math is correct. I need to
clarify the concepts for which the math is making estimates.

Table 45

“One of the major contributions of item response theory is
the extension of the concept of reliability. Traditionally, reliability refers
to the precision of measurement (i.e., the degree to which measurement is free
of error). And traditionally, it is measured using a single index defined in various
ways, such as the ratio of true and observed score variance.”

“This index is helpful in characterizing a test’s average
reliability, for example in order to compare two tests.”

The test reliability for CTT and IRT are also comparable on
Table 45a and 45c, 0.29 and 0.27.

“But IRT makes it clear that precision is not uniform across
the entire range of test scores. Scores at the edges of the test’s range, for example,
generally have more error associated with them than scores closer to the middle
of the range.”

“Item response theory advances the concept of item and test
information. Information is also a function
of the model parameters. For example, according to Fisher information theory,
the item information supplied in the case of the 1PL for dichotomous response
data is simply the probability of a correct response multiplied by the
probability of an incorrect response, or,
. . .” [I = pq].

“The standard error of estimation (SE) is the reciprocal of
the test information of at a given trait level, is the . . .” [1/SQRT(pq)].

Is the “test information … at a given trait level” the Score
Information (3.24, red, Chart 89, dummy data) for 17 right out of 21 items?
Then the reciprocal of 3.24 is 0.31, the error variance (green, Chart 89 and
Table 46, col 9) in measures on a logit scale. And the IRT conditional error of
estimation (SE) would be the square root: SQRT(0.31) = 0.56 in measures. And
this inverted would yield the CTT CSEM: 1/0.56 = 1.80 in counts.

The IRT (CSEM) in Chart 89 is really the IRT standard error
of estimation (SE or SEE). On Table 45c, the CSEM (SQRT) is also the SE
(conditional error of estimation) obtained from the square root of the error
variance for that ability level (17 right, 1.73 measures, or 0.81 or 81%).

“Thus more information implies less error of measurement.”

See Table 45c, CSEM, green, and Table 46, col 9-10.

“In general, item information functions tend to look
bell-shaped. Highly discriminating items have tall, narrow information
functions; they contribute greatly but over a narrow range. Less discriminating
items provide less information but over a wider range.”

Chart 92

Table 47

The same generality applies to the item information
functions (IIF)s in Chart 92 but it is not very evident. The item with a
difficulty of 10 (IIF = 1.80, Table 47) is also highly discriminating. The two easiest items had negative discrimination; they show
an increase in information as student ability decreases toward zero measure. The generality applies best near the
average test raw score of 50% or zero measure; which is not on the chart (no student got a score of 50% on this test).

This test had an average test score of 80%. This has spread the item information
function curves out (Chart 92). They are not centered on the raw score of 50%
or the measures zero location. However
each peaks near the point where item difficulty in measures is close to student
difficulty in measures. This observation is critical in establishing the
value of IRT item analysis and how it is used. This makes sense in measures (a natural
log of the ratio of right and wrong mark scale) but not in raw

scores (normal
linear scale) as I first posted in Chart 75 with only count and percent scales.

“Plots of item information can be used to see how much
information an item contributes and to what portion of the scale score range.”

This is very evident in Table 47 and Chart 92.

“Because of local independence, item information functions
are additive.”

“Thus, the test information function is simply the sum of
the information functions of the items on the exam. Using this property with a
large item bank, test information functions can be shaped to control
measurement error very precisely.”

“Characterizing
the accuracy of test scores is perhaps the central issue in psychometric theory
and is a chief difference between IRT and CTT. IRT findings reveal that the CTT
concept of reliability is a simplification.”

At this point my audit tool, Table 45, falls silent. These
two mathematical models are a means for only estimating theoretical values;
they are not the theoretical values nor are they the reasoning behind them. CTT
starts from observed values and projects into the general environment. IRT can start with the perfect Rasch model and select observations that fit the model. The
two models are looking in opposite directions. CTT uses a linear scale with the
origin at zero counts. IRT sets its log ratio point-of-origin (zero) at the 50%
CTT point. I must accept the concept that CTT is a simplification of IRT on the
basis of authority at this point.

“In the place of reliability, IRT offers the test
information function which shows the degree of precision at different values of
theta, [student ability].”

I would word this, “In ADDITION to reliability,” (Table 45a,
CTT = 0.29 and 45c, IRT = 0.27). Also the “IRT offers the ITEM information
function which shows the degree of precision at different values . . .”

“These results allow psychometricians to (potentially)
carefully shape the level of reliability for different ranges of ability by
including carefully chose items. For example, in a certification situation in
which a test can only be passed or failed, where there is only a single
“cutscore,” and where the actually passing score is unimportant, a very
efficient test can be developed by selecting only items that have high
information near the cutscore. These items generally correspond to items whose
difficulty is about the same as that of the cutscore.”

The eleven items in Table 47 and Chart 92 each peak near the
point where item difficulty in measures is close to student difficulty in
measures. The discovery or invention of this relationship is the key advantage
of IRT over CTT.

These data show that a test item need not have to have (a commonly
recommended) average score near 50% for useable results. Any cutscore from 50%
to 80% would produce useable results on this test with an average score of 80%
and cutscore (passing) of 70%.

"IRT is sometimes called strong true score theory or modern mental test theory because it is a more recent body of theory and makes more explicit the hypotheses that are implicit within CTT."

My understanding is that with CTT an item may be 50% difficult for the class without reveiling how difficult it is for each student (no location). With IRT ever item is 50% difficult for any student with a comparable ability (difficulty and ability at the same location).

I do not know what part of IRT is invention and what part is discovery
on the part of some ingenious people. Two basic parts had to be fit together:
information and measures by way of an inversion. Then a story had to be created
to market the finished product; the Rasch model and Winsteps (full and partial
credit) are the limit of my experience. The unfortunate name choice of “partial
credit” rather than knowledge or skill and judgment may have been a factor in the Rasch
partial credit model not becoming popular. The name, partial credit, falls into
the realm of psychometrician tools. The name, Knowledge and Judgment, falls
into the realm of classroom tools needed to guide the development of scholars
as well as obtain maximum information from paper standardized tests; where
students individually customized their tests (accurately, honestly, and fairly) rather than
CAT where the test is tailored to fit the student; using best-guess, dated, and questionable
second hand information.

IRT makes CAT possible. Please see "Adaptive Testing Evolves to Assess Common-Core Skills" for current marketing, use, and a list of comments, including two of mine. The exaggerated claims of test makers to assess and promote deveoping students by the continued use of forced-choice lower level of thinking tests continues to be ignored in the marketing of these tests to assess Common Core skills. Increased precision of nonsense still takes precedence over an assessment that is compatible with and supports the classroom and scholarship.Serious mastery: Knowledge Factor.Student development: Knowledge and Judgment Scoring (Free Power Up Plus) and IRT Rasch partial credit (Free Ministep).Ranking: Forced-choice on paper or CAT.

Wednesday, April 8, 2015

An apparent
paradox is that extreme scores have perfect precision, but extreme measures
have perfect imprecision” in “Reliability and separation of measures.” A more
complete discussion is given
under the title, “Standard Errors and Reliabilities: Rasch and Raw Score”.

Chart 82

The apparent paradox is graphed in Chart 82. Precision on
one scale is the inverse or reciprocal of the other: 1/0.44 = 2.27 and 1/2.27 =
0.44.

Table 45

I edited Table 32 to disclose a full development of a comparison between CTT and IRT using real classroom data (Table 45). This first view is too complicated.

Table 45 includes the process of combining student
scores and item difficulties onto one logit scale.

Table 46

I then isolated the item analysis from the complete development
above by skipping the formation of a single scale from real classroom data.
Instead, I feed the IRT item analysis a percent (dummy) data set (Table 46)
with the same number of items as in the classroom test (21 items). I then
graphed the data strings in Table 46 as a second, simpler, view of IRT item
analysis.

Chart 85

Turning right counts (Chart 85, blue) into a right/wrong ratio string (red) yields a
very different shape than a straight line right mark count. We now have the rate at which each mark completes a
perfect score of 21 or 100%. It starts slow (1/20), with the last mark racing
20 times (20/1) the average rate (10/11 or 11/10, near 1, in Table 46, col 2).

Taking the natural log of the ratio (a logit, Table 46, col
3) creates the Rasch model IRT characteristic curve (Chart 85, purple) with the
zero logit point of origin positioned at the 50% normal value. [Ratios and log
ratios have no dimensions.]

Chart 86

Winsteps, at this point, has reduced student raw scores and
item difficulties (in counts) into
one logit scale of student ability and item difficulty with the dimension of a measure. These are then combined into
the probability of a right answer to start the item analysis. The percent
(dummy) input (Table 46, col 6) replaces this operation (Chart 86). This
simplifies the current discussion to just item analysis and precision.

Chart 87

Percent input and Information for one central cell are
plotted in Chart 87. Cell information is limited to a maximum of 0.25 at a student
raw score of 50% (Table 46, col 7), when combining p*q (0.50 * 0.50 = 0.25 ).
The next step is to adjust the cell information for 21 items on the test
(Column 8).

Chart 88

Chart 88 completes the comparison of CTT and IRT
calculations on Table 46. The inversion of Information (col 9) yields the error
variance that aligns with student score measures such that the greatest
precision (smallest error variance) is at the point of origin of the logit
scale. The square root of the error variance (col 10) yields the CSEM
equivalent for IRT measures. And then, by a second inversion these measure
values are transformed into the identical normal CSEM values (col 11 - 12) for
a CTT item analysis. The total view in Table 45 was too complicated. Charts 85
– 88 are also.

Chart 89

My third, simple, and last view is a flowchart (Chart 89)
constructed from the above charts and tables.

CTT captures the variation (in marks) within a student score
in the variance (0.15); IRT captures the variation (in probabilities) as
information (0.15). In all cases the score variance and score information are
treated with the square root (SQRT, pink) to yield standard errors (estimates
of precision: CTT CSEM, on a normal scale in counts, and IRT (CSEM) on a logit
scale in measures.

In summary, as CTT score variance and IRT score information
(red) increase, CSEM increases on a normal scale (Chart 89). Precision
decreases.At the same time IRT
error variance (green) and IRT (CSEM) decrease on a logit scale. Precision
increases with respect to the Rasch model point of origin zero (50% on a normal
scale). This inversion aligns the IRT (CSEM) to student scores in measures on a
logit scale.

It appears that the meaning of this depends upon what is
being measured and how well it is being measured. CTT measures in counts and
sets error (based on the score variance, Chart 89, red) about the student score
count on a normal scale (CSEM). IRT converts counts to “measures”. IRT then
measures in “measures” and sets error (based on the error variance, Chart 89,
green) about the point of origin (zero) on a logit scale that corresponds to
50% on a normal scale.

Chart 90

The two methods of feeding an item analysis are using two
different reference points. This was easier to see when I took the core out of
Chart 88 and plotted it in a more common form in Chart 90. Precision on both
scales is shown in solid black. This line intersects the Rach model IRT
characteristic curve where normal is 50% and IRT is zero. At a count of 17
right, the normal scale shows higher precision; the logit scale shows lower
precision in respect to the perfect Rasch model.

The characteristic curve is a collection of points where student ability and item
difficulties match resulting in students with this ability getting 50% right answers with items with matching difficulties. This situation exists for CTT only at the average test score (mean).

[The slope of the
test characteristic curve is given as the inverse of the raw score error
variance (3.24, red, Chart 88 - 89, and Table 46).]

Chart 91

Table 91 applies the above thinking to real classroom data
(Table 45c). This time the average score was not at 50% but at 81%. The lowest
student score on Table 45c was 12 (57%).

In a lost reference, I have read that at the 50% point
students do not know anything; it is all chance. I can see that for true-false.
That could put CTT and IRT in conflict. A student must know something to earn a
score of 50% when there are four options to each item. There is a free 25%. The
student must supply the remaining 25%. Also few CCT tests are filled with items
that have maximum discrimination and precision. A high quality CTT test can
look very much like a high quality IRT test. The difference is that the IRT
test item analysis takes more into the calculations than the CTT test when
offered as forced-choice (a cheap way to rank students) or as with knowledge and judgment scoring (where students report what they actually know and find
meaningful and useful; the basis for effective teaching).

Historically, test reliability was the chief marketing point
of standardized tests. In the past decade the precision of individual student
scores has replaced test reliability. IRT (CSEM) provides a more marketable
product along with promoting the sale of equipment and related CAT services.
Again psychometricians on the backside are continuing to support and lend
credibility to the claims from the sales office on the front end.

Wednesday, March 11, 2015

A single standardized right-count score (RCS) has little
meaning beyond a ranking. A knowledge and judgment score (JKS) from the same
set of questions not only tells us how much the student may know or can do but
also the judgment to make use of that knowledge and skill. A student with a RCS
must be told what he/she knows or can do. A student with a KJS tells the
teacher or test maker what he/she knows. A RCS becomes a token in a federally
sponsored political game. A KJS is a base onto which students build further
learning and teachers build further instruction.

Table 40. RCS

Table 41. KJS

The previous two posts dealt with student ability during the test. This one looks at the
score after the test. I developed
four runs of the Visual Education Statistics Engine: Table 40. RCS, Table 41. KJS
(simulated), and after maximizing item discrimination, Table 42. RCSmax, and Table
43. KJSmax.

Table 42. RCSma

Table 43. KJSmax

Test reliability and the standard error of measurement (SEM) with
some related statistics are gathered into Table 44. The reliability and SEM
values are plotted on Chart 81 below.

Table 44

Students, on average, can reduce their wrong marks by about
one half when they at first switch to knowledge and judgment scoring. The most
obvious effect of changing 24 of 48 zeros to a value of 0.5 to simulate Knowledge
and Judgment Scoring (KJS) was to reduce test reliability (0.36, red). Scoring both
quantity and quality also increased the average test score from 64% to 73%.

Psychometricians do not like the reduction in test
reliability. Standardized paper tests were marketed as “the higher the
reliability the better the test”. Marketing has now moved to “the lower the
standard error of measurement (SEM), the better the test”, using computers, CAT
and online testing (green). The simulated KJS shows a better SEM (10%) in
relation to 12% for RCS. By switching current emphasis from test reliability to
precision (SEM) KJS now shows a slight advantage to test makers over RCS.

Chart 80

Chart 80 shows the general relationships between a
right-count score and a KJS. This is Chart 4/4 from the previous post tipped on
its side with the 60% passing performance replaced with the average scores of
64% RMS and 73% KJS. Again, KJS is not a giveaway. There is an increase in the
score, if the student elects to use his/her judgment. There is also an increase
in the ability to know what a student actually knows because the student is
given the opportunity to report what is known, not to just to mark an answer to
every question (even before looking at the test).

Chart 81

Chart 81 expands Chart 80 using the statistics in Table 44. In
general there is little difference between a right-count score and a KJS,
statistically. What is different is what is known about the student; the full
meaning of the score. Right-count scoring delivers a score on a test carefully
crafted to deliver a desired on-average test score distribution and cut score. THE
TEST IS DESIGNED TO PRODUCE THE DESIRED SCORE DISTRIBUTION. The KJS adds to
this the ability to assess what students actually know and can do that is of
value to them. The knowledge and judgment score assess the complete student
(quantity and quality).

Knowledge and Judgment Scoring requires appropriate
implementation for the maximum effect on student development. In my experience,
the switch from RCS must be voluntary to promote student development. It must
result in a change in the level of thinking and related study habits where the
student assumes responsibility for learning and reporting. At that time
students feel comfortable changing scoring methods. They like the quality
score. It reassures them that they really can learn and understand.

KJS no longer has a totally negative effect on current psychometrician
attempts to sharpen their data reduction tools. But there are still the effects
of tradition and project size. The NCLB movement demonstrated (failed in part)
because low performing schools mimicked the standardized tests rather than
tended to teaching and learning. Their attempt to succeed was
counterproductive. Doing more of the same does not produce different results.
These schools could also be expected to mimic standardized tests offering KJS.

The current CCSS movement is based on the need for one test
for all in an attempt to get valid comparisons between students, teachers,
schools and states. The effect has been gigantic contracts that only a few
companies have the capacity to bid on and little competition to modernize their
test scoring.

KJS is then a supplement to RCS. It can be offered on
standardized tests. As such, it updates the multiple-choice test to its maximum
potential, IMHO. KJS can be implemented in the classroom, by testing companies
and entrepreneurs who see the mismatch between instruction and assessment.

Knowledge Factor has already done this with their patented
learning/assessment system, Amplifire.
It can prepare students online for current standardized tests. Power
Up Plus is free for paper classroom tests. (Please see the two preceding
posts for more details related to student ability during the test).

Wednesday, February 11, 2015

Students, teachers, and test makers each
have responsibilities that contribute to the meaning of a multiple-choice test
score. This post extracts the responsibilities from the four charts in the
prior post, Meaningful Multiple-Choice Test Scores, that compares short answer,
right-count traditional multiple-choice, and knowledge and judgment scoring
(KJS) of both.

Testing looks simple: learn, test, and evaluate. Short
answer, multiple-choice, or both with student judgment. Lower levels of thinking,
higher levels of thinking, or both as needed. Student ability below, on level,
or above grade level. There are many more variables for standardized test
makers to worry about in a nearly impossible situation. By the time these have
been sanitized from their standardized tests all that remains is a ranking on
the test that is of little if any instructional value (unless student judgment
is added to the scoring).

Chart 1/4 compares a short answer and a right-count
traditional multiple-choice test. The teacher has the most responsibility for
the test score when working with pupils at lower levels of thinking (60%). A
high quality student functioning at higher levels of thinking could take the
responsibility to report what is known or can be done in one pass and then just
mark the remainder for the same score (60%). The teacher’s score is based on
the subjective interpretation of the student’s work. The student’s score is
based on a matching of the subjective interpretation of the test questions with
test preparation. [The judgment needed to do this is not recorded in
traditional multiple-choice scores.]

Chart 2/4 compares what students are told about
multiple-choice tests and what actually takes place. Students are told the
starting score is zero. One point is added for each right mark. Wrong or blank answers
add nothing. There is no penalty. Mark an answer to every question. As a classroom
test, this makes sense if the results are returned in a functional formative
assessment environment. Teachers have the responsibility to sum several scores
when ranking students for grades.

As a standardized test, the single score is very unfair.
Test makers place great emphasis on the right-mark after-test score and the
precision of their data reduction tools (for individual questions and for groups
of students). They have a responsibility of pointing out that the student on
either side of you has an unknowable, different, starting score from chance,
let alone your luck on test day. The forced-choice test actually functions as a
lottery. Lower scoring students are well aware of this and adjust their sense
of responsibility accordingly (in the absence of a judgment or quality score to
guide them).

Chart 3/4 compares student performance by quality. Only a
student with a well-developed sense of responsibility, or a comparable innate
ability, can be expected to function as a high quality, high scoring, student
(100% but reported as 60%). A less self-motivated student or with less ability
can perform two passes at 100% and 80% to also yield 60%. The typical student,
facing a multiple-choice test, will make one pass; marking every question as it
comes to earn a quantity, quality, and test score of 60%; a rank of 60%. No one knows which right mark is a right
answer.

Teachers and test makers have a responsibility to assess and
report individual student quality on multiple-choice tests just as is done on
short-answer, essay, project, research, and performance tests. These notes of
encouragement and direction provide the same “feel good” effect found in a
knowledge and judgment scored quality score when accompanied with a list of what was known or could be done
(the right-marked questions).

Chart 4/4 shows knowledge and judgment scoring (KJS) with a
five-option question made from a regular four-option question plus omit. Omit
replaces “just marking”. A short answer question scored with KJS earns one
point for judgment and +/-1 point for right or wrong. An essay question
expecting four bits of information (short sentence, relationship, sketch, or chart)
earns 4 points for judgment and +/-4 points for an acceptable or not acceptable
report. (All fluff, filler, and snow are ignored. Students quickly learn to not
waste time on these unless the test is scored at the lowest level of thinking
by a “positive” scorer.)

Each student starts with the same multiple-choice score:
50%. Each student stops when each student has customized the test to that
student’s preparation. This produces an accurate, honest and fair test score. The
quality score provides judgment guidance for students at all levels. It is the
best that I know of when operating with paper and pencil. Power
Up Plus is a free example. Amplifire
refines judgment into confidence using a computer, and now on the Internet. It
is just easier to teach a high quality student who knows what he/she knows.

Most teachers I have met question the score of 60% from KJS.
How can a student get a score of 60% and only mark 10% of the questions right?
Easy. Sum 50% for perfect judgment, 10% for right answers, and NO wrong. Or sum 10% right, 10% right
and 10% wrong, and omit 20%. If the
student in the example chose to mark 10% right (a few well mastered facts) and
then just marked the rest (had no idea how to answer) the resulting score falls
below 40% (about 25% wrong). With no
judgment, the two methods of scoring (smart and dumb) produce identical test
scores. KJS is not a give-away. It is a simple, easy way to update currently
used multiple-choice questions to produce an accurate, honest, and fair test
score. KJS records what right-count traditional multiple-choice misses
(judgment) and what the CCSS movement tries to promote.

Wednesday, January 14, 2015

The meaning of a multiple-choice test score is determined by
several factors in the testing cycle including test creation, test
instructions, and the shift from teacher to student being responsible for
learning and reporting. Luck-on-test-day, in this discussion, is considered to
have similar effects on the following scoring methods.

[Luck-on-test-day includes but is not limited to: test
blueprint, question author, item calibration, test creator, teacher,
curriculum, standards; classroom, home, and in between, environment; and a
little bit of random chance (act of God that psychometricians need to smooth
their data).]

Three ways of obtaining test scores: open ended short answer, closed ended right-count four-part
multiple-choice, and knowledge and judgment scoring (KJS) for both short
answer and multiple-choice. These range from familiar manual scoring to what is
now easily done with KJS computer software. Each method of scoring has a
different starting score with a different meaning. The average customary class room
score of 75% is assumed (60% passing).

Chart 1/4

Open ended short
answer scores start with zero and increase with each acceptable answer.
There may be several acceptable answers for a single short answer question. The
level of thinking required depends upon the stem of the question. There may be
an acceptable answer for a question both at lower and at higher levels of
thinking. These properties carry over into KJS below.

The teacher or test maker is responsible for scoring the
test (Mastery = 60%; + Wrong = 0%; = 60% passing for quantity in Chart 1/4).
The quality of the answers can be judged by the scorer and may influence which
ones are considered right answers.

The open ended short answer question is flexible (multiple
right answers) and with some subjectivity; different scorers are expected to
produce similar scores. The average test score is controlled by selecting a set
of items that is expected to yield an average test score of 75%. The student test
score is a rank based on items included in the test to survey what students
were expected to master, to group students who know from those who do not know
each item, and items that fail to show mastery or discrimination (unfinished
items for a host of reasons including luck-on-test-day above).

The open ended short answer question can also be scored as a
multiple-choice item. First tabulate the answers. Sort the answers from high to
low count. The most frequent
answer, on a normal question, will be the right answer option. The next three
ranking answers will be real student supplied wrong answer options (rather than
test writer created wrong answer options). This pseudo-multiple-choice item can
now be printed as a real question on your next multiple-choice test (with
answers scrambled).

A high quality student could also mark only right answers on
the first pass using the above test (Chart 1/4) and then finish by just marking
on the second pass to earn a score of 60%. A lower quality student could just
mark each item in order, as is usually done on multiple-choice tests, mixing
right and wrong marks, to earn the same score of 60%. Using only a score after
the test we cannot see what is taking place during the test. Turning a short
answer test into traditional multiple-choice hides student quality, the very
thing that the CCSS movement is now promoting.

Chart 2/4

Closed ended
right-count four-option multiple-choice scores start with zero and increase
with each right mark. Not really!! This is only how this method of scoring has
been marketed for a century by only considering a score based on right-counts
after the test is completed. In the first place traditional multiple-choice is
not multiple-choice, but forced-choice (it lacks one option discussed below).
This injects a 25% bonus (on average) at the start of the test (Chart 2/4). This
evil flaw in test design was countered, over 50 years ago, by a now defunct
“formula scoring”. After forcing students to guess, psychometricians wanted to remove
the effect of just marking! It took the SAT until March of this year, 2014, to
drop this “score correction”.

[Since there was no way to tell which right answer must be
changed for the correction, it made no sense to anyone other than psychometricians
wanting to optimize their data reduction tools, with disregard for the effect of
the correction on the students taking such a test. Now that 4-option questions
have become popular on standardized tests, a student who can eliminate one
option can guess from the remaining three for better odds on getting a right
mark (which is not necessarily a right answer that reflects recall,
understanding, or skill).]

The closed ended right-count four-option multiple-choice question
is inflexible (one right answer) and with no scoring subjectivity; all scorers
yield the same count of right marks. Again, the average test score is
controlled by selecting a set of items expected to yield 75% on-average (60%
passing). However, this 75% is not the same as that for the open ended short
answer test. As a forced-choice test, the multiple-choice test will be easier;
it starts with a 25% on-average advantage. (That means one student may start
with 15% and a classmate with 35%.) To further confound things, the level of
thinking used by students can also vary. A forced-choice test can be marked
entirely at lower levels of thinking.

[Standardized tests control part of the above problems by
eliminating almost all mastery and unfinished items. The game is to use the
fewest items that will produce a desired score distribution with an acceptable
reliability. A traditional multiple-choice scored standardized test score of 60%
is a much more difficult accomplishment than the same score on a classroom
test.]

A forced-choice test score is a rank of how well a student
did on a test. It is not a report of what a student actually knows or can do
that will serve as the basis for further instruction and learning. The
reasoning is rather simple: the forced-choice score is counted up AFTER the
test is finished; this is the final game score. How the game started (25%
on-average) and was played is not observed (but this is what sports fans pay
for). This is what students and teachers need to know so students can take
responsibility for self-corrective learning.

Chart 3/4

[Three student performances that all end up with a
traditional multiple-choice score of 60% are shown in Chart 3/4. The highest
quality student used two passes, “I know or can do this or I can eliminate all
the wrong options” and “I don’t have a clue”. The next lower quality student
used three passes, “I know or can do this”; “I can eliminate one or more answer
options before marking” and “I am just marking.” The lowest level of thinking
student just marks answers one pass, right and wrong, as most low quality,
lower level of thinking students do. But what takes place during the test is
not seen in the score made after the test. The lowest quality student must
review all past work (if tests are cumulative) or continue on with an
additional burden as a low quality student. A high quality student needs only
to check on what has not been learned.]

Chart 4/4

Knowledge and
Judgment scores start at 50% for every student plus one point for
acceptable and minus one point for not acceptable (right/wrong on traditional multiple-choice).
(Lower level of thinking students prefer: Wrong = 0, Omit = 1, and Right= 2) Omitting an answer is good
judgment to report what has yet to be learned or to be done (understood).
Omitting keeps the one point for good judgment. An unacceptable or wrong mark
is poor judgment. You lose one point for bad judgment.

Now what is hidden with forced-choice scoring is visible
with knowledge and Judgment Scoring (KJS). Each student can show how the game is
played. There is a separate student score for quantity and for quality. A
starting score of 50% gives quantity and quality equal value (Chart 4/4). [Knowledge
Factor sets the starting score near 75%. Judgment is far more important than
knowledge in high risk occupations.]

KJS includes a fifth answer option: omit (good judgment to
report what has yet to be learned or understood). When this option is not used,
the test reverts to forced-choice scoring (marking one of the four answer
options for every question).

A high quality student marked 10 right out of 10 marked and
then omitted the remainder (in two passes through the test) or managed to do a
few of one right and one wrong (three passes) for a passing score of 60% in
Chart 4/4. A student of less quality did not omit but just marked for a score
of less than 50%. A lower level of thinking, low quality student marked 10
right and just marked the rest (two passes) for a score of less than 40%. KJS
yields a score based on student judgment (60%) or on the lack of that judgment
(less than 50%).

In summary, the
current assessment fad is still oriented on right marks rather than on student
judgment (and development). Students with a practiced good judgment develop the
sense of responsibility needed to learn at all levels of thinking. They do not
have to wait for the teacher to tell them they are right. Learning is stimulated
and exhilarating. It is fun to learn when you can question, get answers, and
verify a right answer or a new level of understanding; when you can build on
your own trusted foundation.

Low quality students learn by repeating the teacher. High
quality students learn by making sense of an assignment. Traditional multiple-choice
(TMC) assesses and rewards lower-levels-of-thinking. KJS assesses and rewards
all-levels-of-thinking. TMC requires little sense of responsibility. KJS
rewards (encourages) the sense of responsibility needed to learn at all levels
of thinking.

1.A
short answer, hand scored, test score
is an indicator of student ability and class ranking based on the scorer’s
judgment. The scorer can make a subjective estimate of student quality.

2.A
TMC score is only a rank on a completed
test with increased confounding at lower scores. A score matching a short
answer score is easier to obtain in the classroom and much more difficult to
obtain on a standardized test.

3.A
KJS test score is based on a
student, self-reporting, estimate of what the student knows and can do on a
completed test (quantity) and an estimate of the student’s ability to make use of that
knowledge (judgment) during the test (quality). The score has student judgment
and quality, not scorer judgment and quality.

In short, students who know that they can learn (get rapid
feedback on quantity and quality),who
want to learn, enjoy learning (see Amplifire below). All testing methods fail
to promote these student development characteristics unless the test results
are meaningful, easy to use by students and teachers, and timely. Student
development requires student performance, not just talking about it or labeling
something formative assessment.

Power Up Plus (PUP or
PowerUP) scores both TMC and KJS. Students have the option of selecting the
method of scoring they are comfortable with. Such standardized tests have the
ability to estimate the level of thinking used in the classroom and by each
student.Lack of information,
misinformation, misconceptions and cheating can be detected by school, teacher,
classroom, and student.

Power Up Plus is hosted at TeachersPayTeachers to share what
was learned in a nine year period with 3000 students at NWMSU. The free download below supports individual
teachers who want to upgrade their multiple-choice tests for formative,
cumulative, and exit ticket assessment. Good teachers, working within the
bounds of accepted standards, do not need to rely on expensive assessments.
They (and their students) do need fast, easy to use, test results to develop
successful high quality students.

I hope your students respond with the same positive
enthusiasm that over 90% of mine did. We need to assess students to promote
their abilities. We do not need to primarily assess students to promote the
development of psychometric tools that yield far less than what is marketed.

Created partial
credit scoring for the Rasch model (1982) as a scoring refinement for
traditional right-count multiple-choice. It gives partial credit for near right
marks. It does not change the meaning of the right-count score (as quantity and
quality have the same value by default [both wrong marks and blanks are counted
as zeros], only quantity is scored). The routine is free in Ministep software.

Richard A. Hart (1930-)Promotes student development by student self-assessment of what each
student actually knows and can do, AFTER learning, with “next class period”
feedback.

Knowledge and
Judgment Scoring was started as Net-Yield-Scoring in 1975. Later I used it to
reduce the time needed for students to write, and for me to score, short answer
and essay questions. I created software (1981) to score multiple-choice, both
right-count, and knowledge and judgment, to encourage students to take responsibility
for what they were learning at all levels of thinking in any subject area. Students
voted to give knowledge and judgment equal value. The right-count score retains
the same meaning (quantity of right marks) as above. The knowledge and judgment
score is a composite of the judgment score (quality, the “feel good” score
AFTER learning) and the right-count score (quantity). Power
Up Plus (2006) is classroom friendly (for students and teachers) and a free
download: Smarter
Test Scoring and Item Analysis.

Knowledge Factor
was built on the work of Walter R. Borg (1921-1990). The patented learning-assessment
program, Amplifire, places much more weight on confidence than on knowledge (a
wrong mark may reduce the score by three times as much as a right mark adds).
The software leads students through the steps needed to learn easily, quickly
and in a depth that is easily retained for more than a year. Students do not
have to master the study skills and the sense of responsibility needed to learn
at all levels of thinking needed for master with KJS. Amplifire is student friendly, online,
and so very commercially successful in developed topics that it is not free.

[Judgment and confidence are not the same thing. Judgment is
measured by performance (percent of right marks), AFTER learning, at any level
of student score. Confidence is a good feeling that Amplifier skillfully uses
to promote rapid learning, DURING learning and self-assessment, into a mastery
level. Students can take confidence in their practiced and applied
self-judgment. The KJS and Amplifire test scores reflect the complete student.
IMHO standardized tests should do this also, considering their cost in time and
money.]

Wednesday, December 10, 2014

Adding 22 balanced items to Table 33 of 21 items, in the
prior post, resulted in a similar average test score (Table 36) and the same
item information functions (the added items were duplicates of those in the
first Nurse124 data set of 21 items.) What happens if an unbalance set of 6
items is added? I just deleted the 16 high scoring additions from Table 36.
Both balanced additions (Table 36) and unbalanced additions (Table 39) had the
same extended range of item difficulties (5 to 21 right marks, or 23% to 95%
difficulty).

Table 33

Table 36

Table 39

Adding a balanced set of items to the Nurse124 data set kept
the average score the same: 80% and 79% (Table 36). Adding a set of more
difficult items to the Nurse124 data decreased the average score to 70% (Table
39) and decreased student scores. Traditionally, a student’s overall score is
then the average of the three test scores: 80%, 79% and 70% or 76% for an
average student (Tables 33, 36, and 39). An estimate of a student’s “ability”
is thus directly dependent upon his test scores which are dependent upon the
difficulty of the items on each test. This score is accepted as a best estimate
of the student’s true score. This value is a best guess of future test scores.
This makes common sense, that past is a predictor of future performance.

[Again a
distinction must be made between what is being measured by right mark scoring
(0,1) and by knowledge and judgment scoring (0,1,2). One yields a rank on a
test the student may not be able to read or understand. The other also
indicates the quality of each student’s knowledge; the ability to make meaningful
use of knowledge and skills. Both methods of analysis can use the exact same
tests. I continue to wonder why people are still paying full price but harvesting
only a portion of the results.]

The Rasch model IRT takes a very different route to
“ability”. The very same student mark data sets can be used. Expected IRT student
scores are based on the probability that half of all students with a given
ability location will correctly mark a question with a comparable difficulty
location on a single logit scale. (More at Winsteps and my Rasch Model Audit blog.)[The location starts from the natural
log of a ratio of right/wrong score and wrong/right difficulty. A convergence
of score and difficulty yields the final location. The 50% test score becomes
the zero logit location, the only point right mark scoring and IRT scores are
in full agreement.]

The Rasch model IRT converts student scores and item
difficulties [in the marginal cells of student data] into the probabilities of
a right answer (Table 33b). [The probabilities replace the marks in the central
cell field of student data.] It also yields raw student scores, and their conditional
standard error of measurements (CSEM)s (Table 33c, 34c, and 39c) based on the probabilities of a right answer rather
than the count of right marks. (For
more see my Rasch Model Audit blog.)

Student ability becomes fixed and separated from the student
test score; a student with a given ability can obtain a range of scores on
future tests without affecting his ability location. A calibrated item can yield a range
of difficulties on future tests without affecting its difficulty calibrated location. This makes
sense only in relation to the trust you can have in the person interpreting IRT
results; that person’s skill, knowledge, and (most important) experience at all
levels of assessment: student performance expectations, test blueprint, and politics.

In practice, student data that do not fit well, “look
right”, can be eliminated from the data set. Also the same data set (Table 33,
Table 36, and Table 39) can be treated differently if it is classified as field
test, operational test, benchmark test, or current test.

At this point states recalibrated and creatively
equilibrated test results to optimize federal dollars during the NCLB era by
showing gradualcontinuing improvement. It is time to end the ranking of students by right mark
scoring (0,1 scoring) and include KJS,
or PCM (0,1,2 scoring) [that about every state education department has:
Winsteps], so that standardized testing yields the results needed to guide
student development: the main goal of the CCSS movement.

The need to equilibrate a test is an admission of failure.
The practice has become “normal” because failure is so common. It opened the
door to cheating at state and national levels. [To my knowledge no one has been
charged and convicted of a crime for this cheating.] Current computer adaptive
testing (CAT) hovers about the 50% level of difficulty. This optimizes
psychometric tools. Having a disinterested party outside of the educational
community doing the assessment analysis and online CAT
reduce the opportunity to cheat. They do not IMHO optimize the usefulness of
the test results. End-of-course tests are now molding standardized testing into
an instrument to evaluate teacher effectiveness rather than assess student
knowledge and judgment (student development).

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.

Wednesday, November 12, 2014

I learned in the prior post that test precision can be adjusted by selecting the needed set of items based on their item information
functions (IIF). This post makes use of that observation to improve the
Nurse124 data set that generated the set of IFFs in Chart 75.

I observed that Tables 33 and 34, in the prior post,
contained no items with difficulties below 45%. The item information functions
(IIF) were also skewed (Chart 75). This is not the symmetrical display
associated with the Rasch IRT model. I reasoned that adding a balanced set of
items would increase the number of IFFs without changing the average item
difficulty.

Table 36a shows the addition of a balanced set of 22 items
to the Nurse124 data set of 21 items. As each lower ranking item was added, one
or more high ranking items were added to keep the average test score near 80%.
This table added six lower ranking items and 16 higher scoring items resulting
in an average score of 79% and 43 items total.

Table 36

The average item difficulty for the Nurse124 data set was
17.57 and the expanded set was 17.28. The average test score of 80% came in as
79%. Student scores (ability) also remained about the same. [I did not take the
time to tweak the additions for a better fit.] Both item difficulty and student
score (ability) remained about the same.

The conditional standard error of measurement (CSEM) did
change with the addition of more items (Chart 79 below). The number of cells
containing information expanded from 99 to 204 cells. The average right count
student score increased from 17 to 34.

Table 36c shows the resulting item information functions
(IIF). The original set of 11 IIFs now contains 17 IIFs (orange). The original set
of 9 different student scores now contains 12 different scores, however the
range of student scores is comparable between the two sets. This makes sense as
the average test scores are similar and the student scores are also about the
same.

Table 37

Chart 77

Chart 77 (Table 37) shows the 17 IIFs as they spread across the
student ability range of 12 rankings (student score right count/% right). The
trace for the IIF with a difficulty of 11/50% (blue square) peaks (0.25) near
the average test score of 79%. This was expected as the maximum information
value within an IIF occurs when the
item difficulty and student ability score match. [The three bottom traces on
Chart 77 (blue, red, and green) have been colored in Table 37 as an aid in
relating the table and chart (rotate Table 37 counter-clockwise 90 degrees).]

Even more important is the way the traces are increasingly
skewed the further the IIFs are away from this maximum, 11/50%, trace (blue
square, Chart 77). Also the IIF with a difficulty of 18/82%, near the average
test score, produced the identical total information (1.41) from both the
Nurse124 and the supplemented data sets. But these values also drifted apart
for the two data sets for IIFs of higher and lower difficulty.

Two IIFs near the 50% difficulty point delivered the maximum
information (2.17). Here again is evidence that prompts psychometricians to
work closely to the 50% or zero logit point to optimize their tools when
working on low quality data (limiting scoring only to right counts rather than also
offering students the option to assess their judgment to report what is
actually meaningful and useful; to assess their development toward being a successful,
independent, high quality achiever). [Students that only need some guidance
rather than endless “re-teaching”; that, for the most part, consider right
count standardized tests a joke and a waste of time.]

Chart 78

Tabel 38

The test information function for the supplemented data set
Is the sum of the information in all 17 item information functions (Table 38
and Chart 78). It took 16 easy items to balance 6 difficult items. The result
was a marked increase in precision at the student score levels between 30/70%
and 32/74%. [More at Rasch Model
Audit blog.]

Chart 79

Chart 79 summarizes the relationships between the Nurse124
data, the supplemented data (adding a balanced set of items that keeps student
ability and item difficulty unchanged), and the CTT and IRT data reduction
methods. The IRT logit values (green) were plotted directly and inverted (1/CSEM)
for comparison. In general, both CTT (blue) and IRT inverted (red) produced comparable CSEM values.

Adding 22 items increased the CTT Test SEM from 1.75 to
2.54. The standard deviation (SD) between student test scores increased from
2.07 to 4.46. The relative effect being, 1.75/2.07 and 2.54/4.46, or 84% and
57% with a difference of 27, or an improvement in precision of 27/84 or 32%.

Chart 79 also makes it very obvious that the higher the
student test score the lower the CTT CSEM, the more precise the student score measurement,
the less error. That makes sense.

The above statement about a CTT CSEM must be related to a
second statement that the more item information, the greater the precision of
measurement by the item at this student score rank. The first statement
harvests variance from the central cell field from within rows of student (right) marks (Table 36a) and from rows of probabilities (of right marks)
in Table 36c.

The binomial
variance CTT CSEM view is then comparable to the reciprocal or inverted
(1/CSEM) view of the test information function CSEM view (Chart 79). CTT (blue,
CTT Nurse124, Chart 79) and IRT inverted (red, IRT N124 Inverted) produced
similar results even with an average test score of 79% that is 29 percentage
points away from the 50%, zero logit, IRT optimum performance point.

The second statement harvests variance, item information
functions, in Table 36c from columns of
probabilities (of right marks). Layering one IIF on top of another across
the student score distribution yields the test information function (Chart 78).

The Rasch IRT model harvests the variance from rows and from columns of probabilities of getting

a right answer that were generated
from the marginal student scores and item difficulties. CTT harvests from the variance of the marks students actually made. Yet,
at the count only right mark level, they deliver very similar results, with the
exception of the IIF from IRT analysis that the CTT analysis does do.

- - - - - - - - - - - - - - - - - - - - -

The Best of the Blog - FREE

The Visual Education Statistics Engine (VESEngine) presents the common education statistics on one Excel traditional two-dimensional spreadsheet. The post includes definitions. Download as .xlsm or .xls.

This blog started five years ago. It has meandered through several views. The current project is visualizing the VESEngine in three dimensions. The observed student mark patterns (on their answer sheets) are on one level. The variation in the mark patterns (variance) is on the second level.

Power Up Plus (PUP) is classroom friendly software used to score and analyze what students guess (traditional multiple-choice) and what they report as the basis for further learning and instruction (knowledge and judgment scoring multiple-choice). This is a quick way to update your multiple-choice to meet Common Core State Standards (promote understanding as well as rote memory). Knowledge and judgment scoring originated as a classroom project, starting in 1980, that converted passive pupils into self-correcting highly successful achievers in two to nine months. Download as .xlsm or .xls.