Risk Management of Your Exams: Inside Numbers
Would you have predicted that the Pittsburgh Steelers would go 15-1 during this season? Would you have predicted that the Indianapolis Colts would be knocked out of the playoffs in the second round? The answer is that it depends on which statistics you used to make your decisions. There are some numbers that are used consistently to predict the outcomes of sporting events and others that are used by television broadcasters such as Chris Berman with higher success rates.
The same is true in the testing business. We use results from a field test or beta sample to make predictions about how well a test will perform. Do you stop there, or should you be looking at additional data? The focus of this month’s article is on looking inside the numbers—the numbers you should be tracking to monitor the health of your test.
There is a great body of knowledge available about what statistics are appropriate for use in the initial development of a test. These statistics vary depending on whether you want to use fixed form, multi-stage, computer adaptive or linear on-the-fly testing models. There is much less information widely available about what statistics are most commonly used to monitor the health of your test. The literature tends to focus on item level performance using item drift statistics. While these statistics are important, there are many other statistics about the overall test or item pool that can be monitored more easily before you even think about looking at these more advanced statistics.
There are three key areas that you should be tracking, at a minimum, if your program uses computer-based testing: test scores, data forensics and response latency. The first two also apply to paper-and-pencil tests.
First and foremost, it’s essential that you monitor any changes in test scores over time. You can choose to track the scaled test score or you may simply want to monitor pass rates in the case of licensure and certification testing. I recommend the development and review of a report that includes an analysis of scores over time. You can look at the overall test performance across geographies as well as countries and test sites in the case of licensure and certification, and school districts, schools and test administrators in the case of education. You may also want to consider conducting a retake analysis to determine if there are any extreme changes in test performance by examinees over time.
A second area of analysis that you should investigate is an area sometimes referred to as “data forensics.” Similar to the way crime scene investigators use forensic techniques to solve crime cases, program managers can use data forensics to detect test fraud. Test fraud can be divided into the following five areas: cheating, collusion, piracy, retake violations and volatile retakes. Depending on what type of test you deliver, one ore more of these potential types of test fraud may occur within your testing program. A rise in pass rates may be one indicator of cheating, but irregular patterns of responding (missing easy items and passing difficult ones) is another indicator of cheating. Collusion occurs when there are similar patterns of responding during the same test administration (such as a test site or classroom). Not being able to complete the test is one indicator of a potential pirate. Retake violators are easy to identify by simply comparing the time between test events to the program policy. Volatile retakes would be identified by extreme changes in test scores for any given examinee.
Finally, the time that an examinee takes on an individual item and the overall exam should be monitored on an ongoing basis and may warrant further investigation. If examinees in Beijing are completing exams three times faster than those in Boston, cheating may be occurring. If examinees can complete the exam in less than 40 percent of the time allotted compared to the field test, then it’s likely that your exam may be overexposed and that your exam content is available on the Internet.
It’s important to track test scores and response time on an ongoing basis. If you can also allocate resources to generate any or all of the data forensic analyses mentioned, you will have a much clearer picture of the health of your test.
Cyndy Fitzgerald, Ph.D., is co-founder and senior security director at Caveon Test Security (www.caveon.com) and is a member of Association of Test Publishers. Address any test security questions or recommendations to Cyndy via e-mail at firstname.lastname@example.org.