The challenge of writing performance test questions for certification exams
This feature first appeared in the Summer 2020 issue of Certification Magazine. Click here to get your own print or digital copy.
There are several approaches to measuring the knowledge, skills, and abilities of a candidate who wants to prove their competency by achieving a certification credential. In recent years, many exam creators have shifted away from achieving these measurements by using traditional exam metrics such as multiple-choice questions.
One increasingly popular method of measuring knowledge, skills, abilities is through the use of “performance” items that require an exam candidate to perform a specific task or solve a specific problem. The purpose of this article is to address several of the considerations and goals of performance test item writing.
While I will cover many topics, I will also exclude many details, such as how to define competencies and job tasks, psychometrics, types of performance tests, test administration, policies, infrastructure, deployment considerations, scoring methods, scalability, archiving, security, localization, maintenance, and costs. Should you have questions regarding these details, I encourage you to seek out other resources.
Validity and reliability
Regardless of the structure of an exam, I recommend a job task analysis to identify competencies (i.e., knowledge, skills, and/or abilities) and to ensure that all items map back to the blueprint. It is the item writer’s responsibility to ensure that each item measures what is intended to be measured, as well as the frequency to which a competency is measured via relevant tasks.
Reliability can be tricky in performance tests and is often a source of debate amongst psychometricians for various reasons. For example, sometimes a performance test requires fewer measurements than a traditional exam. Or, a performance test might have a mix of highly related tasks measured on the outcome or tasks broken out into several points of measurement.
For the purposes of this article, I advise item writers to work closely with a psychometrician to ensure validity and reliability.
Exam fidelity and structure
Throughout the years, certification program managers and their stakeholders have debated the item types that are most effective when testing candidates. One can argue that an item author is able to write a very well-conceived scenario-based item that can be answered with a multiple-choice item; and, it would be just as effective as a performance test item.
In fact, I recently spoke with an industry colleague who has implemented an open-book, scenario-based exam that requires candidates to launch an instance of the test sponsor’s software in order to answer a multiple-choice item. This means that in order to answer the item correctly, the candidates must have prior knowledge in order to know what to look for within the software, training materials, or product documentation.
In other instances, certification program managers have implemented performance testing (e.g., completing tasks in a live, real-world environment with actual equipment) or performance-based testing (e.g., completing an emulation or simulation within the given confines of a scenario, virtual reality, board assessment, etc.) because it is perceived to be a more advanced level of testing. For the purposes of this article, I’ll refer to “performance testing” across the board.
Performance testing offers a high-fidelity way to engage candidates, offering a highly reliable and credible way to measure real-world skills. The most traditional exams include computer-based multiple-choice item types. Regardless of candidate or stakeholder perceptions, however, learning objectives and outcomes may influence the need to evaluate a candidate’s ability to actually perform a task.
With the advent of virtualization (e.g., think holographic lenses) and games or simulators, some tests can replace live actors. Some tasks, on the other hand, simply cannot be replicated without using real-world tools or circumstances.
By way of example, there’s a difference in someone who can play an online football game versus being able to actually throw a football or kick a field goal. As a result, there are specific goals and considerations that item writers must consider when developing performance tests.
Process versus outcomes
One consideration might include restricting the testing environment so that the item types are highly structured and controlled. For instance, items authors might preface a test with a controlled case study. Conversely, one might open up the entire environment to make the test more authentic, which could lead to unintended consequences. For example, perhaps a candidate performs a task in a way that is bad practice or simply goes against best practice.
Test authors might only care about measuring the outcome versus the process to achieve the outcome. But what happens if the outcome generates an inadvertent and negative result (e.g., introduces a security risk)? Do you still pass the candidate since you weren’t measuring the process to achieve the outcome?
Is it possible to verify an artifact left by the candidate that can be (automatically) identified to mark an item response as incorrect if the process to achieve the outcome is undesirable (e.g., an artifact of a process or setting that is tracked in a log file)? Or, is it worth the extra effort to measure the process to achieve an outcome?
Should a task include checkpoints during the process and on the way to achieving an outcome? Are some tasks or checkpoints more important than others, and therefore, worth determining if partial credit or weighted scoring is appropriate?
I once developed a performance test that required candidates to achieve a minimum score in a specific section of the exam. If they failed that section, then they failed the entire exam (this is also known as all-or-nothing scoring, or dichotomous scoring). But what if a candidate fails the exam in an early section of an exam, which results in failure of the entire exam? Does the exam just stop right there or do you let the candidate continue to attempt all tasks in the exam?
The questions above demonstrate the importance of considering all options to achieve the desired and measurable process or outcome. This includes taking more time and effort to not only write meaningful items, but to also consider in advance the candidate testing experience and how the performance items will be implemented and scored.
Exam and item durations
Another consideration is to take into account any testing latency. Let’s imagine that the test takes place in a computer environment where the machine configuration requires time to set up a performance test task. In this case, I’ve heard of other test sponsors who decided to integrate multiple-choice testing during the time it takes to spin up a computer lab environment.
Performance test item writers should consider the allotted time to establish for the exam duration. The length of time for a performance test should take into account item exposure and exam security. For example, one might decide to restrict the controls of the exam duration in order to prevent memorization.
In this case, candidates would have to quickly recognize the tasks and what is expected, which requires prior knowledge in order to demonstrate skill and ability. Additionally, item writers need to consider the amount of time a set of job tasks would take to complete and determine whether the performance test accounts for real-world application and/or latency of equipment or connectivity.
It’s important to consider the methods of scoring a performance test. Item writers must consider the scoring rubric and whether certain tasks will be weighted more heavily than other tasks. Ideally, item writers would create a uniform method for scoring items, which is ideal for validity and reliability versus variable item weights.
If items require a candidate to perform multiple meaningful, highly related steps in order to achieve an outcome, then it might make sense to measure the item as a whole versus measuring each step in the process. Furthermore, if it’s important to measure multiple steps or checkpoints, the item writers should consider partial credit (also known as polytomous scoring), which increases accuracy but adds complexity. But, if one prefers to keep scoring straightforward, then opt for scoring all-or-nothing (dichotomous scoring).
Performance testing has grown in the certification industry in the recent years, particularly with the advent of new technologies and vendors who help scale programs with automated scoring and remote delivery. While there are many considerations in designing a performance testing program, item writers should specifically consider the following goals:
● Conduct a job task analysis.
● Design tasks aligned to competencies that map back to the blueprint.
● Create a positive candidate testing experience.
● Determine the degree of fidelity to solve real-world tasks.
● Decide the exam structure, taking into account time, costs, resources, and scalability.
● Determine the benefits of measuring process versus outcomes.
● Establish exam and item durations that align with real-world expectations.
● Implement a scoring process that addresses: 1) Standardizing item weighting, 2) Breaking out meaningful tasks and measuring separate elements of each task or, 3) Measuring connected tasks that can be measured as a single item, 4) Scoring models depending on complexity (e.g., dichotomous versus polytomous scoring).