NOTE: This is an ongoing series. To view all articles in the series, click here.
Now that we know the process for designing and developing a good exam, we need to understand two attributes of a good exam: Validity and Reliability. This brings up a key question of assessment design. Namely, how do you determine whether an exam is Valid and Reliable?
Validity is a way of gauging the extent to which an exam measures what it is supposed to measure. Or to say it a different way, validity is a measure of how well aligned your questions or items are with your objectives or competencies. Four common types of validity are face, concurrent, predictive, and content.
Content validity is the way subject matter experts assess whether an exam measures what it is supposed to measure. In addition to being a practical measure of validity, it is also one of the most powerful ways of legally establishing the validity of your content.
Face validity, on the other hand, is not a good way of proving your content is valid. Face validity, rather, is a measure of the extent to which an exam measures what its name implies is being measured.
Concurrent validity is the technical process that an assessment manager uses to evaluate whether an instrument can tell the difference between those who have mastered the published competencies and those who have not.
Predictive validity is most often used in NRTs (norm referenced tests). Predictive validity refers to how well a test candidate will succeed in the future based on his or her performance on the test. The difference between the concurrent and predictive validity of a test is that concurrent validity will correctly classify learners based on currently known competence, and predictive validity of a test will correctly predict the future competence of a learner (Shrock and Coscarelli, 2007).
Reliability does not refer to whether you are being tested on what the test says you are being tested on, or to how well aligned your questions or items are with your objectives or competencies. Rather, reliability refers to consistency. If a learner took the test multiple times without studying or refreshing his or her knowledge between attempts, how consistently would they achieve a roughly equivalent score? Also, how consistent are results of a given exam across multiple forms of that exam?
The five types of reliability that certification program managers are interested in are grouped in two techniques. The first of those techniques is used for a single administration of a test instrument, while the second technique is for those methods used for two administrations of an instrument.
These five methods are primarily concerned with cognitive tests. There is also a sixth method, which should be of interest to those managing a performance-based program: inter-rater reliability. This form of reliability looks at the extent to which raters are consistent in scoring the performance of test taker.
The single-exam administration techniques are internal consistency, squared-error loss, and threshold loss. The two-exam administration techniques are equivalence reliability, and test-retest reliability. Without going into a lot of statistical jargon, I will simply say my preferred method for single administration is internal consistency, looking primarily at Cronbach's alpha.
For two-exam administration, my preference is the test-retest method. I also have a strong reliability preference for inter-rater reliability, which includes having multiple raters scoring each exam with a solidly built checklist that each rater has a copy of and can easily follow.
Whether your exam will be a traditional form-based exam or you have a more advanced format available to you, whether your items are simply Bloom Level 1 and 2 multiple choice or are more Bloom Level 3-6 knowledge-based or performance-based, it is vital to a good program to have a constant view of the exams' validity and reliability.
Now that we know what we should have in the exam instrument, it is time to look at the major question every certification manager must wrestle with daily.
4) What do you communicate to your test authors regarding what a good assessment item should look like?
What I will share with you is part of a training that I developed to help SMEs write good assessment items. The basic guidelines for writing good multiple-choice questions include the following.
General Guidelines
1. Multiple choice questions should have a minimum of three distractors along with one correct response.
2. Except when creating multiple response or matching items, write questions that have only one correct or clearly best answer.
3. Do not use text formatting to emphasize words.
4. Avoid the following five common flaws when writing multiple choice questions:
a) Do not include grammatical clues.
b) Do not include logical clues.
c) Do not repeat key words.
d) Do not use absolute terms.
e) Avoid convergence.
5. Use uppercase letters for first word of each choice.
6. Do not put a period at the end of individual choices.
7. Always include a minimum of four choices.
Stem (Question) Guidelines
1. The stem should fully state the problem and all qualifications
2. Write the stem in a positive form.
3. Always include a verb in the stem.
Distractor (Choices) Guidelines
1. Each choice must be plausible.
2. Avoid making the correct choice the longest one.
Good Stem (Question) Characteristics
1. Incorporate scenarios.
2. Incorporate exam objectives.
3. Use Bloom Levels 3 through 6.
4. Always address the job role being tested.
A good scenario-based question puts the test taker in a situation as it would occur in the workplace. Scenarios should be detailed. Provide all of the information that test taker needs to weigh the choices.
Sample BAD Question (taken out of context): "Which option should you select if you want to document a decrease in sPO2?" There are no details to place the test taker in the situation. Who is "documenting a decrease"? Why would someone do that? How would someone do that?
Sample GOOD Questions (taken out of context): "Philip, a new registrar, is reviewing the census and does not know what the icon on a patient means in the XYZ Registration application. What is the best course of action to take to find out what the icon represents?" "Susan, who is working at the main registration desk, has been called by administration. They need to know the name and the sex of the patient in bed 401. Where in the XYZ Registration application would she find this information?"
These questions give us a full picture of who, what, where, and even how. A picture is painted that draws in the test taker and helps them view the scenario from the inside.
Sample VERY GOOD Question (taken out of context): "Suzanne is at a client site deploying Update Manager. Mike is one of the floor admins with a deployment question for Suzanne. He has heard that one of the Microsoft deployment technologies that has been used in the past on COMPANY projects has had an issue. Which statement correctly defines the issue that Mike has heard about?"
To quickly review: A good certification question will be scenario-based, preferably hit Bloom Levels 3-6, be based on the published exam objectives, and paints a picture of some activity that occurs on the job.
Now that we know what it takes to write a good question, we are ready to move to our fifth point of discussion.
5) How do we score an exam?
The answer to this question is simple: consistently. Whether it be a knowledge-based or cognitive exam, or one of the many types of performance-based exams, every leaner expects and deserves a fair and unbiased evaluation of their performance on the given instrument.
This means every leaner — whether they are part of a high school class of 20 science students, or a population of several thousand individuals seeking the CTT+ credential from CompTIA, or a population of physicians and nurses seeking an HIT certification from me — deserves a fair and unbiased evaluation of their knowledge, skills, and abilities as demonstrated via a given evaluation instrument.
On a computer-generated cognitive exam, scoring is easy as long as the process outlined in Figures 1 and 2 (see preceding article) is followed. If the process is followed, then what the learner will face is an exam that has been reviewed at least three times — by SMEs, by a copy editor, and by you as the program manager — prior to getting released for alpha and beta reviews. (These are done by a select group of test takers who will either provide a final blessing or inform you that you have more work to do.)
If your instrument has been approved, then what you will have is a legally defensible answer set for the competencies that were being evaluated. If you are evaluating someone's performance, then you will not end up with an answer set, but you will end up with a fair and unbiased scoring tool — called a rubric — that will help you and your evaluators convey your expectations to all potential test takers.
5a) What is a rubric and how is it created?
A rubric (Stevens and Levi, 2013) for our purposes is an authoritative scoring tool that communicates clearly our specific expectations for an assigned assessment. This can be an oral presentation, a hands-on exercise, or a written critique.
A rubric in its simplest form contains four sections. These are: 1) the task description or the assignment; 2) the grading scale which pertains to the task, 3) the competencies involved in the assessment, and 4) clear concise descriptions of what makes up each performance level.
You will note that Table 1, shown below is simply a Microsoft Word table, with each column heading providing a point on your evaluation scale, and each row heading providing the competencies that were to be assessed. Each intersecting box has a description of what constitutes each level of performance.
5b) Who will design the scoring rubric for the tests?
To create a sound rubric for a certification program, you as the program manager need to clearly communicate your leadership's expectations of what someone who has mastered the competencies being evaluated should be able to accomplish. Once you can clearly communicate that message, it becomes a matter of getting your fellow managers on board.
Your colleagues will work with you and your team of ISDs and assessment developers to begin drafting the rubrics you will need for each assessment type. You will notice I said rubrics, in the plural. For each exam format, you will need a separate rubric because different competencies will be evaluated on each instrument.
To be fair and unbiased with the potential candidates it is imperative that that the testing rubric be clear and available before the candidate ever steps into the exam room.
5c) When incorporating oral exam formats, role-playing formats, or hands-on performance-based formats, how should these non-traditional formats be scored?
A sample rubric which I leveraged in the third release of our certification is shown below in Table 2. This was scored by three independent evaluators for each type of exam and the results shared with the resource managers to finally communicate the results one-on-one with the candidate. This put the burden not solely on the evaluators, but jointly on resource managers.
The scoring was an agreed-upon value that complied with our clients' wishes. For this oral exam, the score on each competency had to be no less than a 2. With the overall score being no lower than 2.5. This was, for the most part, the scoring decision that was accepted by each resource manager for each type of performance-based evaluation.
For each cognitive exam, the initial pass score was set prior to the alpha pilots at 70 percent. This score was also agreed to by the resource managers.
5c) How is the final cut score (or passing score) determined for cognitive exams?
If the initial score is set by the managers, then how do you set the final score? The final passing score (or cut score) should be determined after the beta test using a standard technique called the Angoff method. The Angoff method is the one most often used in the certification industry as we know it, and in the current IT world.
The way I did the Angoff method is as follows: I identified at least three managers or SMEs for each role to serve as judges for each cognitive exam to be evaluated. The judges for each exam all met with me in a conference room where we first reached consensus on the definition of a minimally competent test taker.
We then took one of the tested competencies and had each judge privately evaluate a single test question, and then estimate the probability of a minimally competent test taker getting a single question correct. After they had privately come to a probability that they were comfortable with, I had each person share their number and recorded it in an Excel spreadsheet.
We did this for two more items, and then I had the judges privately evaluate all items based on the identified competencies and estimate the probability of a minimally competent test taker getting each question correct. The probabilities for each item were then summed up and an average calculated for all items based on the number of judges. This then became the basis for the final passing score.
What this confirmed was that the 70 percent initial cut score was either too high or too low, while for some of the role-based exams it was more than accurate. There were cases when the final cut score had to be adjusted either up or down as more people took the test. The judges for each exam were kept in the loop, and were consulted when any changes were indicated.
Summary
We have covered a lot of material in this one article, but we have covered only the major high points of the assessment aspect of the role of certification program manager. Stay tuned for the next article, which will be the last in this series. We will examine the program and project management aspects of designing and developing a quality certification program.
REFERENCES
1) Allen, M. (2012). Leaving Addie for SAM: An Agile Model for Developing the Best Learning Experiences. Alexandria, VA: ASTD Press.
2) Allen, M. W. (2003). Michael Allen's Guide To E-Learning: Building Interactive, Fun, and Effective Learning Programs for Any Company. Hoboken, N.J.: John Wiley.
3) Brown, A., & Green, T. D. (2011). The Essentials of Instructional Design: Connecting Fundamental Principles with Process and Practice (2nd ed.). Boston: Prentice Hall.
4) Dick, W., Carey, L., & Carey, J. O. (2009). The Systematic Design of Instruction (7thed.). Upper Saddle River, N.J.: Merrill/Pearson.
5) Gagné, R. M. (1985). The Conditions of Learning and Theory of Instruction (4th ed.). New York: Holt, Rinehart and Winston.
6) Gagné, R. M., Wager, W. W., Golas, K. C., & Keller, J. M. (2005). Principles of Instructional Design (5th ed.). Belmont, CA: Thomson/Wadsworth.
7) Morrison, G. R., Ross, S. M., Kalman, H. K., & Kemp, J. E. (2013). Designing Effective Instruction (7th ed.). Hoboken, NJ: Wiley.
8) Norton, R. E., & Moser, J. (2007). SCID Handbook (7th ed.). Columbus, OH: Center on Education and Training for Employment, The Ohio State University.
9) Rothwell, W. J. (2002). The Workplace Learner: How to Align Training Initiatives with Individual Learning Competencies. New York: American Management Association.
10) Shrock, S. A., & Coscarelli, W. C. C. (2007). Criterion-referenced Test Development: Technical and Legal Guidelines for Corporate Training (3rd ed.). San Francisco, CA: Pfeiffer.
11) Smith, P. L., & Ragan, T. J. (2005). Instructional Design (3rd ed.). Hoboken, N.J.: J. Wiley & Sons.
12) Stevens, D.D., & Levi, A. J. (2013), Introduction to Rubrics (2nd). Sterling, VA: Stylus.
13) Wyrostek, W. (2011) Core Considerations for a Successful Corporate Certification Program. Retrieved from http://www.informit.com/articles/article.aspx?p=1804139
14) Wyrostek, W., & Downey, S. (2016). Compatibility of common instructional models with the DACUM process. Adult Learning, 0(0), 1-7, doi:10.1177/1045159516669702
Important Update: We have updated our Privacy Policy to comply with the California Consumer Privacy Act (CCPA)