Module Chapter 7 p655 wk3
C H A P T E R 7
In Ieveryday language, we use the term utility to refer to the usefulness of some thing or some process. In the language of psychometrics, utility (also referred to as test utility) means much the same thing; it refers to how useful a test is. More specifically, it refers to the practical value of using a test to aid in decision making. An overview of some frequently raised utility-related questions would include the following:
· How useful is this test in terms of cost efficiency?
· How useful is this test in terms of savings in time?
· What is the comparative utility of this test? That is, how useful is this test as compared to another test?
· What is the clinical utility of this test? That is, how useful is it for purposes of diagnostic assessment or treatment?
· What is the diagnostic utility of this neurological test? That is, how useful is it for classification purposes?
· How useful is this medical school admissions test used in assigning a limited number of openings to an overwhelming number of applicants?
· How useful is the addition of another test to the test battery already in use for screening purposes?
· How useful is this personnel test as a tool for the selection of new employees?
· Is this particular personnel test used for promoting middle-management employees more useful than using no test at all?
· Is the time and money it takes to administer, score, and interpret this personnel promotion test battery worth it as compared to simply asking the employee’s supervisor for a recommendation as to whether the employee should be promoted?
· How useful is the training program in place for new recruits?
· How effective is this particular clinical technique?
· Should this new intervention be used in place of an existing intervention?
What Is Utility?
We may define utility in the context of testing and assessment as the usefulness or practical value of testing to improve efficiency. Note that in this definition, “testing” refers to anything from a single test to a large-scale testing program that employs a battery of tests. For simplicity and convenience, in this chapter we often refer to the utility of one individual test. Keep in mind, however, that such discussion is applicable and generalizable to the utility of large-scale testing programs that may employ many tests or test batteries. Utility is also used to refer to the usefulness or practical value of a training program or intervention. We may speak, for example, of the utility of adding a particular component to an existing corporate training program or clinical intervention. Throughout this chapter, however, our discussion and illustrations will focus primarily on utility as it relates to testing.
JUST THINK . . .
Based on everything that you have read about tests and testing so far in this book, how do you think you would go about making a judgment regarding the utility of a test?
If your response to our Just Think question about judging a test’s utility made reference to the reliability of a test or the validity of a test, then you are correct—well, partly. Judgments concerning the utility of a test are made on the basis of test reliability and validity data as well as on other data.
Factors That Affect a Test’s Utility
A number of considerations are involved in making a judgment about the utility of a test. Here we will review how a test’s psychometric soundness, costs, and benefits can all affect a judgment concerning a test’s utility.
By psychometric soundness, we refer—as you probably know by now—to the reliability and validity of a test. A test is said to be psychometrically sound for a particular purpose if reliability and validity coefficients are acceptably high. How can an index of utility be distinguished from an index of reliability or validity? The short answer to that question is as follows: An index of reliability can tell us something about how consistently a test measures what it measures; and an index of validity can tell us something about whether a test measures what it purports to measure. But an index of utility can tell us something about the practical value of the information derived from scores on the test. Test scores are said to have utility if their use in a particular situation helps us to make better decisions—better, that is, in the sense of being more cost-effective (see, for example, Brettschneider et al., 2015; or Winser et al., 2015).
In previous chapters on reliability and validity, it was noted that reliability sets a ceiling on validity. It is tempting to draw the conclusion that a comparable relationship exists between validity and utility and conclude that “validity sets a ceiling on utility.” In many instances, such a conclusion would certainly be defensible. After all, a test must be valid to be useful. Of what practical value or usefulness is a test for a specific purpose if the test is not valid for that purpose?
Unfortunately, few things about utility theory and its application are simple and uncomplicated. Generally speaking, the higher the criterion-related validity of test scores for making a particular decision, the higher the utility of the test is likely to be. However, there are exceptions to this general rule. This is so because many factors may enter into an estimate of a test’s utility, and there are great variations in the ways in which the utility of a test is determined. In a study of the utility of a test used for personnel selection, for example, the selection ratio may be very high. We’ll review the concept of a selection ratio (introduced in the previous chapter) in greater detail later in this chapter. For now, let’s simply note that if the selection ratio is very high, most people who apply for the job are being hired. Under such circumstances, the validity of the test may have little to do with the test’s utility.Page 202
What about the other side of the coin? Would it be accurate to conclude that “a valid test is a useful test”? At first blush this statement may also seem perfectly logical and true. But once again—we’re talking about utility theory here, and this can be very complicated stuff—the answer is no; it is not the case that “a valid test is a useful test.” People often refer to a particular test as “valid” if scores on the test have been shown to be good indicators of how the person will score on the criterion.
An example from the published literature may help to further illustrate how a valid tool of assessment may have questionable utility. One way of monitoring the drug use of cocaine users being treated on an outpatient basis is through regular urine tests. As an alternative to that monitoring method, researchers developed a patch which, if worn day and night, could detect cocaine use through sweat. In a study designed to explore the utility of the sweat patch with 63 opiate-dependent volunteers who were seeking treatment, investigators found a 92% level of agreement between a positive urine test for cocaine and a positive test on the sweat patch for cocaine. On the face of it, these results would seem to be encouraging for the developers of the patch. However, this high rate of agreement occurred only when the patch had been untampered with and properly applied by research participants—which, as it turned out, wasn’t all that often. Overall, the researchers felt compelled to conclude that the sweat patch had limited utility as a means of monitoring drug use in outpatient treatment facilities (Chawarski et al., 2007). This study illustrates that even though a test may be psychometrically sound, it may have little utility—particularly if the targeted testtakers demonstrate a tendency to “bend, fold, spindle, mutilate, destroy, tamper with,” or otherwise fail to scrupulously follow the test’s directions.
Another utility-related factor does not necessarily have anything to do with the behavior of targeted testtakers. In fact, it typically has more to do with the behavior of the test’s targeted users.
Mention the word costs and what comes to mind? Usually words like money or dollars. In considerations of test utility, factors variously referred to as economic, financial, or budget-related in nature must certainly be taken into account. In fact, one of the most basic elements in any utility analysis is the financial cost of the selection device (or training program or clinical intervention) under study. However, the meaning of “cost” as applied to test utility can extend far beyond dollars and cents (see Figure 7–1). Briefly, cost in the context of test utility refers to disadvantages, losses, or expenses in both economic and noneconomic terms.
Figure 7–1 Rethinking the “Costs” of Testing—and of Not Testing The cost of this X-ray might be $100 or so . . . but what is the cost of not having this diagnostic procedure done? Depending on the particular case, the cost of not testing might be unnecessary pain and suffering, lifelong disability, or worse. In sum, the decision to test or not must be made with thoughtful consideration of all possible pros and cons, financial and otherwise.© Martin Barraud/age fotostock RF
As used with respect to test utility decisions, the term costs can be interpreted in the traditional, economic sense; that is, relating to expenditures associated with testing or not testing. If testing is to be conducted, then it may be necessary to allocate funds to purchase (1) a particular test, (2) a supply of blank test protocols, and (3) computerized test processing, scoring, and interpretation from the test publisher or some independent service. AssociatedPage 203 costs of testing may come in the form of (1) payment to professional personnel and staff associated with test administration, scoring, and interpretation, (2) facility rental, mortgage, and/or other charges related to the usage of the test facility, and (3) insurance, legal, accounting, licensing, and other routine costs of doing business. In some settings, such as private clinics, these costs may be offset by revenue, such as fees paid by testtakers. In other settings, such as research organizations, these costs will be paid from the test user’s funds, which may in turn derive from sources such as private donations or government grants.
The economic costs listed here are the easy ones to calculate. Not so easy to calculate are other economic costs, particularly those associated with not testing or testing with an instrument that turns out to be ineffective. As an admittedly far-fetched example, what if skyrocketing fuel costs prompted a commercial airline to institute cost-cutting methods?1 What if one of the cost-cutting methods the airline instituted was the cessation of its personnel assessment program? Now, all personnel—-including pilots and equipment repair personnel—would be hired and trained with little or no evaluation. Alternatively, what if the airline simply converted its current hiring and training program to a much less expensive program with much less rigorous (and perhaps ineffective) testing for all personnel? What economic (and noneconomic) consequences do you envision might result from such action? Would cost-cutting actions such as those described previously be prudent from a business perspective?
One need not hold an M.B.A. or an advanced degree in consumer psychology to understand that such actions on the part of the airline would probably not be effective. The resulting cost savings from elimination of such assessment programs would pale in comparison to the probable losses in customer revenue once word got out about the airline’s strategy for cost cutting; loss of public confidence in the safety of the airline would almost certainly translate into a loss of ticket sales. Additionally, such revenue losses would be irrevocably compounded by any safety-related incidents (with their attendant lawsuits) that occurred as a consequence of such imprudent cost cutting.
In this example, mention of the variable of “loss of confidence” brings us to another meaning of “costs” in terms of utility analyses; that is, costs in terms of loss. Noneconomic costs of drastic cost cutting by the airline might come in the form of harm or injury to airline passengers and crew as a result of incompetent pilots flying the plane and incompetent ground crews servicing the planes. Although people (and most notably insurance companies) do place dollar amounts on the loss of life and limb, for our purposes we can still categorize such tragic losses as noneconomic in nature.
JUST THINK . . .
How would you describe the non-economic cost of a nation’s armed forces using ineffective screening mechanisms to screen military recruits?
Other noneconomic costs of testing can be far more subtle. Consider, for example, a published study that examined the utility of taking four X-ray pictures as compared to two X-ray pictures in routine screening for fractured ribs among potential child abuse victims. Hansen et al. (2008) found that a four-view series of X-rays differed significantly from the more traditional, two-view series in terms of the number of fractures identified. These researchers recommended the addition of two more views in the routine X-ray protocols for possible physical abuse. Stated another way, these authors found diagnostic utility in adding two X-ray views to the more traditional protocol. The financial cost of using the two additional X-rays was seen as worth it, given the consequences and potential costs of failing to diagnose the injuries. Here, the (non-economic) cost concerns the risk of letting a potential child abuser continue to abuse a child without detection. In other medical research, such as that described by our featured assessment professional, the utility of various other tests and procedures are routinely evaluatedPage 204 (see this chapter’s Meet an Assessment Professional ).
MEET AN ASSESSMENT PROFESSIONAL
Meet Dr. Delphine Courvoisier
My name is Delphine Courvoisier. I hold a Ph.D. in psychometrics from the University of Geneva, Switzerland, and Master’s degrees in statistics from the University of Geneva, in epidemiology from Harvard School of Public Health, and in human resources from the University of Geneva. I currently work as a biostatistician in the Department of Rheumatology, at the University Hospitals of Geneva, Switzerland. A typical work day for me entails consulting with clinicians about their research projects. Assistance from me may be sought at any stage in a research project. So, for example, I might help out one team of researchers in conceptualizing initial hypotheses. Another research team might require assistance in selecting the most appropriate outcome measures, given the population of subjects with whom they are working. Yet another team might request assistance with data analysis or interpretation. In addition to all of that, a work day typically includes providing a colleague with some technical or social support—this to counter the concern or discouragement that may have been engendered by some methodological or statistical complexity inherent in a project that they are working on.
Rheumatoid arthritis is a chronic disease. Patients with this disease frequently suffer pain and may have limited functioning. Among other variables, research team members may focus their attention on quality-of-life issues for members of this population. Quality-of-life research may be conducted at different points in time through the course of the disease. In conducting the research, various tools of assessment, including psychological tests and structured interviews, may be used.
The focus of my own research team has been on several overlapping variables, including health-related quality of life, degree of functional disability, and disease activity and progression. We measure health-related quality of life using the Short-Form 36 Health Survey (SF36). We measure functional disability by means of the Health Assessment Questionnaire (HAQ). We assess disease activity and progression by means of a structured interview conducted by a health-care professional. The interview yields a proprietary disease activity score (DAS). All these data are then employed to evaluate the effectiveness of various treatment regimens, and adjust, where necessary, patient treatment plans.
Delphine Courvoisier, Ph.D., Psychometrician and biostatistician at the Department of Rheumatology at the University Hospitals of Geneva, Switzerland. © Delphine Courvoisier
Since so much of our work involves evaluation by means of tests or other assessment procedures, it is important to examine the utility of the methods we use. For example, when a research project demands that subjects respond to a series of telephone calls, it would be instructive to understand how compliance (or, answering the phone and responding to the experimenter’s questions) versus non-compliance (or, not answering the phone) affects the other variables under study. It may be, for example, that people who are more compliant are simply more conscientious. If that was indeed the case, all the data collected from people who answered the phone might be more causally related to a personality variable (such as conscientiousness) than anything else. Thus, prior to analyzing content of phone interviews, it would be useful to test—and reject—the hypothesis that only patients high on the personality trait of conscientiousness will answer the phone.
We conducted a study that entailed the administration of a personality test (the NEO Personality Inventory-Revised), as well as ecological momentary assessment (EMA) in the form of a series of phone interviews with subjects (Courvoisier et al., 2012). EMA is a tool of assessment that researchersPage 205 can use to examine behaviors and subjective states in the settings in which they naturally occur, and at a frequency that can capture their variability. Through the use of EMA we learned, among other things, that subject compliance was not attributable to personality factors (see Courvoisier et al., 2012 for full details).
Being a psychometrician can be most fulfilling, especially when one’s measurement-related knowledge and expertise brings added value to a research project that has exciting prospects for bettering the quality of life for members of a specific population. Psychologists who raise compelling research questions understand that the road to satisfactory answers is paved with psychometric essentials such as a sound research design, the use of appropriate measures, and accurate analysis and interpretation of findings. Psychometricians lend their expertise in these areas to help make research meaningful, replicable, generalizable, and actionable. From my own experience, one day I might be meeting with a researcher to discuss why a particular test is (or is not) more appropriate as an outcome measure, given the unique design and objectives of the study. Another day might find me cautioning experimenters against the use of a spontaneously created, “home-made” questionnaire for the purpose of screening subjects. In such scenarios, a strong knowledge of psychometrics combined with a certain savoir faire in diplomacy would seem to be useful prerequisites to success.
I would advise any student who is considering or contemplating a career as a psychometrician to learn everything they can about measurement theory and practice. In addition, the student would do well to cultivate the interpersonal skills that will most certainly be needed to interact professionally and effectively with fellow producers and consumers of psychological research. Contrary to what many may hold as an intuitive truth, success in the world of psychometrics cannot be measured by numbers alone.
Used with permission of Delphine Courvoisier.
Judgments regarding the utility of a test may take into account whether the benefits of testing justify the costs of administering, scoring, and interpreting the test. So, when evaluating the utility of a particular test, an evaluation is made of the costs incurred by testing as compared to the benefits accrued from testing. Here, benefit refers to profits, gains, or advantages. As we did in discussing costs associated with testing (and not testing), we can view benefits in both economic and noneconomic terms.
From an economic perspective, the cost of administering tests can be minuscule when compared to the economic benefits—or financial returns in dollars and cents—that a successful testing program can yield. For example, if a new personnel testing program results in the selection of employees who produce significantly more than other employees, then the program will have been responsible for greater productivity on the part of the new employees. This greater productivity may lead to greater overall company profits. If a new method of quality control in a food-processing plant results in higher quality products and less product being trashed as waste, the net result will be greater profits for the company.
There are also many potential noneconomic benefits to be derived from thoughtfully designed and well-run testing programs. In industrial settings, a partial list of such noneconomic benefits—many carrying with them economic benefits as well—would include:
· an increase in the quality of workers’ performance;
· an increase in the quantity of workers’ performance;
· a decrease in the time needed to train workers;
· a reduction in the number of accidents;
· a reduction in worker turnover.
The cost of administering tests can be well worth it if the result is certain noneconomic benefits, such as a good work environment. As an example, consider the admissions program in place at most universities. Educational institutions that pride themselves on their graduates are often on the lookout for ways to improve the way that they select applicants for theirPage 206 programs. Why? Because it is to the credit of a university that their graduates succeed at their chosen careers. A large portion of happy, successful graduates enhances the university’s reputation and sends the message that the university is doing something right. Related benefits to a university that has students who are successfully going through its programs may include high morale and a good learning environment for students, high morale of and a good work environment for the faculty, and reduced load on counselors and on disciplinary personnel and boards. With fewer students leaving the school before graduation for academic reasons, there might actually be less of a load on admissions personnel as well; the admissions office will not be constantly working to select students to replace those who have left before completing their degree programs. A good work environment and a good learning environment are not necessarily things that money can buy. Such outcomes can, however, result from a well-administered admissions program that consistently selects qualified students who will keep up with the work and “fit in” to the environment of a particular university.
JUST THINK . . .
Provide an example of another situation in which the stakes involving the utility of a tool of psychological assessment are high.
One of the economic benefits of a diagnostic test used to make decisions about involuntary hospitalization of psychiatric patients is a benefit to society at large. Persons are frequently confined involuntarily for psychiatric reasons if they are harmful to themselves or others. Tools of psychological assessment such as tests, case history data, and interviews may be used to make a decision regarding involuntary psychiatric hospitalization. The more useful such tools of assessment are, the safer society will be from individuals intent on inflicting harm or injury. Clearly, the potential noneconomic benefit derived from the use of such diagnostic tools is great. It is also true, however, that the potential economic costs are great when errors are made. Errors in clinical determination made in cases of involuntary hospitalization may cause people who are not threats to themselves or others to be denied their freedom. The stakes involving the utility of tests can indeed be quite high.
How do professionals in the field of testing and assessment balance variables such as psychometric soundness, benefits, and costs? How do they come to a judgment regarding the utility of a specific test? How do they decide that the benefits (however defined) outweigh the costs (however defined) and that a test or intervention indeed has utility? There are formulas that can be used with values that can be filled in, and there are tables that can be used with values to be looked up. We will introduce you to such methods in this chapter. But let’s preface our discussion of utility analysis by emphasizing that other, less definable elements—such as prudence, vision, and, for lack of a better (or more technical) term, common sense—must be ever-present in the process. A psychometrically sound test of practical value is worth paying for, even when the dollar cost is high, if the potential benefits of its use are also high or if the potential costs of not using it are high. We have discussed “costs” and “benefits” at length in order to underscore that such matters cannot be considered solely in monetary terms.
What Is a Utility Analysis?
A utility analysis may be broadly defined as a family of techniques that entail a cost–benefit analysis designed to yield information relevant to a decision about the usefulness and/or practical value of a tool of assessment. Note that in this definition, we used the phrase “family of techniques.” This is so because a utility analysis is not one specific technique used for one specific objective. Rather, utility analysis is an umbrella term covering various possible methods, each requiring various kinds of data to be inputted and yielding various kinds of output. Some utility analyses are quite sophisticated, employing high-level mathematical models and detailed strategies forPage 207 weighting the different variables under consideration (Roth et al., 2001). Other utility analyses are far more straightforward and can be readily understood in terms of answers to relatively uncomplicated questions, such as: “Which test gives us more bang for the buck?”
In a most general sense, a utility analysis may be undertaken for the purpose of evaluating whether the benefits of using a test (or training program or intervention) outweigh the costs. If undertaken to evaluate a test, the utility analysis will help make decisions regarding whether:
· one test is preferable to another test for use for a specific purpose;
· one tool of assessment (such as a test) is preferable to another tool of assessment (such as behavioral observation) for a specific purpose;
· the addition of one or more tests (or other tools of assessment) to one or more tests (or other tools of assessment) that are already in use is preferable for a specific purpose;
· no testing or assessment is preferable to any testing or assessment.
If undertaken for the purpose of evaluating a training program or intervention, the utility analysis will help make decisions regarding whether:
· one training program is preferable to another training program;
· one method of intervention is preferable to another method of intervention;
· the addition or subtraction of elements to an existing training program improves the overall training program by making it more effective and efficient;
· the addition or subtraction of elements to an existing method of intervention improves the overall intervention by making it more effective and efficient;
· no training program is preferable to a given training program;
· no intervention is preferable to a given intervention.
The endpoint of a utility analysis is typically an educated decision about which of many possible courses of action is optimal. For example, in a now-classic utility analysis, Cascio and Ramos (1986) found that the use of a particular approach to assessment in selecting managers could save a telephone company more than $13 million over four years (see also Cascio, 1994, 2000).
Whether reading about utility analysis in this chapter or in other sources, a solid foundation in the language of this endeavor—both written and graphic—is essential. Toward that end, we hope you find the detailed case illustration presented in our Close-Up helpful.
Utility Analysis: An Illustration
Like factor analysis, discriminant analysis, psychoanalysis, and other specific approaches to analysis and evaluation, utility analysis has its own vocabulary. It even has its own images in terms of graphic representations of various phenomena. As a point of departure for learning about the words and images associated with utility analysis, we present a hypothetical scenario involving utility-related issues that arise in a corporate personnel office. The company is a South American package delivery company called Federale (pronounced fed-a-rally) Express (FE). The question at hand concerns the cost-effectiveness of adding a new test to the process of hiring delivery drivers. Consider the following details.
Dr. Wanda Carlos, the personnel director of Federale Express, has been charged with the task of evaluating the utility of adding a new test to the procedures currently in place for hiring delivery drivers. Current FE policy states that drivers must possess a valid driver’s license and have no criminal record. Once hired, the delivery driver is placed on probation for three months, during which time on-the-job supervisory ratings (OTJSRs) are collected on random work days. If scores on the OTJSRs are satisfactory at the end of the probationary period, then the new delivery driver is deemed “qualified.” Only qualified drivers attain permanent employee status and benefits at Federale Express.
The new evaluation procedure to be considered from a cost-benefit perspective is the Federale Express Road Test (FERT). The FERT is a procedure that takes less than one hour and entails the applicant driving an FE truck in actual traffic to a given destination, parallel parking, and then driving back to the start point. Does the FERT evidence criterion-related validity? If so, what cut score instituted to designate passing and failing scores would provide the greatest utility? These are preliminary questions that Dr. Carlos seeks to answer “on the road” to tackling issues of utility. They will be addressed in a study exploring the predictive validity of the FERT.
Dr. Carlos conducts a study in which a new group of drivers is hired based on FE’s existing requirements: possession of a valid driver’s license and no criminal record. However, to shed light on the question of the value of adding a new test to the process, these new hires must also take the FERT. So, subsequent to their hiring and after taking the FERT, these new employees are all placed on probation for the usual period of three months. During this probationary period, the usual on-the-job supervisory ratings (OTJSRs) are collected on randomly selected work days. The total scores the new employees achieve on the OTJSRs will be used to address not only the question of whether the new hire is qualified but also questions concerning the added value of the FERT in the hiring process.
The three-month probationary period for the new hires is now over, and Dr. Carlos has accumulated quite a bit of data including scores on the predictor measure (the FERT) and scores on the criterion measure (the OTJSR). Looking at these data, Dr. Carlos wonders aloud about setting a cut score for the FERT . . . but does she even need to set a cut score? What if FE hired as many new permanent drivers as they need by a process of top-down selection with regard to OTJSRs? Top-down selection is a process of awarding available positions to applicants whereby the highest scorer is awarded the first position, the next highest scorer the next position, and so forth until all positions are filled. Dr. Carlos decides against a top-down hiring policy based on her awareness of its possible adverse impact. Top-down selection practices may carry with them unintended discriminatory effects (Cascio et al., 1995; De Corte & Lievens, 2005; McKinney & Collins, 1991; Zedeck et al., 1996).
For assistance in setting a cut score for hiring and in answering questions related to the utility of the FERT, Dr. Carlos purchases a (hypothetical) computer program entitled Utility Analysis Made Easy. This program contains definitions for a wealth of utility-related terms and also provides the tools for automatically creating computer-generated, utility-related tables and graphs. In what follows we learn, along with Dr. Carlos, how utility analysis can be “made easy” (or, at the very least, somewhat less complicated). After entering all of the data from this study, she enters the command set cut score, and what pops up is a table (Table 1) and this prompt:
There is no single, all-around best way to determine the cut score to use on the FERT. The cut score chosen will reflect the goal of the selection process. In this case, consider which of the following four options best reflects the company’s hiring policy and objectives. For some companies, the best cut score may be no cut score (Option 1).
(1) Limit the cost of selection by not using the FERT.
This goal could be appropriate (a) if Federale Express just needs “bodies” to fill positions in order to continue operations, (b) if the consequences of hiring unqualified personnel are not a major consideration; and/or (c) if the size of the applicant pool is equal to or smaller than the number of openings.
Table 1 Hits and Misses
|Term||General Definition||What It Means in This Study||Implication|
|Hit||1. A correct classification||1. A passing score on the FERT is associated with satisfactory performance on the OTJSR, and a failing score on the FERT is associated with unsatisfactory performance on the OTJSR.||1. The predictor test has successfully predicted performance on the criterion; it has successfully predicted on-the-job outcome. A qualified driver is hired; an unqualified driver is not hired.|
|Miss||1. An incorrect classification; a mistake||1. A passing score on the FERT is associated with unsatisfactory performance on the OTJSR, and a failing score on the FERT is associated with satisfactory performance on the OTJSR.||1. The predictor test has not predicted performance on the criterion; it has failed to predict the on-the-job outcome. A qualified driver is not hired; an unqualified driver is hired.|
|Hit rate||1. The proportion of people that an assessment tool accurately identifies as possessing or exhibiting a particular trait, ability, behavior, or attribute||1. The proportion of FE drivers with a passing FERT score who perform satisfactorily after three months based on OTJSRs. Also, the proportion of FE drivers with a failing FERT score who do not perform satisfactorily after three months based on OTJSRs.||1. The proportion of qualified drivers with a passing FERT score who actually gain permanent employee status after three months on the job. Also, the proportion of unqualified drivers with a failing FERT score who are let go after three months.|
|Miss rate||1. The proportion of people that an assessment tool inaccurately identifies as possessing or exhibiting a particular trait, ability, behavior, or attribute||1. The proportion of FE drivers with a passing FERT score who perform unsatisfactorily after three months based on OTJSRs. Also, the proportion of FE drivers with a failing FERT score who perform satisfactorily after three months based on OTJSRs.||1. The proportion of drivers whom the FERT inaccurately predicted to be qualified. Also, the proportion of drivers whom the FERT inaccurately predicted to be unqualified|
|False positive||1. A specific type of miss whereby an assessment tool falsely indicates that the testtaker possesses or exhibits a particular trait, ability, behavior, or attribute||1. The FERT indicates that the new hire will perform successfully on the job but, in fact, the new driver does not.||1. A driver who is hired is not qualified|
|False negative||1. A specific type of miss whereby an assessment tool falsely indicates that the testtaker does not possess or exhibit a particular trait, ability, behavior, or attribute||1. The FERT indicates that the new hire will not perform successfully on the job but, in fact, the new driver would have performed successfully.||1. FERT says to not hire but driver would have been rated as qualified.|
a. Boudreau (1988).
(2) Ensure that qualified candidates are not rejected.
To accomplish this goal, set a FERT cut score that ensures that no one who is rejected by the cut would have been deemed qualified at the end of the probationary period. Stated another way, set a cut score that yields the lowest false negative rate. The emphasis in such a scenario is on weeding out the “worst” applicants; that is, those applicants who will definitely be deemed unqualified at the end of the probationary period.
(3) Ensure that all candidates selected will prove to be qualified.
To accomplish this goal, set a FERT cut score that ensures that everyone who “makes the cut” on the FERT is rated as qualified at the end of the probationary period; no one who “makes the cut” is rated as unqualified at the end of the probationary period. Stated another way, set a cut score that yields the lowest false positive rate. The emphasis in such a scenario is on selecting only the best applicants; that is, those applicants who will definitely be deemed qualified at the end of the probationary period.
(4) Ensure, to the extent possible, that qualified candidates will be selected and unqualified candidates will be rejected.
This objective can be met by setting a cut score on the FERT that is helpful in (a) selecting for permanent positions those drivers who performed satisfactorily on the OTJSR, (b) eliminating from consideration those drivers who performed unsatisfactorily on the OTJSR, and (c) reducing the miss rate as much as possible. This approach to setting a cut score will yield the highest hit rate while allowing for FERT-related “misses” that may be either of the false-positive or false-negative variety. Here, false positives are seen as no better or worse than false negatives and vice versa.
It is seldom possible to “have it all ways.” In other words, it is seldom possible to have the lowest false positive rate, the lowest false negative rate, the highest hit rate, and not incur any costs of testing. Which of the four listed objectives represents the best “fit” with your policies and the company’s hiring objectives? Before responding, it may be helpful to review Table 1 .
After reviewing Table 1 and all of the material on terms including hit, miss, false positive, and false negative, Dr. Carlos elects to continue and is presented with the following four options from which to choose.
1. Select applicants without using the FERT.
2. Use the FERT to select with the lowest false negative rate.
3. Use the FERT to select with the lowest false positive rate.
4. Use the FERT to yield the highest hit rate and lowest miss rate.
Curious about the outcome associated with each of these four options, Dr. Carlos wishes to explore all of them. She begins by selecting Option 1: Select applicants without using the FERT. Immediately, a graph (Close-Up Figure 1) and this prompt pop up:
Figure 1 Base Rate Data for Federale Express Before the use of the FERT, any applicant with a valid driver’s license and no criminal record was hired for a permanent position as an FE driver. Drivers could be classified into two groups based on their on-the-job supervisory ratings (OTJSRs): those whose driving was considered to be satisfactory (located above the dashed horizontal line) and those whose driving was considered to be unsatisfactory (below the dashed line). Without use of the FERT, then, all applicants were hired and the selection ratio was 1.0; 60 drivers were hired out of the 60 applicants. However, the base rate of successful performance shown in Figure 1 was only .50. This means that only half of the drivers hired (30 of 60) were considered “qualified” drivers by their supervisor. This also shows a miss rate of .50, because half of the drivers turned out to perform below the minimally accepted level. Yet because scores on the FERT and the OTJSRs are positively correlated, the FERT can be used to help select the individuals who are likely to be rated as qualified drivers. Thus, using the FERT is a good idea, but how should it be used? One method would entail top-down selection. That is, a permanent position could be offered first to the individual with the highest score on the FERT (top, rightmost case in Figure 1 ), followed by the individual with the next highest FERT score, and so on until all available positions are filled. As you can see in the figure, if permanent positions are offered only to individuals with the top 20 FERT scores, then OTJSR ratings of the permanent hires will mostly be in the satisfactory performer range. However, as previously noted, such a top-down selection policy can be discriminatory.
Generally speaking, base rate is defined as the proportion of people in the population that possess a particular trait, behavior, characteristic, or attribute. In this study, base rate refers to the proportion of new hire drivers who would go on to perform satisfactorily on the criterion measure (the OTJSRs) and be deemed “qualified” regardless of whether or not a test such as the FERT existed (and regardless of their score on the FERT if it were administered). The base rate is represented in Figure 1 (and in all subsequent graphs) by the number of drivers whose OTJSRs fall above the dashed horizontal line (a line that refers to minimally acceptable performance on the OTJSR) as compared to the total number of scores. In other words, the base rate is equal to the ratio of qualified applicants to the total number of applicants.
Without the use of the FERT, it is estimated that about one-half of all new hires would exhibit satisfactory performance; that is, the base rate would be .50. Without use of the FERT, the miss rate would also be .50—this because half of all drivers hired would be deemed unqualified based on the OTJSRs at the end of the probationary period.
Dr. Carlos considers the consequences of a 50% miss rate. She thinks about the possibility of an increase in customer complaints regarding the level of service. She envisions an increase in at-fault accidents and costly lawsuits. Dr. Carlos is pleasantly distracted from these potential nightmares when she inadvertently leans on her keyboard and it furiously begins to beep. Having rejected Option 1, she “presses on” and next explores what outcomes would be associated with Option 2: Use the FERT to select with the lowest false negative rate. Now, another graph (Close-Up Figure 2) appears along with this text:
This graph, as well as all others incorporating FERT cut-score data, have FERT (predictor) scores on the horizontal axis (which increase from left to right), and OTJSR (criterion) scores on the vertical axis (with scores increasing from the bottom toward the top). The selection ratio provides an indication of the competitiveness of the position; it is directly affected by the cut score used in selection. As the cut score is set farther to the right, the selection ratio goes down. The practical implication of the decreasing selection ratio is that hiring becomes more selective; this means that there is more competition for a position and that the proportion of people actually hired (from all of those who applied) will be less.2 As the cut score is set farther to the left, the selection ratio goes up; hiring becomes less selective, and chances are that more people will be hired.3
Using a cut score of 18 on the FERT, as compared to not using the FERT at all, reduces the miss rate from 50% to 45% (see Figure 2). The major advantage of setting the cut score this low is that the false negative rate falls to zero; no potentially qualified drivers will be rejected based on the FERT. Use of this FERT cut score also increases the base rate of successful performance from .50 to .526. This means that the percentage of hires who will be rated as “qualified” has increased from 50% without use of the FERT to 52.6% with the FERT. The selection ratio associated with using 18 as the cut score is .95, which means that 95% of drivers who apply are selected.
Figure 2 Selection with Low Cut Score and High Selection Ratio As we saw in Figure 1 , without the use of the FERT, only half of all the probationary hires would be rated as satisfactory drivers by their supervisors. Now we will consider how to improve selection by using the FERT. For ease of reference, each of the quadrants in Figure 2 (as well as the remaining Close-Up graphs) have been labeled, A, B, C, or D. The selection ratio in this and the following graphs may be defined as being equal to the ratio of the number of people who are hired on a permanent basis (qualified applicants as determined by FERT score) compared to the total number of people who apply. The total number of applicants for permanent positions was 60, as evidenced by all of the dots in all of the quadrants. In quadrants A and B, just to the right of the vertical Cut score line (set at 18), are the 57 FE drivers who were offered permanent employment. We can also see that the false positive rate is zero because no scores fall in quadrant D; thus, no potentially qualified drivers will be rejected based on use of the FERT with a cut score of 18. The selection ratio in this scenario is 57/60, or .95. We can therefore conclude that 57 applicants (95% of the 60 who originally applied) would have been hired on the basis of their FERT scores with a cut score set at 18 (resulting in a “high” selection ratio of 95%); only three applicants would not be hired based on their FERT scores. These three applicants would also be rated as unqualified by their supervisors at the end of the probationary period. We can also see that, by removing the lowest-scoring applicants, the base rate of successful performance improves slightly as compared to not using the FERT at all. Instead of having a successful performance base rate of only .50 (as was the case when all applicants were hired), now the base rate of successful performance is .526. This is so because 30 drivers are still rated as qualified based on OTJSRs while the number of drivers hired has been reduced from 60 to 57.
Dr. Carlos appreciates that the false negative rate is zero and thus no potentially qualified drivers are turned away based on FERT score. She also believes that a 5% reduction in the miss Page 212rate is better than no reduction at all. She wonders, however, whether this reduction in the miss rate is statistically significant. She would have to formally analyze these data to be certain but, after simply “eyeballing” these findings, a decrease in the miss rate from 50% to 45% does not seem significant. Similarly, an increase in the number of qualified drivers of only 2.6% through the use of a test for selection purposes does not, on its face, seem significant. It simply does not seem prudent to institute a new personnel selection test at real cost and expense to the company if the only benefit of the test is to reject the lowest-scoring 3 of 60 applicants—when, in reality, 30 of the 60 applicants will be rated as “unqualified.”
Dr. Carlos pauses to envision a situation in which reducing the false negative rate to zero might be prudent; it might be ideal if she were testing drivers for drug use, because she would definitely not want a test to indicate a driver is drug-free if that driver had been using drugs. Of course, a test with a false negative rate of zero would likely also have a high false positive rate. But then she could retest any candidate who received a positive result with a second, more expensive, more accurate test—this to ensure that the initial positive result was correct and not a testing error. As Dr. Carlos mulls over these issues, a colleague startles her with a friendly query: “How’s that FERT researching coming?”
Dr. Carlos says, “Fine,” and smoothly reaches for her keyboard to select Option 3: Use the FERT to select with the lowest false positive rate. Now, another graph (Close-Up Figure 3) and another message pop up:
Using a cut score of 80 on the FERT, as compared to not using the FERT at all, results in a reduction of the miss rate from 50% to 40% (see Figure 3) but also reduces the false positive rate to zero. Use of this FERT cut score also increases the base rate of successful performance from .50 to 1.00. This means that the percentage of drivers selected who are rated as “qualified” increases from 50% without use of the FERT to 100% when the FERT is used with a cut score of 80. The selection ratio associated with using 80 as the cut score is .10, which means that 10% of applicants are selected.
Figure 3 Selection with High Cut Score and Low Selection Ratio As before, the total number of applicants for permanent positions was 60, as evidenced by all of the dots in all of the quadrants. In quadrants A and B, just to the right of the vertical Cut score line (set at a FERT score of 80), are the 6 FE drivers who were offered permanent employment. The selection ratio in this scenario is 6/60, or .10. We can therefore conclude that 6 applicants (10% of the 60 who originally applied) would have been hired on the basis of their FERT scores with the cut score set at 80 (and with a “low” selection ratio of 10%). Note also that the base rate improves dramatically, from .50 without use of the FERT to 1.00 with a FERT cut score set at 80. This means that all drivers selected when this cut score is in place will be qualified. Although only 10% of the drivers will be offered permanent positions, all who are offered permanent positions will be rated qualified drivers on the OTJSR. Note, however, that even though the false positive rate drops to zero, the overall miss rate only drops to .40. This is so because a substantial number (24) of qualified applicants would be denied permanent positions because their FERT scores were below 80.
Dr. Carlos likes the idea of the “100% solution” entailed by a false positive rate of zero. It means that 100% of the applicants selected by their FERT scores will turn out to be qualified drivers. At first blush, this solution seems optimal. However, there is, as they say, a fly in the ointment. Although the high cut score (80) results in the selection of only qualified candidates, the selection ratio is so stringent that only 10% of those candidates would actually be hired. Dr. Carlos envisions the consequences of this low selection ratio. She sees herself as having to recruit and test at least 100 applicants for every 10 drivers she actually hires. To meet her company goal of hiring 60 drivers, for example, she would have to recruit about 600 applicants for testing. Attracting that many applicants to the company is a venture that has some obvious (as well as some less obvious) costs. Dr. Carlos sees her recruiting budget dwindle as she repeatedly writes checks for classified advertising in newspapers. She sees herself purchasing airline tickets and making hotel reservations in order to attend various job fairs, far and wide. Fantasizing about the applicants she will attract at one of those job fairs, she is abruptly brought Page 213back to the here-and-now by the friendly voice of a fellow staff member asking her if she wants to go to lunch. Still half-steeped in thought about a potential budget crisis, Dr. Carlos responds, “Yes, just give me ten dollars . . . I mean, ten minutes.”
As Dr. Carlos takes the menu of a local hamburger haunt from her desk to review, she still can’t get the “100% solution” out of her mind. Although clearly attractive, she has reservations (about the solution, not for the restaurant). Offering permanent positions to only the top-performing applicants could easily backfire. Competing companies could be expected to also offer these applicants positions, perhaps with more attractive benefit packages. How many of the top drivers hired would actually stay at Federale Express? Hard to say. What is not hard to say, however, is that the use of the “100% solution” has essentially brought Dr. Carlos full circle back to the top-down hiring policy that she sought to avoid in the first place. Also, scrutinizing Figure 3, Dr. Carlos sees that—even though the base rate with this cut score is 100%—the percentage of misclassifications (as compared to not using any selection test) is reduced only by a measly 10%. Further, there would be many qualified drivers who would also be cut by this cut score. In this instance, then, a cut score that scrupulously seeks to avoid the hiring of unqualified drivers also leads to rejecting a number of qualified applicants. Perhaps in the hiring of “super responsible” positions—say, nuclear power plant supervisors—such a rigorous selection policy could be justified. But is such rigor really required in the selection of Federale Express drivers?
Hoping for a more reasonable solution to her cut-score dilemma and beginning to feel hungry, Dr. Carlos leafs through the burger menu while choosing Option 4 on her computer screen: Use the FERT to yield the highest hit rate and lowest miss rate. In response to this selection, another graph (Close-Up Figure 4) along with the following message is presented:
Using a cut score of 48 on the FERT results in a reduction of the miss rate from 50% to 15% as compared to not using the FERT (see Figure 4). False positive and false negative rates are both fairly low at .167 and .133, respectively. Use of this cut score also increases the base rate from .50 (without use of the FERT) to .839. This means that the percentage of hired drivers who are rated as “qualified” at the end of the probationary period has increased from 50% (without use of the FERT) to 83.9%. The selection ratio associated with using 48 as the cut score is .517, which means that 51.7% of applicants will be hired.
Figure 4 Selection with Moderate Cut Score and Moderate Selection Ratio Again, the total number of applicants was 60. In quadrants A and B, just to the right of the vertical Cut Score line (set at 48), are the 31 FE drivers who were offered permanent employment at the end of the probationary period. The selection ratio in this scenario is therefore equal to 31/60, or about .517. This means that slightly more than half of all applicants will be hired based on the use of 48 as the FERT cut score. The selection ratio of .517 is a moderate one. It is not as stringent as is the .10 selection ratio that results from a cut score of 80, nor is it as lenient as the .95 selection ratio that results from a cut score of 18. Note also that the cut score set at 48 effectively weeds out many of the applicants who won’t receive acceptable performance ratings. Further, it does this while retaining many of the applicants who will receive acceptable performance ratings. With a FERT cut score of 48, the base rate increases quite a bit: from .50 (as was the case without using the FERT) to .839. This means that about 84% (83.9%, to be exact) of the hired drivers will be rated as qualified when the FERT cut score is set to 48 for driver selection.
Although a formal analysis would have to be run, Dr. Carlos again “eyeballs” the findings and, based on her extensive experience, strongly suspects that these results are statistically significant. Moreover, these findings would seem to be of practical significance. As compared to not using the FERT, use of the FERT with a cut score of 48 could reduce misclassifications from 50% to 15%. Such a reduction in misclassifications would almost certainly have positive cost–benefit implications for FE. Also, the percentage of drivers who are deemed qualified at the end of the probationary period would rise from 50% (without use of the FERT) to 83.9% (using the FERT with a cut score of 48). The implications of such improved selection are many and include better service to customers (leading to an increase in business volume), less costly accidents, and fewer costs involved in hiring and training new personnel.
Yet another benefit of using the FERT with a cut score of 48 concerns recruiting costs. Using a cut score of 48, FE would need to recruit only 39 or so qualified applicants for every 20 permanent positions it needed to fill. Now, anticipating real savings in her annual budget, Dr. Carlos returns the hamburger menu to her desk drawer and removes instead the menu from her favorite (pricey) steakhouse.
Dr. Carlos decides that the moderate cut-score solution is optimal for FE. She acknowledges that this solution doesn’t reduce any of the error rates to zero. However, it produces relatively low error rates overall. It also yields a relatively high hit rate; about 84% of the drivers hired will be qualified at the end of the probationary period. Dr. Carlos believes that the costs associated with recruitment and testing using this FERT cut score will be more than compensated by the evolution of a work force that evidences satisfactory performance and has fewer accidents. As she peruses the steakhouse menu and mentally debates the pros and cons of sautéed onions, she also wonders about the dollars-and-cents utility of using the FERT. Are all of the costs associated with instituting the FERT as part of FE hiring procedures worth the benefits?
Dr. Carlos puts down the menu and begins to calculate the company’s return on investment (the ratio of benefits to costs). She estimates the cost of each FERT to be about $200, including the costs associated with truck usage, gas, and supervisory personnel time. She further estimates that FE will test 120 applicants per year in order to select approximately 60 new hires based on a moderate FERT cut score. Given the cost of each test ($200) administered individually to 120 applicants, the total to be spent on testing annually will be about $24,000. So, is it worth it? Considering all of the possible benefits previously listed that could result from a significant reduction of the misclassification rate, Dr. Carlos’s guess is, “Yes, it would be worth it.” Of course, decisions like that aren’t made with guesses. So continue reading—later in this chapter, a formula will be applied that will prove Dr. Carlos right. In fact, the moderate cut score shown in Figure 4 would produce a return on investment of 12.5 to 1. And once Dr. Carlos gets wind of these projections, you can bet it will be surf-and-turf-tortilla time at Federale Express.
How Is a Utility Analysis Conducted?
The specific objective of a utility analysis will dictate what sort of information will be required as well as the specific methods to be used. Here we will briefly discuss two general approaches to utility analysis. The first is an approach that employs data that should actually be quite familiar.
Some utility analyses will require little more than converting a scatterplot of test data to an expectancy table (much like the process described in the previous chapter). An expectancy table can provide an indication of the likelihood that a testtaker will score within some interval of scores on a criterion measure—an interval that may be categorized as “passing,” “acceptable,” or “failing.” For example, with regard to the utility of a new and experimental personnel test in a corporate setting, an expectancy table can provide vital information to decision-makers. An expectancy table might indicate, for example, that the higher a worker’s score is on this new test, the greater the probability that the worker will be judged successful. In other words, the test is working as it should and, by instituting this new test on a permanent basis, the company could reasonably expect to improve its productivity.
Tables that could be used as an aid for personnel directors in their decision-making chores were published by H. C. Taylor and J. T. Russell in the Journal of Applied Psychology in 1939. Referred to by the names of their authors, the Taylor-Russell tables provide an estimate of the extent to which inclusion of a particular test in the selection system will improve selection. More specifically, the tables provide an estimate of the percentage of employees hired by the use of a particular test who will be successful at their jobs, given different combinations of three variables: the test’s validity, the selection ratio used, and the base rate.
The value assigned for the test’s validity is the computed validity coefficient. The selection ratio is a numerical value that reflects the relationship between the number of people to be hired and the number of people available to be hired. For instance, if there are 50 positions and 100 applicants, then the selection ratio is 50/100, or .50. As used here, base rate refers to the percentage of people hired under the existing system for a particular position. If, for example, a firm employs 25 computer programmers and 20 are considered successful, the base rate would be .80. With knowledge of the validity coefficient of a particular test along with the selection ratio, reference to the Taylor-Russell tables provides the personnel officer with an estimate of how much using the test would improve selection over existing methods.
A sample Taylor-Russell table is presented in Table 7–1. This table is for the base rate of .60, meaning that 60% of those hired under the existing system are successful in their work. Down the left-hand side are validity coefficients for a test that could be used to help select employees. Across the top are the various selection ratios. They reflect the proportion of the people applying for the jobs who will be hired. If a new test is introduced to help select employees in a situation with a selection ratio of .20 and if the new test has a predictive validity coefficient of .55, then the table shows that the base rate will increase to .88. This means that, rather than 60% of the hired employees being expected to perform successfully, a full 88% can be expected to do so. When selection ratios are low, as when only 5% of the applicants will be hired, even tests with low validity coefficients, such as .15, can result in improved base rates.
Taylor-Russell Table for a Base Rate of .60
Source: Taylor and Russell (1939).
One limitation of the Taylor-Russell tables is that the relationship between the predictor (the test) and the criterion (rating of performance on the job) must be linear. If, for example,Page 215 there is some point at which job performance levels off, no matter how high the score on the test, use of the Taylor-Russell tables would be inappropriate. Another limitation of the Taylor-Russell tables is the potential difficulty of identifying a criterion score that separates “successful” from “unsuccessful” employees.
The potential problems of the Taylor-Russell tables were avoided by an alternative set of tables (Naylor & Shine, 1965) that provided an indication of the difference in average criterion scores for the selected group as compared with the original group. Use of the Naylor-Shine tables entails obtaining the difference between the means of the selected and unselected groups to derive an index of what the test (or some other tool of assessment) is adding to already established procedures.
Both the Taylor-Russell and the Naylor-Shine tables can assist in judging the utility of a particular test, the former by determining the increase over current procedures and the latter by determining the increase in average score on some criterion measure. With both tables, the validity coefficient used must be one obtained by concurrent validation procedures—a fact that should not be surprising because it is obtained with respect to current employees hired by the selection process in effect at the time of the study.
JUST THINK . . .
In addition to testing, what types of assessment procedures might employers use to help them make judicious personnel selection decisions?
If hiring decisions were made solely on the basis of variables such as the validity of an employment test and the prevailing selection ratio, then tables such as those offered by Taylor and Russell and Naylor and Shine would be in wide use today. The fact is that many other kinds of variables might enter into hiring and other sorts of personnel selection decisions (including decisionsPage 216 relating to promotion, transfer, layoff, and firing). Some additional variables might include, for example, applicants’ minority status, general physical or mental health, or drug use. Given that many variables may affect a personnel selection decision, of what use is a given test in the decision process?
Expectancy data, such as that provided by the Taylor-Russell tables or the Naylor-Shine tables could be used to shed light on many utility-related decisions, particularly those confined to questions concerning the validity of an employment test and the selection ratio employed. Table 7–2 presents a brief summary of some of the uses, advantages, and disadvantages of these approaches. In many instances, however, the purpose of a utility analysis is to answer a question related to costs and benefits in terms of dollars and cents. When such questions are raised, the answer may be found by using the Brogden-Cronbach-Gleser formula.
|Instrument||What It Tells Us||Example||Advantages||Disadvantages|
|1. Expectancy table or chart||1. Likelihood that individuals who score within a given range on the predictor will perform successfully on the criterion||1. A school psychologist uses an expectancy table to determine the likelihood that students who score within a particular range on an aptitude test will succeed in regular classes as opposed to special education classes.||1. Easy-to-use graphical display; can aid in decision making regarding a specific individual or a group of individuals scoring in a given range on the predictor||1. Dichotomizes performance into successful and unsuccessful categories, which is not realistic in most situations; does not address monetary issues such as cost of testing or return on investment of testing|
|1. Taylor-Russell tables||1. Increase in base rate of successful performance that is associated with a particular level of criterion-related validity||1. A human resources manager of a large computer store uses the Taylor-Russell tables to help decide whether applicants for sales positions should be administered an extraversion inventory prior to hire. The manager wants to increase the portion of the sales force that is considered successful (or, consistently meets sales quota). By using an estimate of the test’s validity (e.g., by using a value of .20 based on research by Conte & Gintoft, 2005), the current base rate, and selection ratio, the manager can estimate whether the increase in proportion of the sales force that do successfully meet their quotas will justify the cost of testing all sales applicants.||1. Easy-to-use; shows the relationships between selection ratio, criterion-related validity, and existing base rate; facilitates decision making with regard to test use and/or recruitment to lower the selection ratio||1. Relationship between predictor and criterion must be linear; does not indicate the likely average increase in performance with use of the test; difficulty identifying a criterion value to separate successful and unsuccessful performance; dichotomizes performance into successful versus unsuccessful, which is not realistic in most situations; does not consider the cost of testing in comparison to benefits|
|1. Naylor-Shine tables||1. Likely average increase in criterion performance as a result of using a particular test or intervention; also provides selection ratio needed to achieve a particular increase in criterion performance||1. The provost at a private college estimates the increase in applicant pool (and corresponding decrease in selection ratio) that is needed in order to improve the mean performance of students it selects by 0.50 standardized units while still maintaining its enrollment figures.||1. Provides information (or, average performance gain) needed to use the Brogden-Cronbach-Gleser utility formula; does not dichotomize criterion performance; useful either for showing average performance gain or to show selection ratio needed for a particular performance gain; facilitates decision making with regard to likely increase in performance with test use and/or recruitment needed to lower the selection ratio||1. Overestimates utility unless top-down selection is used;a utility expressed in terms of performance gain based on standardized units, which can be difficult to interpret in practical terms; does not address monetary issues such as cost of testing or return on investment|
Most Everything You Ever Wanted to Know About Utility Tables
The Brogden-Cronbach-Gleser formula
The independent work of Hubert E. Brogden (1949) and a team of decision theorists (Cronbach & Gleser, 1965) has been immortalized in the Brogden-Cronbach-Gleser formula , used to calculate the dollar amount of a utility gain resulting from the use of a particular selection instrument under specified conditions. In general, utility gain refers to an estimate of the benefit (monetary or otherwise) of using a particular test or selection method. The Brogden-Cronbach-Gleser (BCG) formula is:
In the first part of the formula, N represents the number of applicants selected per year, T represents the average length of time in the position (or, tenure), rxyrepresents the (criterion-related) validity coefficient for the given predictor and criterion, SDy represents the standard deviation of performance (in dollars) of employees, and represents the mean (standardized) score on the test for selected applicants. The second part of the formula represents the cost of testing, which takes into consideration the number of applicants (N) multiplied by the cost of the test for each applicant (C). A difficulty in using this formula is estimating the value of SDy, a value that is, quite literally, estimated (Hunter et al., 1990). One recommended way to estimate SDy is by setting it equal to 40% of the mean salary for the job (Schmidt & Hunter, 1998).
The BCG formula can be applied to the question raised in this chapter’s Close-Up about the utility of the FERT. Suppose 60 Federale Express (FE) drivers are selected per year and that each driver stays with FE for one and a half years. Let’s further suppose that the standard deviation of performance of the drivers is about $9,000 (calculated as 40% of annual salary), that the criterion-related validity of FERT scores is .40, and that the mean standardized FERT score for applicants is +1.0. Applying the benefits part of the BCG formula, the benefits are $324,000 (60 × 1.5 × .40 × $9,000 × 1.0). When the costs of testing ($24,000) are subtracted from the financial benefits of testing ($324,000), it can be seen that the utility gain amounts to $300,000.
So, would it be wise for a company to make an investment of $24,000 to receive a return of about $300,000? Most people (and corporations) would be more than willing to invest in something if they knew that the return on their investment would be more than $12.50 for each dollar invested. Clearly, with such a return on investment, using the FERT with the cut score illustrated in Figure 4 of the Close-Up does provide a cost-effective method of selecting delivery drivers.
JUST THINK . . .
When might it be better to present utility gains in productivity terms rather than financial terms?
By the way, a modification of the BCG formula exists for researchers who prefer their findings in terms of productivity gains rather than financial ones. Here, productivity gain refers to an estimated increase in work output. In this modification of the formula, the value of the standard deviation of productivity, SDp, is substituted for the value of the standard deviation of performance in dollars, SDy (Schmidt et al., 1986). The result is a formula that helps estimate the percent increase in output expected through the use of a particular test. The revised formula is:
Throughout this text, including in the boxed material, we have sought to illustrate psychometric principles with reference to contemporary, practical illustrations from everyday life. In recent years, for example, there has increasingly been calls for police to wear body cameras as a means to reduce inappropriate use of force against citizens (Ariel, 2015). In response to such demands, some have questioned whether the purchase of such recording systems as well as all of the ancillary recording and record-keeping technology is justified; that is, will it really make a difference in the behavior of police personnel. Stated another way, important questions regarding the utility of such systems have been raised. Some answers to these important questions can be found in this chapter’s Everyday Psychometrics .Page 217
The Utility of Police Use of Body Cameras*
Imagine you are walking down a street. You see two police officers approach a man who has just walked out of a shop, carrying a shopping bag. The police stop the man, and aggressively ask him to explain who he is, where he is going, and what he was doing in the shop. Frustrated at being detained in this way, the man becomes angry and refuses to cooperate. The situation quickly escalates as the police resort to the use of pepper spray and handcuffs to effect and arrest. The man being arrested is physically injured in the process. After his release, the man files a lawsuit in civil court against the police force, alleging illegal use of force. Several bystanders come forward as witnesses to the event. Their account of what happened serves to support the plaintiff’s claims against the defendant (the defendant being the municipality that manages the police). A jury finds in favor of the plaintiff and orders the defendant city to pay the plaintiff one-million dollars in damages.
Now imagine the same scenario but played through the eyes of the police officer who effected the arrest. Prior to your sighting of the suspect individual, you have heard “be on the lookout” reports over your police radio regarding a man roughly fitting this person’s description. The individual in question has reportedly been observed stealing items from shops in the area. Having observed him, you now approach him and take command of the situation, because that is what you have been trained to do. Despite your forceful, no-nonsense approach to the suspect, the suspect is uncooperative to the point of defiance. As the suspect becomes increasingly agitated, you become increasingly concerned for your own safety, as well as the safety of your partner. Now trying to effect an arrest without resorting to the use of lethal force, you use pepper spray in an effort to subdue him. Subsequently, in court, after the suspect has been cleared of all charges, and the municipality that employs you has been hit with a one-million-dollar judgement, you wonder how things could have more effectively been handled.
In the scenarios described above, the police did pretty much what they were trained to do. Unfortunately, all of that training resulted in a “lose-lose” situation for both the citizen wrongly detained for suspicion of being a thief, and the police officer who was just doing his job as best as he could. So, now a question arises, “Is there something that might have been added to the situation that might have had the effect of retarding the citizen’s combativeness, and the police’s defensive and reflexive use of force in response?” More specifically, might the situation have been different if the parties involved knew that their every move, and their every utterance, were being faithfully recorded? Might the fact that the event was being recorded influence the extent to which the wrongfully charged citizen was noncompliant, even combative? Similarly, might the fact that the event was being recorded influence the extent to which the police officer doing his job had to resort to the use of force? The answer to such questions is “yes” according to a study by Ariel et al. (2015). A brief description of that study follows. Readers interested in a more detailed description of the experiment are urged to consult the original article.
© George Frey/Getty Images
The Ariel (2015) Study
Ariel et al.’s (2015) study with the police force in Rialto, California, was the first published experimental evidence on the effectiveness of the body-worn camera (BWC). In order to establish whether or not cameras were actually able to change officer–citizen interactions for the better, a randomized-controlled field trial (RCT) was designed.1 In nearly every police force around the world, officers work according to a shift pattern. Using a randomization program called the Cambridge Randomizer (Ariel et al., 2012), which is essentially an online coin-flip, the researchers randomly assigned officers of each shift to either a camera or no-camera experimental condition. This meant that every officer on a shift would wear a camera in the Camera condition, but not wear a camera in the No Camera condition. The relevant behavioral data for analysis was not what one of the 54 police officers on the Rialto police forcePage 218 was doing, but what occurred during the 988 randomly assigned shifts over a one-year period.
The research protocol required officers to (i) wear cameras only during Camera shifts; (ii) not wear (or use) cameras during No Camera shifts; (iii) keep cameras on throughout their entire Camera shift; and (iv) issue verbal warnings during the Camera shifts to advise citizens confronted that the interaction was being videotaped by a camera attached to the officer’s uniform.
Over the course of a year that the experiment ran, data from police reports of arrest as well as data from videos (when available) were analyzed for the presence or absence of “use of force.” For the purposes of this experiment, “use of force” was coded as being present on any occasion that a police verbal confrontation with a citizen escalated to the point of physical contact. In addition to the presence or absence of use of force as an outcome measure, another outcome measure was formal complaints of police use of force made by citizens. As clearly illustrated in Figure 7–7, the number of use-of force incidents in shifts significantly decreased beginning at the time of the initiation of this study, as did the number of use-of-force complaints by citizens. Ariel et al. (2015) found that use-of-force rates were more than twice that in the No Camera shifts as compared to the Camera shifts.
Although this study suggests that body cameras worn by police have utility in reducing use-of-force incidents, as well as use-of-force complaints by citizens, it sheds no light on why this might be so. In fact, there are a multitude of variables to consider when analyzing the factors that may influence a police officer’s decision to use force (Bolger, 2015). Given the procedures used in this study, the question of whether changes in the participants’ behavior is more a function of the camera or the police officer’s verbal warning, is an open one (“Cameras on Cops,” 2014; Ariel, 2016). It would be useful to explore in future research the extent to which being filmed, or simply being advised that one is being filmed, is causal in reducing use-of-force incidents and use-of-force complaints.
To be sure, use of force by police in some situations is indicated, legitimate, and unquestionably justified. However, in those more borderline situations, cameras may serve as silent reminders of the efficacy of more “civil” interaction—and this may be true for both members of the general public as well as those well-meaning police officers whose dedicated service and whose judicious use of force is integral to the functioning of civilized society.
Figure 1 Use of Force by Police and Use-of-Force Complaints by Citizens Before and During the Rialto Body Camera Experiment
Used with permission of Alex Sutherland and Barak Ariel.1. Although RCT entails the use of experimental methods, the laboratory in a field experiment is the “real world.” This fact enhances the generalizability of the results. It is also more challenging because there are a lot more things that can go wrong. This is the case for many reasons, not the least of which is the fact that participants do not always do exactly what the experimenter has asked them to do. Used with permission of Alex Sutherland and Barak Ariel.*This Everyday Psychometrics was guest-authored by Alex Sutherland of RAND Europe, and Barak Ariel of Cambridge University and Hebrew University.
Decision theory and test utility
Perhaps the most oft-cited application of statistical decision theory to the field of psychological testing is Cronbach and Gleser’s Psychological Tests and Personnel Decisions (1957, 1965). The idea of applying statistical decision theory to questions of test utility was conceptually appealing and promising, and an authoritative textbook of the day reflects the great enthusiasm with which this marriage of enterprises was greeted:
The basic decision-theory approach to selection and placement . . . has a number of advantages over the more classical approach based upon the correlation model. . . . There is no question but that it is a more general and better model for handling this kind of decision task, and we predict that in the future problems of selection and placement will be treated in this context more frequently—perhaps to [the] eventual exclusion of the more stereotyped correlational model. (Blum & Naylor, 1968, p. 58)
Stated generally, Cronbach and Gleser (1965) presented (1) a classification of decision problems; (2) various selection strategies ranging from single-stage processes to sequential analyses; (3) a quantitative analysis of the relationship between test utility, the selection ratio, cost of the testing program, and expected value of the outcome; and (4) a recommendation that in some instances job requirements be tailored to the applicant’s ability instead of the other way around (a concept they refer to as adaptive treatment).
Let’s illustrate decision theory in action. To do so, recall the definition of five terms that you learned in the previous chapter: base rate, hit rate, miss rate, false positive, and false negative. Now, imagine that you developed a procedure called the Vapor Test (VT), which was designed to determine if alive-and-well subjects are indeed breathing. The procedure for the VT entails having the examiner hold a mirror under the subject’s nose and mouth for a minute or so and observing whether the subject’s breath fogs the mirror. Let’s say that 100 introductory psychology students are administered the VT, and it is concluded that 89 were, in fact, breathing (whereas 11 are deemed, on the basis of the VT, not to be breathing). Is the VT a good test? Obviously not. Because the base rate is 100% of the (alive-and-well) population, we really don’t even need a test to measure the characteristic breathing. If for some reason we did need such a measurement procedure, we probably wouldn’t use one that was inaccurate in approximately 11% of the cases. A test is obviously of no value if the hit rate is higher without using it. One measure of the value of a test lies in the extent to which its use improves on the hit rate that exists without its use.
As a simple illustration of decision theory applied to testing, suppose a test is administered to a group of 100 job applicants and that some cutoff score is applied to distinguish applicants who will be hired (applicants judged to have passed the test) from applicants whose employment application will be rejected (applicants judged to have failed the test). Let’s further suppose that some criterion measure will be applied some time later to ascertain whether the newly hired person was considered a success or a failure at the job. In such a situation, if the test is a perfect predictor (if its validity coefficient is equal to 1), then two distinct types of outcomes can be identified: (1) Some applicants will score at or above the cutoff score on the test and be successful at the job, and (2) some applicants will score below the cutoff score and would not have been successful at the job.
In reality, few, if any, employment tests are perfect predictors with validity coefficients equal to 1. Consequently, two additional types of outcomes are possible: (3) Some applicants will score at or above the cutoff score, be hired, and fail at the job (the criterion), and (4) some applicants who scored below the cutoff score and were not hired could have been successful at the job. People who fall into the third category could be categorized as false positives, and those who fall into the fourth category could be categorized as false negatives.
In this illustration, logic alone tells us that if the selection ratio is, say, 90% (9 out of 10 applicants will be hired), then the cutoff score will probably be set lower than if the selection ratio is 5% (only 5 of the 100 applicants will be hired). Further, if the selection ratio is 90%, then it is a good bet that the number of false positives (people hired who will fail on the criterion measure) will be greater than if the selection ratio is 5%. Conversely, if the selection ratio is only 5%, it is a good bet that the number of false negatives (people not hired who could have succeeded on the criterion measure) will be greater than if the selection ratio is 90%.
Decision theory provides guidelines for setting optimal cutoff scores. In setting such scores, the relative seriousness of making false-positive or false-negative selection decisions is frequently taken into account. Thus, for example, it is a prudent policy for an airline personnel office to set cutoff scores on tests for pilots that might result in a false negative (a pilot who is truly qualified being rejected) as opposed to a cutoff score that would allow a false positive (a pilot who is truly unqualified being hired).Page 221
In the hands of highly skilled researchers, principles of decision theory applied to problems of test utility have led to some enlightening and impressive findings. For example, Schmidt et al. (1979) demonstrated in dollars and cents how the utility of a company’s selection program (and the validity coefficient of the tests used in that program) can play a critical role in the profitability of the company. Focusing on one employer’s population of computer programmers, these researchers asked supervisors to rate (in terms of dollars) the value of good, average, and poor programmers. This information was used in conjunction with other information, including these facts: (1) Each year the employer hired 600 new programmers, (2) the average programmer remained on the job for about 10 years, (3) the Programmer Aptitude Test currently in use as part of the hiring process had a validity coefficient of .76, (4) it cost about $10 per applicant to administer the test, and (5) the company currently employed more than 4,000 programmers.
Schmidt et al. (1979) made a number of calculations using different values for some of the variables. For example, knowing that some of the tests previously used in the hiring process had validity coefficients ranging from .00 to .50, they varied the value of the test’s validity coefficient (along with other factors such as different selection ratios that had been in effect) and examined the relative efficiency of the various conditions. Among their findings was that the existing selection ratio and selection process provided a great gain in efficiency over a previous situation (when the selection ratio was 5% and the validity coefficient of the test used in hiring was equal to .50). This gain was equal to almost $6 million per year. Multiplied over, say, 10 years, that’s $60 million. The existing selection ratio and selection process provided an even greater gain in efficiency over a previously existing situation in which the test had no validity at all and the selection ratio was .80. Here, in one year, the gain in efficiency was estimated to be equal to over $97 million.
By the way, the employer in the previous study was the U.S. government. Hunter and Schmidt (1981) applied the same type of analysis to the national workforce and made a compelling argument with respect to the critical relationship between valid tests and measurement procedures and our national productivity. In a subsequent study, Schmidt, Hunter, and their colleagues found that substantial increases in work output or reductions in payroll costs would result from using valid measures of cognitive ability as opposed to non-test procedures (Schmidt et al., 1986).
JUST THINK . . .
What must happen in society at large if the promise of decision theory in personnel selection is to be fulfilled?
Employers are reluctant to use decision-theory-based strategies in their hiring practices because of the complexity of their application and the threat of legal challenges. Thus, although decision theory approaches to assessment hold great promise, this promise has yet to be fulfilled.
Some Practical Considerations
A number of practical matters must be considered when conducting utility analyses. For example, as we have noted elsewhere, issues related to existing base rates can affect the accuracy of decisions made on the basis of tests. Particular attention must be paid to this factor when the base rates are extremely low or high because such a situation may render the test useless as a tool of selection. Focusing for the purpose of this discussion on the area of personnel selection, some other practical matters to keep in mind involve assumptions about the pool of job applicants, the complexity of the job, and the cut score in use.
The pool of job applicants
If you were to read a number of articles in the utility analysis literature on personnel selection, you might come to the conclusion that there exists, “out there,” what seems to be a limitless supply of potential employees just waiting to be evaluatedPage 222 and possibly selected for employment. For example, utility estimates such as those derived by Schmidt et al. (1979) are based on the assumption that there will be a ready supply of viable applicants from which to choose and fill positions. Perhaps for some types of jobs and in some economic climates that is, indeed, the case. There are certain jobs, however, that require such unique skills or demand such great sacrifice that there are relatively few people who would even apply, let alone be selected. Also, the pool of possible job applicants for a particular type of position may vary with the economic climate. It may be that in periods of high unemployment there are significantly more people in the pool of possible job applicants than in periods of high employment.
JUST THINK . . .
What is an example of a type of job that requires such unique skills that there are probably relatively few people in the pool of qualified employees?
Closely related to issues concerning the available pool of job applicants is the issue of how many people would actually accept the employment position offered to them even if they were found to be a qualified candidate. Many utility models, somewhat naively, are constructed on the assumption that all of the people selected by a personnel test accept the position that they are offered. In fact, many of the top performers on the test are people who, because of their superior and desirable abilities, are also being offered positions by one or more other potential employers. Consequently, the top performers on the test are probably the least likely of all of the job applicants to actually be hired. Utility estimates based on the assumption that all people selected will actually accept offers of employment thus tend to overestimate the utility of the measurement tool. These estimates may have to be adjusted downward as much as 80% in order to provide a more realistic estimate of the utility of a tool of assessment used for selection purposes (Murphy, 1986).
The complexity of the job
In general, the same sorts of approaches to utility analysis are put to work for positions that vary greatly in terms of complexity. The same sorts of data are gathered, the same sorts of analytic methods may be applied, and the same sorts of utility models may be invoked for corporate positions ranging from assembly line worker to computer programmer. Yet as Hunter et al. (1990) observed, the more complex the job, the more people differ on how well or poorly they do that job. Whether or not the same utility models apply to jobs of varied complexity, and whether or not the same utility analysis methods are equally applicable, remain matters of debate.
The cut score in use
Also called a cutoff score, we have previously defined a cut score as a (usually numerical) reference point derived as a result of a judgment and used to divide a set of data into two or more classifications, with some action to be taken or some inference to be made on the basis of these classifications. In discussions of utility theory and utility analysis, reference is frequently made to different types of cut scores. For example, a distinction can be made between a relative cut score and a fixed cut score. A relative cut score may be defined as a reference point—in a distribution of test scores used to divide a set of data into two or more classifications—that is set based on norm-related considerations rather than on the relationship of test scores to a criterion. Because this type of cut score is set with reference to the performance of a group (or some target segment of a group), it is also referred to as a norm-referenced cut score .
As an example of a relative cut score, envision your instructor announcing on the first day of class that, for each of the four examinations to come, the top 10% of all scores on each test would receive the grade of A. In other words, the cut score in use would depend on the performance of the class as a whole. Stated another way, the cut score in use would be relative to the scores achieved by a targeted group (in this case, the entire class and in particular the top 10% of the class). The actual test score used to define who would and would not achievePage 223 the grade of A on each test could be quite different for each of the four tests, depending upon where the boundary line for the 10% cutoff fell on each test.
In contrast to a relative cut score is the fixed cut score , which we may define as a reference point—in a distribution of test scores used to divide a set of data into two or more classifications—that is typically set with reference to a judgment concerning a minimum level of proficiency required to be included in a particular classification. Fixed cut scores may also be referred to as absolute cut scores. An example of a fixed cut score might be the score achieved on the road test for a driver’s license. Here the performance of other would-be drivers has no bearing upon whether an individual testtaker is classified as “licensed” or “not licensed.” All that really matters here is the examiner’s answer to this question: “Is this driver able to meet (or exceed) the fixed and absolute score on the road test necessary to be licensed?”
JUST THINK . . .
Can both relative and absolute cut scores be used within the same evaluation? If so, provide an example.
A distinction can also be made between the terms multiple cut scores and multiple hurdles as used in decision-making processes. Multiple cut scores refers to the use of two or more cut scores with reference to one predictor for the purpose of categorizing testtakers. So, for example, your instructor may have multiple cut scores in place every time an examination is administered, and each class member will be assigned to one category (e.g., A, B, C, D, or F) on the basis of scores on that examination. That is, meeting or exceeding one cut score will result in an A for the examination, meeting or exceeding another cut score will result in a B for the examination, and so forth. This is an example of multiple cut scores being used with a single predictor. Of course, we may also speak of multiple cut scores being used in an evaluation that entails several predictors wherein applicants must meet the requisite cut score on every predictor to be considered for the position. A more sophisticated but cost-effective multiple cut-score method can involve several “hurdles” to overcome.
At every stage in a multistage (or multiple hurdle ) selection process, a cut score is in place for each predictor used. The cut score used for each predictor will be designed to ensure that each applicant possess some minimum level of a specific attribute or skill. In this context, multiple hurdles may be thought of as one collective element of a multistage decision-making process in which the achievement of a particular cut score on one test is necessary in order to advance to the next stage of evaluation in the selection process. In applying to colleges or professional schools, for example, applicants may have to successfully meet some standard in order to move to the next stage in a series of stages. The process might begin, for example, with the written applicationstage in which individuals who turn in incomplete applications are eliminated from further consideration. This is followed by what might be termed an additional materials stage in which individuals with low test scores, GPAs, or poor letters of recommendation are eliminated. The final stage in the process might be a personal interview stage. Each of these stages entails unique demands (and cut scores) to be successfully met, or hurdles to be overcome, if an applicant is to proceed to the next stage. Switching gears considerably, another example of a selection process that entails multiple hurdles is presented in Figure 7–2.
Figure 7–2 “There She Goes . . .” Over Yet Another Hurdle Contestants in this pageant must exhibit more than beauty if they are to be crowned. Beyond the swimsuit competition, contestants are judged on talent, responses to interview questions, and other variables. Only by “making the cut” and “clearing each hurdle” in each category of the judging will one of the contestants emerge as the pageant winner.© James Atoa/Everett Collection/Age Fotostock
JUST THINK . . .
Many television programs—including shows like Dancing with the Stars, and The Voice—could be conceptualized as having a multiple-hurdle selection policy in place. Explain why these are multiple-hurdle processes. Offer your suggestions, from a psychometric perspective, for improving the selection process on these or any other show with a multiple-hurdle selection policy.
Multiple-hurdle selection methods assume that an individual must possess a certain minimum amount of knowledge, skill, or ability for each attribute measured by a predictor to be successful in the desired position. But is that really the case? Could it be that a very high score in one stage of a multistage evaluation compensates for or “balances out” a relatively low score in another stage of the evaluation? In what is referred to as a compensatory model of selection , an assumption is made thatPage 224 high scores on one attribute can, in fact, “balance out” or compensate for low scores on another attribute. According to this model, a person strong in some areas and weak in others can perform as successfully in a position as a person with moderate abilities in all areas relevant to the position in question.
JUST THINK . . .
Imagine that you are on the hiring committee of an airline that has a compensatory selection model in place. What three pilot characteristics would you rate as most desirable in new hires? Using percentages, how would you differentially weight each of these three characteristics in terms of importance (with the total equal to 100%)?
Intuitively, the compensatory model is appealing, especially when post-hire training or other opportunities are available to develop proficiencies and help an applicant compensate for any areas of deficiency. For instance, with reference to the delivery driver example in this chapter’s Close-Up, consider an applicant with strong driving skills but weak customer service skills. All it might take for this applicant to blossom into an outstanding employee is some additional education (including readings and exposure to videotaped models) and training (role-play and on-the-job supervision) in customer service.
JUST THINK . . .
It is possible for a corporate employer to have in place personnel selection procedures that use both cutoff scores at one stage of the decision process and a compensatory approach at another? Can you think of an example?
When a compensatory selection model is in place, the individual or entity making the selection will, in general, differentially weight the predictors being used in order to arrive at a total score. Such differential weightings may reflect valuePage 225 judgments made on the part of the test developers regarding the relative importance of different criteria used in hiring. For example, a safe driving history may be weighted higher in the selection formula than is customer service. This weighting might be based on a company-wide “safety first” ethic. It may also be based on a company belief that skill in driving safely is less amenable to education and training than skill in customer service. The total score on all of the predictors will be used to make the decision to select or reject. The statistical tool that is ideally suited for making such selection decisions within the framework of a compensatory model is multiple regression. Other tools, as we will see in what follows, are used to set cut scores.
Methods for Setting Cut Scores
If you have ever had the experience of earning a grade of B when you came oh-so-close to the cut score needed for a grade A, then you have no doubt spent some time pondering the way that cut scores are determined. In this exercise, you are not alone. Educators, researchers, corporate statisticians, and others with diverse backgrounds have spent countless hours questioning, debating, and—judging from the nature of the heated debates in the literature—agonizing about various aspects of cut scores. No wonder; cut scores applied to a wide array of tests may be used (usually in combination with other tools of measurement) to make various “high-stakes” (read “life-changing”) decisions, a partial listing of which would include:
· who gets into what college, graduate school, or professional school;
· who is certified or licensed to practice a particular occupation or profession;
· who is accepted for employment, promoted, or moved to some desirable position in a business or other organization;
· who will advance to the next stage in evaluation of knowledge or skills;
· who is legally able to drive an automobile;
· who is legally competent to stand trial;
· who is legally competent to make a last will;
· who is considered to be legally intoxicated;
· who is not guilty by reason of insanity;
· which foreign national will earn American citizenship.
JUST THINK . . .
What if there were a “true cut-score theory” for setting cut scores that was analogous to the “true score theory” for tests? What might it look like?
Page upon page in journal articles, books, and other scholarly publications contain writings that wrestle with issues regarding the optimal method of “making the cut” with cut scores. One thoughtful researcher raised the question that served as the inspiration for our next Just Think exercise (see Reckase, 2004). So, after you have given due thought to that exercise, read on and become acquainted with various methods in use today for setting fixed and relative cut scores. Although no one method has won universal acceptance, some methods are more popular than others.
The Angoff Method
Devised by William Angoff (1971), the Angoff method for setting fixed cut scores can be applied to personnel selection tasks as well as to questions regarding the presence or absence of a particular trait, attribute, or ability. When used for purposes of personnel selection, experts in the area provide estimates regarding how testtakers who have at least minimalPage 226 competence for the position should answer test items correctly. As applied for purposes relating to the determination of whether or not testtakers possess a particular trait, attribute, or ability, an expert panel makes judgments concerning the way a person with that trait, attribute, or ability would respond to test items. In both cases, the judgments of the experts are averaged to yield cut scores for the test. Persons who score at or above the cut score are considered high enough in the ability to be hired or to be sufficiently high in the trait, attribute, or ability of interest. This relatively simple technique has wide appeal (Cascio et al., 1988; Maurer & Alexander, 1992) and works well—that is, as long as the experts agree. The Achilles heel of the Angoff method is when there is low inter-rater reliability and major disagreement regarding how certain populations of testtakers should respond to items. In such scenarios, it may be time for “Plan B,” a strategy for setting cut scores that is driven more by data and less by subjective judgments.
The Known Groups Method
Also referred to as the method of contrasting groups, the known groups method entails collection of data on the predictor of interest from groups known to possess, and not to possess, a trait, attribute, or ability of interest. Based on an analysis of this data, a cut score is set on the test that best discriminates the two groups’ test performance. How does this work in practice? Consider the following example.
A hypothetical online college called Internet Oxford University (IOU) offers a remedial math course for students who have not been adequately prepared in high school for college-level math. But who needs to take remedial math before taking regular math? To answer that question, senior personnel in the IOU Math Department prepare a placement test called the “Who Needs to Take Remedial Math? Test” (WNTRMT). The next question is, “What shall the cut score on the WNTRMT be?” That question will be answered by administering the test to a selected population and then setting a cut score based on the performance of two contrasting groups: (1) students who successfully completed college-level math, and (2) students who failed college-level math.
Accordingly, the WNTRMT is administered to all incoming freshmen. IOU collects all test data and holds it for a semester (or two). It then analyzes the scores of two approximately equal-sized groups of students who took college-level math courses: a group who passed the course and earned credit, and a group who did not earn credit for the course because their final grade was a D or an F. IOU statisticians will now use these data to choose the score that best discriminates the two groups from each other, which is the score at the point of least difference between the two groups. As shown in Figure 7–3 the two groups are indistinguishable at a score of 6. Consequently, now and forever more (or at least until IOU conducts another study), the cutoff score on the IOU shall be 6.
Figure 7–3 Scores on IOU’s WMTRMT
The main problem with using known groups is that determination of where to set the cutoff score is inherently affected by the composition of the contrasting groups. No standard set of guidelines exist for choosing contrasting groups. In the IOU example, the university officials could have chosen to contrast just the A students with the F students when deriving a cut score; this would definitely have resulted in a different cutoff score. Other types of problems in choosing scores from contrasting groups occur in other studies. For example, in setting cut scores for a clinical measure of depression, just how depressed do respondents from the depressed group have to be? How “normal” should the respondents in the nondepressed group be?
The methods described thus far for setting cut scores are based on classical test score theory. In this theory, cut scores are typically set based on tessttakers’ performance across all the items on the test; somePage 227 portion of the total number of items on the test must be scored “correct” (or in a way that indicates the testtaker possesses the target trait or attribute) in order for the testtaker to “pass” the test (or be deemed to possess the targeted trait or attribute). Within an item response theory (IRT) framework, however, things can be done a little differently. In the IRT framework, each item is associated with a particular level of difficulty. In order to “pass” the test, the testtaker must answer items that are deemed to be above some minimum level of difficulty, which is determined by experts and serves as the cut score.
There are several IRT-based methods for determining the difficulty level reflected by a cut score (Karantonis & Sireci, 2006; Wang, 2003). For example, a technique that has found application in setting cut scores for licensing examinations is the item-mapping method . It entails the arrangement of items in a histogram, with each column in the histogram containing items deemed to be of equivalent value. Judges who have been trained regarding minimal competence required for licensure are presented with sample items from each column and are asked whether or not a minimally competent licensed individual would answer those items correctly about half the time. If so, that difficulty level is set as the cut score; if not, the process continues until the appropriate difficulty level has been selected. Typically, the process involves several rounds of judgments in which experts may receive feedback regarding how their ratings compare to ratings made by other experts.
An IRT-based method of setting cut scores that is more typically used in academic applications is the bookmark method (Lewis et al., 1996; see also Mitzel et al., 2000). Use of this method begins with the training of experts with regard to the minimal knowledge, skills, and/or abilities that testtakers should possess in order to “pass.” Subsequent to this training, the experts are given a book of items, with one item printed per page, such that items are arranged in an ascending order of difficulty. The expert then places a “bookmark” between the two pages (or, the two items) that are deemed to separate testtakers who have acquired the minimal knowledge, skills, and/or abilities from those who have not. The bookmark serves as the cut score. Additional rounds of bookmarking with the same or other judges may take place as necessary. Feedback regarding placement may be provided, and discussion among experts about the bookmarkings may be allowed. In the end, the level of difficulty to use as the cut score is decided upon by the test developers. Of course, none of these procedures are free of possible drawbacks. Some concerns raised about the bookmarking method include issues regarding the training of experts, possible floor and ceiling effects, and the optimal length of item booklets (Skaggs et al., 2007).Page 228
Our overview of cut-score setting has touched on only a few of the many methods that have been proposed, implemented, or experimented with; many other methods exist. For example, Hambleton and Novick (1973) presented a decision-theoretic approach to setting cut scores. In his book Personnel Psychology,R. L. Thorndike (1949) proposed a norm-referenced method for setting cut scores called the method of predictive yield. The method of predictive yield was a technique for setting cut scores which took into account the number of positions to be filled, projections regarding the likelihood of offer acceptance, and the distribution of applicant scores. Another approach to setting cut scores employs a family of statistical techniques called discriminant analysis (also referred to as discriminant function analysis). These techniques are typically used to shed light on the relationship between identified variables (such as scores on a battery of tests) and two (and in some cases more) naturally occurring groups (such as persons judged to be successful at a job and persons judged unsuccessful at a job).
Given the importance of setting cut scores and how much can be at stake for individuals “cut” by them, research and debate on the issues involved are likely to continue—at least until that hypothetical “true score theory for cut scores” alluded to earlier in this chapter is identified and welcomed by members of the research community.
In this chapter, we have focused on the possible benefits of testing and how to assess those benefits. In so doing, we have touched on several aspects of test development and construction. In the next chapter, we delve more deeply into the details of these important elements of testing and assessment.
Test your understanding of elements of this chapter by seeing if you can explain each of the following terms, expressions, and abbreviations: