12 | Judgment and Reasoning |
Judgment The activity of “thinking” takes many forms, but one of the central forms is judgment-the process through which people draw conclusions from the evidence they encounter, often evidence provided by life experiences. But how-and how well-do people make judgments? Experience is, of course, an who, after many seasons, tells you which game strategies work and which ones don’t. Likewise, you trust the police detective who asserts that over the years he’s learned how to tell whether a suspect extraordinary teacher, and so you’re likely to believe the sports coach is lying. You welcome the advice of the hair stylist who says, “I can tell you from the hair I cut every day, this shampoo repairs split ends.”
But we can also find cases in which people don’t learn from experience: “He’s getting married again? Why does he think this one will last longer than the last four?”; “It doesn’t matter how many polite New Yorkers she meets; she’s still convinced that everyone from that city is rude.”
What’s going on here? Why do people sometimes draw accurate conclusions from their experience, and sometimes not? Attribute Substitution Let’s start with the information you use when drawing a conclusion from experience. Imagine that you’re shopping for a car and trying to decide if European know how often these cars break down and need repair-how frequent are the problems? As a cars are reliable. Surely, you’d want to different case, imagine that you’re trying to choose an efficient route for your morning drive to work. Here, too, the information you need concerns frequencies: When you’ve gone down 4th Avenue, how often were you late? How often were you late when you stayed on Front Street instead?
Examples like these remind us that a wide range of judgments begin with a frequency estimate– an assessment of how often various events have occurred in the past. For many of the judgments you make in day-to-day life, though, you don’t have direct access to frequency information. You probably don’t have instant access to a count of how many VW’s break down, in comparison to how many ow, therefore, do Hondas. You probably don’t have a detailed list of your various commute times. you proceed in making your judgments?
Let’s pursue the decision about commuting routes. In making your choice, you’re likely to do a quick scan through memory, looking for relevant cases. If you can immediately think of three occasions when you got caught in a traffic snarl on 4th Avenue and can’t think of similar occasions on Front Street, you’ll probably decide that Front Street is the better bet. In contrast, if you can recall two horrible traffic jams on Front Street but only one on 4th Avenue, you’ll draw the opposite conclusion. The strategy you’re using here is known as attribute substitution- a strategy in which you rely on easily assessed information as a proxy for the information you really need. In this judgment about traffic, the information you need is frequency (how often you’ve been late when you’ve taken one route or the other), but you don’t have access to this information. As a substitute, you base your judgment on availability-how easily and how quickly you can come up with relevant examples. The logic is this: “Examples leap to mind? Must be a common, often-experienced event. A struggle to come up with examples? Must be a rare event”.
This strategy-relying on availability substitution known as the availability heuristic (Tversky & Kahneman, 1973). Here’s a different type of attribute substitution: Imagine that you’re applying for a job. You hope that the employer will examine your credentials carefully and make a thoughtful judgment about whether you’d be a good hire. It’s likely, though, that the employer will rely on a faster, easier strategy. Specifically, he may as a substitute for frequency-is a form of attribute barely glance at your résumé and, instead, ask himself how much you resemble other people he’s hired who have worked out well. Do you have the same mannerisms or the same look as Joan, an employee that he’s very happy with? If so, you’re likely to get the job. Or do you remind him of Jane, an employee he had to fire after just two months? If so, you’ll still be looking at the job ads tomorrow In this case, the person who’s interviewing you needs to judge probability that you’d work out well if hired) and instead relies on resemblance to known cases. This a probability (namely, the substitution is referred to as the representativeness heuristic. The Availability Heuristic People rely on heuristics like availability and representativeness in a wide range of settings, and so, if we understand these strategies, we understand how a great deal of thinking proceeds. (See Table 12.1 for a summary comparison of these two strategies.)
In general, the term heuristic describes an efficient strategy that usually leads to the right answer. The key word, however, is “usually,” because heuristics allow errors; that’s the price you pay in order to gain the efficiency. The availability and representativeness heuristics both fit this profile. In each case, you’re relying on an attribute (availability or resemblance) that’s easy to assess, and that’s the source of the efficiency. And in each case, the attribute is correlated with the target dimension, so that it can serve as a reasonable proxy for the target: Events or objects that are frequent in the world are, in most cases, likely to be easily available in memory, so generally you’ll be fine if you rely on availability as an index for frequency. And many categories are homogeneous enough so that members of the category do resemble one another; that’s why you can often rely on resemblance as a way of judging probability of category membership.
Nonetheless, these strategies can lead to error. To take a simple case, ask yourself: “Are there more words in the dictionary beginning with the letter R (rose, ‘rock, rabbit) or more words with an R in the third position (tarp, ‘bare; ‘throw)?” Most people insist that there are more words beginning with R (Tversky & Kahneman, 1973, 1974), but the reverse is true-by a margin of at least 2- to-1.
Why do people get this wrong? The answer lies in availability. If you search your memory for words starting with R, many will come to mind. (Try it: How many R-words can you name in 10 seconds?) But if you search your memory for words with an R in the third position, fewer will come up. (Again, try this for 10 seconds.) This difference, favoring the words beginning with R, arises because your memory is organized roughly like a dictionary, with words that share a starting sound all grouped together. As a result, it’s easy to search memory using “starting letter” as your cue; a search based on “R in third position” is more difficult. In this way, the organization of memory creates a bias in what’s easily available, and this bias in availability leads to an error in frequency judgment. The Wide Range of Availability Effects The R-word example isn’t very interesting on its own-after all, how often do you need to make judgments about spelling patterns? But other examples are easy to find, including cases in which people are making judgments of some importance.
For example, people regularly overestimate the frequency of events that are, in actuality, quite rare (Lichtenstein, Slovic, Fischhoff, Layman,&Combs, 1978). This probably plays a part in people’s willingness to buy lottery tickets; they overestimate the likelihood of winning. Likewise, physicians often overestimate the likelihood of a rare disease and, in the process, fail to pursue other, more appropriate, diagnoses (e.g., Elstein et al., 1986; Obrecht, Chapman, & Gelman, 2009).
What causes this pattern? There’s little reason to spend time thinking about familiar events (“Oh, look-that airplane has wings!”), but you’re likely to notice and think about rare events, especially rare emotional events (“How awful-that airplane crashed!”). As a result, rare events are likely well recorded in memory, and this will, in turn, make these events easily available to you. As a to be consequence, if you rely on the availability heuristic, you’ll overestimate the frequency of these distinctive events and, correspondingly, overestimate the likelihood of similar events happening in the future.
Here’s a different example. Participants in one study were asked to think about episodes in their lives in which they’d acted in an assertive manner (Schwarz et al., 1991; also see Raghubir & Menon, 2005). Half of the participants were asked to recall 6 episodes; half were asked to recall 12 episodes. Then, all the participants were asked some general questions, including how assertive overall they thought they were. Participants had an easy time coming up with 6 episodes, and so, using the availability heuristic, they concluded, “Those cases came quickly to mind; therefore, there must be a large number of these episodes; therefore, I must be an assertive person.” In contrast, participants who were asked for 12 episodes had some difficulty generating the longer list, so they concluded, “If these cases are so difficult to recall, I guess the episodes can’t be typical for how I act.”
Consistent with these suggestions, participants who were asked to recall fewer episodes judged themselves to be more assertive. Notice, ironically, that the participants who tried to recall more episodes actually ended up with more evidence in view for their own assertiveness. But it’s not the quantity of evidence that matters. Instead, what matters is the ease of coming up with the episodes. Participants who were asked for a dozen episodes had a hard time with the task because they’d been asked to do something difficult-namely, to come up with a lot of cases. But the participants seemed not to realize this. They reacted only to the fact that the examples were difficult to generate, and using the availability heuristic, they concluded that being assertive was relatively infrequent in their past. The Representativeness Heuristic Similar points can be made about the representativeness heuristic. Just like availability, this strategy often leads to the correct conclusion. But here, too, the strategy can sometimes lead you astray.
How does the representativeness heuristic work? Let’s start with the fact that many of the categories you encounter are relatively homogeneous. The category “birds,” for example, is reasonably uniform with regard to the traits of having wings, having feathers, and so on. Virtually every member of the category has these traits, and so, in these regards, each member of the category resembles most of the others. Likewise, the category “motels” is homogeneous with regard to traits like has beds in each room, has a Bible in each room, and has an office, and so, again, in these regards each member of the category resembles the others. The representativeness heuristic capitalizes on this homogeneity. We expect each individual to resemble the other individuals in the category (i.e., we expect each individual to be representative of the category overall). As a result, we can use resemblance as a basis for judging the likelihood of category membership. So if a creature resembles other birds you’ve seen, you conclude that the creature probably is a bird. We first met this approach in Chapter 9, when we were discussing simple categories like “dog” and “fruit.” But the same approach can be used more broadly-and this is the heart of the representativeness strategy. Thus, if a job candidate resembles successful hires you’ve made, you conclu de that the person will probably be a successful hire; if someone you meet at a party resembles engineers you’ve known, you assume that the person is likely to be an engineer.
Once again, though, use of this heuristic can lead to error. Imagine tossing a coin over and over, and let’s say that it lands “heads” up six times in a row. Many people believe that on the next toss the coin is more likely to come up tails. They reason that if the coin is fair, then any series of tosses should contain roughly equal numbers of heads and tails. If no tails have appeared for a while, then some are “overdue” to make up the balance.
This pattern of thinking is called the “gambler’s fallacy.” To see that it is a fallacy, bear in mind that a coin has no “memory,” so the coin has no way of knowing how long it has been since the last tails. Therefore, the likelihood of a tail occurring on any particular toss must be independent of what happened on previous tosses; there’s no way that the previous tosses could possibly influence the next one. As a result, the probability of a tail on toss number 7 is .50, just as it was on the first toss- and on every toss. What produces the gambler’s fallacy? The explanation lies in the assumption of category homogeneity. We know that in the long run, a fair coin will produce equal numbers of heads and tails. Therefore, the category of “all tosses” has this property. Our assumption of homogeneity, though, leads us to expect that any “representative” of the category will also have this property-that is, any sequence of tosses will also show the 50-50 split. But this isn’t true: Some sequences of tosses are 75% heads; some are 5% heads. It’s only when we combine these sequences that the 50-50 split emerges. (For a different perspective on the gambler’s fallacy, see Farmer, Warren, & Hahn, 2017.) Reasoning from a Single Case to the Entire Population The assumption of homogeneity can also lead to a different error, one that’s in view whenever people try to persuade each other with a “man who” argument. To understand this term (first proposed by Nisbett & Ross, 1980), imagine that you’re shopping for a new cell phone. You’ve read various consumer magazines and decided, based on their test data, that you’ll buy a Smacko brand phone. You report this to a friend, who is aghast. “Smacko? You must be crazy. Why, I know a guy who bought a Smacko, and the case fell apart two weeks after he got it. Then, the wire for the headphones went. Then, the charger failed. How could you possibly buy a Smacko?” What should you make of this argument? The consumer magazines tested many phones and reported that, say, 2% of all Smackos have repair problems. In your friend’s “data,” 100 % of the Smackos (one out of one) broke. It seems silly to let this “sample of one” outweigh the much larger sample tested by the magazine, but even so your friend probably thinks he’s offering persuasive a argument. What guides your friend’s thinking? He must be assuming that the category will resemble the instance. Only in that case would reasoning from a single instance be appropriate. (For a classic demonstration of the “man who” pattern, see Hamill, Wilson, & Nisbett, 1980.)
If you listen to conversations around you, you’ll regularly hear “man who” (or “woman who”) arguments. “What do you mean, cigarette smoking 50 years, and she runs in marathons!” Often, these arguments seem persuasive. But they have force only by virtue of the representativeness heuristic and your assumption of category homogeneity. causes cancer?! I have an aunt who smoked for e. Demonstration 12.1: Sample Size Research on how people make judgments suggests that their performance is at best uneven, with people in many cases drawing conclusions that are not justified by the evidence they’ve seen. Here, for example, is a question drawn from a classic study of judgment:
In a small town nearby, there are two hospitals. Hospital A has an average of 45 births per day; Hospital B is smaller and has an average of 15 births per day. As we all know, overall the proportion of males born is 50%. Each hospital recorded the number of days in which, on that day, at least 60 % of the babies born were male.
Which hospital recorded more such days?
a. Hospital A
b. Hospital B
c. both equal
What’s your answer to this question? In more formal procedures, the majority of research participants choose response (c), “both equal,” but this answer is statistically unwarranted. All of the births in the country add up to a 50-50 split between male and female babies, and, the larger the sample you examine, the more likely you are to approximate this ideal. But, conversely, the smaller the sample you examine, the more likely you are to stray from this ideal. Days with 60 % male births, straying from the ideal, are therefore more likely in the smaller hospital, Hospital B.
If you don’t see this, consider a more extreme case:
Hospital C has 1,000 births per day; Hospital D has exactly 1 birth per day. Which hospital records more days with at least 90% male births?
This value will be observed in Hospital D rather often, since on many days all the babies born (one out of one) will be male. This value is surely less likely, though, in Hospital C: 900 male births, with just 100 female, would be a remarkable event indeed. In this case, it seems clear that the smaller hospital can more easily stray far from the 50-50 split.
In the hospital problem, participants seem not to take sample size into account. They seem to think a particular pattern is just as likely with a small sample as with a large sample, although this is plainly not true. This belief, however, is just what we would expect if people were relying on the representativeness heuristic, making the assumption that each instance of a category-or, in this case, each subset of a larger set-should show the properties associated with the entire set.
Try this demonstration witha couple of your friends. As you’ll see, it’s easy to find people who choose the incorrect option (“both equal”), underlining just how often people seem to be insensitive to considerations of sample size. e. Demonstration 12.2: Relying on the Representativeness Heuristic Demonstration 12.1 indicated that people often neglect (or misunderstand the meaning of) sample so their size. In other cases, people rely on heuristics that are not in any way guided by logic, conclusion ends up being quite illogical. For example, here is another classic problem from research on judgment:
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and she also participated in anti-nuclear demonstrations.
Which of the following is more likely to be true?
a. Linda is a bank teller.
b. Linda is a bank teller and is active in the feminist movement.
What’s your response? In many studies, a clear majority of participants (sometimes as high as 85%) choose option (b). Logically, though, this makes no sense. If Linda is a feminist bank teller, then she is still a bank teller. Therefore, there’s no way for option (b) to be true without option (a) also being true. Therefore, option (b) couldn’t possibly be more likely than option (a)! Choosing option (b), in other words, is akin to saying that if we randomly choose someone who lives in North America, the chance of that person being from Vermont is greater than the chance of that person being from the United States. Why, therefore, do so many people choose option (b)? This option makes sense if people are relying on the representativeness heuristic. In that case, they make the category judgment by asking themselves: “How much does Linda resemble my idea of a bank teller? How much does she resemble my idea of a feminist bank teller?” On this basis, they could easily be led to option (b), because the description of Linda does, in fact, encourage a particular view of her and her politics.
There is, however, another possibility. With options (a) and (b) sitting side-by-side, someone might say: “Well, if option (b) is talking about a bank teller who is a feminist, then option (a) must be talking about a bank teller who is not a feminist.” On that interpretation, choosing option (b) does seem reasonable. Is this how you interpreted option (a)?
You might spend a moment thinking about how to test this alternative interpretation-the idea that research participants interpret option (a) in this narrowed fashion. One strategy is to present option (a) to some participants and ask them how likely it is, and to present option (b) to other participants and ask them how likely it is. In this way, the options there’s never any implied contrast in the options. In this situation, then, there’s no reason at all for are never put side by side, so participants to interpret option (a) in the narrowed fashion. Even so, in studies using this alternative procedure, the group of participants seeing option (a) still rated it as less likely than the other group of participants rated the option they saw. Again, this makes no sense from the standpoint of logic, but it makes perfect sense if participants are using the representativeness heuristic. Detecting Covariation It cannot be surprising that people often rely on mental shortcuts. After all, you don’t have unlimited time, and many of the judgments you make, day by day, are far from life-changing. It’s unsettling, though, that people use the same shortcuts when making deeply consequential judgments. And to make things worse, the errors caused by the heuristics can trigger other sorts of errors, including errors in judgments of covariation. This term has a technical meaning, but for our purposes we can define it this way: X and Y “covary” if X tends to be on the scene whenever Y is, and if X tends to be absent whenever Y is absent. For example, exercise and stamina covary: People who do the first tend to have a lot of the second. Years of education and annual salary also covary (and so people with more education tend to earn more), but the covariation is weaker than that between exercise and stamina. Notice, then, that covariation can be strong or weak, and it can also be negative or positive. Exercise and stamina, for example, covary positively (as exercise increases, so does stamina). Exercise and risk of heart attacks covary negatively (because exercise strengthens the heart muscle, decreasing the risk).
Covariation is important for many reasons-including the fact that it’s what you need to consider when checking on a belief about cause and effect. For example, do you feel better on days when you eat a good breakfast? If so, then the presence or absence ofa good breakfast in the morning should covary with how you feel as the day wears on. Similarly: Are you more likely to fall in love with someone tall? Does your car start more easily if you pump the gas pedal? These, too, are questions that hinge on covariation, leading us to ask: How accurately do people judge covariation? Illusions of Covariation People routinely “detect” covariation even where there is none. For example, many people are convinced there’s a relationship between someone’s astrological sign (e.g., whether the person is a Libra or a Virgo) and their personality, yet no serious study has documented this covariation. predict the weather by paying attention to their arthritis Likewise, many people believe they can pain (“My knee always acts up when a storm is coming”). This belief, too, turns out to be groundless. Other examples concern social stereotypes (e.g., the idea that being “moody” covaries with gender), superstitions (e.g., the idea that Friday the 13th brings bad luck), and more. (For some of the evidence, see King & Koehler, 2000; Redelmeier & Tversky, 1996; Shaklee & Mims, 1982.)
What causes illusions like these? One reason, which we’ve known about for years, is centered on the evidence people consider when judging covariation: In making these judgments, people seem to consider only a subset of the facts, and it’s a subset skewed by their prior expectations (Baron, 1988; Evans, 1989; Gilovich, 1991; Jennings, Amabile, & Ross, 1982). This virtually guarantees mistaken judgments, since even if the judgment process were 100% fair, a biased input would lead to a biased output. Specifically, when judging covariation, your selection of evidence is likely to be guided by confirmation bias-a tendency to be more alert to evidence that confirms your beliefs rather than to evidence that might challenge them (Nisbett & Ross, 1980; Tweney, Doherty, & Mynatt, 1981). We’ll have more to say about confirmation bias later, but for now let’s note how confirmation bias can distort the assessment of covariation. Let’s say, for example, that you have the belief that big dogs tend to be vicious. With this belief, you’re more likely to notice big dogs that are, in fact, vicious and little dogs that are friendly. As a result, a biased sample of dogs is available to you, in the dogs you perceive and the dogs you remember. Therefore, if you try to estimate covariation between dog size and temperament, you’ll get it wrong. This isn’t because you’re ignoring the facts. The problem instead lies in your “data”; if the data are biased, so will be your judgment. Base Rates Assessment of covariation can also be pulled off track by another problem: neglect of base-rate information-information about how frequently something occurs in general. Imagine that we’re testing a new drug in the hope that it will cure the common cold. Here, we’re trying to find out if taking the drug covaries with a better medical outcome, and let’s say that our study tells us that 70% of patients taking the drug recover from their illness within 48 hours. This result is uninterpretable on its own, because we need the base rate: We need to know how often in general people recover from their colds in the same time span. If it turns out that the overall recovery rate within 48 hours is 70%, then our new drug is having no effect whatsoever.
Similarly, do good-luck charms help? Let’s say that you wear your lucky socks whenever your favorite team plays, and the team has won 85% of its games. Here, too, we need to ask about base rates: How many games has your team won over the last few years? Perhaps the team has won 90% overall. In that case, your socks are actually a jinx.
Despite the importance of base rates, people often ignore them. In a classic study, Kahneman and Tversky (1973) asked participants this question: If someone is chosen at random from a group of 70 lawyers and 30 engineers, what is his profession likely to be? Participants understood perfectly well that in this setting the probability of the person being a lawyer is .70. Apparently, in some settings people are appropriately sensitive to base-rate information. Other participants did a similar task, but they were given the same base rates and also brief descriptions of certain individuals. Based on this information, they were asked whether each individual was more likely to be a lawyer or an engineer. These descriptions provide diagnostic information-information about the particular case-and some of the descriptions had been crafted (based on common stereotypes) to suggest that the person was a lawyer; some suggested engineer; some were relatively neutral.
Participants understood the value of these descriptions and-as we’ve just seen-also seem to understand the value of base rates: They’re responsive to base-rate information if this is the only information they have. When given both types of information, therefore, we should expect that the participants will combine these inputs as well as they can. If both the base rate and the diagnostic information favor the lawyer response, participants should offer this response with confidence. If the base rate indicates one response and the diagnostic information the other response, participants should temper their estimates accordingly.
However, this isn’t what participants did. When provided with both types of information, they relied only on the descriptive information about the individual. In fact, when given both the base rate and diagnostic information, participants’ responses were the same if the base rates were as already described (70 lawyers, 30 engineers) or if the base rates were reversed (30 lawyers, 70 engineers). This reversal had no impact on participants’ judgments, confirming that they were indeed ignoring the base rates. What produces this neglect of base rates? The answer, in part, is attribute substitution. When asked whether a particular person-Tom, let’s say-is a lawyer or an engineer, people seem to turn this question about category membership into a question about resemblance. (In other words, they rely on the representativeness heuristic.) Therefore, to ask whether Tom is a lawyer, they ask themselves how much Tom resembles (their idea of) a lawyer. This substitution is (as we’ve discussed) often helpful, but the strategy provides no role for base rates-and this guarantees that people will routinely ignore base rates. Consistent with this claim, base-rate neglect is widespread and can be observed both in laboratory tasks and in many real-world judgments. (For some indications, though, of when people do take base rates into account, see Griffin et al., 2012; Klayman & Brown, 1993; Pennycook, Trippas, Handley, & Thompson, 2014.) e. Demonstration 12.3: “Man who” Arguments The chapter suggests that people often rely on “man who” arguments. For example: “It’s crazy to Toyota, and she’s had problem after think Japanese cars are reliable. I have a friend who owns a problem with it, starting with the week she bought it!”
But are “man who” arguments really common? As an exercise, for the next week try to be on the lookout for these arguments. Bear in mind that there are many variations on this form: “I know a team that barely practices, and they win almost all their games” “I have a classmate who parties every Friday, and he’s doing great in school, what these variations have in common is that they draw a conclusion based on just Friday nights?” Of course, why should I stay home on SO one case. How often can you detect any of these variations in your day-to-day conversations or in things you read online?
One more question: Once you’re alert to “man who” arguments, and noticing them when they guard against such arguments? You might try of these arguments, with the simple assertion, “That’s just come into your view, does that help you to be on responding, each time you encounter one single case; maybe it’s not in any way typical of the broader pattern.” e. Demonstration 12.4: Applying Base Rates Chapter 12 documents many errors in judgment, and it is deeply troubling that these errors can be making judgments about domains that are observed even when knowledgeable experts are enormously consequential. As an illustration, consider the following scenario.
Imagine that someone you care about-let’s call her Julia, age 42-is worried that she might have breast cancer. In thinking about Julia’s case, we might start by asking: How common is breast cancer for women of Julia’s age, with her family history, her dietary pattern, and so on? Let’s assume that for this group the statistics show an overall 3% likelihood of developing breast cancer. This should be reassuring to Julia, because there is a 97% chance that she is cancer free.
Of course, a 3 % chance is still scary for this disease, so Julia decides to get a mammogram. When her results come back, the report is bad-indicating that she does have breast cancer. Julia quickly does some research to find out how accurate mammograms are, and she learns that the available data are something like this:
Mammogram indicates
Cancer No cancer
Cancer actually present 85% 15%
Cancer actually absent 10% 90%
(We emphasize that these are fictitious numbers, created for this exercise. Even so, the reality is that mammograms are reasonably accurate, in the way shown.)
In light of all this information, what is your best estimate of the chance that Julia does, in fact, have breast cancer? She comes from a group that only has a 3 % risk for cancer, but she’s gotten an abnormal mammogram result, and the test seems, according to her research, accurate. What should we conclude? Think about this for a few moments, and before reading on, estimate the percentage chance of Julia having breast cancer.
When medical doctors are asked questions like these, their answers are often wildly inaccurate, because they (like most people) fail to use base-rate information correctly. What was your estimate of the percentage chance of Julia having breast cancer? The correct answer is 20 %. This is an awful number, given what’s at stake, and Julia would surely want to pursue further tests. But the odds are still heavily in Julia’s favor, with a 4-to-1 chance that she is entirely free of cancer.
Where does this answer come from? Let’s create a table using actual counts rather than the percentages shown in the previous table. Go get a piece of paper, and set up a table like this:
Mammogram indicates
Cancer No cancer Total number
Cancer actually present
Cancer actually absent
Let’s imagine that we’re considering 100,000 women with medical histories similar to Julia’s. We have already said that overall there’s a 3% chance of breast cancer in this group, and so 3,000 (3% of 100,000) of these women will have breast cancer. Fill that number in as the top number in the “Total number” column, and this will leave the rest of the overall group (97,000) as the bottom number in this column.
SHOW
Now, let’s fill in the rows. There are 3,000 women counted in the top row, and we’ve already said that in this group the mammogram will (correctly) indicate that cancer is present in 85% of the cases. So the number for “Mammogram indicates cancer” in the top row will be 85% of the total in this row (3,000), or 2,550. The number of cases for “Mammogram indicates no cancer” in this row will be the remaining 15% of the 3,000, so let’s fill in that number-450.
SHOW
Let’s now do the same for the bottom row. We’ve already said that there are 97,000 women Let’s now do the same for the bottom row. We’ve already said that there are 97,000 women represented in this row; of these, the mammogram will correctly indicate no cancer for 90 % (87,300) and will falsely indicate cancer for 10 % (9,700). Let’s now put those numbers in the appropriate positions.
SHOW
Finally, let’s put these pieces together. According to our numbers, a total of 12,250 women will receive the horrid information that they have breast cancer. (That’s the total of the two numbers, 2,550+9,700, in the left column, “Mammogram indicates cancer”) Within this group, this test result will be correct for 2,550 women (left column, top row). The result will be misleading for the remaining 9,700 (left column, bottom row). Therefore, of the women receiving awful news from their 12,250, or 20%, will actually have breast cancer; the remaining 80% will be mammogram, 2,550 cancer free.
Notice, then, that the mammogram is wrong far more often than it’s right. This isn’t because the mammogram is an inaccurate test. In fact, the test is rather accurate. However, if the test is used with patient groups for which the base rate is low, then the mammogram might be wrong in only 10% of the cancer-free cases; but this will be 10% of a large number, producing a substantial number of horrifying false alarms. This is obviously a consequential example, because we’re discussing a disease that is lethal in many cases. It is therefore deeply troubling that even in this very important example, people still make errors of judgment. Worse, it’s striking that experienced physicians, when asked the same questions, also make errors-they, too, ignore the base rates and therefore give risk estimates that are off by a very wide margin.
At the same time, because this is a consequential example, let’s add some caution to these points. First, if a woman has a different background from Julia (our hypothetical patient), her overall risk for breast cancer may be higher than Julia’s. In other words, the base rate for her group may be higher or lower (depending on the woman’s age, exposure to certain toxins, family history, and other factors), and this will have a huge impact on the calculations we’ve discussed here. Therefore, we cannot freely generalize from the numbers considered here to other cases; we would need to know the base rate for these other cases.
Second, even if Julia’s risk is 20%, this is still a high number, so Julia (or anyone in this situation) might pursue treatment for this life-threatening illness. A 1-in-5 chance of having a deadly disease must be taken seriously! However, this doesn’t change the fact that a 20 % risk is very different from the 85% risk that one might fear if one considered only the mammogram results in isolation from the base rates. At 20%, the odds are good that Julia is safe; at 85%, she probably does have this disease. It seems certain that this is a difference that would matter for Julia’s subsequent steps, and it reminds us that medical decision making needs to be guided by full information-including, it seems, information about base rates. Dual-Process Models We seem to be painting a grim portrait of human judgment, and we can document errors even among experts-financial managers making large investments (e.g., Hilton, 2003; Kahneman, 2011) and physicians diagnosing cancer (but ignoring base rates; Eddy, 1982; also see Koehler, Brenner, & Griffin, 2002). The errors occur even when people are strongly motivated to be careful, with clear instructions and financial rewards offered for good performance (Arkes, 1991; Gilovich, 1991; Hertwig & Ortmann, 2003)
Could it be, then, that human judgment is fundamentally flawed? If so, this might explain why people are so ready to believe in telepathy, astrology, and a variety of bogus cures (Gilovich, 1991; King & Koehler, 2000). In fact, maybe these points help us understand why warfare, racism, neglect of poverty, and environmental destruction are so widespread; maybe these problems are the inevitable outcome of people’s inability to understand facts and to draw decent conclusions.
Before we make these claims, however, let’s acknowledge another side to our story: Sometimes human judgment rises above the heuristics we’ve described so far. People often rely on availability in judging frequency, but sometimes they seek other (more accurate) bases for making their judgments (Oppenheimer, 2004; Schwarz, 1998; Winkielman & Schwarz, 2001). Likewise, people often rely on the representativeness heuristic, and so (among other concerns) they draw conclusions from “man who” stories. But in other settings people are keenly sensitive to sample size, and they draw no conclusions if their sample is small or possibly biased. (For an early statement of this point, see Nisbett, Krantz, Jepson, & Kunda, 1983; for more recent discussion, see Kahneman, 2011.) How can we make sense of this mixed pattern? Ways of Thinking: Type 1, Type 2 A number of authors have proposed that people have two distinct ways of thinking. One type of thinking is fast and easy; the heuristics we’ve described fall into this category. The other type is slower and more effortful, but also more accurate.
Researchers have offered various versions of this dual-process model, and different theorists use different terminology (Evans, 2006, 2012a; Ferreira, Garcia-Marques, Sherman, & Sherman, 2006; Kahneman, 2011; Pretz, 2008; Shafir & LeBoeuf, 2002). We’ll rely on rather neutral terms (initially proposed by Stanovich and West, 2000; Stanovich, 2012), so we’ll use Type 1 as the label for the fast easy sort of thinking and Type 2 as the label for the slower, more effortful thinking. (Also see Figure 12.2.)
When do people use one type of thinking or the other? One hypothesis is that people choose when to rely on each system; presumably, they shift to the more accurate Type 2 when making judgments that really matter. As we’ve seen, however, people rely on Type 1 heuristics even when incentives are offered for accuracy, even when making important professional judgments, even when making medical diagnoses that may literally be matters of life and death. Surely people would choose Type 1 and fall into error. On these Type 2 in these cases if they could, yet they still rely on to use grounds, it’s difficult to argue that using Type 2 is a matter of deliberate choice.
Instead, evidence suggests that Type 2 is likely to come into play only if it’s triggered by certain cues and only if the circumstances are right. We’ve suggested, for example, that Type 2 judgments surprising that heuristic-based judgments (and, likely when judgments are made under time pressure (Finucane, Alhakami, Slovic, & Johnson, 2000). We’ve also said that Type 2 judgments require effort, so this form of thinking is more likely if the person can focus attention on the judgment being made are slower than Type 1, and on this basis it’s not thus, heuristic-based errors) are more (De Neys, 2006; Ferreira et al., 2006; for some complexity, though, see Chun & Kruglanski, 2006). Triggers for Skilled Intuition We need to be clear, though, that we cannot equate Type 1 thinking with “bad” or “sloppy” thinking, because fast-and-efficient thinking can be quite sophisticated if the environment contains the “right sort” of triggers. Consider base-rate neglect. We’ve already said that people often ignore base rates- and, as a result, misinterpret the evidence they encounter. But sensitivity demonstrated, even in cases involving Type 1 thinking (e.g., Pennycook to base rates can also be et al., 2014). This mixed likely presented. Base-rate neglect is more pattern is attributable, in part, to how the base rates are if the relevant information is cast in terms of probabilities or proportions: “There is a .01 chance that people like Mary will have this disease”; “Only 5% of the people in this group are lawyers.” But base- rate information can also be conveyed in terms of frequencies, and it turns out that people often use the base rates if they’re conveyed in this way. For example, people are more alert to a base rate are to the same information cast as a percentage phrased as “12 out of every 1,000 cases” than they probability (.012). (See Gigerenzer & Hoffrage, 1995; also Brase, 2008; Cosmides & Tooby on how the problem is presented, with some presentations (1.2%) 1996.) It seems, then, that much depends being more “user friendly” than others. (For more on the circumstances in which Type 1 thinking can be rather sophisticated, see Gigerenzer & Gaissmaier, 2011; Kahneman & Klein, 2009; Oaksford & Hall, 2016.) The Role for Chance Fast-but-accurate judgments are also more likely if the role of random chance is conspicuous in a likely to realize that the “evidence” they’re problem. If this role is prominent, people are more considering may just be a fluke or an accident, not an indication of a reliable pattern. With this, people are more likely to pay attention to the quantity of evidence, on the (sensible) idea that a larger set of observations is less vulnerable to chance fluctuations.
In one study, participants were asked about someone who evaluated a restaurant based on just one meal (Nisbett, Krantz, Jepson, & Kunda, 1983). This is, of course, a weak basis for judging the restaurant: If the diner had a great meal, maybe he was lucky and selected by chance the one entrée the chef knew how to cook. If the meal was lousy, maybe the diner happened to choose the weakest option on the menu. With an eye on these possibilities, we should be cautious in evaluating the diner’s report, based on his limited experience of just one dinner.
In one condition of the study, participants were told that the diner chose his entrée by blindly dropping a pencil onto the menu. This cue helped the participants realize that a different sample and perhaps different views of the restaurant, might have emerged if the pencil had fallen on a different selection. As a result, these participants were appropriately cautious about the diner’s Gigerenzer, 1991; Gigerenzer, Hell, & Blank, 1988; assessment based on just a single meal. (Also see Tversky & Kahneman, 1982.) Education The quality of a person’s thinking is also shaped by the background knowledge that she or he brings judgment; Figure 12.3 provides an illustration of this pattern (after Nisbett et al., 1983). In addition, a person’s quality of thinking is influenced by education. For example, Fong, Krantz, & to a telephone survey of “opinions about sports,” calling students who were an undergraduate course in statistics. Half of the students were contacted during the first Nisbett (1986) conducted a taking week of the semester; half were contacted during the last week.
In their course, these students had learned about the importance of sample size. They’d been reminded that accidents do happen, but that accidents don’t keep happening over and over. Therefore, a pattern visible in a small sample of data might be the result of some accident, but a pattern evident in large sample probably isn’t. Consequently, large samples are more reliable, more trustworthy, than small samples.
This classroom training had a broad impact. In the phone interview (which was-as far as the students knew-not in any way connected to their course), one of the questions involved a comparison between how well a baseball player did in his first year and how well he did in the rest of his career. This is essentially a question about sample size (with the first year being just a sample of the player’s overall performance). Did the students realize that sample size was relevant here? For those contacted early in the term, only 16% gave answers that showed any consideration of sample size. For those contacted later, the number of answers influenced by sample size more than doubled (to 37%).
It seems, then, that how well people think about evidence can be improved, and the improvement applies to problems in new domains and new contexts. Training in statistics, it appears, can have widespread benefits. (For more on education effects, see Ferreira et al., 2006; Gigerenzer, Gaissmaier, Kurz-Milcke, Schwartz, & Woloshin, 2008; Lehman & Nisbett, 1990.) The Cognitive Reflection Test Even with education, some people make judgment errors all the time, and part of the explanation is suggested by the Cognitive Reflection Test (CRT). This test includes just three questions (Figure 12.4), and for each one, there is an obvious answer that turns out to be wrong. To do well on the test, therefore, you need to resist the obvious answer and instead spend a moment reflecting on the question; if you do, the correct answer is readily available.
Many people perform poorly on the CRT, even when we test students at elite universities (Frederick, 2005). People who do well on the CRT, in contrast, are people who in general are more likely to rely on Type 2 thinking-and therefore likely to avoid the errors we’ve described in this chapter. In fact, people with higher CRT scores tend to have better scientific understanding, show greater skepticism about paranormal abilities, and even seem more analytic in their moral decisions (Baron, Scott, Fincher, & Metz, 2015; Pennycook, Cheyne, Koehler, & Fugelsang, 2016; Travers, Rolison, & Feeney, 2016). Let’s be clear, though, that no one is immune to the errors we’ve been discussing, but the risk of error does seem lower in people who score well on the CRT. e. Demonstration 12.5: Frequencies Versus Percentages This chapter argues that we can improve people’s judgments by presenting evidence to them in the right way. To see how this plays out, recruit a few friends. Ask some of them Question 1, and some Question 2:
1. Mr. Jones is a patient in a psychiatric hospital, and he has a history of violence. However, the time has come to consider discharging Mr. Jones from the hospital. He is therefore evaluated by several experts at the hospital, and they conclude: Patients with Mr. Jones’s profile are estimated to have a 10 % probability of committing an act of violence against others during the first several months after discharge. How comfortable would you be in releasing Mr. Jones?
1 2 3 5 4 6 7
No way! Keep him in the hospital. Yes, he is certainly ready for discharge.
2. Mr. Jones is a patient in a psychiatric hospital, and he has a history of violence. However, the time has come to consider discharging Mr. Jones from the hospital. He is therefore evaluated by several experts at the hospital, and they conclude: Of every 100 patients similar to Mr. Jones 10 are estimated to commit an act of violence against others during the first several months after discharge. How comfortable would you be in releasing Mr. Jones?
1 2 3 5 4 6 7
No way! Keep him in the hospital. Yes, he is certainly ready for discharge.
These two questions provide the same information (10% = 10 out of 100), but do your friends react in the same way? When experienced forensic psychologists were asked these questions, 41 % of them denied the discharge when they saw the data in frequency format (10 out of 100), and only 21 % denied the discharge when they saw the percentage
Of course, there’s room for debate about what the “right answer” is in this case. Therefore, we cannot conclude from this example that a frequency format improves reasoning. (Other evidence, though, does confirm this important point.) But this example does make clear that a change in format matters-with plainly different outcomes when information is presented as a frequency, rather than as a percentage. e. Demonstration 12.6: Cognitive Reflection People make many errors in judgment and reasoning. Are some people, however, more likely to make these errors? One line of evidence, discussed in the chapter, comes from the Cognitive Reflection Test (CRT). For each of the three questions on the test, there’s an obvious and intuitive answer that happens to be wrong. What the test really measures, therefore, is whether someone is inclined to quickly give that intuitive answer, or whether the person is instead inclined to pause and give a more reflective answer.
The CRT is widely used, and the questions are well publicized (in the media, on the Internet). If someone has seen the questions in one of these other settings, then the test loses all validity. To address this concern, some researchers have tried to develop variations on the CRT-still seeking to measure cognitive reflection, but using questions that may be less familiar.
Here are some examples; what is your answer to each question?
1. If you’re running a race and you pass the person in second place, what place are you in?
2. A farmer had 15 sheep, and all but 8 died. How many are left?
3. Emily’s father has three daughters. The first two are named April and May. What is the third daughter’s name?
4. How many cubic feet of dirt are there in a hole that is 3 feet deep, 3 feet wide, and 3 feet long?
Answer the questions before you read further.
For question 1, the intuitive answer is that you’re now in first place. The correct answer is that you’re actually in second place.
For question 2, the intuitive answer is 7. The correct answer is 8 (“all but 8 died”).
For question 3, the intuitive answer is June. But note that we’re talking about Emily’s father, so apparently Emily is the name of the third daughter!
For question 4, the intuitive answer is 27. But, of course, what makes a hole a hole is that all of the dirt has been removed from it. Therefore, the correct answer is “none.”
Even if you got these questions right, you probably felt the “tug” toward the obvious-but- incorrect answer. This tug is exactly what the test is trying to measure-by determining how often you give in to the tug! Confirmation and Disconfirmation In this chapter so far, we’ve been looking at a type of thinking that requires induction-the process through which you make forecasts about new cases, based on cases you’ve observed so far. Just as important, though, is deduction-a process in which you start with claims or assertions that you count as “given” and ask what follows from these premises. For example, perhaps you’re already convinced that red wine gives you headaches or that relationships based on rarely last. You might want to ask: What follows from this? What implications do these claims have physical attraction for your other beliefs or actions?
Deduction has many functions, including the fact that it helps keep your beliefs in touch with reality. After all, if deduction leads you to a prediction based on your beliefs and the prediction turns out to be wrong, this indicates that something is off track in your beliefs-so that claims you thought were solidly established aren’t so solid after all.
Does human reasoning respect this principle? If you encounter evidence confirming your beliefs, does this strengthen your convictions? If evidence challenging your beliefs should come your way, do you adjust? Confirmation Bias It seems sensible that in evaluating any belief, you’d want to take a balanced approach-considering evidence that supports the belief, and weighing that information against other evidence that might challenge the belief. And, in fact, evidence that challenges you is especially valuable; many authors argue that this type of evidence is more informative than evidence that seems to support you. (For the classic statement of this position, see Popper, 1934.) There’s a substantial gap, however, between these suggestions about what people should do and what they actually do. Specifically, people routinely display a pattern we’ve already mentioned, confirmation bias: a greater sensitivity to confirming evidence and a tendency to neglect disconfirming evidence. Let’s emphasize, however, that this is an “umbrella” term, because confirmation bias can take many different forms (see Figure 12.5). What all the forms have in common is the tendency to protect your beliefs from challenge. (See, among others, Gilovich, 1991; Kassin, Bogart, & Kerner, 2012; Schulz-Hardt, Frey, Lüthgens, & Moscovici, 2000; Stangor & McMillan, 1992.)
In an early demonstration of confirmation bias, Wason (1966, 1968) presented research participants with a series of numbers, such as “2, 4, 6 The participants numbers conformed to a specific rule, and their task was to figure out the rule. Participants allowed to propose their own trios of numbers (“Does ‘8, 10, 12’ follow the rule?”), and in each case were told that this trio of were the experimenter responded appropriately (“Yes, it follows the rule” or “No, it doesn’t”). Then, once participants were satisfied that they had discovered the rule, they announced their “discovery.”
The rule was actually quite simple: The three numbers had to be in ascending order. For example, “1, 3, 5” follows the rule, but “6, 4, 2” does not, and neither does “10, 10, 10” Despite this simplicity, participants had difficulty discovering the rule, often requiring many minutes. This was to the type of information they requested as they tried to evaluate their hypotheses: To an overwhelming extent, they sought to confirm the rules they had proposed; requests for disconfirmation were relatively rare. And it’s noteworthy that those few participants who did seek out disconfirmation for their hypotheses were more largely due to discover the rule. It seems, then, that confirmation bias was likely strongly present in this experiment and interfered with performance. Reinterpreting Disconfirming Evidence Here’s a different manifestation of confirmation bias. When people encounter information consistent with their beliefs, they’re likely to take the evidence at face value, accepting it without challenge or question. In contrast, when people encounter evidence that’s inconsistent with their beliefs, they’re often skeptical and scrutinize this new evidence, seeking flaws or ambiguities.
One study examined gamblers who bet on professional football games (Gilovich, 1983; see also Gilovich, 1991; Gilovich & Douglas, 1986). These people all believed they had good strategies for picking winning teams, and their faith in these strategies was undiminished by a series of losses. Why is this? It’s because the gamblers didn’t understand their losses as “losses” Instead, they going to win if it right. New York was remembered them as flukes or oddball coincidences: “I was hadn’t been for that crazy injury to their running back”; “I was correct in picking St. Louis. They would have won except for that goofy bounce the ball took after the kickoff.” In this way, winning bets were remembered as wins; losing bets were remembered as “near wins.” No wonder, then, that the gamblers maintained their views despite the contrary evidence provided by their empty wallets. Belief Perseverance
Even when disconfirming evidence is undeniable, people sometimes don’t use it, leading to a phenomenon called belief perseverance. Participants in a classic study were asked to read a series of suicide notes; their task was to figure out which notes were authentic,