Introduction to Hypothesis Testing
✪ A Hypothesis-Testing Example 108
✪ The Core Logic of Hypothesis Testing 109
✪ The Hypothesis-Testing Process 110
✪ One-Tailed and Two-Tailed Hypothesis Tests 119
✪ Controversy: Should Significance Tests Be Banned? 124
In this chapter, we introduce the crucial topic of hypothesis testing. A hypothesisis a prediction intended to be tested in a research study. The prediction may bebased on informal observation (as in clinical or applied settings regarding a pos- sible practical innovation), on related results of previous studies, or on a broader theory about what is being studied. You can think of a theory as a set of principles that attempt to explain an important psychological process. A theory usually leads to various specific hypotheses that can be tested in research studies.
This chapter focuses on the basic logic for analyzing results of a research study to test a hypothesis. The central theme of hypothesis testing has to do with the im- portant distinction between sample and population discussed in the last chapter: hypothesis testing is a systematic procedure for deciding whether the results of a re- search study, which examines a sample, support a hypothesis which applies to a pop- ulation. Hypothesis testing is the central theme in all the remaining chapters of this book, as it is in most research in psychology and related fields.
Many students find the most difficult part of the course to be mastering the basic logic of this chapter and the next two. This chapter in particular requires some mental gymnastics. Even if you follow everything the first time through, you will be wise to
✪ Hypothesis Tests in Research Articles 127
✪ Summary 128
✪ Key Terms 129
✪ Example Worked-Out Problems 129
✪ Practice Problems 131
✪ Chapter Notes 136
theory set of principles that attempt to explain one or more facts, relationships, or events; psychologists often derive specific predictions from theories that are then tested in research studies.
hypothesis prediction, often based on informal observation, previous research, or theory, that is tested in a research study.
hypothesis testing procedure for de- ciding whether the outcome of a study (results for a sample) support a particular theory or practical innovation (which is thought to apply to a population).
Statistics for Psychology, Fifth Edition, by Arthur Aron, Elaine N. Aron, and Elliot J. Coups. Published by Prentice Hall. Copyright © 2009 by Pearson Education, Inc.
108 Chapter 4
review the chapter thoroughly. Hypothesis testing involves grasping ideas that make little sense covered separately; so in this chapter you learn several new ideas all at once. However, once you understand the material in this chapter and the two that follow, your mind will be used to this sort of thing, and the rest of the course should seem easier.
At the same time, we have kept this introduction to hypothesis testing as simple as possible, putting off what we could for later chapters. For example, real-life psy- chology research involves samples of many individuals. However, to minimize how much you have to learn at one time, this chapter’s examples are about studies in which the sample is a single individual. To do this, we use some odd examples. Just remember that you are building a foundation that will, by Chapter 7, prepare you to understand hypothesis testing as it is actually done in real research.
A Hypothesis-Testing Example Here is our first necessarily odd example that we made up to keep this introduction to hypothesis testing as straightforward as possible. A large research project has been going on for several years. In this project, new babies are given a particular vitamin and then the research team follows their development during the first 2 years of life. So far, the vitamin has not speeded up the development of the babies. The ages at which these and all other babies start to walk is shown in Figure 4–1. The mean is 14 months ( ), the standard deviation is 3 months ( ), and the ages follow a normal curve. Based on the normal curve percentages, you can figure that fewer than 2% of babies start walking before 8 months of age; these are the babies who are more than 2 standard deviations below the mean. [This fictional distribution is close to the true distribution psychologists have found for European babies, although that true distrib- ution is slightly skewed to the right (Hindley et al., 1966).]
One of the researchers working on the project has an idea. If the vitamin the ba- bies are taking could be more highly refined, perhaps its effect would be dramatically increased: babies taking the highly purified version should start walking much earli- er than other babies. (We will assume that the purification process could not possibly make the vitamin harmful.) However, refining the vitamin in this way is extremely expensive for each dose; so the research team decides to try the procedure with just enough purified doses for one baby. A newborn in the project is then randomly selected to take the highly purified version of the vitamin, and the researchers then
� = 3� = 14
12 1413 15 16
+1 19 20 21
μ = 14 σ = 3
Figure 4–1 Distribution of when babies begin to walk (fictional data).
Statistics for Psychology, Fifth Edition, by Arthur Aron, Elaine N. Aron, and Elliot J. Coups. Published by Prentice Hall. Copyright © 2009 by Pearson Education, Inc.
Introduction to Hypothesis Testing 109
follow this baby’s progress for 2 years. What kind of result should lead the re- searchers to conclude that the highly purified vitamin allows babies to walk earlier?
This is a hypothesis-testing problem. The researchers want to draw a general conclusion about whether the purified vitamin allows babies in general to walk ear- lier. The conclusion will be about babies in general (a population of babies). How- ever, the conclusion will be based on results of studying a sample. In this example, the sample consists of a single baby.
The Core Logic of Hypothesis Testing There is a standard way researchers approach any hypothesis-testing problem. For this example, it works as follows. Consider first the population of babies in general (those who are not given the specially purified vitamin). In this population, the chance of a baby’s starting to walk at age 8 months or earlier would be less than 2%. (As shown in Figure 4–1, the mean walking age is 14 months with a standard devia- tion of 3 months.) Thus, walking at 8 months or earlier is highly unlikely among such babies. But what if the randomly selected sample of one baby in our study does start walking by 8 months? If the specially purified vitamin had no effect on this par- ticular baby’s walking age (which means that the baby’s walking age should be sim- ilar to that of babies who were not given the vitamin), it is highly unlikely (less than a 2% chance) that the particular baby we selected at random would start walking by 8 months. So, if the baby in our study does in fact start walking by 8 months, that al- lows us to reject the idea that the specially purified vitamin has no effect. And if we reject the idea that the specially purified vitamin has no effect, then we must also accept the idea that the specially purified vitamin does have an effect.
Using the same reasoning, if the baby starts walking by 8 months, we can reject the idea that this baby comes from a population of babies like that of the general population with a mean walking age of 14 months. We therefore conclude that ba- bies given the specially purified vitamin will on the average start to walk before 14 months. Our explanation for the baby’s early walking age in the study is that the specially purified vitamin speeded up the baby’s development.
In this example, the researchers first spelled out what would have to happen for them to conclude that the special purification procedure makes a difference. Having laid this out in advance, the researchers then conducted their study. Conducting the study in this case meant giving the specially purified vitamin to a randomly selected baby and watching to see how early that baby walked. We supposed that the result of the study is that the baby started walking before 8 months. The researchers then con- cluded that it is unlikely the specially purified vitamin makes no difference and thus also that it does make a difference.
This kind of testing, with its opposite-of-what-you-predict, roundabout reason- ing, is at the heart of inferential statistics in psychology. It is something like a double negative. One reason for this approach is that we have the information to figure the probability of getting a particular experimental result if the situation of there being no difference is true. In the purified vitamin example, the researchers know what the probabilities are of babies walking at different ages if the specially purified vitamin does not have any effect. The probabilities of babies walking at various ages are already known from studies of babies in general—that is, babies who have not received the specially purified vitamin. If the specially purified vitamin has no ef- fect, then the ages at which babies start walking are the same with or without the spe- cially purified vitamin. Thus, the distribution is that shown in Figure 4–1, based on ages at which babies start walking in general.
T I P F O R S U C C E S S This section, The Core Logic of Hypothesis Testing, is central to everything else we do in the book. Thus, you may want to read it a few times. You should also be cer- tain that you understand the logic of hypothesis testing before read- ing later chapters.
Statistics for Psychology, Fifth Edition, by Arthur Aron, Elaine N. Aron, and Elliot J. Coups. Published by Prentice Hall. Copyright © 2009 by Pearson Education, Inc.
110 Chapter 4
Without such a tortuous way of going at the problem, in most cases you could not test hypotheses scientifically at all. In almost all psychology research, we base our conclusions on the question, “What is the probability of getting our research re- sults if the opposite of what we are predicting were true?” That is, we usually predict an effect of some kind. However, we decide on whether there is such an effect by seeing if it is unlikely that there is not such an effect. If it is highly unlikely that we would get our research results if the opposite of what we are predicting were true, that finding allows us to reject the opposite prediction. If we reject the opposite pre- diction, we are able to accept our prediction. However, if it is likely that we would get our research results if the opposite of what we are predicting were true, we are not able to reject the opposite prediction. If we are not able to reject the opposite pre- diction, we are not able to accept our prediction.
The Hypothesis-Testing Process Let’s look at our example again, this time going over each step in some detail. Along the way, we cover the special terminology of hypothesis testing. Most important, we introduce the five steps of hypothesis testing that you use for the rest of this book.
Step ❶: Restate the Question as a Research Hypothesis and a Null Hypothesis About the Populations Our researchers are interested in the effects on babies in general (not just on this par- ticular baby). That is, the purpose of studying samples is to know about populations. Thus, it is useful to restate the research question in terms of populations. In our example, we can think of two populations of babies:
Population 1: Babies who take the specially purified vitamin. Population 2: Babies in general (that is, babies who do not take the specially purified vitamin).
Population 1 consists of babies who receive the experimental treatment (the spe- cially purified vitamin). In our example, we use a sample of one baby to draw a con- clusion about the age at which babies in Population 1 start to walk. Population 2 is a kind of comparison baseline of what is already known about babies in general.
The prediction of our research team is that Population 1 babies (those who take the specially purified vitamin) will on the average walk earlier than Population 2 ba- bies (babies in general who do not take the specially purified vitamin). This predic- tion is based on the researchers’ theory of how these vitamins work. A prediction like this about the difference between populations is called a research hypothesis. Put more formally, the prediction is that the mean of Population 1 is lower (babies re- ceiving the special vitamin walk earlier) than the mean of Population 2. In symbols, the research hypothesis for this example is .
The opposite of the research hypothesis is that the populations are not different in the way predicted. Under this scenario, Population 1 babies (those who take the specially purified vitamin) will on the average not walk earlier than Population 2 ba- bies (babies in general—those who do not take the specially purified vitamin). That is, the prediction is that there is no difference in the ages at which Population 1 and Population 2 babies start walking. On the average, they start at the same time. A statement like this, about a lack of difference between populations, is the crucial opposite of the research hypothesis. It is called a null hypothesis. It has this name
�1 6 �2
research hypothesis statement in hypothesis testing about the predicted relation between populations (often a prediction of a difference between population means).
null hypothesis statement about a relation between populations that is the opposite of the research hypothesis; statement that in the population there is no difference (or a difference opposite to that predicted) between populations; contrived statement set up to examine whether it can be rejected as part of hypothesis testing.
Introduction to Hypothesis Testing 111
because it states the situation in which there is no difference (the difference is “null”) between the populations. In symbols, the null hypothesis is .1
The research hypothesis and the null hypothesis are complete opposites: if one is true, the other cannot be. In fact, the research hypothesis is sometimes called the alternative hypothesis—that is, it is the alternative to the null hypothesis. This term is a bit ironic. As researchers, we care most about the research hypothesis. But when doing the steps of hypothesis testing, we use this roundabout method of seeing whether or not we can reject the null hypothesis so that we can decide about its alter- native (the research hypothesis).
Step ❷: Determine the Characteristics of the Comparison Distribution Recall that the overall logic of hypothesis testing involves figuring out the probabil- ity of getting a particular result if the null hypothesis is true. Thus, you need to know what the situation would be if the null hypothesis were true. In our example, we start out knowing the key information about Population 2, babies in the general popula- tion (see Figure 4–1): we know it follows a normal curve, , and If the null hypothesis is true, Population 1 and Population 2 are the same; in our example, this would mean Populations 1 and 2 both follow a normal curve, , and
. In the hypothesis-testing process, you want to find out the probability that you
could have gotten a sample score as extreme as what you got (say, a baby walking very early) if your sample were from a population with a distribution of the sort you would have if the null hypothesis were true. Thus, in this book we call this distrib- ution a comparison distribution. (The comparison distribution is sometimes called a sampling distribution—an idea we discuss in Chapter 5.) That is, in the hypothesis-testing process, you compare the actual sample’s score to this compari- son distribution.
In our vitamin example, the null hypothesis is that there is no difference in walking age between babies who take the specially purified vitamin (Population 1) and babies in general who do not take the specially purified vitamin (Population 2). The comparison distribution is the distribution for Population 2, since this popula- tion represents the walking age of babies if the null hypothesis is true. In later chap- ters, you will learn about different types of comparison distributions, but the same principle applies in all cases: The comparison distribution is the distribution that rep- resents the population situation if the null hypothesis is true.
Step ❸: Determine the Cutoff Sample Score on the Comparison Distribution at Which the Null Hypothesis Should Be Rejected Ideally, before conducting a study, researchers set a target against which they will compare their result: how extreme a sample score they would need to decide against the null hypothesis, that is, how extreme the sample score would have to be for it to be too unlikely that they could get such an extreme score if the null hypothesis were true. This is called the cutoff sample score. (The cutoff sample score is also known as the critical value.)
Consider our purified vitamin example, in which the null hypothesis is that walking age is not influenced by whether babies take the specially purified vitamin. The researchers might decide that, if the null hypothesis were true, a randomly
� = 3 � = 14
� = 3.� = 14
�1 = �2
comparison distribution distribution used in hypothesis testing. It represents the population situation if the null hy- pothesis is true. It is the distribution to which you compare the score based on your sample’s results.
cutoff sample score point in hypoth- esis testing, on the comparison distribu- tion at which, if reached or exceeded by the sample score, you reject the null hy- pothesis. Also called critical value.
112 Chapter 4
selected baby walking before 8 months would be very unlikely. With a normal distri- bution, being 2 or more standard deviations below the mean (walking by 8 months) could occur less than 2% of the time. Thus, based on the comparison distribution, the researchers set their cutoff sample score even before doing the study. They decide in advance that if the result of their study is a baby who walks by 8 months, they will reject the null hypothesis.
But what if the baby does not start walking until after 8 months? If that happens, the researchers will not be able to reject the null hypothesis.
When setting in advance how extreme a sample’s score needs to be to reject the null hypothesis, researchers use Z scores and percentages. In our purified vitamin ex- ample, the researchers might decide that if a result were less likely than 2%, they would reject the null hypothesis. Being in the bottom 2% of a normal curve means having a Z score of about –2 or lower. Thus, the researchers would set –2 as their Z-score cutoff point on the comparison distribution for deciding that a result is ex- treme enough to reject the null hypothesis. So, if the actual sample Z score is –2 or lower, the researchers will reject the null hypothesis. However, if the actual sample Z score is greater than –2, the researchers will not reject the null hypothesis.
Suppose that the researchers are even more cautious about too easily rejecting the null hypothesis. They might decide that they will reject the null hypothesis only if they get a result that could occur by chance 1% of the time or less. They could then figure out the Z-score cutoff for 1%. Using the normal curve table, to have a score in the lower 1% of a normal curve, you need a Z score of –2.33 or less. (In our example, a Z score of –2.33 means 7 months.) In Figure 4–2, we have shaded the 1% of the com- parison distribution in which a sample would be considered so extreme that the possibil- ity that it came from a distribution like this would be rejected. Now the researchers will reject the null hypothesis only if the actual sample Z score is –2.33 or lower—that is, if it falls in the shaded area in Figure 4–2. If the sample Z score falls outside the shaded area in Figure 4–2, the researchers will not reject the null hypothesis.
In general, psychology researchers use a cutoff on the comparison distribution with a probability of 5% that a score will be at least that extreme if the null hypothe- sis were true. That is, researchers reject the null hypothesis if the probability of get- ting a sample score this extreme (if the null hypothesis were true) is less than 5%. This probability is usually written as p � .05. However, in some areas of research, or when researchers want to be especially cautious, they use a cutoff of 1% (p 6 .01).2
12 1413 15 16
+1 19 20 21
Figure 4–2 Distribution of when babies begin to walk, with bottom 1% shaded (fictional data).
Introduction to Hypothesis Testing 113
These are called conventional levels of significance. They are described as the .05 significance level and the .01 significance level. We also refer to them as the 5% sig- nificance level and the 1% significance level. (We discuss in more detail in Chapter 6 the issues in deciding on the significance level to use.) When a sample score is so extreme that researchers reject the null hypothesis, the result is said to be statistically significant (or significant, as it is often abbreviated).
Step ❹: Determine Your Sample’s Score on the Comparison Distribution The next step is to carry out the study and get the actual results for your sample. Once you have the results for your sample, you figure the Z score for the sample’s raw score based on the population mean and standard deviation of the comparison distribution.
Assume that the researchers did the study and the baby who was given the spe- cially purified vitamin started walking at 6 months. The mean of the comparison dis- tribution to which we are comparing these results is 14 months and the standard deviation is 3 months. That is, and Thus, a baby who walks at 6 months is 8 months below the population mean. This puts the baby 22⁄3 standard de- viations below the population mean. The Z score for this sample baby on the com- parison distribution is thus . Figure 4–3 shows the score of our sample baby on the comparison distribution.
Step ➎: Decide Whether to Reject the Null Hypothesis To decide whether to reject the null hypothesis, compare your actual sample’s Z score (from Step ❹) to the cutoff Z score (from Step ❸). In our example, the actual result was . Let’s suppose the researchers had decided in advance that they would reject the null hypothesis if the sample’s Z score was below . Since is below , the researchers would reject the null hypothesis.
Alternatively, suppose the researchers had used the more conservative 1% sig- nificance level. The needed Z score to reject the null hypothesis would then have
[that is, Z = (6 – 14)>3 = -2.67]-2.67
� = 3.� = 14
statistically significant conclusion that the results of a study would be un- likely if in fact the sample studied repre- sents a population that is no different from the population in general; an out- come of hypothesis testing in which the null hypothesis is rejected.
conventional levels of significance levels of signifi-
cance widely used in psychology. p<.01)(p<.05,
12 1413 15 16
+1 19 20 21
Experimental sample baby (Z = −2.67)
Cutoff Z Score = −2.33
Figure 4–3 Distribution of when babies begin to walk, showing both the bottom 1% and the single baby who is the sample studied (fictional data).
T I P F O R S U C C E S S If you are unsure about these symbols for population parameters
be sure to review Table 3–2 on p. 87. (�, �),
114 Chapter 4
been or lower. But, again, the actual Z for the randomly selected baby was (a more extreme score than ). Thus, even with this more conservative
cutoff, they would still reject the null hypothesis. This situation is shown in Figure 4–3. As you can see in the figure, the bottom 1% of the distribution is shaded. We recommend that you always draw such a picture of the distribution. Be sure to shade in the part of the distribution that is more extreme (that is, farther out in the tail) than the cutoff sample score. If your actual sample Z score falls within the shaded region, you can reject the null hypothesis. Since the sample Z score in this example falls within the shaded tail region, the researchers can reject the null hypothesis.
If the researchers reject the null hypothesis, what remains is the research hy- pothesis. In this example, the research team can conclude that the results of their study support the research hypothesis that babies who take the specially purified vitamin walk earlier than babies in general.
Implications of Rejecting or Failing to Reject the Null Hypothesis It is important to emphasize two points about the conclusions you can make from the hypothesis-testing process. First, when you reject the null hypothesis, all you are saying is that your results support the research hypothesis (as in our example). You would not go on to say that the results prove the research hypothesis or that the re- sults show that the research hypothesis is true. Terms such as prove and true are too strong because the results of research studies are based on probabilities. Specifically, they are based on the probability being low of getting your result if the null hypoth- esis were true. Proven and true are okay terms in logic and mathematics, but to use these words in conclusions from scientific research is unprofessional. (It is okay to use true when speaking hypothetically—for example, “if this hypothesis were true, then . . .”—but not when speaking of conclusions about an actual result.) What you do say when you reject the null hypothesis is that the results are statistically signifi- cant. You can also say that the results “support” or “provide evidence for” the research hypothesis.
Second, when a result is not extreme enough to reject the null hypothesis, you do not say that the result supports the null hypothesis. You simply say the result is not statistically significant.
A result that is not strong enough to reject the null hypothesis means the study was inconclusive. The results may not be extreme enough to reject the null hypothe- sis, but the null hypothesis might still be false (and the research hypothesis true). Suppose in our example that the specially purified vitamin had only a slight but still real effect. In that case, we would not expect to find a baby who is given the purified vitamin to be walking a lot earlier than babies in general. Thus, we would not be able to reject the null hypothesis, even though it is false. (You will learn more about such situations in the Decision Errors section in Chapter 6.)
Showing the null hypothesis to be true would mean showing that there is ab- solutely no difference between the populations. It is always possible that there is a difference between the populations but that the difference is much smaller than the particular study was able to detect. Therefore, when a result is not extreme enough to reject the null hypothesis, the results are said to be inconclusive. Sometimes, however, if studies have been done using large samples and accurate measuring procedures, evidence may build up in support of something close to the null hypothesis—that there is at most very little difference between the populations. (We have more to say
Introduction to Hypothesis Testing 115
on this important issue later in this chapter and in Chapter 6.) The basic logic of hypothesis testing is summarized in Table 4–1, which also includes the logic for our example of a baby who is given a specially purified vitamin.
Summary of Steps of Hypothesis Testing Here is a summary of the five steps of hypothesis testing.
❶ Restate the question as a research hypothesis and a null hypothesis about the populations.
❷ Determine the characteristics of the comparison distribution. ❸ Determine the cutoff sample score on the comparison distribution at which
the null hypothesis should be rejected. ❹ Determine your sample’s score on the comparison distribution. ➎ Decide whether to reject the null hypothesis.
A Second Example Here is another fictional example. Two happy-go-lucky personality psychologists are examining the theory that happiness comes from positive experiences. In partic- ular, these researchers argue that if people have something very fortunate happen to them, they become very happy and will still be happy 6 months later. So the re- searchers plan the following experiment: a person will be randomly selected from the North American adult public and given $10 million. Six months later, the per- son’s happiness will be measured. It is already known (in this fictional example) what the distribution of happiness is like in the general population of North Ameri- can adults, and this is shown in Figure 4–4. On the test being used, the mean happi- ness score is 70, the standard deviation is 10, and the distribution is approximately normal.
Table 4–1 The Basic Logic of Hypothesis Testing, Including the Logic for the Example of the Effect of a Specially Purified Vitamin on the Age That Babies Begin to Walk
Basic Logic Baby Example
Focus of Research
Sample is studied Baby given specially purified vitamin and age of walking observed
Question Is the sample typical of the general population?
Is this baby’s walking age typical of babies in general?
Answer Very unlikely Could be Very unlikely
∂∂∂ Conclusion The sample is
probably not from the general population; it is probably from a different population.
This baby is probably not from the general popula- tion of babies, because its walking age is much lower than for babies in general. Therefore, babies who take the specially puri- fied vitamin will probably begin walking at an ear- lier age than babies in the general population.
116 Chapter 4
The psychologists now carry out the hypothesis-testing procedure. That is, the researchers consider how happy the person would have to be before they can confi- dently reject the null hypothesis that receiving so much money does not make people happier 6 months later. If the researchers’ result shows a very high level of happi- ness, the psychologists will reject the null hypothesis and conclude that getting $10 million probably does make people happier 6 months later. But if the result is not very extreme, the researchers will conclude that there is not sufficient evidence to reject the null hypothesis, and the results of the experiment are inconclusive.
Now let us consider the hypothesis-testing procedure in more detail in this example, following the five steps.
❶ Restate the question as a research hypothesis and a null hypothesis about the populations. There are two populations of interest:
Population 1: People who 6 months ago received $10 million. Population 2: The general population (consisting of people who 6 months ago did not receive $10 million).
The prediction of the personality psychologists, based on their theory of happiness, is that Population 1 people will on the average be happier than Popu- lation 2 people: in symbols, . The null hypothesis is that Population 1 people (those who get $10 million) will not be happier than Population 2 people (people in general who do not get $10 million).
❷ Determine the characteristics of the comparison distribution. The comparison distribution is the distribution that represents the population situation if the null hy- pothesis is true. If the null hypothesis is true, the distributions of Populations 1 and 2 are the same. We know Population 2’s distribution (it is normally distributed with
and ); so we can use it as the comparison distribution. ❸ Determine the cutoff sample score on the comparison distribution at which
the null hypothesis should be rejected. What kind of result would be extreme enough to convince us to reject the null hypothesis? In this example, assume that the researchers decided the following in advance: they will reject the null hypothesis as too unlikely if the results would occur less than 5% of the time if this null hypothesis were true. We know that the comparison distribution is a normal curve. Thus, we can figure that the top 5% of scores from the normal
� = 10� = 70
�1 7 �2
Z Score: 0 +1 95
+2−1−2 45Happiness Score: 9085807565605550
Figure 4–4 Distribution of happiness sources (fictional data).
Introduction to Hypothesis Testing 117
curve table begin at a Z score of about 1.64. Thus the researchers set as the cut- off point for rejecting the null hypothesis a result in which the sample’s Z score on the comparison distribution is at or above 1.64. (The mean of the comparison distribution is 70 and the standard deviation is 10. Therefore, the null hypothe- sis will be rejected if the sample result is at or above 86.4.)
❹ Determine your sample’s score on the comparison distribution. Now for the results: six months after giving the randomly selected person $10 million, the now very wealthy research participant takes the happiness test. The person’s score is 80. As you can see from Figure 4–4, a score of 80 has a Z score of on the comparison distribution.
❺ Decide whether to reject the null hypothesis. The Z score of the sample indi- vidual is . The researchers set the minimum Z score to reject the null hypoth- esis at . Thus, the sample score is not extreme enough to reject the null hypothesis. The experiment is inconclusive; researchers would say the results are “not statistically significant.” Figure 4–5 shows the comparison distribution with the top 5% shaded and the location of the sample participant who received $10 million.
You may be interested to know that Brickman et al. (1978) carried out a more elaborate study based on the same question. They studied lottery winners as exam- ples of people suddenly having a very positive event happen to them. Their results were similar to those in our fictional example: those who won the lottery were not much happier 6 months later than people who did not win the lottery. Also, another group they studied, people who had become paraplegics through a random accident, were not much less happy than other people 6 months later. These researchers con- cluded that if a major event does have a lasting effect on happiness, it is probably not a very big one. This conclusion is consistent with the findings of more recent studies (e.g., Suh et al., 1996). Indeed, in recent years, a great deal of research has examined what factors contribute to people’s level of happiness. If you are interested in know- ing more about this topic, we highly recommend an article by Diener and colleagues (2006) and social psychologist Daniel Gilbert’s (2006) engaging best seller, Stumbling on Happiness.
Z Score: 0 +1 95
+2−1−2 45Happiness Score: 9085807565605550
Cutoff Z Score = 1.64
Sample participant (Z = 1)
Figure 4–5 Distribution of happiness scores with upper 5% shaded and showing the location of the sample participant (fictional data).
118 Chapter 4
How are you doing?
1. A sample of rats in a laboratory is given an experimental treatment intended to make them learn a maze faster than other rats. State (a) the null hypothesis and (b) the research hypothesis.
2. (a) What is a comparison distribution? (b) What role does it play in hypothesis testing?
3. What is the cutoff sample score? 4. Why do we say that hypothesis testing involves a double negative logic? 5. What can you conclude when (a) a result is so extreme that you reject the null
hypothesis and (b) a result is not very extreme so that you cannot reject the null hypothesis?
6. A training program to increase friendliness is tried on one individual randomly selected from the general public. Among the general public (who do not get this training program), the mean on the friendliness measure is 30 with a stan- dard deviation of 4. The researchers want to test their hypothesis at the 5% significance level. After going through the training program, this individual takes the friendliness measure and gets a score of 40. What should the re- searchers conclude?
1.(a) The population of rats like those that get the experimental treatment score the same on the time to learn the maze as the population of rats in general that do not get the experimental treatment. (b) The population of rats like those that get the experimental treatment learn the maze faster than the pop- ulation of rats in general that do not get the experimental treatment.
2.(a) A comparison distribution is a distribution to which you compare the re- sults of your study. (b) In hypothesis testing, the comparison distribution is the distribution for the situation when the null hypothesis is true. To decide whether to reject the null hypothesis, you check how extreme the score of your sample is on this comparison distribution—how likely it would be to get a sample with a score this extreme if your sample came from this comparison distribution.
3.The cutoff sample score is the Zscore at which, if the sample’s Zscore is more extreme than it is on the comparison distribution, you reject the null hypothesis.
4.We say that hypothesis testing involves a double negative logic because we are interested in the research hypothesis, but we test whether it is true by seeing if we can reject its opposite, the null hypothesis.
5.(a) The research hypothesis is supported when a result is so extreme that you reject the null hypothesis; the result is statistically significant. (b) The result is not statistically significant when a result is not very extreme; the result is in- conclusive.
6.The training program increases friendliness. (The cutoff sample Zscore on the comparison distribution is 1.64. The actual sample’s Zscore of 2.50 is more extreme—that is, farther in the tail—than the cutoff Zscore. Therefore, reject the null hypothesis; the research hypothesis is supported; the result is statis- tically significant.)
Introduction to Hypothesis Testing 119
directional hypothesis research hy- pothesis predicting a particular direction of difference between populations—for example, a prediction that the population like the sample studied has a higher mean than the population in general.
one-tailed test hypothesis-testing procedure for a directional hypothesis; situation in which the region of the com- parison distribution in which the null hy- pothesis would be rejected is all on one side (tail) of the distribution.
nondirectional hypothesis research hypothesis that does not predict a partic- ular direction of difference between the population like the sample studied and the population in general.
One-Tailed and Two-Tailed Hypothesis Tests In our examples so far, the researchers were interested in only one direction of result. In our first example, researchers tested whether babies given the specially purified vitamin would walk earlier than babies in general. In the happiness exam- ple, the personality psychologists predicted the person who received $10 million would be happier than other people. The researchers in these studies were not in- terested in the possibility that giving the specially purified vitamin would cause babies to start walking later or that people getting $10 million might become less happy.
Directional Hypotheses and One-Tailed Tests The purified vitamin and happiness studies are examples of testing a directional hypothesis. Both studies focused on a specific direction of effect. When a researcher makes a directional hypothesis, the null hypothesis is also, in a sense, directional. Suppose the research hypothesis is that getting $10 million will make a person hap- pier than the general population. The null hypothesis, then, is that the money will either have no effect or make the person less happy. [In symbols, if the research hy- pothesis is , then the null hypothesis is (“ ” is the symbol for less than or equal to).] Thus, in Figure 4–5, to reject the null hypothesis, the sample has to have a score in one tail of the comparison distribution: the upper extreme or tail (in this example, the top 5%) of the comparison distribution. (When it comes to rejecting the null hypothesis with a directional hypothesis, a score at the other tail is the same as a score in the middle; that is, such a score does not allow you to reject the null hypothesis.) For this reason, the test of a directional hypothesis is called a one-tailed test. A one-tailed test can be one-tailed in either direction. In the happi- ness study example, the tail for the predicted effect was at the high end. In the baby study example, the tail for the predicted effect was at the low end (that is, the predic- tion tested was that babies given the specially purified vitamin would start walking unusually early).
Nondirectional Hypotheses and Two-Tailed Tests Sometimes, a research hypothesis states that an experimental procedure will have an effect, without saying whether it will produce a very high score or a very low score. Suppose an organizational psychologist is interested in how a new social skills program will affect productivity. The program could either improve produc- tivity by making the working environment more pleasant or hurt productivity by encouraging people to socialize instead of work. The research hypothesis is that the social skills program changes the level of productivity; the null hypothesis is that the program does not change productivity one way or the other. In symbols, the re- search hypothesis is (“ ” is the symbol for not equal); the null hypothesis is .
When a research hypothesis predicts an effect but does not predict a direction for the effect, it is called a nondirectional hypothesis. To test the significance of a nondirectional hypothesis, you have to consider the possibility that the sample could be extreme at either tail of the comparison distribution. Thus, this is called a two-tailed test.
�1 = �2 Z�1 Z �2
…�1 … �2�1 7 �2
two-tailed test hypothesis-testing procedure for a nondirectional hypothe- sis; the situation in which the region of the comparison distribution in which the null hypothesis would be rejected is di- vided between the two sides (tails) of the distribution.
120 Chapter 4
Determining Cutoff Scores with Two-Tailed Tests There is a special complication in a two-tailed test. You have to divide the signifi- cance percentage between the two tails. For example, with a 5% significance level, you reject a null hypothesis only if the sample is so extreme that it is in either the top 2.5% or the bottom 2.5% of the comparison distribution. This keeps the overall level of significance at a total of 5%.
Note that a two-tailed test makes the cutoff Z scores for the 5% level and . For a one-tailed test at the 5% level, the cutoff is not so extreme: only
or . But with a one-tailed test, only one side of the distribution is considered. These situations are shown in Figure 4–6a.
Using the 1% significance level, a two-tailed test (.5% at each tail) has cutoffs of and , while a one-tailed test’s cutoff is either or . These sit-
uations are shown in Figure 4–6b. The Z score cutoffs for one-tailed and two-tailed tests for the .05 and .01 significance levels are also summarized in Table 4–2.
0 +1 +2
.025 (=.05 two-tailed)
(.05 two-tailed =) .025
0 +1 +2
.005 (=.01 two-tailed)
(.01 two-tailed =) .005
.01 significance level
.05 significance level
Figure 4–6 Significance level cutoffs for one-tailed and two-tailed tests: (a) .05 signi- ficance level; (b) .01 significance level. (The one-tailed tests in these examples assume the prediction was for a high score. You could instead have a one-tailed test where the prediction is for the lower, left tail.)
Introduction to Hypothesis Testing 121
Table 4–2 One-Tailed and Two-Tailed Cutoff Z Scores for the .05 and .01 Significance Levels
Type of Test
Significance .05 �1.64 or 1.64 �1.96 and 1.96
Level .01 �2.33 or 2.33 �2.58 and 2.58
When to Use One-Tailed or Two-Tailed Tests If the researcher decides in advance to use a one-tailed test, then the sample’s score does not need to be so extreme to be significant compared to what would be needed with a two-tailed test. Yet there is a price. If the result is extreme in the direction op- posite to what was predicted—no matter how extreme—the result cannot be consid- ered statistically significant.
In principle, you plan to use a one-tailed test when you have a clearly directional hypothesis and a two-tailed test when you have a clearly nondirectional hypothesis. In practice, the decision is not so simple. Even when a theory clearly predicts a par- ticular result, the actual result may come out opposite to what you expected. Some- times, the opposite may be more interesting than what you had predicted. (For example, what if, as in all the fairy tales about wish-granting genies and fish, receiv- ing $10 million and being able to fulfill almost any desire had made that individual miserable?) By using one-tailed tests, we risk having to ignore possibly important results.
For these reasons, researchers disagree about whether one-tailed tests should be used, even when there is a clearly directional hypothesis. To be safe, many re- searchers use two-tailed tests for both nondirectional and directional hypotheses. If the two-tailed test is significant, then the researcher looks at the result to see the di- rection and considers the study significant in that direction. In practice, always using two-tailed tests is a conservative procedure because the cutoff scores are more ex- treme for a two-tailed test and so it is less likely that a two-tailed test will give a sig- nificant result. Thus, if you do get a significant result with a two-tailed test, you are more confident about the conclusion. In fact, in most psychology research articles, unless the researcher specifically states that a one-tailed test was used, it is assumed that the test was two-tailed.
In practice, however, our experience is that most research results are either so extreme that they will be significant whether you use a one-tailed or two-tailed test or so far from extreme that they would not be significant in either kind of test. But what happens when a result is less certain? The researcher’s decision about one- or two- tailed tests now can make a big difference. In this situation the researcher tries to use the type of test that will give the most accurate and noncontroversial conclusion. The idea is to let nature—not a researcher’s decisions—determine the conclusion as much as possible. Further, whenever a result is less than completely clear one way or the other, most researchers are not comfortable drawing strong conclusions until more research is done.
Example of Hypothesis Testing with a Two-Tailed Test Here is one more fictional example, this time using a two-tailed test. Clinical psy- chologists at a residential treatment center have developed a new type of therapy to reduce depression that they believe is more effective than the current therapy.
122 Chapter 4
Z Score: 0 +1 +2−1−2 Depression Score: 97.783.655.441.3
Figure 4–7 Distribution of depression scores at 4 weeks after admission for diagnosed depressed psychiatric patients receiving the standard therapy (fictional data).
However, as with any treatment, it could make patients’ depression worse. Thus, the clinical psychologists make a nondirectional hypothesis.
The psychologists randomly select an incoming patient to receive the new form of therapy instead of the usual therapy. (In a real study, of course, more than one pa- tient would be selected, but let’s assume that only one person has been trained to do the new therapy and she has time to treat only one patient.) After 4 weeks, the patient fills out a standard depression scale that is given automatically to all patients after 4 weeks. The standard scale has been given at this treatment center for a long time. Thus, the psychologists know in advance the distribution of depression scores at 4 weeks for those who receive the usual therapy: it follows a normal curve with a mean of 69.5 and a standard deviation of 14.1. [These figures correspond roughly to the depression scores found in a national survey of 75,000 psychiatric patients given a widely used standard test (Dahlstrom et al., 1986).] This distribution is shown in Figure 4–7.
The clinical psychologists then carry out the five steps of hypothesis-testing.
❶ Restate the question as a research hypothesis and a null hypothesis about the populations. There are two populations of interest:
Population 1: Patients diagnosed as depressed who receive the new therapy. Population 2: Patients diagnosed as depressed in general (who receive the usual therapy).
The research hypothesis is that when measured on depression 4 weeks after admis- sion, patients who receive the new therapy (Population 1) will on the average score differently from patients who receive the current therapy (Population 2). In symbols, the research hypothesis is . The opposite of the research hy- pothesis, the null hypothesis, is that patients who receive the new therapy will have the same average depression level as the patients who receive the usual ther- apy. (That is, the depression level measured after 4 weeks will have the same mean for Populations 1 and 2.) In symbols, the null hypothesis is
❷ Determine the characteristics of the comparison distribution. If the null hy- pothesis is true, the distributions of Populations 1 and 2 are the same. We know
�1 = �2.
�1 Z �2T I P F O R S U C C E S S Remember that the research hy- pothesis and null hypothesis must always be complete opposites. Researchers specify the research hypothesis and this determines the null hypothesis that goes with it.
Introduction to Hypothesis Testing 123
the distribution of Population 2 (it is the one shown in Figure 4–7). Thus, we can use Population 2 as our comparison distribution. As noted, it follows a normal curve, with and
❸ Determine the cutoff sample score on the comparison distribution at which the null hypothesis should be rejected. The clinical psychologists select the 5% significance level. They have made a nondirectional hypothesis and will therefore use a two-tailed test. Thus, they will reject the null hypothesis only if the patient’s depression score is in either the top or bottom 2.5% of the compar- ison distribution. In terms of Z scores, these cutoffs are �1.96 and �1.96 (see Figure 4–6 and Table 4–2).
❹ Determine your sample’s score on the comparison distribution. The patient who received the new therapy was measured 4 weeks after admission. The pa- tient’s score on the depression scale was 41, which is a Z score on the comparison distribution of �2.02. That is, Figure 4–8 shows the distribution of Population 2 for this study, with the upper and lower 2.5% areas shaded; the depression score of the sample patient is also shown.
➎ Decide whether to reject the null hypothesis. A Z score of �2.02 is slightly more extreme than a Z score of �1.96, which is where the lower 2.5% of the comparison distribution begins. Notice in Figure 4–8 that the Z score of �2.02 falls within the shaded area in the left tail of the comparison distribution. This Z score of �2.02 is a result so extreme that it is unlikely to have occurred if this pa- tient were from a population no different from Population 2. Therefore, the clini- cal psychologists reject the null hypothesis. The result is statistically significant, and it supports the research hypothesis that depressed patients receiving the new therapy have different depression levels than depressed patients in general who receive the usual therapy.
Z = (X – M)>SD = (41 – 69.5)>14.1 = -2.02.
� = 14.1.� = 69.5
Z Score: 0 +1 +2−1−2 Depression Score: 97.783.655.441.3
Sample patient depression = 41
Z = −2.02
Cutoff Z Score = −1.96
Cutoff Z Score = 1.96
Figure 4–8 Distribution of depression scores with upper and lower 2.5% shaded and showing the sample patient who received the new therapy (fictional data).
T I P F O R S U C C E S S When carrying out the five steps of hypothesis testing, always draw a figure like Figure 4–8. Be sure to include the cutoff score(s) and shade the appropriate tail(s). If the sample score falls inside a shaded tail region, you can reject the null hypothesis and the result is statis- tically significant. If the sample score does not fall inside a shaded tail region, you cannot reject the null hypothesis.
124 Chapter 4
How are you doing?
1. What is a nondirectional hypothesis test? 2. What is a two-tailed test? 3. Why do you use a two-tailed test when testing a nondirectional hypothesis? 4. What is the advantage of using a one-tailed test when your theory predicts a
particular direction of result? 5. Why might you use a two-tailed test even when your theory predicts a partic-
ular direction of result? 6. A researcher predicts that making people hungry will affect how they do on a
coordination test. A randomly selected person is asked not to eat for 24 hours before taking a standard coordination test and gets a score of 400. For peo- ple in general of this age group and gender, tested under normal conditions, coordination scores are normally distributed with a mean of 500 and a stan- dard deviation of 40. Using the .01 significance level, what should the re- searcher conclude?
1.A nondirectional hypothesis test is a hypothesis test in which you do not pre- dict a particular direction of difference.
2.Atwo-tailedtestisoneinwhichtheoverallpercentageforthecutoffisdivided between the two tails of the comparison distribution. A two-tailed test is used to test the significance of a nondirectional hypothesis.
3.You use a two-tailed test when testing a nondirectional hypothesis because an extreme result in either direction supports the research hypothesis.
4.The cutoff for a one-tailed test is not so extreme; thus, if your result comes out in the predicted direction, it is more likely to be significant. The cutoff is not so extreme because the entire percentage (say 5%) is put in one tail in- stead of being divided between two tails.
5.It lets you count as significant an extreme result in either direction; if you used a one-tailed test and the result came out opposite to the prediction, it could not be called statistically significant.
6.The cutoffs are and . The sample person’s Zscore is ( . The result is not significant; the study is inconclusive. 40=-2.5
400-500)> -2.58 +2.58
Controversy: Should Significance Tests Be Banned? In recent years, there has been a major controversy about significance testing itself, with a concerted movement on the part of a small but vocal group of psychologists to ban significance tests completely! This is a radical suggestion with far-reaching implications: for at least half a century, nearly every research study in psychology has used significance tests. There probably has been more written in the major psy- chology journals in the last dozen years or so about this controversy than ever before in history about any issue having to do with statistics.
The discussion has gotten so heated that one article began as follows:
It is not true that a group of radical activists held 10 statisticians and six editors hostage at the . . . convention of the American Psychological Society and chanted, “Support the total test ban!” and “Nix the null!” (Abelson, 1997, p. 12)
Introduction to Hypothesis Testing 125
Since this is by far the most important controversy in years regarding statistics as used in psychology, we discuss the issues in at least three different places. In this chapter we focus on some basic challenges to hypothesis testing. In Chapters 5 and 6, we cover other topics that relate to aspects of hypothesis testing that you will learn about in those chapters.
Before discussing this controversy, you should be reassured that you are not learning about hypothesis testing for nothing. Whatever happens in the future, you absolutely have to understand hypothesis testing to make sense of virtually every re- search article published in the past. Further, in spite of the controversy that has raged for more than a decade, it is extremely rare to see new articles that do not use signif- icance testing. Thus, it is doubtful that any major shifts will occur in the near future. Finally, even if hypothesis testing is completely abandoned, the alternatives (which involve procedures you will learn about in Chapters 5 and 6) require understanding virtually all of the logic and procedures we are covering here.
So what is the big controversy? Some of the debate concerns subtle points of logic. For example, one issue relates to whether it makes sense to worry about reject- ing the null hypothesis when a hypothesis of no effect whatsoever is extremely un- likely to be true. Another issue is about the foundation of hypothesis testing in terms of populations and samples, since in most experiments the samples we use are not randomly selected from any definable population. We discussed some points relating to this issue in Chapter 3. Finally, some have questioned the appropriateness of con- cluding that if the data are inconsistent with the null hypothesis, this should be counted as evidence for the research hypothesis. This controversy becomes rather technical, but our own view is that, given recent considerations of the issues, the way researchers in psychology use hypothesis testing is reasonable (Balluerka et al., 2005; Iacobucci, 2005; Nickerson, 2000).
However, the biggest complaint against significance tests, and the one that has received almost universal agreement, is that they are misused (Balluerka et al., 2005). In fact, opponents of significance tests argue that even if there were no other problems with the tests, they should be banned simply because they are so often and so badly misused. They are misused in two main ways: one we can consider now; the other must wait until we have covered a topic you learn in Chapter 6.
A major misuse of significance tests is the tendency for researchers to decide that if a result is not significant, the null hypothesis is shown to be true (see Box 4–1). We have emphasized that when you can’t reject the null hypothesis, the results are simply inconclusive. The error of concluding the null hypothesis is true from failing to reject it is extremely serious, because important theories and methods may be con- sidered false just because a particular study did not get strong enough results. [You learn in Chapter 6 that it is quite easy for a true research hypothesis not to come out significant just because there were too few people in the study or the measures were not very accurate. In fact, Hunter (1997) argues that in about 60% of psychology studies, we are likely to get nonsignificant results even when the research hypothesis is actually true.]
What should be done? The general consensus seems to be that we should keep significance tests, but better train our students not to misuse them (hence the empha- sis on these points in this book). We should not, as it were, throw the baby out with the bathwater. To address this controversy, the American Psychological Association (APA) established a committee of eminent psychologists renowned for their statisti- cal expertise. The committee met over a two-year period, circulated a preliminary report, and considered reactions to it from a large number of researchers. In the end, they strongly condemned various misuses of significance testing of the kind we have
126 Chapter 4
been discussing, but they left its use up to the decision of each researcher. In their report they concluded:
Some had hoped that this task force would vote to recommend an outright ban on the use of significance tests in psychology journals. Although this might eliminate some abuses, the committee thought there were enough counterexamples (e.g., Abelson, 1997) to justify forbearance. (Wilkinson & Task Force on Statistical Inference, 1999, pp. 602–603)
Balluerka and colleagues (2005) reviewed the arguments for and against signif- icance testing. Their conclusion, with which we agree (as do probably most psychol- ogy researchers), is that “. . . rigorous research activity requires the use of . . . [significance testing] in the appropriate context, the complementary use of other methods which provide information about aspects not addressed by . . . [significance testing], and adherence to a series of recommendations which promote its rational use in psychological research” (p. 55).
really began to force the issue of the mindless use of sig- nificance testing. But he still used humor to tease behav- ioral and social scientists for their failure to see the problems inherent in the arbitrary yes-no decision fea- ture of null hypothesis testing. For example, he liked to remind everyone that significance testing came out of Sir Ronald Fisher’s work in agriculture (see Box 9–1), in which the decisions were yes-no matters such as whether a crop needed manure. He pointed out that behavioral and social scientists “do not deal in manure, at least not knowingly” (Cohen, 1990, p. 1307)! He really disliked the fact that Fisher-style decision making is used to de- termine the fate of not only doctoral dissertations, re- search funds, publications, and promotions, “but whether to have a baby just now” (1990, p. 1307). And getting more serious, he charged that significance testing’s “arbitrary unreasonable tyranny has led to data fudging of varying degrees of subtlety, from grossly altering data to dropping cases where there ‘must have been’ errors” (p. 1307).
Cohen was active in many social causes, especially desegregation in the schools and fighting discrimination in police departments. He cared passionately about everything he did. He was deeply loved. And he suffered from major depression, becoming incapacitated by it four times in his life.
Got troubles? Got no more math than high school al- gebra? It doesn’t have to stop you from contributing to science.
BOX 4–1 Jacob Cohen, the Ultimate New Yorker: Funny, Pushy, Brilliant, and Kind
New Yorkers can be proud of Jacob Cohen, who single- handedly introduced to behavioral and social scientists some of our most important statistical tools. Never worried about being popular—although he was—he almost single-handedly forced the current debate over significance testing, which he liked to joke was en- trenched like a “secular religion.” About the asterisk that accompanies a significant result, he said the religion must be “of Judeo-Christian derivation, as it employs as its most powerful icon a six-pointed cross” (1990, p. 1307).
Jacob entered graduate school at New York Univer- sity (NYU) in clinical psychology in 1947 and three years later had a masters and a doctorate. He then worked in rather lowly roles for the Veterans Adminis- tration, doing research on various practical topics, until he returned to NYU in 1959. There he became a very famous faculty member because of his creative, off- beat ideas about statistics. Amazingly, he made his con- tributions having no mathematics training beyond high school algebra.
But a lack of formal training may have been Jacob Cohen’s advantage because he emphasized looking at data and thinking about them, not just applying a stan- dard analysis. In particular, he demonstrated that the standard methods were not working very well, especially for the “soft” fields of psychology such as clinical, per- sonality, and social psychology. Many of his ideas were hailed as great breakthroughs. Starting in the 1990s he
Introduction to Hypothesis Testing 127
Hypothesis Tests in Research Articles In general, hypothesis testing is reported in research articles using one of the specific methods of hypothesis testing you learn in later chapters. For each result of interest, the researcher usually first indicates whether the result was statistically significant. (Note that, as with the first of the following examples, the researcher will not neces- sarily use the word significant; so look out for other indicators, such as reporting that scores on a variable decreased, increased, or were associated with scores on another variable.) Next, the researcher usually gives the symbol associated with the specific method used in figuring the probability that the result would have occurred if the null hypothesis was true, such as t, F, r, or (see Chapters 7 to 13). Finally, there will be an indication of the significance level, such as p .05 or p .01. (The re- searcher will usually also provide other information, such as the mean and standard deviation of sample scores.) For example, in a study of competitive Scrabble play- ers, Halpern and Wai (2007) reported: “Contrary to expectations, the number of cor- rectly defined words correlated significantly with participants’ official Scrabble rating, � .45, p .05, showing a moderate relationship (Cohen & Cohen, 1983), with higher-rated players defining more words correctly.” There is a lot here that you will learn about in later chapters, but the key thing to understand now about this result is the “p .05.” This means that the probability of the results if the null hypothesis were true is less than .05 (5%).
When a result is close but does not reach the significance level chosen, it may be reported anyway as a “near significant trend” or as having “approached signifi- cance.” When a result is not even close to being extreme enough to reject the null hy- pothesis, it may be reported as “not significant,” or the abbreviation ns will be used. Finally, whether or not a result is significant, it is increasingly common for re- searchers to report the exact p level, such as p � .03 or p � .27 (these are given in computer outputs of hypothesis testing results). The p reported here is based on the proportion of the comparison distribution that is more extreme than the sample score information that you could figure from the Z score for your sample and a normal curve table.
A researcher will usually note if he or she used a one-tailed test. When reading research articles, assume the researcher used a two-tailed test if nothing is said oth- erwise. Even though a researcher has chosen a significance level in advance, such as .05, the researcher may note that results meet more rigorous standards. Thus, in the same article, you may see some results noted as “p � .05,” others as “p � .01,” and still others as “p � .001.”
Finally, often researchers show hypothesis testing results only as asterisks (stars) in a table of results. In such tables, a result with an asterisk means it is signif- icant, while a result without an asterisk is not. For example, Table 4–3 shows the re- sults of part of a study by Bohnert and colleagues (2007) comparing various aspects of social adjustment to college of male and female college students during the sum- mer before their first year of college (Time 1) and 10 months later (Time 2). The table gives figures for means, standard deviations, and t statistics—the “t(83)” is about details of the specific hypothesis testing procedure used in this study called a t test, which you will learn in Chapters 7 and 8. The important things to look at now are the asterisks (and the notes at the bottom of the table that go with them). The as- terisks tell you the significance levels for the various comparisons. For example, fe- males had a higher level of friendship quality at Time 1 (M � 2.82) than males (M � 2.49); thus there are three asterisks at the end of the row for this result, which the note at the bottom tells you means that the probability of getting this big a difference
128 Chapter 4
if the null hypothesis was true is less than one in a thousand (.001). At Time 1, males reported being more lonely (M � 39.30) than females (M � 34.78), but you can see that there was no significant gender difference in loneliness at Time 2 (the means were 37.88 and 34.71, and the lack of an asterisk in this row indicates that these were not different enough to be significant in this study). At Time 2, females again reported a significantly higher level of friendship quality (M � 3.21) than males (M � 2.84); the asterisks show that the difference was significant at the .001 (one in a thousand) level.
In reporting results of significance testing, researchers rarely talk explicitly about the research hypothesis or the null hypothesis, nor do they describe any of the other steps of the process in detail. It is assumed that readers of psychology research understand all of this very well.
Table 4–3 Means and Standard Deviation for Main Study Variables by Gender Total
(n � 85) Males
(n � 31) Females (n � 54)
M SD M SD M SD t(83)
Adolescence (Time 1)
Friendship quality 2.70 0.40 2.49 0.46 2.82 0.32 13.98***
Loneliness 36.39 8.71 39.30 9.98 34.78 7.56 5.47*
Emerging adulthood (Time 2)
Friendship quality 3.10 0.48 2.84 0.57 3.21 0.38 11.31***
Loneliness 35.84 9.98 37.88 11.38 34.71 9.21 1.76
Activities: Intensity 8.09 8.27 10.00 10.19 7.18 7.18 0.98
Activities: Breadth 1.71 1.06 1.84 1.18 1.65 1.01 0.51
*p � .05. **p � .01. ***p � .001. Source: Bohnert, A. M., Aikins, J. W., & Edidin, J. (2007). The role of organized activities in facilitating social adaptation across the transition to college. Journal of Adolescent Research, 22, 189–208. Sage Publications, Ltd. Reprinted by permission of Sage Publications, Thousands Oaks, London, and New Delhi.
1. Hypothesis testing considers the probability that the result of a study could have come about even if the experimental procedure had no effect. If this probability is low, the scenario of no effect is rejected and the hypothesis behind the exper- imental procedure is supported.
2. The expectation of an effect is the research hypothesis, and the hypothetical situation of no effect is the null hypothesis.
3. When a result (that is, a sample score) is so extreme that the result would be very unlikely if the null hypothesis were true, the researcher rejects the null hy- pothesis and describes the research hypothesis as supported. If the result is not that extreme, the researcher does not reject the null hypothesis, and the study is inconclusive.
4. Psychologists usually consider a result too extreme if it is less likely than 5% (that is, a significance level of p � .05) to have come about if the null hypothe- sis were true. Psychologists sometimes use a more stringent 1% (p � .01 signif- icance level), or even .1% (p � .001 significance level), cutoff.
Introduction to Hypothesis Testing 129
5. Thecutoffpercentage is theprobabilityof the result beingextreme inapredicted di- rection in a directional or one-tailed test. The cutoff percentages are the probability of the result being extreme in either direction in a nondirectional or two-tailed test.
6. The five steps of hypothesis testing are: ❶ Restate the question as a research hypothesis and a null hypothesis
about the populations. ❷ Determine the characteristics of the comparison distribution. ❸ Determine the cutoff sample score on the comparison distribution at
which the null hypothesis should be rejected. ❹ Determine your sample’s score on the comparison distribution. ❺ Decide whether to reject the null hypothesis.
7. There has been much controversy about significance tests, including critiques of the basic logic and, especially, that they are often misused. One major way researchers misuse significance tests is by interpreting not rejecting the null hypothesis as demonstrating that the null hypothesis is true.
8. Research articles typically report the results of hypothesis testing by saying a re- sult was or was not significant and giving the probability level cutoff (usually 5% or 1%) that the decision was based on.
hypothesis testing (p. 107) hypothesis (p. 107) theory (p. 107) research hypothesis (p. 110) null hypothesis (p. 110)
comparison distribution (p. 111) cutoff sample score (p. 111) conventional levels of significance
(p � .05, p � .01) (p. 113) statistically significant (p. 113)
directional hypothesis (p. 119) one-tailed test (p. 119) nondirectional hypothesis
(p. 119) two-tailed test (p. 119)
A randomly selected individual, after going through an experimental treatment, has a score of 27 on a particular measure. The scores of people in general on this measure are normally distributed with a mean of 19 and a standard deviation of 4. The researcher predicts an effect, but does not predict a particular direction of effect. Using the 5% sig- nificance level, what should you conclude? Solve this problem explicitly using all five steps of hypothesis testing and illustrate your answer with a sketch showing the compar- ison distribution, the cutoff (or cutoffs), and the score of the sample on this distribution.
Answer ❶ Restate the question as a research hypothesis and a null hypothesis about
the populations. There are two populations of interest:
Population 1: People who go through the experimental procedure. Population 2: People in general (that is, people who do not go through the experimental procedure).
The research hypothesis is that Population 1 will score differently than Popula- tion 2 on the particular measure. The null hypothesis is that the two populations are not different on the measure.
Example Worked-Out Problems
130 Chapter 4
Raw Score = 27
Z Score = 2
Cutoff Z Score = −1.96
Cutoff Z Score = 1.96
Figure 4–9 Diagram for Example Worked-Out Problem showing comparison distribu- tion, cutoffs (2.5% shaded area in each tail), and sample score.
❷ Determine the characteristics of the comparison distribution: , normally distributed.
❸ Determine the cutoff sample score on the comparison distribution at which the null hypothesis should be rejected. For a two-tailed test at the 5% level (2.5% at each tail), the cutoff scores are and (see Figure 4–6 or Table 4–2).
❹ Determine your sample’s score on the comparison distribution. Z � (27 � 19)�4 � 2.
❺ Decide whether to reject the null hypothesis. A Z score of 2 is more extreme than the cutoff Z of Reject the null hypothesis; the result is significant. The experimental procedure affects scores on this measure. The diagram is shown in Figure 4–9.
Outline for Writing Essays for Hypothesis-Testing Problems Involving a Single Sample of One Participant and a Known Population
1. Describe the core logic of hypothesis testing. Be sure to explain terminology such as research hypothesis and null hypothesis, and explain the concept of pro- viding support for the research hypothesis when the study results are strong enough to reject the null hypothesis.
2. Explain the concept of the comparison distribution. Be sure to mention that it is the distribution that represents the population situation if the null hypothesis is true. Note that the key characteristics of the comparison distribution are its mean, stan- dard deviation, and shape.
� = 4, � = 19
Introduction to Hypothesis Testing 131
These problems involve figuring. Most real-life statistics problems are done on a computer with special statistical software. Even if you have such software, do these problems by hand to ingrain the method in your mind.
All data are fictional unless an actual citation is given.
Set I (for Answers to Set I Problems, see pp. 675–677) 1. Define the following terms in your own words: (a) hypothesis-testing proce-
dure, (b) .05 significance level, and (c) two-tailed test. 2. When a result is not extreme enough to reject the null hypothesis, explain why it
is wrong to conclude that your result supports the null hypothesis. 3. For each of the following, (a) say which two populations are being compared,
(b) state the research hypothesis, (c) state the null hypothesis, and (d) say whether you should use a one-tailed or two-tailed test and why.
i. Do Canadian children whose parents are librarians score higher than Canadian children in general on reading ability?
ii. Is the level of income for residents of a particular city different from the level of income for people in the region?
iii. Do people who have experienced an earthquake have more or less self- confidence than the general population?
4. Based on the information given for each of the following studies, decide whether to reject the null hypothesis. For each, give (a) the Z-score cutoff (or cutoffs) on the comparison distribution at which the null hypothesis should be rejected, (b) the Z score on the comparison distribution for the sample score, and (c) your conclusion. Assume that all populations are normally distributed.
3. Describe the logic and process for determining (using the normal curve) the cut- off sample scores on the comparison distribution at which you should reject the null hypothesis.
4. Describe how to figure the sample’s score on the comparison distribution. 5. Explain how and why the scores from Steps ❸ and ❹ of the hypothesis-testing
process are compared. Explain the meaning of the result of this comparison with regard to the specific research and null hypotheses being tested.
5. Based on the information given for each of the following studies, decide whether to reject the null hypothesis. For each, give (a) the Z-score cutoff (or cutoffs) on the comparison distribution at which the null hypothesis should be rejected, (b) the Z score on the comparison distribution for the sample score, and (c) your conclusion. Assume that all populations are normally distributed.
Study � � Sample Score p Tails of Test
A 10 2 14 .05 1 (high predicted) B 10 2 14 .05 2 C 10 2 14 .01 1 (high predicted) D 10 2 14 .01 2 E 10 4 14 .05 1 (high predicted)