Discrete Data In: Regression Diagnostics

By: John Fox Pub. Date: 2011 Access Date: October 16, 2019 Publishing Company: SAGE Publications, Inc. City: Thousand Oaks Print ISBN: 9780803939714 Online ISBN: 9781412985604 DOI: https://dx.doi.org/10.4135/9781412985604 Print pages: 62-66

© 1991 SAGE Publications, Inc. All Rights Reserved. This PDF has been generated from SAGE Research Methods. Please note that the pagination of the online version will vary from the pagination of the print book.

Discrete Data

Discrete independent and dependent variables often lead to plots that are difficult to interpret. A simple example of this phenomenon appears in Figure 8.1, the data for which are drawn from the 1989 General Social Survey conducted by the National Opinion Research Center. The independent variable, years of education completed, is coded from 0 to 20. The dependent variable is the number of correct answers to a 10-item vocabulary test; note that this variable is a disguised proportion—literally, the proportion correct × 10.

Figure 8.1. Scatterplot (a) and residual plot (b) for vocabulary score by year of education. The least-squares regression line is shown on the scatterplot.

The scatterplot in Figure 8.1a conveys the general impression that vocabulary increases with education. The plot is difficult to read, however, because most of the 968 data points fall on top of one another. The least- squares regression line, also shown on the plot, has the equation

where V and E are, respectively, the vocabulary score and education.

Figure 8.1b plots residuals from the fitted regression against education. The diagonal lines running from upper left to lower right in this plot are typical of residuals for a discrete dependent variable: For any one of the 11 distinct y values, e.g., y = 5, the residual is e = 5 – b0 – b1x = 3.87 – 0.374x, which is a linear function of x.

I noted a similar phenomenon in Chapter 6 for the plot of residuals against fitted values when y has a fixed minimum score. The diagonals from lower left to upper right are due to the discreteness of x.

It also appears that the variation of the residuals in Figure 8.1b is lower for the largest and smallest values of education than for intermediate values. This pattern is consistent with the observation that the dependent variable is a disguised proportion: As the average number of correct answers approaches 0 or 10, the

SAGE 1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 2 of 5 Discrete Data

potential variation in vocabulary scores decreases. It is possible, however, that at least part of the apparent decrease in residual variation is due to the relative sparseness of data at the extremes of the education scale. Our eye is drawn to the range of residual values, especially because we cannot see most of the data points, and even when variance is constant, the range tends to increase with the amount of data.

These issues are addressed in Figure 8.2, where each data point has been randomly “jittered” both vertically and horizontally: Specifically, a uniform random variable on the interval [-1/2, 1/2] was added to each education and vocabulary score. This approach to plotting discrete data was suggested by Chambers, Cleveland, Kleiner, and Tukey (1983). The plot also shows the fitted regression line for the original data, along with lines tracing the median and first and third quartiles of the distribution of jittered vocabulary scores for each value of education; I excluded education values below six from the median and quartile traces because of the sparseness of data in this region.

Several features of Figure 8.2 are worth highlighting: (a) It is clear from the jittered data that the observations are particularly dense at 12 years of education, corresponding to high-school graduation; (b) the median trace is quite close to the linear least-squares regression line; and (c) the quartile traces indicate that the spread of y does not decrease appreciably at high values of education.

A discrete dependent variable violates the assumption that the error in the regression model is normally distributed with constant variance. This problem, like that of a limited dependent variable, is only serious in extreme cases—for example, when there are very few response categories, or where a large proportion of observations is in a small number of categories, conditional on the values of the independent variables.

In contrast, discrete independent variables are perfectly consistent with the regression model, which makes no distributional assumptions about the xs other than uncorrelation with the error. Indeed a discrete x makes possible a straightforward hypothesis test of nonlinearity, sometimes called a test for “lack of fit.” Likewise, it is relatively simple to test for nonconstant error variance across categories of a discrete independent variable (see below).

Figure 8.2. “Jittered” scatterplot for vocabulary score by education. A small random quantity is added to each horizontal and vertical coordinate. The dashed line is the least-squares regression line for the unjittered data. The solid lines are median and quartile traces for the jittered vocabulary scores.

SAGE 1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 3 of 5 Discrete Data

Testing for Nonlinearity

Suppose, for example, that we model education with a set of dummy regressors rather than specify a linear relationship between vocabulary score and education. Although there are 21 conceivable education scores, ranging from 0 through 20, none of the individuals in the sample has 2 years of education, yielding 20 categories and 19 dummy regressors. The model becomes

TABLE 8.1 Analysis of Variance for Vocabulary-Test Score, Showing the Incremental F Test for

Nonlinearity of the Relationship Between Vocabulary and Education

Contrasting this model with

produces a test for nonlinearity, because Equation 8.2, specifying a linear relationship, is a special case of Equation 8.1, which captures any pattern of relationship between E(y) and x. The resulting incremental F test for nonlinearity appears in the analysis-of-variance of Table 8.1. There is, therefore, very strong evidence of a linear relationship between vocabulary and education, but little evidence of nonlinearity.

SAGE 1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 4 of 5 Discrete Data

The F test for nonlinearity easily can be extended to a discrete independent variable—say, x1—in a multiple-

regression model. Here, we contrast the more general model

with a model specifying a linear effect of x1,

where d1, …, dq-1 are dummy regressors constructed to represent the q categories of x1.

Testing for Nonconstant Error Variance

A discrete x (or combination of xs) partitions the data into q groups. Let yij denote the jth of ni dependent-

variable scores in the ith group. If the error variance is constant, then the within-group variance estimates

should be similar. Here, ŷi is the mean in the ith group. Tests that examine the si2 directly, such as Bartlett’s

(1937) commonly employed test, do not maintain their validity well when the errors are non-normal.

Many alternative tests have been proposed. In a large-scale simulation study, Conover, Johnson, and Johnson (1981) demonstrate that the following simple F test is both robust and powerful: Calculate the values

zij = |yij – yi?| where yi? is the median y within the ith group. Then perform a one-way analysis-of-variance of

the variable z over the q groups. If the error variance is not constant across the groups, then the group means will tend to differ, producing a large value of the F test statistic. For the vocabulary data, for example, where

education partitions the 968 observations into q = 20 groups, this test gives F19,948 = 1.48, p = .08, providing

weak evidence of nonconstant spread.

http://dx.doi.org/10.4135/9781412985604.n8

SAGE 1991 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 5 of 5 Discrete Data