# Nonlinearity

Nonlinearity In: Regression Diagnostics

By: John Fox Pub. Date: 2011 Access Date: October 16, 2019 Publishing Company: SAGE Publications, Inc. City: Thousand Oaks Print ISBN: 9780803939714 Online ISBN: 9781412985604 DOI: https://dx.doi.org/10.4135/9781412985604 Print pages: 54-61

© 1991 SAGE Publications, Inc. All Rights Reserved. This PDF has been generated from SAGE Research Methods. Please note that the pagination of the online version will vary from the pagination of the print book.

Nonlinearity

The assumption that E(?) is everywhere zero implies that the specified regression surface captures the dependency of y on the xs. Violating the assumption of linearity therefore implies that the model fails to capture the systematic pattern of relationship between the dependent and independent variables. For example, a partial relationship specified to be linear may be nonlinear, or two independent variables specified to have additive partial effects may interact in determining y. Nevertheless, the fitted model is frequently a useful approximation even if the regression surface E(y) is not precisely captured. In other instances, however, the model can be extremely misleading.

The regression surface is generally high dimensional, even after accounting for regressors (such as polynomial terms, dummy variables, and interactions) that are functions of a smaller number of fundamental independent variables. As in the case of nonconstant error variance, therefore, it is necessary to focus on particular patterns of departure from linearity. The graphical diagnostics discussed in this chapter represent two-dimensional views of the higher-dimensional point-cloud of observations {yi, x1i, …, xki,}. With modern

computer graphics, the ideas here can usefully be extended to three dimensions, permitting, for example, the detection of two-way interactions between independent variables (Monette, 1990).

Residual and Partial-Residual Plots

Although it is useful in multiple regression to plot y against each x, these plots do not tell the whole story—and can be misleading—because our interest centers on the partial relationship between y and each x, controlling for the other xs, not on the marginal relationship between y and a single x. Residual-based plots are consequently more relevant in this context.

Plotting residuals or studentized residuals against each x, perhaps augmented by a lowess smooth (see Appendix A6.1), is helpful for detecting departures from linearity. As Figure 7.1 illustrates, however, residual plots cannot distinguish between monotone (i.e., strictly increasing or decreasing) and nonmonotone (e.g., falling and then rising) nonlinearity. The distinction between monotone and non-monotone nonlinearity is lost in the residual plots because the least-squares fit ensures that the residuals are linearly uncorrelated with each x. The distinction is important, because, as I shall explain below, monotone nonlinearity frequently can

be corrected by simple transformations. In Figure 7.1, for example, case a might be modeled by y = β0 +

β1×2 + ?, whereas case b cannot be linearized by a power transformation of x and might instead be dealt with

by a quadratic specification,y = β0 + β1x + β2×2 + ?. (Case b could, however, be accommodated by a more

complex transformation of x: y = β0 + β1(x – α)2 + ?; I shall not pursue this approach here.)

SAGE Research Methods

Page 2 of 8 Nonlinearity

Figure 7.1. Scatterplots (a and b) and corresponding residual plots (a’ and b’) in simple regression.

The residual plots do not distinguish between (a) a nonlinear but monotone relationship, and (b) a

nonlinear, nonmonotone relationship.

In contrast to simple residual plots, partial-regression plots, introduced in Chapter 4 for detecting influential data, can reveal nonlinearity and suggest whether a relationship is monotone. These plots are not always useful for locating a transformation, however: The partial-regression plot adjusts xj for the other xs, but it is the

unadjusted xj that is transformed in respecifying the model. Partial-residual plots, also called component-plus-

residual plots, are often an effective alternative. Partial-residual plots are not as suitable as partial-regression plots for revealing leverage and influence.

Define the partial residual for they’th regressor as

In words, add back the linear component of the partial relationship between y and xj to the least-squares

residuals, which may include an unmodeled nonlinear component. Then plot e(j) versus xj. By construction,

the multiple-regression coefficient bj is the slope of the simple linear regression of e(j) on xj, but nonlinearity

should be apparent in the plot as well. Again, a lowess smooth may help in interpreting the plot.

The partial-residual plots in Figure 7.2 are for a regression of the rated prestige P of 102 Canadian occupations (from Pineo and Porter, 1967) on the average education (E) in years, average income (I) in

SAGE Research Methods

Page 3 of 8 Nonlinearity

dollars, and percentage of women (W) in the occupations in 1971. (Related results appear in Fox and Suschnigg ; cf. Duncan’s regression for similar U.S. data reported in Chapter 4.) A lowess smooth is shown in each plot. The results of the regression are as follows:

= -6.79+ 4.19E + 0.00131I – 0.00891W (3.24) (0.39) (0.00028) (0.0304)

R2 = 0.80 s = 7.85

Note that the magnitudes of the regression coefficients should not be compared, because the independent variables are measured in different units: In particular, the unit for income is small—the dollar—and that for education is comparatively large—the year. Interpreting the regression coefficients in light of the units of the corresponding independent variables, the education and income coefficients are both substantial, whereas the coefficient for percentage of women is very small.

There is apparent monotone nonlinearity in the partial-residual plots for education and, much more strongly, income (Figure 7.2, parts a and b); there also is a small apparent tendency for occupations with intermediate percentages of women to have lower prestige, controlling for income and educational levels (Figure 7.2c). To my eye, the patterns in the partial-residual plots for education and percentage of women are not easily discernible without the lowess smooth: The departure from linearity is not great. The nonlinear patterns for income and percentage of women are simple: In the first case, the lowess curve opens downwards; in the second case, it opens upwards. For education, however, the direction of curvature changes, producing a more complex nonlinear pattern.

SAGE Research Methods

Page 4 of 8 Nonlinearity

Figure 7.2. Partial-residual plots for the regression of the rated prestige of 102 Canadian occupations

on 1971 occupational characteristics: (a) education, (b) income, and (c) percentage of women. The

observation index is plotted for each point. In each graph, the linear least-squares fit (broken line) and

the lowess smooth (solid line for f = 0.5 with 2 robustness iterations) are shown.

SOURCE: Data taken from B. Blishen, W. Carroll, and C. Moore, personal communication; Census of Canada (Statistics Canada, 1971, Part 6, pp. 19.1–19.21); Pineo and Porter (1967).

Mallows (1986) has suggested a variation on the partial-residual plot that sometimes reveals nonlinearity more clearly: First, add a quadratic term in xj to the model, which becomes

Then, after fitting the model, form the “augmented” partial residual

Note that in general bj differs from the regression coefficient for xj in the original model, which does not include

the squared term. Finally, plot e'(j) versus xj.

Transformations for Linearity

SAGE Research Methods

Page 5 of 8 Nonlinearity

To consider how power transformations can serve to linearize a monotone nonlinear relationship, examine

Figure 7.3. Here, I have plotted y = (1/5)x2 for x = 1, 2, 3, 4, 5. By construction, the relationship can be

linearized by taking x’ = x2, in which case y = (1/5)x’; or by taking y’ = , in which case y’ = x. Figure 7.3 reveals how each transformation serves to stretch one of the axes differentially, pulling the curve into a straight line.

As illustrated in Figure 7.4, there are four simple patterns of monotone nonlinear relationships. Each can be straightened by moving y, x, or both up or down the ladder of powers and roots: The direction of curvature determines the direction of movement on the ladder; Tukey (1977) calls this the “bulging rule.” Specific transformations to linearity can be located by trial and error (but see Chapter 9 for an analytic approach).

In multiple regression, the bulging rule may be applied to the partial-residual plots. Generally, we transform xj in preference to y, because changing the scale of y disturbs its relationship to other regressors and

because transforming y changes the error distribution. An exception occurs when similar nonlinear patterns are observed in all of the partial-residual plots. Furthermore, the logit transformation often helps for dependent variables that are proportions.

Figure 7.3. How a transformation of y (a to b) or x (a to c) can make a simple monotone nonlinear

relationship linear.

As suggested in connection with Figure 7.1b, nonmonotone nonlinearity (and some complex monotone patterns) frequently can be accommodated by fitting polynomial functions in an x; quadratic specifications are

SAGE Research Methods

Page 6 of 8 Nonlinearity

often useful in applications. As long as the model remains linear in its parameters, it may be fit by linear least- squares regression.

Trial-and-error experimentation with the Canadian occupational prestige data leads to the log transformation of income. The possibly curvilinear partial relationship of prestige to the percentage of women in the occupations suggests the inclusion of linear and quadratic terms for this independent variable. These changes produce a modest, though discernible, improvement in the fit of the model:

Figure 7.4. Determining a transformation to linearity by the “bulging rule.”

= -111 + 3.77E + 9.36log2I – 0.139W + 0.00215W2

(15) (0.35) (1.30) (0.087) (0.00094)

R2 = 0.84 s = 6.95

Note the statistically significant quadratic term for percentage of women. The partial effect of this variable is relatively small, however, ranging from a minimum of -2.2 prestige points for an occupation with 32% women to 7.6 points for a hypothetical occupation consisting entirely of women. Because the nonlinear pattern in the partial-residual plot for education is complex, a power transformation of this independent variable is

not promising: Trial and error suggests that the best that we can do is to increase R2 to 0.85 by squaring education.

In transforming data or respecifying the functional form of the model, there should be an interplay between substantive and modeling considerations. We must recognize, however, that social theories are almost never mathematically concrete: Theory may tell us that prestige should increase with income, but it does not specify the functional form of the relationship.

SAGE Research Methods

Page 7 of 8 Nonlinearity

Still, in certain contexts, specific transformations may have advantages of interpretability. For example, log transformations often can be given meaningful substantive interpretation: To increase log2x by 1, for

instance, represents a doubling of x. In the respecified Canadian occupational prestige regression, therefore, doubling income is associated on average with a 9-point increment in prestige, holding education and gender composition constant.

Likewise, the square root of an area or cube root of a volume can be interpreted as a linear measure of distance or length, the inverse of the amount of time required to traverse a particular distance is speed, and so on. If both y and xj are log-transformed, then the regression coefficient for x’j is interpretable as the “elasticity”

of y with respect to xj—that is, the approximate percentage of change in y corresponding to a 1% change

in xj. In many contexts, a quadratic relationship will have a clear substantive interpretation (in the example,

occupations with a gender mix appear to pay a small penalty in prestige), but a fourth-degree polynomial may not.

Finally, although it is desirable to maintain simplicity and interpretability, it is not reasonable to distort the data by insisting on a functional form that is clearly inadequate. It is possible, in any event, to display the fitted relationship between y and an x graphically or in a table, using the original scales of the variables if they have been transformed, or to describe the effect at a few strategic x values (see, e.g., the brief description above of the partial effect of percentage of women on occupational prestige).

http://dx.doi.org/10.4135/9781412985604.n7

SAGE Research Methods

Page 8 of 8 Nonlinearity

• Nonlinearity
• In: Regression Diagnostics

By: John Fox Pub. Date: 2011 Access Date: October 16, 2019 Publishing Company: SAGE Publications, Inc. City: Thousand Oaks Print ISBN: 9780803939714 Online ISBN: 9781412985604 DOI: https://dx.doi.org/10.4135/9781412985604 Print pages: 54-61

© 1991 SAGE Publications, Inc. All Rights Reserved. This PDF has been generated from SAGE Research Methods. Please note that the pagination of the online version will vary from the pagination of the print book.

https://dx.doi.org/10.4135/9781412985604
Nonlinearity

The assumption that E(?) is everywhere zero implies that the specified regression surface captures the dependency of y on the xs. Violating the assumption of linearity therefore implies that the model fails to capture the systematic pattern of relationship between the dependent and independent variables. For example, a partial relationship specified to be linear may be nonlinear, or two independent variables specified to have additive partial effects may interact in determining y. Nevertheless, the fitted model is frequently a useful approximation even if the regression surface E(y) is not precisely captured. In other instances, however, the model can be extremely misleading.

The regression surface is generally high dimensional, even after accounting for regressors (such as polynomial terms, dummy variables, and interactions) that are functions of a smaller number of fundamental independent variables. As in the case of nonconstant error variance, therefore, it is necessary to focus on particular patterns of departure from linearity. The graphical diagnostics discussed in this chapter represent two-dimensional views of the higher-dimensional point-cloud of observations {yi, x1i, …, xki,}. With modern

computer graphics, the ideas here can usefully be extended to three dimensions, permitting, for example, the detection of two-way interactions between independent variables (Monette, 1990).

Residual and Partial-Residual Plots

Although it is useful in multiple regression to plot y against each x, these plots do not tell the whole story—and can be misleading—because our interest centers on the partial relationship between y and each x, controlling for the other xs, not on the marginal relationship between y and a single x. Residual-based plots are consequently more relevant in this context.

Plotting residuals or studentized residuals against each x, perhaps augmented by a lowess smooth (see Appendix A6.1), is helpful for detecting departures from linearity. As Figure 7.1 illustrates, however, residual plots cannot distinguish between monotone (i.e., strictly increasing or decreasing) and nonmonotone (e.g., falling and then rising) nonlinearity. The distinction between monotone and non-monotone nonlinearity is lost in the residual plots because the least-squares fit ensures that the residuals are linearly uncorrelated with each x. The distinction is important, because, as I shall explain below, monotone nonlinearity frequently can

be corrected by simple transformations. In Figure 7.1, for example, case a might be modeled by y = β0 +

β1×2 + ?, whereas case b cannot be linearized by a power transformation of x and might instead be dealt with

by a quadratic specification,y = β0 + β1x + β2×2 + ?. (Case b could, however, be accommodated by a more

complex transformation of x: y = β0 + β1(x – α)2 + ?; I shall not pursue this approach here.)

SAGE Research Methods

Page 2 of 8 Nonlinearity

Figure 7.1. Scatterplots (a and b) and corresponding residual plots (a’ and b’) in simple regression.

The residual plots do not distinguish between (a) a nonlinear but monotone relationship, and (b) a

nonlinear, nonmonotone relationship.

In contrast to simple residual plots, partial-regression plots, introduced in Chapter 4 for detecting influential data, can reveal nonlinearity and suggest whether a relationship is monotone. These plots are not always useful for locating a transformation, however: The partial-regression plot adjusts xj for the other xs, but it is the

unadjusted xj that is transformed in respecifying the model. Partial-residual plots, also called component-plus-

residual plots, are often an effective alternative. Partial-residual plots are not as suitable as partial-regression plots for revealing leverage and influence.

Define the partial residual for they’th regressor as

In words, add back the linear component of the partial relationship between y and xj to the least-squares

residuals, which may include an unmodeled nonlinear component. Then plot e(j) versus xj. By construction,

the multiple-regression coefficient bj is the slope of the simple linear regression of e(j) on xj, but nonlinearity

should be apparent in the plot as well. Again, a lowess smooth may help in interpreting the plot.

The partial-residual plots in Figure 7.2 are for a regression of the rated prestige P of 102 Canadian occupations (from Pineo and Porter, 1967) on the average education (E) in years, average income (I) in

SAGE Research Methods

Page 3 of 8 Nonlinearity

dollars, and percentage of women (W) in the occupations in 1971. (Related results appear in Fox and Suschnigg ; cf. Duncan’s regression for similar U.S. data reported in Chapter 4.) A lowess smooth is shown in each plot. The results of the regression are as follows:

= -6.79+ 4.19E + 0.00131I – 0.00891W (3.24) (0.39) (0.00028) (0.0304)

R2 = 0.80 s = 7.85

Note that the magnitudes of the regression coefficients should not be compared, because the independent variables are measured in different units: In particular, the unit for income is small—the dollar—and that for education is comparatively large—the year. Interpreting the regression coefficients in light of the units of the corresponding independent variables, the education and income coefficients are both substantial, whereas the coefficient for percentage of women is very small.

There is apparent monotone nonlinearity in the partial-residual plots for education and, much more strongly, income (Figure 7.2, parts a and b); there also is a small apparent tendency for occupations with intermediate percentages of women to have lower prestige, controlling for income and educational levels (Figure 7.2c). To my eye, the patterns in the partial-residual plots for education and percentage of women are not easily discernible without the lowess smooth: The departure from linearity is not great. The nonlinear patterns for income and percentage of women are simple: In the first case, the lowess curve opens downwards; in the second case, it opens upwards. For education, however, the direction of curvature changes, producing a more complex nonlinear pattern.

SAGE Research Methods

Page 4 of 8 Nonlinearity

Figure 7.2. Partial-residual plots for the regression of the rated prestige of 102 Canadian occupations

on 1971 occupational characteristics: (a) education, (b) income, and (c) percentage of women. The

observation index is plotted for each point. In each graph, the linear least-squares fit (broken line) and

the lowess smooth (solid line for f = 0.5 with 2 robustness iterations) are shown.

SOURCE: Data taken from B. Blishen, W. Carroll, and C. Moore, personal communication; Census of Canada (Statistics Canada, 1971, Part 6, pp. 19.1–19.21); Pineo and Porter (1967).

Mallows (1986) has suggested a variation on the partial-residual plot that sometimes reveals nonlinearity more clearly: First, add a quadratic term in xj to the model, which becomes

Then, after fitting the model, form the “augmented” partial residual

Note that in general bj differs from the regression coefficient for xj in the original model, which does not include

the squared term. Finally, plot e'(j) versus xj.

Transformations for Linearity

SAGE Research Methods

Page 5 of 8 Nonlinearity

To consider how power transformations can serve to linearize a monotone nonlinear relationship, examine

Figure 7.3. Here, I have plotted y = (1/5)x2 for x = 1, 2, 3, 4, 5. By construction, the relationship can be

linearized by taking x’ = x2, in which case y = (1/5)x’; or by taking y’ = , in which case y’ = x. Figure 7.3 reveals how each transformation serves to stretch one of the axes differentially, pulling the curve into a straight line.

As illustrated in Figure 7.4, there are four simple patterns of monotone nonlinear relationships. Each can be straightened by moving y, x, or both up or down the ladder of powers and roots: The direction of curvature determines the direction of movement on the ladder; Tukey (1977) calls this the “bulging rule.” Specific transformations to linearity can be located by trial and error (but see Chapter 9 for an analytic approach).

In multiple regression, the bulging rule may be applied to the partial-residual plots. Generally, we transform xj in preference to y, because changing the scale of y disturbs its relationship to other regressors and

because transforming y changes the error distribution. An exception occurs when similar nonlinear patterns are observed in all of the partial-residual plots. Furthermore, the logit transformation often helps for dependent variables that are proportions.

Figure 7.3. How a transformation of y (a to b) or x (a to c) can make a simple monotone nonlinear

relationship linear.

As suggested in connection with Figure 7.1b, nonmonotone nonlinearity (and some complex monotone patterns) frequently can be accommodated by fitting polynomial functions in an x; quadratic specifications are

SAGE Research Methods

Page 6 of 8 Nonlinearity

often useful in applications. As long as the model remains linear in its parameters, it may be fit by linear least- squares regression.

Trial-and-error experimentation with the Canadian occupational prestige data leads to the log transformation of income. The possibly curvilinear partial relationship of prestige to the percentage of women in the occupations suggests the inclusion of linear and quadratic terms for this independent variable. These changes produce a modest, though discernible, improvement in the fit of the model:

Figure 7.4. Determining a transformation to linearity by the “bulging rule.”

= -111 + 3.77E + 9.36log2I – 0.139W + 0.00215W2

(15) (0.35) (1.30) (0.087) (0.00094)

R2 = 0.84 s = 6.95

Note the statistically significant quadratic term for percentage of women. The partial effect of this variable is relatively small, however, ranging from a minimum of -2.2 prestige points for an occupation with 32% women to 7.6 points for a hypothetical occupation consisting entirely of women. Because the nonlinear pattern in the partial-residual plot for education is complex, a power transformation of this independent variable is

not promising: Trial and error suggests that the best that we can do is to increase R2 to 0.85 by squaring education.

In transforming data or respecifying the functional form of the model, there should be an interplay between substantive and modeling considerations. We must recognize, however, that social theories are almost never mathematically concrete: Theory may tell us that prestige should increase with income, but it does not specify the functional form of the relationship.

SAGE Research Methods

Page 7 of 8 Nonlinearity

Still, in certain contexts, specific transformations may have advantages of interpretability. For example, log transformations often can be given meaningful substantive interpretation: To increase log2x by 1, for

instance, represents a doubling of x. In the respecified Canadian occupational prestige regression, therefore, doubling income is associated on average with a 9-point increment in prestige, holding education and gender composition constant.

Likewise, the square root of an area or cube root of a volume can be interpreted as a linear measure of distance or length, the inverse of the amount of time required to traverse a particular distance is speed, and so on. If both y and xj are log-transformed, then the regression coefficient for x’j is interpretable as the “elasticity”

of y with respect to xj—that is, the approximate percentage of change in y corresponding to a 1% change

in xj. In many contexts, a quadratic relationship will have a clear substantive interpretation (in the example,

occupations with a gender mix appear to pay a small penalty in prestige), but a fourth-degree polynomial may not.

Finally, although it is desirable to maintain simplicity and interpretability, it is not reasonable to distort the data by insisting on a functional form that is clearly inadequate. It is possible, in any event, to display the fitted relationship between y and an x graphically or in a table, using the original scales of the variables if they have been transformed, or to describe the effect at a few strategic x values (see, e.g., the brief description above of the partial effect of percentage of women on occupational prestige).

http://dx.doi.org/10.4135/9781412985604.n7