# Non-Normally Distributed Errors

Non-Normally Distributed Errors
In: Regression Diagnostics
By: John Fox
Pub. Date: 2011
Access Date: October 16, 2019
Publishing Company: SAGE Publications, Inc.
City: Thousand Oaks
Print ISBN: 9780803939714
Online ISBN: 9781412985604
DOI: https://dx.doi.org/10.4135/9781412985604
Print pages: 41-48
This PDF has been generated from SAGE Research Methods. Please note that the pagination of the
online version will vary from the pagination of the print book.
Non-Normally Distributed Errors
The assumption of normally distributed errors is almost always arbitrary. Nevertheless, the central-limit
theorem assures that under very broad conditions inference based on the least-squares estimators is
approximately valid in all but small samples. Why, then, should we be concerned about non-normal errors?
First, although the validity of least-squares estimation is robust—as stated, the levels of tests and confidence
intervals are approximately correct in large samples even when the assumption of normality is violated—the
method is not robust in efficiency: The least-squares estimator is maximally efficient among unbiased
estimators when the errors are normal. For some types of error distributions, however, particularly those with
heavy tails, the efficiency of least-squares estimation decreases markedly. In these cases, the least-squares
estimator becomes much less efficient than alternatives (e.g., so-called robust estimators, or least-squares
augmented by diagnostics). To a substantial extent, heavy-tailed error distributions are problematic because
they give rise to outliers, a problem that I addressed in the previous chapter.
A commonly quoted justification of least-squares estimation— called the Gauss-Markov theorem—states
that the least-squares coefficients are the most efficient unbiased estimators that are linear functions of
the observations yi. This result depends on the assumptions of linearity, constant error variance, and
independence, but does not require normality (see, e.g., Fox, 1984, pp. 42–43). Although the restriction to
linear estimators produces simple sampling properties, it is not compelling in light of the vulnerability of least
squares to heavy-tailed error distributions.
Second, highly skewed error distributions, aside from their propensity to generate outliers in the direction of
the skew, compromise the interpretation of the least-squares fit. This fit is, after all, a conditional mean (of y
given the xs), and the mean is not a good measure of the center of a highly skewed distribution. Consequently,
we may prefer to transform the data to produce a symmetric error distribution.
Finally, a multimodal error distribution suggests the omission of one or more qualitative variables mat
divide the data naturally into groups. An examination of the distribution of residuals may therefore motivate
respecification of the model.
Although there are tests for non-normal errors, I shall describe here instead graphical methods for examining
the distribution of the residuals (but see Chapter 9). These methods are more useful for pinpointing the
character of a problem and for suggesting solutions.
Normal Quantile-Comparison Plot of Residuals
One such graphical display is the quantile-comparison plot, which permits us to compare visually the
cumulative distribution of an independent random sample—here of studentized residuals—to a cumulative
reference distribution—the unit-normal distribution. Note that approximations are implied, because the
SAGE
SAGE Research Methods
Page 2 of 9 Non-Normally Distributed Errors
studentized residuals are t distributed and dependent, but generally the distortion is negligible, at least for
moderate-sized to large samples.
To construct the quantile-comparison plot:
1.
Arrange the studentized residuals in ascending order: t(1), t(1), …, t(n). By convention, the ith
largest studentized residual, t(i), has gi = (i – 1/2)/n proportion of the data below it. This convention
avoids cumulative proportions of zero and one by (in effect) counting half of each observation
below and half above its recorded value. Cumulative proportions of zero and one would be
problematic because the normal distribution, to which we wish to compare the distribution of the
residuals, never quite reaches cumulative probabilities of zero or one.
2.
Find the quantile of the unit-normal distribution that corresponds to a cumulative probability of gi
— that is, the value zi from Z ∼ N(0, 1) for which Pr(Z < zi) = gi.
3.
Plot the t(i) against the zi.
If the ti were drawn from a unit-normal distribution, then, within the bounds of sampling error, t(i) = zi.
Consequently, we expect to find an approximately linear plot with zero intercept and unit slope, a line that can
be placed on the plot for comparison. Nonlinearity in the plot, in contrast, is symptomatic of non-normality.
It is sometimes advantageous to adjust the fitted line for the observed center and spread of the residuals. To
understand how the adjustment may be accomplished, suppose more generally that a variable X is normally
distributed with mean μ. and variance ζ2
. Then, for an ordered sample of values, approximately x(i) = μ +
ζzi, where zi is defined as before. In applications, we need to estimate μ and μ, preferably robustly, because
the usual estimators—the sample mean and standard deviation—are markedly affected by extreme values.
Generally effective choices are the median of x to estimate μ and (Q3 – Q1)/1.349 to estimate ζ, where Q1 and
Q3 are, respectively, the first and third quartiles of x: The median and quartiles are not sensitive to outliers.
Note that 1.349 is the number of standard deviations separating the quartiles of a normal distribution. Applied
to the studentized residuals, we have the fitted line (i) = median(t) + {[Q3(t) – Q1(t)]/1.349} × zi. The normal
quantile-comparison plots in this monograph employ the more general procedure.
Several illustrative normal-probability plots for simulated data are shown in Figure 5.1. In parts a and b of
the figure, independent samples of size n = 25 and n = 100, respectively, were drawn from a unit-normal
distribution. In parts c and d, samples of size n = 100 were drawn from the highly positively skewed χ4
2
distribution and the heavy-tailed t2 distribution, respectively. Note how the skew and heavy tails show up as
departures from linearity in the normal quantile-comparison plots. Outliers are discernible as unusually large
or small values in comparison with corresponding normal quantiles.
SAGE
SAGE Research Methods
Page 3 of 9 Non-Normally Distributed Errors
Judging departures from normality can be assisted by plotting information about sampling variation. If the
studentized residuals were drawn independently from a unit-normal distribution, then
where ϕ(zi) is the probability density (i.e., the “height”) of the unit-normal distribution at Z = zi. Thus, zi ± 2 ×
SE(t(i)) gives a rough 95% confidence interval around the fitted line (i) = zi in the quantile-comparison plot.
If the slope of the fitted line is taken as = (Q3 – Q1)/ 1.349 rather than 1, then the estimated standard error
may be multiplied by . As an alternative to computing standard errors, Atkinson (1985) has suggested a
computationally intensive simulation procedure that does not treat the studentized residuals as independent
and normally distributed.
Figure 5.1. Illustrative normal quantile-comparison plots. (a) For a sample of n = 25 from N(0, 1). (b)
For a sample of n = 100 from N(0, 1). (c) For a sample of n – 100 from the positively skewed χ4
2
. (d)
For a sample of n = 100 from the heavy-tailed t2.
Figure 5.2 shows a normal quantile-comparison plot for the studentized residuals from Duncan’s regression of
rated prestige on occupational income and education levels. The plot includes a fitted line with two-standarderror limits. Note that the residual distribution is reasonably well behaved.
SAGE
SAGE Research Methods
Page 4 of 9 Non-Normally Distributed Errors
Figure 5.2. Normal quantile-comparison plot for the studentized residuals from the regression of
occupational prestige on income and education. The plot shows a fitted line, based on the median
and quartiles of the fs, and approximate ±2SE limits around the line.
Histograms of Residuals
A strength of the normal quantile-comparison plot is that it retains high resolution in the tails of the distribution,
where problems often manifest themselves. A weakness of the display, however, is that it does not convey a
good overall sense of the shape of the distribution of the residuals. For example, multiple modes are difficult
to discern in a quantile-comparison plot.
Histograms (frequency bar graphs), in contrast, have poor resolution in the tails or wherever data are
sparse, but do a good job of conveying general distributional information. The arbitrary class boundaries,
arbitrary intervals, and roughness of histograms sometimes produce misleading impressions of the data,
however. These problems can partly be addressed by smoothing the histogram (see Silverman, 1986, or Fox,
1990). Generally, I prefer to employ stem-and-leaf displays—a type of histogram (Tukey, 1977) that records
the numerical data values directly in the bars of the graph—for small samples (say n < 100), smoothed
histograms for moderate-sized samples (say 100 ≤ n ≤ 1,000), and histograms with relatively narrow bars for
large samples (say n > 1,000).
SAGE
SAGE Research Methods
Page 5 of 9 Non-Normally Distributed Errors
Figure 5.3. Stem-and-leaf display of studentized residuals from the regression of occupational
prestige on income and education.
A stem-and-leaf display of studentized residuals from the Duncan regression is shown in Figure 5.3. The
display reveals nothing of note: There is a single node, the distribution appears reasonably symmetric, and
there are no obvious outliers, although the largest value (3.1) is somewhat separated from the next-largest
value (2.0).
Each data value in the stem-and-leaf display is broken into two parts: The leading digits comprise the stem;
the first trailing digit forms the leaf; and the remaining trailing digits are discarded, thus truncating rather
than rounding the data value. (Truncation makes it simpler to locate values in a list or table.) For studentized
residuals, it is usually sensible to make this break at the decimal point. For example, for the residuals shown
in Figure 5.4: 0.3039 → 0 |3; 3.1345 → 3 |1; and -0.4981 → -0 |4. Note that each stem digit appears twice,
implicitly producing bins of width 0.5. Stems marked with asterisks (e.g., 1?
) take leaves 0 — 4; stems
e.g., Velleman and Hoaglin  or Fox .)
SAGE
SAGE Research Methods
Page 6 of 9 Non-Normally Distributed Errors
Figure 5.4. The family of powers and roots. The transformation labeled “p” is actually y’ = (y
p – 1)/p;
for p = 0, y’ = logey.
SOURCE: Adapted with permission from Figure 4-1 from Hoaglin, Mosteller, and Tukey (eds.). Understanding
Robust and Exploratory Data Analysis, © 1983 by John Wiley and Sons, Inc.
Correcting Asymmetry by Transformation
A frequently effective approach to a variety of problems in regression analysis is to transform the data so that
they conform more closely to the assumptions of the linear model. In this and later chapters I shall introduce
transformations to produce symmetry in the error distribution, to stabilize error variance, and to make the
relationship between y and the xs linear.
In each of these cases, we shall employ the family of powers and roots, replacing a variable y (used here
generically, because later we shall want to transform xs as well) by y’ = y
p
. Typically, p = -2, -1, -1/2, 1/2, 2, or
3, although sometimes other powers and roots are considered. Note that p = 1 represents no transformation.
In place of the 0th power, which would be useless because y
0
= 1 regardless of the value of y, we take y’ =
log y, usually using base 2 or 10 for the log function. Because logs to different bases differ only by a constant
factor, we can select the base for convenience of interpretation. Using the log transformation as a “zeroth
power” is reasonable, because the closer p gets to zero, the more y
p
looks like the log function (formally,
limp→0[(y
p
– 1)/p] = logey, where the log to the base e ≈ 2.718 is the so-called “natural” logarithm). Finally, for
negative powers, we take y’ = -y
p
, preserving the order of the y values, which would otherwise be reversed.
As we move away from p = 1 in either direction, the transformations get stronger, as illustrated in Figure 5.4.
The effect of some of these transformations is shown in Table 5.1a. Transformations “up the ladder” of powers
and roots (a term borrowed from Tukey, 1977)—that is, toward y
2—serve differentially to spread out large
SAGE
SAGE Research Methods
Page 7 of 9 Non-Normally Distributed Errors
values of y relative to small ones; transformations “down the ladder”—toward log y—have the opposite effect.
To correct a positive skew (as in Table 5.1b), it is therefore necessary to move down the ladder; to correct a
negative skew (Table 5.1c), which is less common in applications, move up the ladder.
I have implicitly assumed that all data values are positive, a condition that must hold for power transformations
to maintain order. In practice, negative values can be eliminated prior to transformation by adding a small
constant, sometimes called a “start,” to the data. Likewise, for power transformations to be effective, the ratio
of the largest to the smallest data value must be sufficiently large; otherwise the transformation will be too
nearly linear. A small ratio can be dealt with by using a negative start.
In the specific context of regression analysis, a skewed error distribution, revealed by examining the
distribution of the residuals, can often be corrected by transforming the dependent variable. Although more
sophisticated approaches are available (see, e.g., Chapter 9), a good transformation can be located by trial
and error.
Dependent variables that are bounded below, and hence that tend to be positively skewed, often respond
well to transformations down the ladder of powers. Power transformations usually do not work well, however,
when many values stack up against the boundary, a situation termed truncation or censoring (see, e.g., Tobin
 for a treatment of “limited” dependent variables in regression). As well, data that are bounded both
above and below—such as proportions and percentages—generally require another approach. For example
the logit or “log odds” transformation given by y’ = log[y/(l – y)], often works well for proportions.
SAGE