# serialCorrelationTest: Test for the Presence of Serial Correlation In EnvStats: Package for Environmental Statistics, Including US EPA Guidance

## Description

serialCorrelationTest is a generic function used to test for the presence of lag-one serial correlation using either the rank von Neumann ratio test, the normal approximation based on the Yule-Walker estimate of lag-one correlation, or the normal approximation based on the MLE of lag-one correlation. The function invokes particular methods which depend on the class of the first argument.

Currently, there is a default method and a method for objects of class "lm".

## Usage

 1 2 3 4 5 6 7 8 9  serialCorrelationTest(x, ...) ## Default S3 method: serialCorrelationTest(x, test = "rank.von.Neumann", alternative = "two.sided", conf.level = 0.95, ...) ## S3 method for class 'lm' serialCorrelationTest(x, test = "rank.von.Neumann", alternative = "two.sided", conf.level = 0.95, ...) 

## Arguments

 x numeric vector of observations, a numeric univariate time series of class "ts", or an object of class "lm". Undefined (NaN) and infinite (Inf, -Inf) values are not allowed for x when x is a numeric vector or time series, nor for the residuals associated with x when x is an object of class "lm". When test="AR1.mle", missing (NA) values are allowed, otherwise they are not allowed. When x is a numeric vector of observations or a numeric univariate time series of class "ts", it must contain at least 3 non-missing values. When x is an object of class "lm", the residuals must contain at least 3 non-missing values. Note: when x is an object of class "lm", the linear model should have been fit using the argument na.action=na.exclude in the call to lm in order to correctly deal with missing values. test character string indicating which test to use. The possible values are: "rank.von.Neumann" (rank von Neumann ratio test; the default), "AR1.yw" (z-test based on Yule-Walker lag-one estimate of correlation), and "AR1.mle" (z-test based on MLE of lag-one correlation). alternative character string indicating the kind of alternative hypothesis. The possible values are "two.sided" (the default), "greater", and "less". conf.level numeric scalar between 0 and 1 indicating the confidence level associated with the confidence interval for the population lag-one autocorrelation. The default value is conf.level=0.95. ... optional arguments for possible future methods. Currently not used.

## Details

Let \underline{x} = x_1, x_2, …, x_n denote n observations from a stationary time series sampled at equispaced points in time with normal (Gaussian) errors. The function serialCorrelationTest tests the null hypothesis:

H_0: ρ_1 = 0 \;\;\;\;\;\; (1)

where ρ_1 denotes the true lag-1 autocorrelation (also called the lag-1 serial correlation coefficient). Actually, the null hypothesis is that the lag-k autocorrelation is 0 for all values of k greater than 0 (i.e., the time series is purely random).

In the case when the argument x is a linear model, the function serialCorrelationTest tests the null hypothesis (1) for the residuals.

The three possible alternative hypotheses are the upper one-sided alternative (alternative="greater"):

H_a: ρ_1 > 0 \;\;\;\;\;\; (2)

the lower one-sided alternative (alternative="less"):

H_a: ρ_1 < 0 \;\;\;\;\;\; (3)

and the two-sided alternative:

H_a: ρ_1 \ne 0 \;\;\;\;\;\; (4)

Testing the Null Hypothesis of No Lag-1 Autocorrelation
There are several possible methods for testing the null hypothesis (1) versus any of the three alternatives (2)-(4). The function serialCorrelationTest allows you to use one of three possible tests:

• The rank von Neuman ratio test.

• The test based on the normal approximation for the distribution of the Yule-Walker estimate of lag-one correlation.

• The test based on the normal approximation for the distribution of the maximum likelihood estimate (MLE) of lag-one correlation.

Each of these tests is described below.

Test Based on Yule-Walker Estimate (test="AR1.yw")
The Yule-Walker estimate of the lag-1 autocorrelation is given by:

\hat{ρ}_1 = \frac{\hat{γ}_1}{\hat{γ}_0} \;\;\;\;\;\; (5)

where

\hat{γ}_k = \frac{1}{n} ∑_{t=1}^{n-k} (x_t - \bar{x})(x_{t+k} - \bar{x}) \;\;\;\;\;\; (6)

is the estimate of the lag-k autocovariance. (This estimator does not allow for missing values.)

Under the null hypothesis (1), the estimator of lag-1 correlation in Equation (5) is approximately distributed as a normal (Gaussian) random variable with mean 0 and variance given by:

Var(\hat{ρ}_1) \approx \frac{1}{n} \;\;\;\;\;\; (7)

(Box and Jenkins, 1976, pp.34-35). Thus, the null hypothesis (1) can be tested with the statistic

z = √{n} \hat{ρ_1} \;\;\;\;\;\; (8)

which is distributed approximately as a standard normal random variable under the null hypothesis that the lag-1 autocorrelation is 0.

Test Based on the MLE (test="AR1.mle")
The function serialCorrelationTest the R function arima to compute the MLE of the lag-one autocorrelation and the estimated variance of this estimator. As for the test based on the Yule-Walker estimate, the z-statistic is computed as the estimated lag-one autocorrelation divided by the square root of the estimated variance.

Test Based on Rank von Neumann Ratio (test="rank.von.Neumann")
The null distribution of the serial correlation coefficient may be badly affected by departures from normality in the underlying process (Cox, 1966; Bartels, 1977). It is therefore a good idea to consider using a nonparametric test for randomness if the normality of the underlying process is in doubt (Bartels, 1982).

Wald and Wolfowitz (1943) introduced the rank serial correlation coefficient, which for lag-1 autocorrelation is simply the Yule-Walker estimate (Equation (5) above) with the actual observations replaced with their ranks.

von Neumann et al. (1941) introduced a test for randomness in the context of testing for trend in the mean of a process. Their statistic is given by:

V = \frac{∑_{i=1}^{n-1}(x_i - x_{i+1})^2}{∑_{i=1}^n (x_i - \bar{x})^2} \;\;\;\;\;\; (9)

which is the ratio of the square of successive differences to the usual sums of squared deviations from the mean. This statistic is bounded between 0 and 4, and for a purely random process is symmetric about 2. Small values of this statistic indicate possible positive autocorrelation, and large values of this statistics indicate possible negative autocorrelation. Durbin and Watson (1950, 1951, 1971) proposed using this statistic in the context of checking the independence of residuals from a linear regression model and provided tables for the distribution of this statistic. This statistic is therefore often called the “Durbin-Watson statistic” (Draper and Smith, 1998, p.181).

The rank version of the von Neumann ratio statistic is given by:

V_{rank} = \frac{∑_{i=1}^{n-1}(R_i - R_{i+1})^2}{∑_{i=1}^n (R_i - \bar{R})^2} \;\;\;\;\;\; (10)

where R_i denotes the rank of the i'th observation (Bartels, 1982). (This test statistic does not allow for missing values.) In the absence of ties, the denominator of this test statistic is equal to

∑_{i=1}^n (R_i - \bar{R})^2 = \frac{n(n^2 - 1)}{12} \;\;\;\;\;\; (11)

The range of the V_{rank} test statistic is given by:

[\frac{12}{(n)(n+1)} , 4 - \frac{12}{(n)(n+1)}] \;\;\;\;\;\; (12)

if n is even, with a negligible adjustment if n is odd (Bartels, 1982), so asymptotically the range is from 0 to 4, just as for the V test statistic in Equation (9) above.

Bartels (1982) shows that asymptotically, the rank von Neumann ratio statistic is a linear transformation of the rank serial correlation coefficient, so any asymptotic results apply to both statistics.

For any fixed sample size n, the exact distribution of the V_{rank} statistic in Equation (10) above can be computed by simply computing the value of V_{rank} for all possible permutations of the serial order of the ranks. Based on this exact distribution, Bartels (1982) presents a table of critical values for the numerator of the RVN statistic for sample sizes between 4 and 10.

Determining the exact distribution of V_{rank} becomes impractical as the sample size increases. For values of n between 10 and 100, Bartels (1982) approximated the distribution of V_{rank} by a beta distribution over the range 0 to 4 with shape parameters shape1=ν and shape2=ω and:

ν = ω = \frac{5n(n+1)(n-1)^2}{2(n-2)(5n^2 - 2n - 9)} - \frac{1}{2} \;\;\;\;\;\; (13)

Bartels (1982) checked this approximation by simulating the distribution of V_{rank} for n=25 and n=50 and comparing the empirical quantiles at 0.005, 0.01, 0.025, 0.05, and 0.1 with the approximated quantiles based on the beta distribution. He found that the quantiles agreed to 2 decimal places for eight of the 10 values, and differed by 0.01 for the other two values.

Note: The definition of the beta distribution assumes the random variable ranges from 0 to 1. This definition can be generalized as follows. Suppose the random variable Y has a beta distribution over the range a ≤ y ≤ b, with shape parameters ν and ω. Then the random variable X defined as:

X = \frac{Y-a}{b-a} \;\;\;\;\;\; (14)

has the “standard beta distribution” as described in the help file for Beta (Johnson et al., 1995, p.210).

Bartels (1982) shows that asymptotically, V_{rank} has normal distribution with mean 2 and variance 4/n, but notes that a slightly better approximation is given by using a variance of 20/(5n + 7).

To test the null hypothesis (1) when test="rank.von.Neumann", the function
serialCorrelationTest does the following:

• When the sample size is between 3 and 10, the exact distribution of V_{rank} is used to compute the p-value.

• When the sample size is between 11 and 100, the beta approximation to the distribution of V_{rank} is used to compute the p-value.

• When the sample size is larger than 100, the normal approximation to the distribution of V_{rank} is used to compute the p-value. (This uses the variance 20/(5n + 7).)

When ties are present in the observations and midranks are used for the tied observations, the distribution of the V_{rank} statistic based on the assumption of no ties is not applicable. If the number of ties is small, however, they may not grossly affect the assumed p-value.

When ties are present, the function serialCorrelationTest issues a warning. When the sample size is between 3 and 10, the p-value is computed based on rounding up the computed value of V_{rank} to the nearest possible value that could be observed in the case of no ties.

Computing a Confidence Interval for the Lag-1 Autocorrelation
The function serialCorrelationTest computes an approximate 100(1-α)\% confidence interval for the lag-1 autocorrelation as follows:

[\hat{ρ}_1 - z_{1-α/2}\hat{σ}_{\hat{ρ}_1}, \hat{ρ}_1 + z_{1-α/2}\hat{σ}_{\hat{ρ}_1}] \;\;\;\;\;\; (15)

where \hat{σ}_{\hat{ρ}_1} denotes the estimated standard deviation of the estimated of lag-1 autocorrelation and z_p denotes the p'th quantile of the standard normal distribution.

When test="AR1.yw" or test="rank.von.Neumann", the Yule-Walker estimate of lag-1 autocorrelation is used and the variance of the estimated lag-1 autocorrelation is approximately:

Var(\hat{ρ}_1) \approx \frac{1}{n} (1 - ρ_1^2) \;\;\;\;\;\; (16)

(Box and Jenkins, 1976, p.34), so

\hat{σ}_{\hat{ρ}_1} = √{\frac{1 - \hat{ρ}_1^2}{n}} \;\;\;\;\;\; (17)

When test="AR1.mle", the MLE of the lag-1 autocorrelation is used, and its standard deviation is estimated with the square root of the estimated variance returned by arima.

## Value

A list of class "htest" containing the results of the hypothesis test. See the help file for htest.object for details.

## Note

Data collected over time on the same phenomenon are called a time series. A time series is usually modeled as a single realization of a stochastic process; that is, if we could go back in time and repeat the experiment, we would get different results that would vary according to some probabilistic law. The simplest kind of time series is a stationary time series, in which the mean value is constant over time, the variability of the observations is constant over time, etc. That is, the probability distribution associated with each future observation is the same.

A common concern in applying standard statistical tests to time series data is the assumption of independence. Most conventional statistical hypothesis tests assume the observations are independent, but data collected sequentially in time may not satisfy this assumption. For example, high observations may tend to follow high observations (positive serial correlation), or low observations may tend to follow high observations (negative serial correlation). One way to investigate the assumption of independence is to estimate the lag-one serial correlation and test whether it is significantly different from 0.

The null distribution of the serial correlation coefficient may be badly affected by departures from normality in the underlying process (Cox, 1966; Bartels, 1977). It is therefore a good idea to consider using a nonparametric test for randomness if the normality of the underlying process is in doubt (Bartels, 1982). Knoke (1977) showed that under normality, the test based on the rank serial correlation coefficient (and hence the test based on the rank von Neumann ratio statistic) has asymptotic relative efficiency of 0.91 with respect to using the test based on the ordinary serial correlation coefficient against the alternative of first-order autocorrelation.

Bartels (1982) performed an extensive simulation study of the power of the rank von Neumann ratio test relative to the standard von Neumann ratio test (based on the statistic in Equation (9) above) and the runs test (Lehmann, 1975, 313-315). He generated a first-order autoregressive process for sample sizes of 10, 25, and 50, using 6 different parent distributions: normal, Cauchy, contaminated normal, Johnson, Stable, and exponential. Values of lag-1 autocorrelation ranged from -0.8 to 0.8. Bartels (1982) found three important results:

• The rank von Neumann ratio test is far more powerful than the runs test.

• For the normal process, the power of the rank von Neumann ratio test was never less than 89% of the power of the standard von Neumann ratio test.

• For non-normal processes, the rank von Neumann ratio test was often much more powerful than of the standard von Neumann ratio test.

## Author(s)

Steven P. Millard ([email protected])

## References

Bartels, R. (1982). The Rank Version of von Neumann's Ratio Test for Randomness. Journal of the American Statistical Association 77(377), 40–46.

Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.

Box, G.E.P., and G.M. Jenkins. (1976). Time Series Analysis: Forecasting and Control. Prentice Hall, Englewood Cliffs, NJ, Chapter 2.

Cox, D.R. (1966). The Null Distribution of the First Serial Correlation Coefficient. Biometrika 53, 623–626.

Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, pp.69-70;181-192.

Durbin, J., and G.S. Watson. (1950). Testing for Serial Correlation in Least Squares Regression I. Biometrika 37, 409–428.

Durbin, J., and G.S. Watson. (1951). Testing for Serial Correlation in Least Squares Regression II. Biometrika 38, 159–178.

Durbin, J., and G.S. Watson. (1971). Testing for Serial Correlation in Least Squares Regression III. Biometrika 58, 1–19.

Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, pp.250–253.

Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York, Chapter 25.

Knoke, J.D. (1975). Testing for Randomness Against Autocorrelation Alternatives: The Parametric Case. Biometrika 62, 571–575.

Knoke, J.D. (1977). Testing for Randomness Against Autocorrelation Alternatives: Alternative Tests. Biometrika 64, 523–529.

Lehmann, E.L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, Oakland, CA, 457pp.

von Neumann, J., R.H. Kent, H.R. Bellinson, and B.I. Hart. (1941). The Mean Square Successive Difference. Annals of Mathematical Statistics 12(2), 153–162.

Wald, A., and J. Wolfowitz. (1943). An Exact Test for Randomness in the Non-Parametric Case Based on Serial Correlation. Annals of Mathematical Statistics 14, 378–388.

htest.object, acf, ar, arima, arima.sim, ts.plot, plot.ts, lag.plot, Hypothesis Tests.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250  # Generate a purely random normal process, then use serialCorrelationTest # to test for the presence of correlation. # (Note: the call to set.seed allows you to reproduce this example.) set.seed(345) x <- rnorm(100) # Look at the data #----------------- dev.new() ts.plot(x) dev.new() acf(x) # Test for serial correlation #---------------------------- serialCorrelationTest(x) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: rho = 0 # #Alternative Hypothesis: True rho is not equal to 0 # #Test Name: Rank von Neumann Test for # Lag-1 Autocorrelation # (Beta Approximation) # #Estimated Parameter(s): rho = 0.02773737 # #Estimation Method: Yule-Walker # #Data: x # #Sample Size: 100 # #Test Statistic: RVN = 1.929733 # #P-value: 0.7253405 # #Confidence Interval for: rho # #Confidence Interval Method: Normal Approximation # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = -0.1681836 # UCL = 0.2236584 # Clean up #--------- rm(x) graphics.off() #========== # Now use the R function arima.sim to generate an AR(1) process with a # lag-1 autocorrelation of 0.8, then test for autocorrelation. set.seed(432) y <- arima.sim(model = list(ar = 0.8), n = 100) # Look at the data #----------------- dev.new() ts.plot(y) dev.new() acf(y) # Test for serial correlation #---------------------------- serialCorrelationTest(y) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: rho = 0 # #Alternative Hypothesis: True rho is not equal to 0 # #Test Name: Rank von Neumann Test for # Lag-1 Autocorrelation # (Beta Approximation) # #Estimated Parameter(s): rho = 0.835214 # #Estimation Method: Yule-Walker # #Data: y # #Sample Size: 100 # #Test Statistic: RVN = 0.3743174 # #P-value: 0 # #Confidence Interval for: rho # #Confidence Interval Method: Normal Approximation # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.7274307 # UCL = 0.9429973 #---------- # Clean up #--------- rm(y) graphics.off() #========== # The data frame Air.df contains information on ozone (ppb^1/3), # radiation (langleys), temperature (degrees F), and wind speed (mph) # for 153 consecutive days between May 1 and September 30, 1973. # First test for serial correlation in (the cube root of) ozone. # Note that we must use the test based on the MLE because the time series # contains missing values. Serial correlation appears to be present. # Next fit a linear model that includes the predictor variables temperature, # radiation, and wind speed, and test for the presence of serial correlation # in the residuals. There is no evidence of serial correlation. # Look at the data #----------------- Air.df # ozone radiation temperature wind #05/01/1973 3.448217 190 67 7.4 #05/02/1973 3.301927 118 72 8.0 #05/03/1973 2.289428 149 74 12.6 #05/04/1973 2.620741 313 62 11.5 #05/05/1973 NA NA 56 14.3 #... #09/27/1973 NA 145 77 13.2 #09/28/1973 2.410142 191 75 14.3 #09/29/1973 2.620741 131 76 8.0 #09/30/1973 2.714418 223 68 11.5 #---------- # Test for serial correlation #---------------------------- with(Air.df, serialCorrelationTest(ozone, test = "AR1.mle")) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: rho = 0 # #Alternative Hypothesis: True rho is not equal to 0 # #Test Name: z-Test for # Lag-1 Autocorrelation # (Wald Test Based on MLE) # #Estimated Parameter(s): rho = 0.5641616 # #Estimation Method: Maximum Likelihood # #Data: ozone # #Sample Size: 153 # #Number NA/NaN/Inf's: 37 # #Test Statistic: z = 7.586952 # #P-value: 3.28626e-14 # #Confidence Interval for: rho # #Confidence Interval Method: Normal Approximation # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.4184197 # UCL = 0.7099034 #---------- # Next fit a linear model that includes the predictor variables temperature, # radiation, and wind speed, and test for the presence of serial correlation # in the residuals. Note setting the argument na.action = na.exclude in the # call to lm to correctly deal with missing values. #---------------------------------------------------------------------------- lm.ozone <- lm(ozone ~ radiation + temperature + wind + I(temperature^2) + I(wind^2), data = Air.df, na.action = na.exclude) # Now test for serial correlation in the residuals. #-------------------------------------------------- serialCorrelationTest(lm.ozone, test = "AR1.mle") #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: rho = 0 # #Alternative Hypothesis: True rho is not equal to 0 # #Test Name: z-Test for # Lag-1 Autocorrelation # (Wald Test Based on MLE) # #Estimated Parameter(s): rho = 0.1298024 # #Estimation Method: Maximum Likelihood # #Data: Residuals # #Data Source: lm.ozone # #Sample Size: 153 # #Number NA/NaN/Inf's: 42 # #Test Statistic: z = 1.285963 # #P-value: 0.1984559 # #Confidence Interval for: rho # #Confidence Interval Method: Normal Approximation # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = -0.06803223 # UCL = 0.32763704 # Clean up #--------- rm(lm.ozone)