# predIntNpar: Nonparametric Prediction Interval for a Continuous... In EnvStats: Package for Environmental Statistics, Including US EPA Guidance

## Description

Construct a nonparametric prediction interval to contain at least k out of the next m future observations with probability (1-α)100\% for a continuous distribution.

## Usage

 1 2 3  predIntNpar(x, k = m, m = 1, lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), lb = -Inf, ub = Inf, pi.type = "two-sided") 

## Arguments

 x a numeric vector of observations. Missing (NA), undefined (NaN), and infinite (Inf, -Inf) values are allowed but will be removed. k positive integer specifying the minimum number of future observations out of m that should be contained in the prediction interval. The default value is k=m. m positive integer specifying the number of future observations. The default value is m=1. lpl.rank positive integer indicating the rank of the order statistic to use for the lower bound of the prediction interval. If pi.type="two-sided" or pi.type="lower", the default value is lpl.rank=1 (implying the minimum value of x is used as the lower bound of the prediction interval). If pi.type="upper", this argument is set equal to 0 and the value of lb is used as the lower bound of the tolerance interval. n.plus.one.minus.upl.rank positive integer related to the rank of the order statistic to use for the upper bound of the prediction interval. A value of n.plus.one.minus.upl.rank=1 (the default when pi.type="two.sided" or pi.type="upper") means use the first largest value, and in general a value of n.plus.one.minus.upl.rank=i means use the i'th largest value. If pi.type="lower", this argument is set equal to 0 and the value of ub is used as the upper bound of the prediction interval. lb, ub scalars indicating lower and upper bounds on the distribution. By default, lb=-Inf and ub=Inf. If you are constructing a prediction interval for a distribution that you know has a lower bound other than -Inf (e.g., 0), set lb to this value. Similarly, if you know the distribution has an upper bound other than Inf, set ub to this value. The argument lb is ignored if pi.type="two-sided" or pi.type="lower". The argument ub is ignored if pi.type="two-sided" or pi.type="upper". pi.type character string indicating what kind of prediction interval to compute. The possible values are "two-sided" (the default), "lower", and "upper".

## Details

What is a Nonparametric Prediction Interval?
A nonparametric prediction interval for some population is an interval on the real line constructed so that it will contain at least k of m future observations from that population with some specified probability (1-α)100\%, where 0 < α < 1 and k and m are pre-specified positive integer where k ≤ m. The quantity (1-α)100\% is called the confidence coefficient or confidence level associated with the prediction interval.

The Form of a Nonparametric Prediction Interval
Let \underline{x} = x_1, x_2, …, x_n denote a vector of n independent observations from some continuous distribution, and let x_{(i)} denote the the i'th order statistics in \underline{x}. A two-sided nonparametric prediction interval is constructed as:

[x_{(u)}, x_{(v)}] \;\;\;\;\;\; (1)

where u and v are positive integers between 1 and n, and u < v. That is, u denotes the rank of the lower prediction limit, and v denotes the rank of the upper prediction limit. To make it easier to write some equations later on, we can also write the prediction interval (1) in a slightly different way as:

[x_{(u)}, x_{(n + 1 - w)}] \;\;\;\;\;\; (2)

where

w = n + 1 - v \;\;\;\;\;\; (3)

so that w is a positive integer between 1 and n-1, and u < n+1-w. In terms of the arguments to the function predIntNpar, the argument lpl.rank corresponds to u, and the argument n.plus.one.minus.upl.rank corresponds to w.

If we allow u=0 and w=0 and define lower and upper bounds as:

x_{(0)} = lb \;\;\;\;\;\; (4)

x_{(n+1)} = ub \;\;\;\;\;\; (5)

then Equation (2) above can also represent a one-sided lower or one-sided upper prediction interval as well. That is, a one-sided lower nonparametric prediction interval is constructed as:

[x_{(u)}, x_{(n + 1)}] = [x_{(u)}, ub] \;\;\;\;\;\; (6)

and a one-sided upper nonparametric prediction interval is constructed as:

[x_{(0)}, x_{(n + 1 - w)}] = [lb, x_{(n + 1 - w)}] \;\;\;\;\;\; (7)

Usually, lb = -∞ or lb = 0 and ub = ∞.

Constructing Nonparametric Prediction Intervals for Future Observations
Danziger and Davis (1964) show that the probability that at least k out of the next m observations will fall in the interval defined in Equation (2) is given by:

(1 - α) = [∑_{i=k}^m {{m-i+u+w-1} \choose {m-i}} {{i+n-u-w} \choose i}] / {{n+m} \choose m} \;\;\;\;\;\; (8)

(Note that computing a nonparametric prediction interval for the case k = m = 1 is equivalent to computing a nonparametric β-expectation tolerance interval with coverage (1-α)100\%; see tolIntNpar).

The Special Case of Using the Minimum and the Maximum
Setting u = w = 1 implies using the smallest and largest observed values as the prediction limits. In this case, it can be shown that the probability that at least k out of the next m observations will fall in the interval

[x_{(1)}, x_{(n)}] \;\;\;\;\;\; (9)

is given by:

(1 - α) = [∑_{i=k}^m (m-i-1){{n+i-2} \choose i}] / {{n+m} \choose m} \;\;\;\;\;\; (10)

Setting k=m in Equation (10), the probability that all of the next m observations will fall in the interval defined in Equation (9) is given by:

(1 - α) = \frac{n(n-1)}{(n+m)(n+m-1)} \;\;\;\;\;\; (11)

For one-sided prediction limits, the probability that all m future observations will fall below x_{(n)} (upper prediction limit; pi.type="upper") and the probabilitiy that all m future observations will fall above x_{(1)} (lower prediction limit; pi.type="lower") are both given by:

(1 - α) = \frac{n}{n+m} \;\;\;\;\;\; (12)

Constructing Nonparametric Prediction Intervals for Future Medians
To construct a nonparametric prediction interval for a future median based on s future observations, where s is odd, note that this is equivalent to constructing a nonparametric prediction interval that must hold at least k = (s+1)/2 of the next m = s future observations.

## Value

a list of class "estimate" containing the prediction interval and other information. See the help file for estimate.object for details.

## Note

Prediction and tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973; Krishnamoorthy and Mathew, 2009). In the context of environmental statistics, prediction intervals are useful for analyzing data from groundwater detection monitoring programs at hazardous and solid waste facilities (e.g., Gibbons et al., 2009; Millard and Neerchal, 2001; USEPA, 2009).

## Author(s)

Steven P. Millard ([email protected])

## References

Danziger, L., and S. Davis. (1964). Tables of Distribution-Free Tolerance Limits. Annals of Mathematical Statistics 35(5), 1361–1365.

Davis, C.B. (1994). Environmental Regulatory Statistics. In Patil, G.P., and C.R. Rao, eds., Handbook of Statistics, Vol. 12: Environmental Statistics. North-Holland, Amsterdam, a division of Elsevier, New York, NY, Chapter 26, 817–865.

Davis, C.B., and R.J. McNichols. (1987). One-sided Intervals for at Least p of m Observations from a Normal Population on Each of r Future Occasions. Technometrics 29, 359–370.

Davis, C.B., and R.J. McNichols. (1994a). Ground Water Monitoring Statistics Update: Part I: Progress Since 1988. Ground Water Monitoring and Remediation 14(4), 148–158.

Davis, C.B., and R.J. McNichols. (1994b). Ground Water Monitoring Statistics Update: Part II: Nonparametric Prediction Limits. Ground Water Monitoring and Remediation 14(4), 159–175.

Davis, C.B., and R.J. McNichols. (1999). Simultaneous Nonparametric Prediction Limits (with Discusson). Technometrics 41(2), 89–112.

Gibbons, R.D. (1987a). Statistical Prediction Intervals for the Evaluation of Ground-Water Quality. Ground Water 25, 455–465.

Gibbons, R.D. (1991b). Statistical Tolerance Limits for Ground-Water Monitoring. Ground Water 29, 563–570.

Gibbons, R.D., and J. Baker. (1991). The Properties of Various Statistical Prediction Intervals for Ground-Water Detection Monitoring. Journal of Environmental Science and Health A26(4), 535–553.

Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.

Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York, 392pp.

Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178–188.

Hall, I.J., R.R. Prairie, and C.K. Motlagh. (1975). Non-Parametric Prediction Intervals. Journal of Quality Technology 7(3), 109–114.

Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.

USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.

USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.

estimate.object, predIntNparN, predIntNparConfLevel, plotPredIntNparDesign.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271  # Generate 20 observations from a lognormal mixture distribution with # parameters mean1=1, cv1=0.5, mean2=5, cv2=1, and p.mix=0.1. Use # predIntNpar to construct a two-sided prediction interval using the # minimum and maximum observed values. Note that the associated confidence # level is 90%. A larger sample size is required to obtain a larger # confidence level (see the help file for predIntNparN). # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormMixAlt(n = 20, mean1 = 1, cv1 = 0.5, mean2 = 5, cv2 = 1, p.mix = 0.1) predIntNpar(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Exact # #Prediction Interval Type: two-sided # #Confidence Level: 90.47619% # #Prediction Limit Rank(s): 1 20 # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0.3647875 # UPL = 1.8173115 #---------- # Repeat the above example, but specify m=5 future observations should be # contained in the prediction interval. Note that the confidence level is # now only 63%. predIntNpar(dat, m = 5) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Exact # #Prediction Interval Type: two-sided # #Confidence Level: 63.33333% # #Prediction Limit Rank(s): 1 20 # #Number of Future Observations: 5 # #Prediction Interval: LPL = 0.3647875 # UPL = 1.8173115 #---------- # Repeat the above example, but specify that a minimum of k=3 observations # out of a total of m=5 future observations should be contained in the # prediction interval. Note that the confidence level is now 98%. predIntNpar(dat, k = 3, m = 5) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Exact # #Prediction Interval Type: two-sided # #Confidence Level: 98.37945% # #Prediction Limit Rank(s): 1 20 # #Minimum Number of #Future Observations #Interval Should Contain: 3 # #Total Number of #Future Observations: 5 # #Prediction Interval: LPL = 0.3647875 # UPL = 1.8173115 #========== # Example 18-3 of USEPA (2009, p.18-19) shows how to construct # a one-sided upper nonparametric prediction interval for the next # 4 future observations of trichloroethylene (TCE) at a downgradient well. # The data for this example are stored in EPA.09.Ex.18.3.TCE.df. # There are 6 monthly observations of TCE (ppb) at 3 background wells, # and 4 monthly observations of TCE at a compliance well. # Look at the data #----------------- EPA.09.Ex.18.3.TCE.df # Month Well Well.type TCE.ppb.orig TCE.ppb Censored #1 1 BW-1 Background <5 5.0 TRUE #2 2 BW-1 Background <5 5.0 TRUE #3 3 BW-1 Background 8 8.0 FALSE #... #22 4 CW-4 Compliance <5 5.0 TRUE #23 5 CW-4 Compliance 8 8.0 FALSE #24 6 CW-4 Compliance 14 14.0 FALSE longToWide(EPA.09.Ex.18.3.TCE.df, "TCE.ppb.orig", "Month", "Well", paste.row.name = TRUE) # BW-1 BW-2 BW-3 CW-4 #Month.1 <5 7 <5 #Month.2 <5 6.5 <5 #Month.3 8 <5 10.5 7.5 #Month.4 <5 6 <5 <5 #Month.5 9 12 <5 8 #Month.6 10 <5 9 14 # Construct the prediction limit based on the background well data # using the maximum value as the upper prediction limit. # Note that since all censored observations are censored at one # censoring level and the censoring level is less than all of the # uncensored observations, we can just supply the censoring level # to predIntNpar. #----------------------------------------------------------------- with(EPA.09.Ex.18.3.TCE.df, predIntNpar(TCE.ppb[Well.type == "Background"], m = 4, pi.type = "upper", lb = 0)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: TCE.ppb[Well.type == "Background"] # #Sample Size: 18 # #Prediction Interval Method: Exact # #Prediction Interval Type: upper # #Confidence Level: 81.81818% # #Prediction Limit Rank(s): 18 # #Number of Future Observations: 4 # #Prediction Interval: LPL = 0 # UPL = 12 # Since the value of 14 ppb for Month 6 at the compliance well exceeds # the upper prediction limit of 12, we might conclude that there is # statistically significant evidence of an increase over background # at CW-4. However, the confidence level associated with this # prediction limit is about 82%, which implies a Type I error level of # 18%. This means there is nearly a one in five chance of a false positive. # Only additional background data and/or use of a retesting strategy # (see predIntNparSimultaneous) would lower the false positive rate. #========== # Example 18-4 of USEPA (2009, p.18-19) shows how to construct # a one-sided upper nonparametric prediction interval for the next # median of order 3 of xylene at a downgradient well. # The data for this example are stored in EPA.09.Ex.18.4.xylene.df. # There are 8 monthly observations of xylene (ppb) at 3 background wells, # and 3 montly observations of TCE at a compliance well. # Look at the data #----------------- EPA.09.Ex.18.4.xylene.df # Month Well Well.type Xylene.ppb.orig Xylene.ppb Censored #1 1 Well.1 Background <5 5.0 TRUE #2 2 Well.1 Background <5 5.0 TRUE #3 3 Well.1 Background 7.5 7.5 FALSE #... #30 6 Well.4 Compliance <5 5.0 TRUE #31 7 Well.4 Compliance 7.8 7.8 FALSE #32 8 Well.4 Compliance 10.4 10.4 FALSE longToWide(EPA.09.Ex.18.4.xylene.df, "Xylene.ppb.orig", "Month", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 #Month.1 <5 9.2 <5 #Month.2 <5 <5 5.4 #Month.3 7.5 <5 6.7 #Month.4 <5 6.1 <5 #Month.5 <5 8 <5 #Month.6 <5 5.9 <5 <5 #Month.7 6.4 <5 <5 7.8 #Month.8 6 <5 <5 10.4 # Construct the prediction limit based on the background well data # using the maximum value as the upper prediction limit. # Note that since all censored observations are censored at one # censoring level and the censoring level is less than all of the # uncensored observations, we can just supply the censoring level # to predIntNpar. # # To compute a prediction interval for a median of order 3 (i.e., # a median based on 3 observations), this is equivalent to # constructing a nonparametric prediction interval that must hold # at least 2 of the next 3 future observations. #----------------------------------------------------------------- with(EPA.09.Ex.18.4.xylene.df, predIntNpar(Xylene.ppb[Well.type == "Background"], k = 2, m = 3, pi.type = "upper", lb = 0)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Xylene.ppb[Well.type == "Background"] # #Sample Size: 24 # #Prediction Interval Method: Exact # #Prediction Interval Type: upper # #Confidence Level: 99.1453% # #Prediction Limit Rank(s): 24 # #Minimum Number of #Future Observations #Interval Should Contain: 2 # #Total Number of #Future Observations: 3 # #Prediction Interval: LPL = 0.0 # UPL = 9.2 # The Month 8 observation at the Complance well is 10.4 ppb of Xylene, # which is greater than the upper prediction limit of 9.2 ppb, so # conclude there is evidence of contamination at the # 100% - 99% = 1% Type I Error Level #========== # Cleanup #-------- rm(dat)