Estimate the mean and standard deviation of a normal distribution, and construct a simultaneous prediction interval for the next r sampling “occasions”, based on one of three possible rules: kofm, California, or Modified California.
1 2 3  predIntNormSimultaneous(x, n.mean = 1, k = 1, m = 2, r = 1, rule = "k.of.m",
delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95,
K.tol = .Machine$double.eps^0.5)

x 
a numeric vector of observations, or an object resulting from a call to an estimating
function that assumes a normal (Gaussian) distribution (e.g., 
n.mean 
positive integer specifying the sample size associated with the future averages.
The default value is 
k 
for the kofm rule ( 
m 
positive integer specifying the maximum number of future observations (or
averages) on one future sampling “occasion”.
The default value is 
r 
positive integer specifying the number of future sampling “occasions”.
The default value is 
rule 
character string specifying which rule to use. The possible values are

delta.over.sigma 
numeric scalar indicating the ratio Δ/σ. The quantity
Δ (delta) denotes the difference between the mean of the population
that was sampled to construct the prediction interval, and the mean of the
population that will be sampled to produce the future observations. The quantity
σ (sigma) denotes the population standard deviation for both populations.
See the DETAILS section below for more information. The default value is

pi.type 
character string indicating what kind of prediction interval to compute.
The possible values are 
conf.level 
a scalar between 0 and 1 indicating the confidence level of the prediction interval.
The default value is 
K.tol 
numeric scalar indicating the tolerance to use in the nonlinear search algorithm to
compute K. The default value is 
What is a Simultaneous Prediction Interval?
A prediction interval for some population is an interval on the real line constructed
so that it will contain k future observations from that population
with some specified probability (1α)100\%, where
0 < α < 1 and k is some prespecified positive integer.
The quantity (1α)100\% is called
the confidence coefficient or confidence level associated with the prediction
interval. The function predIntNorm
computes a standard prediction
interval based on a sample from a normal distribution.
The function predIntNormSimultaneous
computes a simultaneous prediction
interval that will contain a certain number of future observations with probability
(1α)100\% for each of r future sampling “occasions”,
where r is some prespecified positive integer. The quantity r may
refer to r distinct future sampling occasions in time, or it may for example
refer to sampling at r distinct locations on one future sampling occasion,
assuming that the population standard deviation is the same at all of the r
distinct locations.
The function predIntNormSimultaneous
computes a simultaneous prediction
interval based on one of three possible rules:
For the kofm rule (rule="k.of.m"
), at least k of
the next m future observations will fall in the prediction
interval with probability (1α)100\% on each of the r future
sampling occasions. If obserations are being taken sequentially, for a particular
sampling occasion, up to m observations may be taken, but once
k of the observations fall within the prediction interval, sampling can stop.
Note: When k=m and r=1, the results of predIntNormSimultaneous
are equivalent to the results of predIntNorm
.
For the California rule (rule="CA"
), with probability
(1α)100\%, for each of the r future sampling occasions, either
the first observation will fall in the prediction interval, or else all of the next
m1 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise,
m1 more observations must be taken.
For the Modified California rule (rule="Modified.CA"
), with probability
(1α)100\%, for each of the r future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
Simultaneous prediction intervals can be extended to using averages (means) in place
of single observations (USEPA, 2009, Chapter 19). That is, you can create a
simultaneous prediction interval
that will contain a specified number of averages (based on which rule you choose) on
each of r future sampling occassions, where each each average is based on
w individual observations. For the function predIntNormSimultaneous
,
the argument n.mean
corresponds to w.
The Form of a Prediction Interval
Let \underline{x} = x_1, x_2, …, x_n denote a vector of n
observations from a normal distribution with parameters
mean=
μ and sd=
σ. Also, let w denote the
sample size associated with the future averages (i.e., n.mean=
w).
When w=1, each average is really just a single observation, so in the rest of
this help file the term “averages” will replace the phrase
“observations or averages”.
For a normal distribution, the form of a twosided (1α)100\% prediction interval is:
[\bar{x}  Ks, \bar{x} + Ks] \;\;\;\;\;\; (1)
where \bar{x} denotes the sample mean:
\bar{x} = \frac{1}{n} ∑_{i=1}^n x_i \;\;\;\;\;\; (2)
s denotes the sample standard deviation:
s^2 = \frac{1}{n1} ∑_{i=1}^n (x_i  \bar{x})^2 \;\;\;\;\;\; (3)
and K denotes a constant that depends on the sample size n, the
confidence level, the number of future sampling occassions r, and the
sample size associated with the future averages, w. Do not confuse the
constant K (uppercase K) with the number of future averages k
(lowercase k) in the kofm rule. The symbol K is used here
to be consistent with the notation used for tolerance intervals
(see tolIntNorm
).
Similarly, the form of a onesided lower prediction interval is:
[\bar{x}  Ks, ∞] \;\;\;\;\;\; (4)
and the form of a onesided upper prediction interval is:
[∞, \bar{x} + Ks] \;\;\;\;\;\; (5)
Note: For simultaneous prediction intervals, only lower
(pi.type="lower"
) and upper
(pi.type="upper"
) prediction
intervals are available.
The derivation of the constant K is explained in the help file for
predIntNormSimultaneousK
.
Prediction Intervals are Random Intervals
A prediction interval is a random interval; that is, the lower and/or
upper bounds are random variables computed based on sample statistics in the
baseline sample. Prior to taking one specific baseline sample, the probability
that the prediction interval will perform according to the rule chosen is
(1α)100\%. Once a specific baseline sample is taken and the prediction
interval based on that sample is computed, the probability that that prediction
interval will perform according to the rule chosen is not necessarily
(1α)100\%, but it should be close. See the help file for
predIntNorm
for more information.
If x
is a numeric vector, predIntNormSimultaneous
returns a list of
class "estimate"
containing the estimated parameters, the prediction interval,
and other information. See the help file for
estimate.object
for details.
If x
is the result of calling an estimation function,
predIntNormSimultaneous
returns a list whose class is the same as x
.
The list contains the same components as x
, as well as a component called
interval
containing the prediction interval information.
If x
already has a component called interval
, this component is
replaced with the prediction interval information.
Motivation
Prediction and tolerance intervals have long been applied to quality control and
life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973). In the context of
environmental statistics, prediction intervals are useful for analyzing data from
groundwater detection monitoring programs at hazardous and solid waste facilities.
One of the main statistical problems that plague groundwater monitoring programs at hazardous and solid waste facilities is the requirement of testing several wells and several constituents at each well on each sampling occasion. This is an obvious multiple comparisons problem, and the naive approach of using a standard ttest at a conventional αlevel (e.g., 0.05 or 0.01) for each test leads to a very high probability of at least one significant result on each sampling occasion, when in fact no contamination has occurred. This problem was pointed out years ago by Millard (1987) and others.
Davis and McNichols (1987) proposed simultaneous prediction intervals as a way of controlling the facilitywide false positive rate (FWFPR) while maintaining adequate power to detect contamination in the groundwater. Because of the ubiquitous presence of spatial variability, it is usually best to use simultaneous prediction intervals at each well (Davis, 1998a). That is, by constructing prediction intervals based on background (prelandfill) data on each well, and comparing future observations at a well to the prediction interval for that particular well. In each of these cases, the individual αlevel at each well is equal to the FWFRP divided by the product of the number of wells and constituents.
Often, observations at downgradient wells are not available prior to the construction and operation of the landfill. In this case, upgradient well data can be combined to create a background prediction interval, and observations at each downgradient well can be compared to this prediction interval. If spatial variability is present and a major source of variation, however, this method is not really valid (Davis, 1994; Davis, 1998a).
Chapter 19 of USEPA (2009) contains an extensive discussion of using the 1ofm rule and the Modified California rule.
Chapters 1 and 3 of Gibbons et al. (2009) discuss simultaneous prediction intervals
for the normal and lognormal distributions, respectively.
The kofm Rule
For the kofm rule, Davis and McNichols (1987) give tables with
“optimal” choices of k (in terms of best power for a given overall
confidence level) for selected values of m, r, and n. They found
that the optimal ratios of k to m (i.e., k/m) are generally small,
in the range of 1550%.
The California Rule
The California rule was mandated in that state for groundwater monitoring at waste
disposal facilities when resampling verification is part of the statistical program
(Barclay's Code of California Regulations, 1991). The California code mandates a
“California” rule with m ≥ 3. The motivation for this rule may have
been a desire to have a majority of the observations in bounds (Davis, 1998a). For
example, for a kofm rule with k=1 and m=3, a monitoring
location will pass if the first observation is out of bounds, the second resample
is out of bounds, but the last resample is in bounds, so that 2 out of 3 observations
are out of bounds. For the California rule with m=3, either the first
observation must be in bounds, or the next 2 observations must be in bounds in order
for the monitoring location to pass.
Davis (1998a) states that if the FWFPR is kept constant, then the California rule
offers little increased power compared to the kofm rule, and can
actually decrease the power of detecting contamination.
The Modified California Rule
The Modified California Rule was proposed as a compromise between a 1ofm
rule and the California rule. For a given FWFPR, the Modified California rule
achieves better power than the California rule, and still requires at least as many
observations in bounds as out of bounds, unlike a 1ofm rule.
Different Notations Between Different References
For the kofm rule described in this help file, both
Davis and McNichols (1987) and USEPA (2009, Chapter 19) use the variable
p instead of k to represent the minimum number
of future observations the interval should contain on each of the r sampling
occasions.
Gibbons et al. (2009, Chapter 1) presents extensive lists of the value of K for both kofm rules and California rules. Gibbons et al.'s notation reverses the meaning of k and r compared to the notation used in this help file. That is, in Gibbons et al.'s notation, k represents the number of future sampling occasions or monitoring wells, and r represents the minimum number of observations the interval should contain on each sampling occasion.
USEPA (2009, Chapter 19) uses p in place of k.
Steven P. Millard (EnvStats@ProbStatInfo.com)
Barclay's California Code of Regulations. (1991). Title 22, Section 66264.97 [concerning hazardous waste facilities] and Title 23, Section 2550.7(e)(8) [concerning solid waste facilities]. Barclay's Law Publishers, San Francisco, CA.
Davis, C.B. (1998a). GroundWater Statistics \& Regulations: Principles, Progress and Problems. Second Edition. Environmetrics \& Statistics Limited, Henderson, NV.
Davis, C.B. (1998b). Personal Communication, September 3, 1998.
Davis, C.B., and R.J. McNichols. (1987). Onesided Intervals for at Least p of m Observations from a Normal Population on Each of r Future Occasions. Technometrics 29, 359–370.
Fertig, K.W., and N.R. Mann. (1977). OneSided Prediction Intervals for at Least p Out of m Future Observations From a Normal Population. Technometrics 19, 167–177.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Hahn, G.J. (1969). Factors for Calculating TwoSided Prediction Intervals for Samples from a Normal Distribution. Journal of the American Statistical Association 64(327), 878898.
Hahn, G.J. (1970a). Additional Factors for Calculating Prediction Intervals for Samples from a Normal Distribution. Journal of the American Statistical Association 65(332), 16681676.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178188.
Hall, I.J., and R.R. Prairie. (1973). OneSided Prediction Intervals to Contain at Least m Out of k Future Observations. Technometrics 15, 897–914.
Millard, S.P. (1987). Environmental Monitoring, Statistics, and the Law: Room for Improvement (with Comment). The American Statistician 41(4), 249–259.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with SPLUS. CRC Press, Boca Raton, Florida.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R09007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet  March 2009 Unified Guidance. EPA 530/R09007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
predIntNormSimultaneousK
,
predIntNormSimultaneousTestPower
,
predIntNorm
,
predIntLnormSimultaneous
, tolIntNorm
,
Normal, estimate.object
, enorm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279  # Generate 8 observations from a normal distribution with parameters
# mean=10 and sd=2, then use predIntNormSimultaneous to estimate the
# mean and standard deviation of the true distribution and construct an
# upper 95% prediction interval to contain at least 1 out of the next
# 3 observations.
# (Note: the call to set.seed simply allows you to reproduce this example.)
set.seed(479)
dat < rnorm(8, mean = 10, sd = 2)
predIntNormSimultaneous(dat, k = 1, m = 3)
#Results of Distribution Parameter Estimation
#
#
#Assumed Distribution: Normal
#
#Estimated Parameter(s): mean = 10.269773
# sd = 2.210246
#
#Estimation Method: mvue
#
#Data: dat
#
#Sample Size: 8
#
#Prediction Interval Method: exact
#
#Prediction Interval Type: upper
#
#Confidence Level: 95%
#
#Minimum Number of
#Future Observations
#Interval Should Contain: 1
#
#Total Number of
#Future Observations: 3
#
#Prediction Interval: LPL = Inf
# UPL = 11.4021
#
# Repeat the above example, but do it in two steps. First create a list called
# est.list containing information about the estimated parameters, then create the
# prediction interval.
est.list < enorm(dat)
est.list
#Results of Distribution Parameter Estimation
#
#
#Assumed Distribution: Normal
#
#Estimated Parameter(s): mean = 10.269773
# sd = 2.210246
#
#Estimation Method: mvue
#
#Data: dat
#
#Sample Size: 8
predIntNormSimultaneous(est.list, k = 1, m = 3)
#Results of Distribution Parameter Estimation
#
#
#Assumed Distribution: Normal
#
#Estimated Parameter(s): mean = 10.269773
# sd = 2.210246
#
#Estimation Method: mvue
#
#Data: dat
#
#Sample Size: 8
#
#Prediction Interval Method: exact
#
#Prediction Interval Type: upper
#
#Confidence Level: 95%
#
#Minimum Number of
#Future Observations
#Interval Should Contain: 1
#
#Total Number of
#Future Observations: 3
#
#Prediction Interval: LPL = Inf
# UPL = 11.4021
#
# Compare the 95% 1of3 upper prediction interval to the California and
# Modified California prediction intervals. Note that the upper prediction
# bound for the Modified California rule is between the bound for the
# 1of3 rule bound and the bound for the California rule.
predIntNormSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"]
# UPL
#11.4021
predIntNormSimultaneous(dat, m = 3, rule = "CA")$interval$limits["UPL"]
# UPL
#13.03717
predIntNormSimultaneous(dat, rule = "Modified.CA")$interval$limits["UPL"]
# UPL
#12.12201
#
# Show how the upper bound on an upper 95% simultaneous prediction limit increases
# as the number of future sampling occasions r increases. Here, we'll use the
# 1of3 rule.
predIntNormSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"]
# UPL
#11.4021
predIntNormSimultaneous(dat, k = 1, m = 3, r = 10)$interval$limits["UPL"]
# UPL
#13.28234
#
# Compare the upper simultaneous prediction limit for the 1of3 rule
# based on individual observations versus based on means of order 4.
predIntNormSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"]
# UPL
#11.4021
predIntNormSimultaneous(dat, n.mean = 4, k = 1,
m = 3)$interval$limits["UPL"]
# UPL
#11.26157
#==========
# Example 191 of USEPA (2009, p. 1917) shows how to compute an
# upper simultaneous prediction limit for the 1of3 rule for
# r = 2 future sampling occasions. The data for this example are
# stored in EPA.09.Ex.19.1.sulfate.df.
# We will pool data from 4 background wells that were sampled on
# a number of different occasions, giving us a sample size of
# n = 25 to use to construct the prediction limit.
# There are 50 compliance wells and we will monitor 10 different
# constituents at each well at each of the r=2 future sampling
# occasions. To determine the confidence level we require for
# the simultaneous prediction interval, USEPA (2009) recommends
# setting the individual Type I Error level at each well to
# 1  (1  SWFPR)^(1 / (Number of Constituents * Number of Wells))
# which translates to setting the confidence limit to
# (1  SWFPR)^(1 / (Number of Constituents * Number of Wells))
# where SWFPR = sitewide false positive rate. For this example, we
# will set SWFPR = 0.1. Thus, the confidence level is given by:
nc < 10
nw < 50
SWFPR < 0.1
conf.level < (1  SWFPR)^(1 / (nc * nw))
conf.level
#[1] 0.9997893
#
# Look at the data:
names(EPA.09.Ex.19.1.sulfate.df)
#[1] "Well" "Month" "Day"
#[4] "Year" "Date" "Sulfate.mg.per.l"
#[7] "log.Sulfate.mg.per.l"
EPA.09.Ex.19.1.sulfate.df[,
c("Well", "Date", "Sulfate.mg.per.l", "log.Sulfate.mg.per.l")]
# Well Date Sulfate.mg.per.l log.Sulfate.mg.per.l
#1 GW01 19990708 63.0 4.143135
#2 GW01 19990912 51.0 3.931826
#3 GW01 19991016 60.0 4.094345
#4 GW01 19991102 86.0 4.454347
#5 GW04 19990709 104.0 4.644391
#6 GW04 19990914 102.0 4.624973
#7 GW04 19991012 84.0 4.430817
#8 GW04 19991115 72.0 4.276666
#9 GW08 19971012 31.0 3.433987
#10 GW08 19971116 84.0 4.430817
#11 GW08 19980128 65.0 4.174387
#12 GW08 19990420 41.0 3.713572
#13 GW08 20020604 51.8 3.947390
#14 GW08 20020916 57.5 4.051785
#15 GW08 20021202 66.8 4.201703
#16 GW08 20030324 87.1 4.467057
#17 GW09 19971016 59.0 4.077537
#18 GW09 19980128 85.0 4.442651
#19 GW09 19980412 75.0 4.317488
#20 GW09 19980712 99.0 4.595120
#21 GW09 20000130 75.8 4.328098
#22 GW09 20000424 82.5 4.412798
#23 GW09 20001024 85.5 4.448516
#24 GW09 20021201 188.0 5.236442
#25 GW09 20030324 150.0 5.010635
# Construct the upper simultaneous prediction limit for the
# 1of3 plan based on the logtransformed sulfate data
log.Sulfate < EPA.09.Ex.19.1.sulfate.df$log.Sulfate.mg.per.l
pred.int.list.log <
predIntNormSimultaneous(x = log.Sulfate, k = 1, m = 3, r = 2,
rule = "k.of.m", pi.type = "upper", conf.level = conf.level)
pred.int.list.log
#Results of Distribution Parameter Estimation
#
#
#Assumed Distribution: Normal
#
#Estimated Parameter(s): mean = 4.3156194
# sd = 0.3756697
#
#Estimation Method: mvue
#
#Data: log.Sulfate
#
#Sample Size: 25
#
#Prediction Interval Method: exact
#
#Prediction Interval Type: upper
#
#Confidence Level: 99.97893%
#
#Minimum Number of
#Future Observations
#Interval Should Contain
#(per Sampling Occasion): 1
#
#Total Number of
#Future Observations
#(per Sampling Occasion): 3
#
#Number of Future
#Sampling Occasions: 2
#
#Prediction Interval: LPL = Inf
# UPL = 5.072355
# Now exponentiate the prediction interval to get the limit on
# the original scale
exp(pred.int.list.log$interval$limits["UPL"])
# UPL
#159.5497
#==========
# Cleanup
#
rm(dat, est.list, nc, nw, SWFPR, conf.level, log.Sulfate,
pred.int.list.log)

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.
All documentation is copyright its authors; we didn't write any of that.