SampStop | R Documentation |
Compute the probability that continuing data collection will lead to a change in the value of an estimated mean.
SampStop(lm.obj, formula, n1.data, yvar, n2.data, p = NULL, delta = NULL, seed = NULL)
lm.obj |
object of class |
formula |
righthand side of the formula in |
n1.data |
data frame containing units in the part of the sample that has been completed; includes |
yvar |
name or number of column in |
n2.data |
data frame containing units in the part of the sample that is yet to be completed; includes only covariates in |
p |
Vector of anticipated response probabilities for the n2 sample; 0 < |
delta |
vector of potential differences in the estimated means for the n1 and n2 samples. |
seed |
random number seed for selecting sample from incomplete cases. |
SampStop
allows an evaluation to be made of whether data collection can be stopped, without substantially affecting the value
of an estimated mean, prior to completing collection for all units. Suppose that a sample of size n
is divided between
the n_1
units whose collection has been completed and the remaining n_2 = n - n_1
units that are yet to be completed. The
function computes Pr(|e_1 - e_2| < \delta)
where e_1 - e_2
is the potential difference (delta
) between the estimated mean
based on the completed sample and the estimated mean for the full sample if all units were to be completed. For e_1
the mean is estimated
after imputing the y
's for the n_2
incomplete units. The estimated mean e_2
is computed assuming that an additional
n_2 * p
units are completed, and the y
's for the remaining n_2 - n_2*p
incomplete units are imputed. Estimating the variance
of e_1 - e_2
involves selecting a sample from n2.data
using the random number seed in seed
.
The parameter p
is the response rate that is anticipated for the n_2
uncompleted units. The usual situation is that
there is some uncertainty about p
which can be accounted for by inputting a vector of p
's. \delta
is a difference in estimates
that, if not exceeded, would lead to stopping data collection. For an acceptably small value of delta
, if
Pr(|e_1 - e_2| < \delta)
is large enough, the decision can be made to stop data collection. The variable y
in yvar
is assumed to follow the linear model
in lm.obj
. A model with independent errors (or a simple random sample) is assumed for calculations.
Matrix with length(p)*length{delta}
rows and columns:
Pr(response) |
Probability of response by each of the remaining |
Exp no. resps |
Expected number of respondents among the remaining |
, i.e. n_2*p
y1 mean |
Mean of the |
diff in means |
Value of the input parameter |
se of diff |
Standard error of the difference |
z-score |
Z-score for computing |
Pr(smaller diff) |
|
George Zipf, Richard Valliant
Wagner, J. and Raghunathan, T. (2010). A new stopping rule for surveys. Statistics in Medicine, 29(9), 1014-1024.
library(PracTools)
# Model with quantitative covariates
data(hospital)
HOSP <- hospital
HOSP$sqrt.x <- sqrt(HOSP$x)
sam <- sample(nrow(HOSP), 50)
N1 <- HOSP[sam, ]
N2 <- HOSP[-sam, ]
## Create lm object using "known" data; no intercept model
lm.obj <- lm(y ~ 0 + sqrt.x + x, data = N1)
del <- mean(HOSP$y) - mean(HOSP$y) * seq(.6, 1, by=0.05)
SampStop(lm.obj = lm.obj,
formula = ~ 0 + sqrt.x + x,
n1.data = N1,
yvar = "y",
n2.data = N2,
p = seq(0.2, 0.6, by=0.05),
delta = del,
seed = .Random.seed[413])
# Model with factors
data(labor)
sam <- sample(nrow(labor), 50)
n1.vars <- c("WklyWage", "HoursPerWk", "agecat", "sex")
n2.vars <- c("HoursPerWk", "agecat", "sex")
N1 <- labor[sam, n1.vars]
N2 <- labor[-sam, n2.vars]
lm.obj <- lm(WklyWage ~ HoursPerWk + as.factor(agecat) + as.factor(sex), data = labor)
del <- mean(N1$WklyWage) - mean(N1$WklyWage) * seq(.75, .95, by=0.05)
result <- SampStop(lm.obj = lm.obj,
formula = ~ HoursPerWk + as.factor(agecat) + as.factor(sex),
n1.data = N1,
yvar = "WklyWage",
n2.data = N2,
p = seq(0.2, 0.4, by=0.05),
delta = del,
seed = .Random.seed[78])
p.nresp <- paste(result[,1], result[,2], sep=", ")
library(ggplot2)
ggplot2::ggplot(result, aes(result[,4], result[,7], colour = factor(p.nresp) )) +
geom_point() +
geom_line(linewidth=1.1) +
labs(x = "delta", y = "Pr(|e1-e2|<= delta)", colour = "Pr(resp), n.resp")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.