knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Applications of pooled or group testing include two general aims: estimation of the population individual-level probability of positivity; and cost-efficient schemes of identification of positive individuals. The PooledInfRate package is concerned only with estimation applications. Functionality includes point and interval estimation of a proportion from binary (positive / negative) samples that have been pooled or grouped before assessment of positivity, which is determined only at the pool level. Estimation applications date back at least to @ChiRee1962, who developed asymptotic likelihood-based methods for pools of equal size in application to estimation of virus infection in mosquito collections. These authors also extended their analyses to two different pool sizes under a simplifying assumption to make the mathematical development tractable, but they were constrained from further extension by computational tools available at the time. The package name "PooledInfRate" is based on the use of "infection rate" in entomological applications and follows the author's development of the Microsoft Excel™ (@msexcel) add-in of the same name (https://www.cdc.gov/westnile/resourcepages/mosqSurvSoft.html).
The computational functionality of routines in this package for point estimation are the standard maximum likelihood estimate (MLE); the bias corrected estimate based on the methods of Gart (see @Gar1991), as implemented by @Hep2005; the bias-reduced estimate based on Firth's correction (see @Fir1993 and @HepBig2017); and the traditional minimum infection rate (MIR) estimate used in entomology. Confidence intervals (CI) included are the score, skewness-corrected score, and bias-and-skewness-corrected score intervals following Gart (@Gar1991) [see @Hep2005]; the interval obtained by inverting the likelihood ratio test; the Wald interval; and an interval based on the MIR (see @Hep2005 and @Big2008 for evaluations of these intervals). Also included are routines to compute point and CI estimates for the differences of proportions estimated from pooled samples, as detailed in @Big2008.
To complement applications involving entomological surveillance for pathogens in disease vectors such as mosquitoes, the package provides routines to compute the Vector Index (@FauEtAl2015), a measure incorporating both population size or density and of vector infection prevalences ("infection rate"). Its computation is outlined below.
The methods mentioned so far assume that perfect tests or assays are used in testing the pools for "positivity." Extending this, functionality is included for point estimation when using an imperfect test with known sensitivity and specificity. Following @HepBig2021, methods based both on Firth's correction (@Fir1993) and on the standard MLE are included. There is currently a lack of theoretical work in this area for CI estimation in the case of imperfect tests, so none are presently included; future theoretical and corresponding package development in this area is anticipated.
A variety of data structures may be used with estimation functions in PooledInfRate. As with R functionality in general, either individual vector objects or data.frame objects supply the data to the functions. The data must have at a minimum a recording of the number of positive pools, x, and the pool sizes, m, corresponding to those pools. A variable giving the number of pools, n, corresponding to those records may also be specified, and when none is given, the values for x are assumed to be 0/1 and each n is assumed to be 1 by default. To carry out the same analyses for multiple groups at once, a grouping variable, either as a factor or a numeric vector that will be treated as a factor may be used. Illustrations for acceptable data formats are
x <- c(1,0,0,0) m <- c(50,25,10,5)
and
ex1.dat <- data.frame(Pos = c(1,0,0,0), PoolSize = c(50,25,10,5)) ex1.dat
and
ex2.dat <- data.frame(Pos = c(2,1,0,1), PoolSize = c(50,25,10,5), NumPools = c(10,10,5,2), Location = c("A","A","B","B")) ex2.dat
and if there are different groups, say Species,
mosq.dat <- data.frame(Zone = c(1,1,2,2, 1,1,1,2,2, 1,1,2), Species = rep(c("Culex pipiens","Culex tarsalis", "Culex quinquefaciatus"), c(4,5,3)), Pos = c(1,0,0,1, 1,0,1,0,0, 0,1,0), PoolSize = c(100,50,25,5, 200,100,50,25,10, 25,10,5), Nights = c(5,5,3,3,5,5,5,3,3,5,5,3)) mosq.dat
Other data may naturally be included in data frames, and we will include these data sets and needed extensions of these data sets in certain examples below.
The function pooledBin ("pooled binary") provides the interface to both point and CI estimates, and there are both default and formula interfaces provided. The function pIR ("pooled infection rate") is a copy of pooledBin with a slightly different and shorter name merely for convenience; in all the examples below, pooledBin and pIR can be used interchangeably. For single samples, the default methods are straightforward, with printed output being the point estimate and computed CI:
library(PooledInfRate) pooledBin(x,m) with(ex2.dat, pooledBin(Pos, PoolSize, NumPools)) with(mosq.dat, pIR(Pos, PoolSize))
When there is a grouping variable, as with the mosq.dat data set, an optional group= may be used to obtain individual results for each group:
with(mosq.dat, pooledBin(Pos, PoolSize, group = Species))
The formula interface in R provides a convenient, generally consistent interface to modeling function expressions, and this interface has been adapted to ease use of the functions in PooledInfRate. As an example,
pooledBin(Pos ~ PoolSize, mosq.dat) # the second function argument is data=
Specification of a grouping variable is made using a vertical bar (|) in the formula:
pooledBin(Pos ~ PoolSize | Species, mosq.dat) # the second function argument is data=
This example data set mosq.dat does not have a "number of pools" variable, so the above evaluations do not need this specified, since by default this is a vector of 1s. Use of formula interface to specify such a variable is required for the ex2.dat data set, however, and this is accommodated using "special" functions m() and n() (echoing the mathematical exposition in @HepBig2017). As an illustration, the following incorporates this variable:
pooledBin(Pos ~ m(PoolSize) + n(NumPools), ex2.dat)
Because the pool size variable [m()] is required for any pooled estimation (and because it is easier when there is less typing), the m() specification is optional, so that
pooledBin(Pos ~ PoolSize + n(NumPools), ex2.dat)
gives the same result. Further, since the m() and n() identifiers communicate to the fitting functions the roles of the variables, the order does not matter in the formula, so that the following all provide the same result:
pooledBin(Pos ~ m(PoolSize) + n(NumPools), ex2.dat) pooledBin(Pos ~ PoolSize + n(NumPools), ex2.dat) pooledBin(Pos ~ n(NumPools) + m(PoolSize), ex2.dat) pooledBin(Pos ~ n(NumPools) + PoolSize , ex2.dat)
Specification of the grouping variable is as above:
pooledBin(Pos ~ PoolSize + n(NumPools) | Location, ex2.dat) # and use the (significant) 'digits' argument of print() to make for easier reading print(pooledBin(Pos ~ PoolSize + n(NumPools) | Location, ex2.dat), digits = 3)
With package version 1.2, results for the combinations of multiple grouping variables may be obtained by specifying the grouping variables after the | as above and separated by * characters. As an example using the mosq.dat data set above,
pooledBin(Pos ~ PoolSize | Zone * Species, mosq.dat)
Beginning with package version 1.1, returned objects of class pooledBin/pIR can be accessed like data frames, using both [. and $ extractor methods. Examples are
# original call pooledBin(Pos ~ PoolSize | Zone, mosq.dat) # just Zone 2 pooledBin(Pos ~ PoolSize | Zone, mosq.dat)[2,] # just P pooledBin(Pos ~ PoolSize | Zone, mosq.dat)[,"P"] pooledBin(Pos ~ PoolSize | Zone, mosq.dat)$P # assign and just CIs pb.out <- pooledBin(Pos ~ PoolSize | Zone * Species, mosq.dat) pb.out[,3:4] pb.out[c("Lower","Upper")]
Beginning with package version 1.4, returned objects of class pooledBin/pIR can be converted to data frames, using as.data.frame. This can be helpful when wanting to use estimates for other purposes, such as plotting.
To print out more detailed information on the estimation results, summary methods are provided. The resulting output contains details including the estimates themselves, estimation methods used, the total number of individuals, the total numbers of pools, and the total number of positive pools.
pir.combined.out <- pIR(Pos ~ PoolSize + n(NumPools), ex2.dat) summary(pir.combined.out) pir.location.out <- pIR(Pos ~ PoolSize + n(NumPools) | Location, ex2.dat) summary(pir.location.out) pir.mosq.out <- pIR(Pos ~ PoolSize | Species, mosq.dat) print(summary(pir.mosq.out), digits=3)
Further, the summary methods have an argument simple that can be used to return the detailed results as a data frame. To return the summary data frame, set simple = TRUE (default is simple = FALSE):
summary(pir.mosq.out) summary(pir.mosq.out, simple = TRUE)
A plot method of the diagnostic tool described by @CheSwa1990 to evaluate the suitability of the binomial model is available.
plot(pir.combined.out)
and this also works when there are groups (though adjustment of the layout parameter may be required).
Records with missing data on any of the variables used in the function calls are removed for estimation. For example,
x.na <- c(1,0,0,NA,1) m.na <- c(10,50,25,10,10) pooledBin(x.na, m.na) summary(pooledBin(x.na,m.na))
and this is also true using the formula method interface.
As described above, there are several point and CI methods available, and the choices are specified with the pt.method and ci.method parameters.
pIR(Pos ~ PoolSize, mosq.dat, pt.method = "firth", ci.method = "skew-score") # the defaults pIR(Pos ~ PoolSize, mosq.dat, pt.method = "gart", ci.method = "score") pIR(Pos ~ PoolSize, mosq.dat, pt.method = "mle", ci.method = "wald") # specification of "mir" for either sets the other to "mir" pIR(Pos ~ PoolSize, mosq.dat, pt.method = "mir", ci.method = "mir")
Also, note that when all pool sizes are 1, then there is no pooling, so the standard Wilson score interval is returned, and pt.method is set to "mle" and ci.method is set to "score" in this case.
For users of the Microsoft Excel™ PooledInfRate add-in: the default point estimation method for this R package is Firth's method, which is different than the one in Excel™, which is Gart's method. This is because Firth's method was shown in @HepBig2017 to perform generally better.
Estimated values for the prevalence are often very small in pooled testing applications, so a scale parameter is provided to facilitate reading output (scale = 1000 is often used in entomological applications). Results are printed using the specified scale, both with the standard print and summary methods.
pIR(Pos ~ PoolSize, mosq.dat) # default scale = 1 pIR(Pos ~ PoolSize, mosq.dat, scale=1000)
To set the desired confidence level for CIs, the parameter alpha, with default alpha = 0.05, is used, with the confidence level equal to 100(1-alpha)%.
The estimation methods used require iterative numerical computation, and the parameter tol (for "tolerance") is used to indicate how precisely estimation is required; the default is .Machine$double.eps^0.5.
The "equal" pair of functions pooledBinDiff / pIRDiff compute confidence intervals for differences of proportions, as detailed in @Big2008. The interface for the default call is an expansion of the one-sample call: pooledBin(x1,m1,x2,m2). For the formula interface, these functions are exactly the same as for the one-sample case, only a single group variable is permitted and it is in this case required. Estimates of differences and CIs for all pairwise differences of the levels of the grouping variable are computed, and a summary method provides detail, including individual group estimates. Examples follow.
x1 <- c(1,0,0,0) m1 <- c(100,50,25,10) x2 <- c(1,1,0,0) m2 <- c(50,40,30,20) pooledBinDiff(x1,m1,x2,m2) n1 <- c(10,20,30,40) n2 <- rep(1,4) pooledBinDiff(x1,m1,x2,m2,n1,n2) pooledBinDiff(Pos ~ PoolSize + n(NumPools) | Location, ex2.dat) mdiff.out <- pIRDiff(Pos ~ PoolSize | Species, mosq.dat, scale=1000) # scale for easier interpretation # Print fewer digits for easier reading print(summary(mdiff.out), digits = 3) # as with the one-sample results, one can return the detailed summary as a data frame summary(mdiff.out, simple = TRUE)
Beginning with package version 1.1, returned objects of class pooledBinDiff/pIRDiff can be accessed like data frames, using both [. and $ extractor methods. Examples are
# original call pooledBinDiff(Pos ~ PoolSize | Species, mosq.dat) # just the Culex pipiens - Culex quinquefaciatus comparison pooledBinDiff(Pos ~ PoolSize | Species, mosq.dat)[2,] # just the differences pooledBinDiff(Pos ~ PoolSize | Species, mosq.dat)[,"Diff"] pooledBinDiff(Pos ~ PoolSize | Species, mosq.dat)$Diff # assign and just CIs pbd.out <- pooledBinDiff(Pos ~ PoolSize | Species, mosq.dat) pbd.out[,3:4] pbd.out[c("Lower","Upper")]
As with the one-sample functions beginning with package version 1.4, returned objects of class pooledBinDiff/pIRDiff can be converted to data frames, using as.data.frame. This can be helpful when wanting to use estimates for other purposes, such as plotting.
As with the one-sample functions, records with missing data on any of the variables used in the function calls are removed for estimation.
Options scale, alpha, and tol are the same as for the one-sample functions.
In mosquito-borne disease surveillance, the Vector Index (VI) is a measure used in evaluating risk of infection in a human population, often to aid decisions on community interventions, such as area-wide vector mosquito abatement (see, e.g., @FauEtAl2015). Primarily used in mosquito-borne disease surveillance, the VI is the sum of the products of a measure of population "size" and the infection rate over vector species. Because this definition requires the specification of "vector species" over which to perform these computations, an additional variable is needed beyond those specified in the pooledBin / pIR functions above. This is accommodated in the default and formula interfaces as follows.
Field collections of mosquitoes used for such surveillance can result in a wide range of data configurations, and the individuals that contribute to the population size measures may not be the same as those used in estimating the infection rate (prevalence).
The most straightforward data situation is when all individuals in the collection are used in both components, the population measure and the infection rate estimate. Data in this case may be formatted as above, say with the example mosq.dat data set and the example just shown.
For the default interface, a vector parameter is used, so using the mosq.dat example data set above,
vi.out <- with(mosq.dat, vectorIndex(Pos,PoolSize,vector=Species)) # or use VI() vi.out
and there is a summary method available to provide detail:
summary(vi.out)
Using the formula interface, the vector species variable is specified after a forward slash symbol (/), written after the main part of the formula:
vectorIndex(Pos ~ PoolSize / Species, mosq.dat)
Computing the VI by groups is available using the same interface as groups for the one- and two-sample functions. Note, however, that using the formula interface, the group variable must come before the slash (/) indicating the vector species variable. Examples using the mosq.dat data set are
with(mosq.dat, vectorIndex(Pos, PoolSize, vector=Species, group=Zone)) VI(Pos ~ PoolSize | Zone / Species, mosq.dat)
The above calculations assume that collection effort is constant (i.e., represents the same time in the environment for collection) across traps. Mosquito trapping effort, often expressed "per trap night" to reflect the duration between retrievals of specimens, may differ by trap, however, and when pools are trap-specific or contain individuals caught in traps using the same effort, a trapping effort variable provided to specify this. For the default interface, a trap.time variable is specified, while for the formula interface the variable is included after the Vector variable, separated by a colon (:). Examples using the mosq.dat data set are
with(mosq.dat, vectorIndex(Pos, PoolSize, vector=Species, trap.time = Nights, group=Zone)) VI(Pos ~ PoolSize | Zone / Species:Nights, mosq.dat)
It may be that field collections used for computing the measure of population size differ, or that individual mosquitoes are pooled from different traps possibly representing different collection efforts. If the counts of mosquitoes caught are aggregated separately by collection effort, and otherwise some or all of them are pooled for testing and infection rate estimation, these pieces of data may be combined in computing the VI by formatting the data using missing values (NA) for the response data to be used in computing the population size measure. This is possible in both the default and formula interfaces by correct specification of the n.use.traps and n.use.na parameter options, which indicate what subsets of the data are to be used to compute the population measure (reflected in the "n.use." in the parameter names).
As an example data set, the document West Nile Virus in the United States: Guidelines for Surveillance, Prevention, and Control of the US Centers for Disease Control and Prevention, Division of Vector-Borne Diseases, contains this example data set in Appendix 2 (written here in aggregated form):
pools.dat <- data.frame(Species=c("Cx. tarsalis","Cx. pipiens"), Pos=c(1,1), PoolSize=c(50,50), NumPools=c(6,5)) pools.dat
In addition to the pool testing data, the total number of Cx. tarsalis caught for 6 traps was 442, and the total number of Cx. pipiens caught for 6 traps was 233. To include this information in the pools.dat data set, (1) enter missing values (NA) for the "number positive" variable (x); (2) treat the "pool size" variable (m) as the collection count; and augment "number of pools" variable (n) as 1.
traps.dat <- data.frame(Species = c("Cx. tarsalis","Cx. pipiens"), Pos = c(NA,NA), PoolSize=c(442, 233), NumPools=c(1,1)) # note NumPools should be 1 for each entry vi.dat <- rbind(pools.dat, traps.dat) vi.dat VI(Pos ~ PoolSize + n(NumPools) / Species, vi.dat, n.use.traps = FALSE, n.use.na = TRUE) summary(VI(Pos ~ PoolSize + n(NumPools) / Species, vi.dat, n.use.traps = FALSE, n.use.na = TRUE))
Finally, to account for trapping effort, recall that these vector mosquito counts are from 6 traps (assumed to be for 1 night each). Augment the vi.dat data set as
# put 1s in for the pool data for later; use values for trap nights in application vi.dat$TrapNights <- c(1,1,6,6) vi.dat VI(Pos ~ PoolSize + n(NumPools) / Species:TrapNights, vi.dat, n.use.traps = FALSE, n.use.na = TRUE) summary(VI(Pos ~ PoolSize + n(NumPools) / Species:TrapNights, vi.dat, n.use.traps = FALSE, n.use.na = TRUE))
The Avg N values reported in the summary result match the CDC result. To match the VI result exactly, note that the CDC computation used the MIR as the estimate of the infection rate, so compute this using
VI(Pos ~ PoolSize + n(NumPools) / Species:TrapNights, vi.dat, n.use.traps = FALSE, n.use.na = TRUE, pt.method="mir") summary(VI(Pos ~ PoolSize + n(NumPools) / Species:TrapNights, vi.dat, n.use.traps = FALSE, n.use.na = TRUE, pt.method="mir"))
Finally, if the individuals in the pools are not also counted in the "not-tested" data set (since doing so would over-count the collected number of individuals), these counts can be included in the computation of the population size measure, making sure that the collection effort variable (trap.time; TrapNights in the example) is correct for all the individuals in the pools. To include these in the computation, simply set the n.use.traps variable to TRUE:
# not trying to match the CDC guidelines report here, since in the example more individuals are # included in the population size (mosquito density) measure VI(Pos ~ PoolSize + n(NumPools) / Species:TrapNights, vi.dat, n.use.traps = TRUE, n.use.na = TRUE) summary(VI(Pos ~ PoolSize + n(NumPools) / Species:TrapNights, vi.dat, n.use.traps = TRUE, n.use.na = TRUE))
Beginning with package version 1.1, returned objects of class vectorIndex/VI can be accessed like data frames, using both [. and $ extractor methods. Examples are
# original call VI(Pos ~ PoolSize | Zone / Species, mosq.dat) # just Zone 2 VI(Pos ~ PoolSize | Zone / Species, mosq.dat)[2,] # just the VIs VI(Pos ~ PoolSize | Zone / Species, mosq.dat)[["VI"]] VI(Pos ~ PoolSize | Zone / Species, mosq.dat)$VI
As with the estimation functions above beginning with package version 1.4, returned objects of class vectorIndex/VI can be converted to data frames, using as.data.frame. This can be helpful when wanting to use estimates for other purposes, such as plotting.
Other options to the vectorIndex / VI functions are the same as those for pooledBin, so that the user may specify the method for point estimation.
The VI is a measure used for evaluation of infection risk of some human pathogen, but in principle it is not estimating a population quantity, though it is expected to be proportional to the number of infected individual vectors. Because of this, at this time functionality is not included to compute CIs (which are inferential beasts, after all) to accompany VI computations. Inclusion of CI computations to provide some measure of "uncertainty" or "variability" of the VI computed is being evaluated, and this would be straightforward conditional on the collection counts. A more complete measure of such variability would include uncertainty in the counts, too, and this is not presently available (but is being evaluated).
The methods used for point estimation of population individual-level prevalence using pooledBin assume that the test used to assess the "positivity" of a pool are perfect. @HepBig2021 extend the methods of @HepBig2017 to include imperfect tests, quantified through known values for test sensitivity and specificity. This functionality is available via the function ipooledBin; there is not at present a corresponding ipIR, as further development will see this extension incorporated directly into pooledBin.
The principle difference between pooledBin and ipooledBin is the specification of the parameters sens and spec, which both default to 1, a default perfect test. Using the mosq.dat data set from above:
ipooledBin(Pos ~ PoolSize | Species, mosq.dat, sens=0.9, spec=0.95) summary(ipooledBin(Pos ~ PoolSize | Species, mosq.dat, sens=0.9, spec=0.95))
Estimates based on the uncorrected MLE for imperfect tests are also available by setting the pt.method parameter:
ipooledBin(Pos ~ PoolSize | Species, mosq.dat, sens=0.9, spec=0.95, pt.method="mle") summary(ipooledBin(Pos ~ PoolSize | Species, mosq.dat, sens=0.9, spec=0.95, pt.method="mle"))
Multiple grouping variables may be specified as with the perfect test functions:
ipooledBin(Pos ~ PoolSize | Species * Zone, mosq.dat, sens=0.9, spec=0.95)
Compared to pooledBin, the additional parameter p.start gives the user the options of specifying a starting starting value for the numerical algorithm (Newton-Raphson) used in estimation; there is a default value set for this when p.start = NULL, as is the default.
Confidence intervals are not included as a option for ipooledBin, because at present there is not a theoretically recommended method. (This is why this functionally has not yet been incorporated into the base pooledBin function.)
Finally, a word of caution: as noted in @HepBig2021, convergence of the computational algorithm in the presence of an imperfect test is not assured (either computationally or theoretically) for every value of sensitivity and specificity for a given data set, as some data are simply incompatible with some test performance specifications. Should the user encounter convergence issues using specified values for sens and spec, the recommendation is to evaluate estimates for a variety of these parameter specifications, perhaps beginning by assuming a perfect test, to see how estimates are impacted by such specifications.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.