View source: R/CheckInputData.R
PrepareClusterData | R Documentation |
Helper function to prepare data for analysis with HierPoolPrev()
.
PrepareClusterData(data, result, poolSize, hierarchy = NULL, ...)
data |
A |
result |
The name of column with the result of each test on each pooled sample. The result must be stored with 1 indicating a positive test result and 0 indicating a negative test result. |
poolSize |
The name of the column with number of specimens/isolates/insects in each pool |
hierarchy |
The name of column(s) indicating the group membership, ordered from largest to smallest. |
... |
Optional name(s) of columns with variables to stratify the data by. |
When including information about hierarchical sampling frame (e.g. where units were sampled from sites and sites were selected from a broader region), it is critical that each cluster can be uniquely identified. It's not enough for the combination of columns specifying the hierarchy to be unique. Each different location needs to have a unique label. It's not enough for the combination of columns specifying the hierarchy to be unique. Each different location needs to have a unique label.
This function is provided to assist users by detecting certain types of nesting and clustering issues within data. As we do not make assumptions about the hierarchical/clustering scheme, we cannot guarantee that this function will detect all errors in hierarchical/clustering schemes. This function checks for the most common type of error, which is multiple different locations (e.g., collection sites) having the same value within a hierarchical/clustering column.
An object of class data.frame
, identical to the input
data
. If there were issues with nesting inside the hierarchy, the
output will have a single additional column PoolTestR_ID
, containing
unique identifier for each location in the survey (created by concatenating
the hierarchy column values within each row).
This function can be used to check levels of the hierarchical/clustering
scheme. For example, the SimpleExampleData
has the scheme
Region
> Village
> Site
. The full geographical
hierarchy/clustering scheme can be tested using
hierarchy = c("Region", "Village", "Site")
. This function can also be
used to check only the levels of the hierarchical/clustering scheme that
will be included in prevalence estimates, e.g., if planning to stratify
SimpleExampleData
by Region
, the hierarchy will be
hierarchy = c("Village", "Site")
.
Functions in PoolTools do not make assumptions about the number of levels present or the names of hierarchical columns. They can be applied in any cases where a hierarchical sampling frame is involved.
CheckInputData
, HierPoolPrev
,
getPrevalence
, SimpleExampleData
# Check whether the SimpleExampleData is nested
# appropriately for estimating prevalence in
# HierPoolPrev()
SimpleExample_output <-
PrepareClusterData(
data = SimpleExampleData,
result = "Result", poolSize = "NumInPool",
hierarchy = c("Region", "Village", "Site")
)
# No errors/warnings were raised
identical(SimpleExample_output, SimpleExampleData)
# The hierarchical scheme is formatted properly so
# the output is identical to the input
## Not run:
# Checking another example data set for clustering issues
# Create a test data frame that has incorrectly nested
# Village and Site variables
check_data <- data.frame(
Region = rep(c("A", "B"), each = 4),
Village = rep(rep(c("W", "X"), each = 2), 2),
Site = c(1:4, 4:1),
Year = rep(0, 8),
NumInPool = rep(10, 8),
Result = c(rep(0, 8))
)
# Test whether the data.frame is formatted appropriately
# for HierPoolPrev()
check_output <- PrepareClusterData(
data = check_data,
result = "Result", poolSize = "NumInPool",
hierarchy = c("Region", "Village", "Site")
)
# New column has been added with unique identifier for
# each location
check_output$PoolTestR_ID
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.