PrepareClusterData: Prepare hierarchical/clustered survey data for analysis

View source: R/CheckInputData.R

PrepareClusterDataR Documentation

Prepare hierarchical/clustered survey data for analysis

Description

Helper function to prepare data for analysis with HierPoolPrev().

Usage

PrepareClusterData(data, result, poolSize, hierarchy = NULL, ...)

Arguments

data

A data.frame with one row for each pooled sampled and columns for the size of the pool (i.e. the number of specimens / isolates / insects pooled to make that particular pool), the result of the test of the pool. It may also contain additional columns with additional information (e.g. location where pool was taken) which can optionally be used for stratifying the data into smaller groups and calculating prevalence by group (e.g. calculating prevalence for each location)

result

The name of column with the result of each test on each pooled sample. The result must be stored with 1 indicating a positive test result and 0 indicating a negative test result.

poolSize

The name of the column with number of specimens/isolates/insects in each pool

hierarchy

The name of column(s) indicating the group membership, ordered from largest to smallest.

...

Optional name(s) of columns with variables to stratify the data by.

Details

When including information about hierarchical sampling frame (e.g. where units were sampled from sites and sites were selected from a broader region), it is critical that each cluster can be uniquely identified. It's not enough for the combination of columns specifying the hierarchy to be unique. Each different location needs to have a unique label. It's not enough for the combination of columns specifying the hierarchy to be unique. Each different location needs to have a unique label.

This function is provided to assist users by detecting certain types of nesting and clustering issues within data. As we do not make assumptions about the hierarchical/clustering scheme, we cannot guarantee that this function will detect all errors in hierarchical/clustering schemes. This function checks for the most common type of error, which is multiple different locations (e.g., collection sites) having the same value within a hierarchical/clustering column.

Value

An object of class data.frame, identical to the input data. If there were issues with nesting inside the hierarchy, the output will have a single additional column PoolTestR_ID, containing unique identifier for each location in the survey (created by concatenating the hierarchy column values within each row).

This function can be used to check levels of the hierarchical/clustering scheme. For example, the SimpleExampleData has the scheme Region > Village > Site. The full geographical hierarchy/clustering scheme can be tested using hierarchy = c("Region", "Village", "Site"). This function can also be used to check only the levels of the hierarchical/clustering scheme that will be included in prevalence estimates, e.g., if planning to stratify SimpleExampleData by Region, the hierarchy will be hierarchy = c("Village", "Site").

Functions in PoolTools do not make assumptions about the number of levels present or the names of hierarchical columns. They can be applied in any cases where a hierarchical sampling frame is involved.

See Also

CheckInputData, HierPoolPrev, getPrevalence, SimpleExampleData

Examples

# Check whether the SimpleExampleData is nested 
# appropriately for estimating prevalence in 
# HierPoolPrev()
SimpleExample_output <- 
  PrepareClusterData(
    data = SimpleExampleData, 
    result = "Result", poolSize = "NumInPool", 
    hierarchy = c("Region", "Village", "Site") 
  )
# No errors/warnings were raised
identical(SimpleExample_output, SimpleExampleData)
# The hierarchical scheme is formatted properly so
# the output is identical to the input


## Not run: 
  # Checking another example data set for clustering issues
  # Create a test data frame that has incorrectly nested 
  # Village and Site variables
  check_data <- data.frame(
    Region = rep(c("A", "B"), each = 4),
    Village = rep(rep(c("W", "X"), each = 2), 2),
    Site = c(1:4, 4:1),
    Year = rep(0, 8),
    NumInPool = rep(10, 8),
    Result = c(rep(0, 8))
  )
  # Test whether the data.frame is formatted appropriately
  # for HierPoolPrev()
  check_output <- PrepareClusterData(
    data = check_data, 
    result = "Result", poolSize = "NumInPool", 
    hierarchy = c("Region", "Village", "Site")
  )
  # New column has been added with unique identifier for
  # each location
  check_output$PoolTestR_ID

## End(Not run)


AngusMcLure/PoolTestR documentation built on Jan. 16, 2025, 4:35 p.m.