knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

fabricated

Tools to Audit Survey Data Quality

The fabricated package provides tools to identify potential data quality issues for surveys. Currently, this includes tools to identify deviations from the uniform distribution and "number-bunching" for the first decimal place.

Installation

You can install fabricated from GitHub with:

install.packages("devtools")
devtools::install_github("josh-mc/fabricated")

Example

library(fabricated)

Currently, fabricated includes functions for two data quality audits methods. The first audit method relies on the assumption that the digit in the first decimal place should be uniformly distributed. If this assumption holds for the data generating process for a particular variable, then significant deviations from the uniform distribution could indicate that enumerators are deviating from survey protocols. This could include inattention to decimals, rounding decimals to multiples of five, or wholesale fabrication.

Here we work with the dataset 'bodyweight' which is drawn from a normal distribution with an mean of 75 and a standad deviation of 10. This dataset is designed to approximate a sample of adult weights expressed in kilograms.

The 'hist_digits' function generates a histogram of the digits in the first decimal place. Here we generate a histogram for each 'group.' In this case, group would likely indicate enumerators.

hist_digits(bodyweight, obs, ~group)

The 'count_digits' function generates a tibble that displays the counts for each digit in the first decimal place.

count_digits(bodyweight, obs, group) 

Finally, the 'unif_digits()' function acts as a wrapper for 'count_digits()' and the 'chisq.test()' function from the stats package. This returns a tibble with columns for the Chi-square statistic, the p value computed from the chi-square statistic, and the mean average deviation (MAD) from the expected distribution. Other distance measures are also avaiable. When "group" is used, the chi-square test is applied separately to each group.

unif_digits(bodyweight, obs, group)

The distribution of digits in the first decimal place for this is relatively uniform. For no group (enumerator), does the distribution yield a p value of less than or equal to 0.05.

Number bunching

Second, fabricated includes a number of functions to implement audits for "number-bunching" as described by Uri Simonsohn. At first glance, number-bunching is similar to heaping, but while heaping relies on distributional assumptions (e.g. that numbers ending in zero do not occur more often than numbers not ending in zero), number-bunching relies on the assumption that the relation between integers and decimals is random.

The 'average_fre()' function calculates a measure of the frequency with which digits in the first decimal place are paired with certain integers. The number of times each pair (integer and first decimal) occurs is calculated. This number appears once in the numerator for each occurance. These numbers are summed and divided by n.

For example, for the vector c(1.4, 0.4, 1.4, 2.0) the average frequency is calculated by (2 + 1 + 2 + 1) / 4 = 1.5. Again, when the group variable is specified, this is calculated separately for reach group.

average_fre(bodyweight, obs, group)

By itself, this number is not particularly informative, but the 'shuffle()' function allows us to develop a test based on permutations. Leveraging the assumption that the relation between integers and the digit in the first decimal place is random, 'shuffle()' randomly reassign digits in the first decimal place to integers and calculates the average frequency for the shuffled data.

The 'average_fre_p()' function is a wrapper for 'shuffle()' that calculates p-values for a one-sided hypothesis test based on n random permutations. Below, we run 1,000 permutations and as above we're interested in results per group so will name the group variable. We'll also set a seed so that our results are reproducible.

set.seed(799)

average_fre_p(bodyweight, obs, group, reps = 1000)


josh-mc/fabricated documentation built on April 25, 2022, 1:31 p.m.