padding_test: Performs padding test vs simulations of Benford conforming...

View source: R/padding_test.R

padding_testR Documentation

Performs padding test vs simulations of Benford conforming datasets via percentile

Description

Performs padding test vs simulations of Benford conforming datasets via percentile

Usage

padding_test(
  digitdata,
  data_columns = "all",
  max_length = 8,
  num_digits = 5,
  N = 10000,
  simulate = TRUE,
  omit_05 = NA,
  break_out = NA,
  break_out_grouping = NA,
  category = NA,
  category_grouping = NA,
  distribution = "Benford",
  contingency_table = NA,
  suppress_first_division_plots = NA,
  plot = TRUE
)

Arguments

digitdata

A object of class DigitAnalysis.

data_columns

The names of numeric columns of data to be analyzed. Default can be 'all', where using all data columns in numbers df in digitdata; an array of column names, as characters; a single column name, as character.

max_length

The length of the longest numbers considered. Defaulted to 8.

num_digits

The total number of digits aligned from the right to be analyzed. Defaulted to 5, meaning analyzing digit place 1s to 10ks.

N

The number of Benford conforming datasets to simulate.

  • 2400 seconds for N=10,000; data dimension = 4000 x 5 total digits.

simulate

TRUE or FALSE: If TRUE, will stimulate the datasets and generate p-value. If FALSE, only produces diff_in_mean and plots. Overwrites N.

omit_05

Whether to omit 0 or both 0 and 5. If omit both 0 and 5, pass in c(0,5) or c(5,0); if omit only 0 pass in 0 or c(0); if omit neither, pass in NA. Default to NA.

break_out
  • The data column (non-numeric!) to split up the dataset based on different categories in the column if specified as an character.

  • The first division (usually x-axis) shown in plots.

  • Default to NA.

break_out_grouping

A list of arrays, or defaulted to NA. Only effective if break_out is not NA.

  • Each the names of the elements in the list is the break_out name

  • Each array contains the values belonging to that break_out

  • If it is remain as NA as default, while break_out is not NA, then break_out_grouping will default to every individual item in break_out will be in a separate group.

category

The column for splitting the data into sectors for separate analysis. The second division (usually variables) shown in plots.

category_grouping

A list of arrays, or defaulted to NA. Only effective if category is not NA.

  • Each the names of the elements in the list is the category name

  • Each array contains the values belonging to that category

  • If it is remain as NA as default, while category is not NA, then category_grouping will default to every individual item in category will be in a separate group.

  • e.g. category_grouping = list(group_1=c(category_1, category_2, ...), group_2=c(category_10, ...), group_3=c(...))

distribution

'Benford' or 'Uniform'. Case insensitive. Specifies the distribution the chi square test is testing against. Default to 'Benford'.

contingency_table

The user-input probability table of arbitrary distribution. Overwrites distribution if not NA. Must be a dataframe of the form as benford_table. Defaulted to NA.

  • Check out load(file = "data/benford_table.RData") to see the format of benford_table

suppress_first_division_plots

TRUE or FALSE: If TRUE, suppress the display of all plots on first and second division. If TRUE, suppress_second_division_plots will also be set to TRUE.

plot

TRUE or FALSE or 'Save': If TRUE, display the plots and return them. If 'Save', return the plots but suppress display. If FALSE, no plot is produced. Default to TRUE.

Value

A list with 4 elements

  • A list of p-values from Monte Carlo Simulation on each category

  • A list of difference in mean between observed_mean and expected_mean on each category

  • A sample size value that corresponds to N if simulate = TRUE

  • Plots for each category if plot = TRUE or 'Save'

Examples

padding_test(digitdata, omit_05=c(0,5), simulate=FALSE)
padding_test(digitdata, data_columns=c('col_name1', 'col_name2'), break_out='col_name')
padding_test(digitdata, N=100, break_out='col_name', distribution='uniform', plot='Save')
padding_test(digitdata, max_length=10, num_digits=3, omit_05=0, break_out='col_name', category='category_name')

jlederluis/digitanalysis documentation built on Nov. 5, 2023, 11:46 a.m.