study_hack: Find smallest subset to exclude from sample for...

Description Usage Arguments Value Examples

Description

The function iteratively learns which observations should at least be excluded from the data to reach a conservative 'goal value' for the statistic of interest. It does so by relying on a genetic algorithm, which efficiently explores the (usually vast) space of possible subsets. The result can uncover impactful subsamples and fuel discussions of robustness. Necessary arguments include the dataframe, a function to compute the statistic of interest ('statistic_computation' see examples), and the goal value of interest.

Usage

1
2
3
4
5
study_hack(data = NULL, goal_value = NULL,
  statistic_computation = NULL, max_exclusions = NULL, pop = 500,
  max_generations = 2000, exclusion_cost = 0.01,
  prop_included_cases = 0.95, chance_of_mutation = 0.02,
  stop_search = 100, random_seed = 42)

Arguments

data

A data.frame containing the observations as rows.

goal_value

This conservative value (e.g., small effect size) is targeted.

statistic_computation

A formula which has 'data' as input and returns the statistic of interest.

max_exclusions

maximum number of cases to be excluded

pop

Number of 'individuals' in each generation of the genetic algorithm.

max_generations

Maximum number of generations that the algorithm generates.

exclusion_cost

Used to calibrate fitness function.

prop_included_cases

Initial proportion of included cases (e.g. .90).

chance_of_mutation

Chance that a gene mutates, higher is slower but more accurate (e.g. .02).

stop_search

After how many generations without change is the 'converged' result returned.

random_seed

Seed for replicability.

Value

Vector of zeros and ones with length equal to number of observations in data. Ones indicate exclusion.

Examples

1
2
3
4
5
6
coefficient_computation <- function(data){
statistic <- cor(data$Sepal.Length, data$Petal.Width)
return(statistic)}

filter <- study_hack(data = iris, statistic_computation = coefficient_computation, goal_value = 0.2, max_generations = 500)
print(filter)

hannesrosenbusch/studybreak documentation built on June 20, 2019, 7:15 p.m.