sampleStrat: Stratified randomization

sampleStratR Documentation

Stratified randomization

Description

This function scrambles values of a given column of a data frame in a stratified manner with respect to one or more other "covariate" columns. The covariate columns can be specified, as well as the width of the range of each covariate around each focal value from which to sample candidates for swapping.

Usage

sampleStrat(
  x,
  col,
  w = function(x) stats::sd(x, na.rm = TRUE)/(max(x, na.rm = TRUE) - min(x, na.rm =
    TRUE)),
  d = 0.1,
  by = "all",
  permuteBy = TRUE
)

Arguments

x

Data frame containing at least two columns, one with numeric values and at least one more with numeric or factor values.

col

Character or integer, name or number of column in x to swap values.

w

Function or numeric value >0, sets window size of non-factor covariates as a proportion of their range. If using a function it must work on a list of values. It can be helpful if this function accepts the argument 'na.rm=T' to avoid problems with NAs in the column specified by col. The default is the standard deviation divided by the range. This reduces the correlation between erstwhile perfectly correlated variables to ~0.80 (on average). Ignored for covariates that are factors.

d

Numeric > 0, if no swappable value is found within w * (max(col) - min(col)), then w is expanded by 1 + d iteratively until a value is found. Ignored for covariates that are factors.

by

Character vector or integers. Name(s) or columns numbers of covariates by which to stratify the target column. Can also specify 'all' (default) to stratify by all columns with a numeric/integer/factor class except the target column.

permuteBy

Logical, if TRUE then in each step scramble the order of values in by. If FALSE then strata are considered for each covariate in teh order listed by by. This argument has no effect if by has just one value.

Details

The script starts by randomly selecting a value v_i from the target column. It then finds the value of covariate c_j, that is associated with v_i. Call the particular value of c_j associated with v_i c_j:i. If c_j is a continuous variable it then finds all values c_{v} that fall within c_j:i - w, c_j:i + w where w is a proportion of the range of c_j.
The function then randomly selects a value of v_k from those associated with this range of c_j and swaps v_i with this value. Depending on the random number generator, v_i can = v_k and in fact be the same value. If no values of c_j other than the one associated with v_i are found within this range, then the window is expanded iteratively by a factor of w * (1 + d) until at least one more values that have yet to be swapped have been found. The procedure then finds a window around v_k as described above (or randomly selects a new v_i if v_i was v_k) and continues. If there is an odd number of values then the last value is kept as is (not scrambled). If c_j is a categorical variable (a factor), then the script finds all values of of v in same factor level as v_i. Swaps of v occur within this level of c_j. However, if there are <2 of values in the level (including the value associated with v_i), then the script looks to the next factor level. The "next" is taken to be the factor level with the least difference between v_i and the average of values of v associated with the potential "next" factor level. The "window" for a factor level is thus the level plus one or more levels with the closest average values of v given that there is >1 value of v within this group that has yet to be swapped.
If there is more than one covariate, then these steps are repeated iteratively for each covariate (i.e., selecting values of v given the stratum identified in covariate c_j, then among these values those also in the stratum identified in covariate c_k, and so on). In this case the order in which the covariates are listed in by can affect the outcome. The order can be permuted each values of v_i if permuteBy is TRUE.

Value

A data frame with one column swapped in a stratified manner relative another column or set of columns.

See Also

sample

Examples


# Example #1: Scramble column 1 with respect to columns 2 and 3.
# Note in the output high values of "a" tend to be associated with
# high values of "b" and low values of "c". This tendency decreases as "w" increases.

x <- data.frame(a=1:20, b=1:20, c=20:1, d=c(rep('a', 10), rep('b', 10)))
x$d <- as.factor(x$d)
x

# scramble by all other columns
sampleStrat(x=x, col=1, w=0.2, by='all', d=0.1)

# scramble by column "d"
sampleStrat(x=x, col=1, w=0.2, by='d', d=0.1)

# Example #2: The target variable and covariate are equal
# (perfectly collinear). How wide must the window (set by
# argument "w'" be to reduce the average correlation
# between them to an arbitrary low level?

df <- data.frame(a=1:100, b=1:100)
cor(df) # perfect correlation

corFrame <- data.frame()
for (w in seq(0.1, 1, 0.1)) {
    for (countRep in 1:10) {
       df2 <- sampleStrat(x=df, col=1, w=w)
       corFrame <- rbind(corFrame, data.frame(w=w, cor=cor(df2)[1, 2]))
    }
}

boxplot(cor ~ w, data=corFrame, xlab='w', ylab='correlation coefficient')


adamlilith/statisfactory documentation built on Jan. 3, 2024, 10:37 p.m.