genSample: Generate synthetic observations using inverse-probability...
In sambia: A Collection of Techniques Correcting for Sample Selection Bias

Description Usage Arguments Value Author(s) References Examples

This method corrects a given data set for sample selection bias by generating new observations via Stochastic inverse-probability oversampling or parametric inverse-probability sampling using inverse-probability weights and information on covariance structure of the given strata (Krautenbacher et al, 2017).

1 2	genSample(data, strata.variables = NULL, stratum = NULL, weights = rep(1, nrow(data)), distr = "mvnorm", type = c("parIP", "stochIP"))

`data`	a data frame containing the observations rowwise, along with their corresponding categorical strata feature.
`strata.variables`	a character vector of the names determined by the categorical stratum features.
`stratum`	a numerical vector of the length of the number of rows of the data specifying the stratum ID. Either 'strata.variables' or 'stratum' has to be provided. This vector will not be included as a column in the resulting data set.
`weights`	a numerical vector whose length must coincide with the number of the rows of data. The i-th value contains the inverse-probability e.g. determines how often the i-th observation of data shall be replicated.
`distr`	character object that describes the distribution
`type`	character which decides which method is used to correct a given data set for sample selection bias. Stochastic Inverse-Probabiltiy oversampling is applied if type = 'stochIP' or Parametric Inverse-Probability Bagging if type = 'parIP'.

$data data frame containing synthetic data which is corrected for sample selection bias by generating new observations via Stochastic inverse-probability oversampling or parametric inverse-probability oversampling.

$orig.data original data frame which shall to corrected

$stratum vector containing the stratum of each observation

$method a character indicating which method was used. If method = 'stochIP' then Stochastic Inverse-Probabiltiy oversampling was used, if method = 'parIP' the Parametric Inverse-Probability sampling was used.

$strata.tbl a data frame containing all variables and their feature occurences

$N number of rows in data

$n number of rows in original data

Norbert Krautenbacher, Kevin Strauss, Maximilian Mandl, Christiane Fuchs

Krautenbacher, N., Theis, F. J., & Fuchs, C. (2017). Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies. Computational and mathematical methods in medicine, 2017.

## simulate data for a population
require(pROC)

set.seed(1342334)
N = 100000
x1 <- rnorm(N, mean=0, sd=1) 
x2 <- rt(N, df=25)
x3 <- x1 + rnorm(N, mean=0, sd=.6)
x4 <- x2 + rnorm(N, mean=0, sd=1.3)
x5 <- rbinom(N, 1, prob=.6)
x6 <- rnorm(N, 0, sd = 1) # noise not known as variable
x7 <- x1*x5 # interaction
x <- cbind(x1, x2, x3, x4, x5, x6, x7)

## stratum variable (covariate)
xs <- c(rep(1,0.1*N), rep(0,(1-0.1)*N))

## effects
beta <- c(-1, 0.2, 0.4, 0.4, 0.5, 0.5, 0.6)
beta0 <- -2

## generate binary outcome
linpred.slopes <-  log(0.5)*xs + c(x %*% beta)
eta <-  beta0 + linpred.slopes

p <- 1/(1+exp(-eta)) # this is the probability P(Y=1|X), we want the binary outcome however:
y<-rbinom(n=N, size=1, prob=p) #

population <- data.frame(y,xs,x)

#### draw "given" data set 
sel.prob <- rep(1,N)
sel.prob[population$xs == 1] <- 9
sel.prob[population$y == 1] <- 8
sel.prob[population$y == 1 & population$xs == 1] <- 150
ind <- sample(1:N, 200, prob = sel.prob)

data = population[ind, ]

## calculate weights from original numbers for xs and y
w.matrix <- table(population$y, population$xs)/table(data$y, data$xs)
w <- rep(NA, nrow(data))
w[data$y==0 & data$xs ==0] <- w.matrix[1,1]
w[data$y==1 & data$xs ==0] <- w.matrix[2,1]
w[data$y==0 & data$xs ==1] <- w.matrix[1,2]
w[data$y==1 & data$xs ==1] <- w.matrix[2,2]
## parametric IP bootstrap sample
sample1 <- sambia::genSample(data=data, strata.variables = c('y', 'xs'),
                          weights = w, type='parIP')
## stochastic IP oversampling; treating 'y' and 'xs' as usual input variable
## but using strata info unambiguously defined by the weights w                        
sample2 <- sambia::genSample(data=data,
                            weights = w, type='stochIP', stratum= round(w))