smoteMod: smoteMod is a modified version of the 'synthetic minority...
In sambia: A Collection of Techniques Correcting for Sample Selection Bias

Description Usage Arguments Author(s) Examples

View source: R/smoteMod.R

This method adapts SMOTE to the context of stratified random samples. Rather than enlarging only the minority class, smoteMod generates synthetic data for all strata with a weight bigger than 1. Note: this function has to apply SMOTE H-1 times: 1. subsample data by smallest stratum and a stratum to oversample 2. oversample with modified SMOTE function according to weight of the stratum 3. do this for the other H-2 to subsamples 4. build new data set with strata where H-1 strata contain synthetic data (stratum with smallest weight remains as is)

1	smoteMod(data.x, stratum, weights, data.y = NULL, K)

`data.x`	A data frame or matrix of numeric-attributed dataset
`stratum`	a numerical vector of the same length as the number of the rows of data. Depending on the number of strata variables and their number of exposures each such combination is assigned to a numeric class id. The i-th entry of stratum contains the class id (and therefore class belonging) of the i-th row (=observation) of data.
`weights`	a numerical vector whose length must coincide with the number of the rows of data. The i-th value contains the inverse-probability e.g. determines how often the i-th observation of data shall be replicated.
`data.y`	A vector of a target class attribute corresponding to a dataset data.x.
`K`	The number of nearest neighbors during sampling process

Norbert Krautenbacher, Kevin Strauss, Maximilian Mandl, Christiane Fuchs

## simulate data for a population
require(pROC)

set.seed(1342334)
N = 100000
x1 <- rnorm(N, mean=0, sd=1) 
x2 <- rt(N, df=25)
x3 <- x1 + rnorm(N, mean=0, sd=.6)
x4 <- x2 + rnorm(N, mean=0, sd=1.3)
x5 <- rbinom(N, 1, prob=.6)
x6 <- rnorm(N, 0, sd = 1) # noise not known as variable
x7 <- x1*x5 # interaction
x <- cbind(x1, x2, x3, x4, x5, x6, x7)

## stratum variable (covariate)
xs <- c(rep(1,0.1*N), rep(0,(1-0.1)*N))

## effects
beta <- c(-1, 0.2, 0.4, 0.4, 0.5, 0.5, 0.6)
beta0 <- -2

## generate binary outcome
linpred.slopes <-  log(0.5)*xs + c(x %*% beta)
eta <-  beta0 + linpred.slopes

p <- 1/(1+exp(-eta)) # this is the probability P(Y=1|X), we want the binary outcome however:
y<-rbinom(n=N, size=1, prob=p) #

population <- data.frame(y,xs,x)

#### draw "given" data set for training
sel.prob <- rep(1,N)
sel.prob[population$xs == 1] <- 9
sel.prob[population$y == 1] <- 8
sel.prob[population$y == 1 & population$xs == 1] <- 150
ind <- sample(1:N, 200, prob = sel.prob)

data = population[ind, ]

## calculate weights from original numbers for xs and y
w.matrix <- table(population$y, population$xs)/table(data$y, data$xs)
w <- rep(NA, nrow(data))
w[data$y==0 & data$xs ==0] <- w.matrix[1,1]
w[data$y==1 & data$xs ==0] <- w.matrix[2,1]
w[data$y==0 & data$xs ==1] <- w.matrix[1,2]
w[data$y==1 & data$xs ==1] <- w.matrix[2,2]

### draw a test data set
newdata = population[sample(1:N, size=200 ), ]

K = 5
genData = smoteMod(data.x = data[ , -which(colnames(data) %in% c('y', 'xs'))] , 
stratum = w, data.y = data$y, weights = w, K=K)