hutch_sampling: Balancing of DTM

View source: R/underSam.R

hutch_samplingR Documentation

Balancing of DTM

Description

hutch_sampling returns the balanced document term matrix using Randon Undersampling and Random Oversampling techniques

Usage

hutch_sampling(X, Y, type = "ROS", perc = 50, k_pos = 0, w = NULL,
  verbose = TRUE)

Arguments

X

represents a DTM with the class c('DocumentTermMatrix', 'simple_triplet_matrix')

Y

the response variable of the unbalanced dataset, should be a factor with two levels (binary)

type

technique for balancing i.e. either Random Oversampling (ROS) or two different ways of applying Random Undersampling (RUS) i.e "RUS_under" type to apply percentage of undersampling according to the majority class("percUnder") or RUS_Pos minority class("percPos")

perc

argument for the type "RUS_under" and "RUS_Pos" only i.e. percentage of sampling of majority class depending on the type of RUS i.e (RUS_under, RUS_Pos)

k_pos

argument for the type "ROS" only, number of times of positve (minority) instances to be generated

w

argument for the type "RUS_under" and "RUS_Pos only", undersampling with weighting of majotity class, if NULL sampling is done by giving qual weights

verbose

argument only for the type = "ROS" only. If TRUE, prints extra information

Details

This function applies balancing techniques: Random Undersampling and Random Oversampling on the document term matrix using the functions ubUnder and ubOver.

Value

if type = "RUS_Pos" or "RUS_under", value will be a list of 3 elements. The first element X will be the balanced DTM of the same class as the input DTM i.e c('DocumentTermMatrix', 'simple_triplet_matrix'), the second element Y will contain the response variable of the balanced data as factors and the third element will contain a vector representing the removed documents.

if type = "ROS", value will be a list of two elements. The first element X will be the balanced DTM of the same class as the input DTM i.e c('DocumentTermMatrix', 'simple_triplet_matrix'), the second element Y will contain the response variable of the balanced data as factors

Examples

library(tm)
library(unbalanced)

y <- factor(meta(liu_corpus)$real_label)
x <- liu_dtm

exp <- hutch_sampling(x, y, type = "RUS_Pos", perc = 50, k_pos = 0,
  w = NULL, verbose = FALSE
)

exp$X
exp$Y
exp$id.rm


test <- hutch_sampling(x, y, type = "ROS", perc = 50, k_pos = 2,
  w = NULL, verbose = FALSE
)

test$X
test$Y

UBESP-DCTV/costumer documentation built on Feb. 1, 2023, 4:52 a.m.