DTsubsample: data.table subsampling (nearly without) copy

Description Usage Arguments Details Value Examples

Description

This function attempts to subsample one data.table without making copies. Compared to direct subsampling, this can result to up to 1.1X memory efficiency. In most cases, you get a NEGATIVE memory efficiency even with frequent garbage collects. Use this only if you are working with super large datasets that fills up your RAM.

Usage

1
2
DTsubsample(DT, kept, remove = FALSE, low_mem = FALSE, collect = 0,
  silent = TRUE)

Arguments

DT

Type: data.table. The data.table to combine on.

kept

Type: vector of integers. The rows to select for subsampling.

remove

Type: boolean. Whether the argument kept acts as a removal (keep all rows which are not in kept). Defaults to FALSE.

low_mem

Type: boolean. Unallows DT (up to) twice in memory by deleting DT (WARNING: empties your DT) to save memory when set to TRUE. Setting it to FALSE allow DT to reside (up to) twice in memory, therefore memory usage increases. Defaults to FALSE.

collect

Type: integer. Forces a garbage collect every collect iterations to clear up memory. Setting this to 1 along with low_mem = TRUE leads to the lowest possible memory usage one can ever get to merge two data.tables. It also prints verbose information about the process everytime it garbage collects. Setting this to 0 leads to no garbage collect. Lower values increases the time required to subsample the data.table. Defauls to 0.

silent

Type: boolean. Force silence during garbage collection iterations at no speed cost. Defaults to TRUE.

Details

Warning: DT is a pointer only even if you pass the object to this function. This is how memory efficiency is achieved.

Value

The subsampled data.table.

Examples

1
2
3
4
5
6
7
8
9
library(data.table)
DT <- data.frame(matrix(nrow = 5000000, ncol = 10))
DT <- setDT(DT)
DT[is.na(DT)] <- 1
colnames(DT) <- paste(colnames(DT), "xx", sep = "")
kept <- 1:4000000
DT_sub <- DTsubsample(DT, sample(5e6, 4e6, FALSE), collect = 5, silent = TRUE)
#DT_sub <- DT[sample(5e6, 4e6, FALSE), ] #works good
DT_sub <- DTsubsample(DT, sample(4e6, 3e6, FALSE), low_mem = TRUE, collect = 5, silent = TRUE)

Laurae2/Laurae documentation built on May 8, 2019, 7:59 p.m.