DTrbind: data.table row binding (nearly without) copy

Description Usage Arguments Details Value Examples

Description

This function attempts to rbind two data.tables without making copies. Compared to rbind, this can result to up to 3X memory efficiency. By default, a 2X memory efficiency is minimal with frequent garbage collects.

Usage

1
DTrbind(dt1, dt2, low_mem = FALSE, collect = 0, silent = TRUE)

Arguments

dt1

Type: data.table. The data.table to combine on.

dt2

Type: data.table. The data.table to "copy" on dt1

low_mem

Type: boolean. Unallows dt1 and dt2 twice in memory by deleting dt1 and dt2 (WARNING: empties your dt2) to save memory when set to TRUE. Setting it to FALSE allow dt1 and dt2 to reside twice in memory, therefore memory usage increases. Defaults to FALSE.

collect

Type: integer. Forces a garbage collect every collect iterations to clear up memory. Setting this to 1 along with low_mem = TRUE leads to the lowest possible memory usage one can ever get to merge two data.tables. It also prints verbose information about the process everytime it garbage collects. Setting this to 0 leads to no garbage collect. Lower values increases the time required to bind the data.tables. Defauls to 0.

silent

Type: boolean. Force silence during garbage collection iterations at no speed cost. Defaults to TRUE.

Details

Warning: dt1 and dt2 are pointers only even if you pass the objects to this function. This is how memory efficiency is achieved. dt1 and dt2 gets overwritten on the fly.

Value

A data.table based on dt1.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
library(data.table)
df1 <- data.frame(matrix(nrow = 50000, ncol = 1000))
df2 <- data.frame(matrix(nrow = 50000, ncol = 1000))
setDT(df1)
setDT(df2)
df1[is.na(df1)] <- 1
gc()
df2[is.na(df2)] <- 2
gc() # look memory usage
# open a task manager to check current RAM usage
df1 <- DTrbind(df1, df2, low_mem = TRUE, collect = 20, silent = FALSE)
# check RAM usage in a task manager: it is identical to what we had previously!
gc() # gives no gain
df3 <- data.frame(matrix(nrow = 50000, ncol = 1000))
setDT(df3)
# look on task manager the current RAM usage
#df1 <- rbind(df1, df3) # RAM usage explodes!

Laurae2/Laurae documentation built on May 8, 2019, 7:59 p.m.