DTfillNA: data.table NA fill (nearly without) copy (or data.frame)

Description Usage Arguments Details Value Examples

Description

This function attempts to fill NA values in a data.table. Compared to DT[is.na(DT)] <- value, this result in a guaranteed 3X memory efficiency. By default, a 2X memory efficiency is minimal with frequent garbage collects.

Usage

1
DTfillNA(DT, value = 0, low_mem = FALSE, collect = 0, silent = TRUE)

Arguments

DT

Type: data.table (or a data.frame, partially supported). The data.table to fill NAs on.

value

Type: vector of length 1 or of length ncol(DT). If a vector of length 1 is supplied, NA values are replaced by that value. Otherwise, attempts to replace values by matching the column number with the vector. Defaults to 0.

low_mem

Type: boolean. Unallows DT twice in memory by modifying DT in place. (WARNING: empties your DT) to save memory when set to TRUE. Setting it to FALSE allow DT to reside twice in memory, therefore memory usage increases. Defaults to FALSE.

collect

Type: integer. Forces a garbage collect every collect iterations to clear up memory. Setting this to 1 along with low_mem = TRUE leads to the lowest possible memory usage one can ever get to merge two data.tables. It also prints verbose information about the process everytime it garbage collects. Setting this to 0 leads to no garbage collect. Lower values increases the time required to bind the data.tables. Defauls to 0.

silent

Type: boolean. Force silence during garbage collection iterations at no speed cost. Defaults to TRUE.

Details

Warning: DT is a pointer only and is directly modified.

Value

A data.table with filled NA values (if low_mem is set to TRUE).

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
library(data.table)
df1 <- data.frame(matrix(nrow = 50000, ncol = 1000))
df2 <- data.frame(matrix(nrow = 50000, ncol = 1000))
setDT(df1)
setDT(df2)
gc() # check memory usage...380MB?
DTfillNA(df2, value = 0, low_mem = TRUE, collect = 20, silent = TRUE)
gc() # check memory usage peak... 600MB?
rm(df2)
gc() # 200MB only, lets try with only 1 frame left...
df1[is.na(df1)] <- 0
gc() # with 1 data.table less, memory still peaked to 850MB (200MB->850MB)
# e.g it took at least 3.5X more memory than the object alone

df2 <- data.frame(matrix(nrow = 50000, ncol = 1000))
setDT(df2)
DTfillNA(df2, value = 0, low_mem = TRUE, collect = 20, silent = TRUE)
gc() # all good
identical(df1, df2) # TRUE => the same...

rm(df1, df2)
gc(reset = TRUE)

# Let's try to make a copy
df1 <- data.frame(matrix(nrow = 50000, ncol = 1000))
df2 <- DTfillNA(df1, value = 99, low_mem = FALSE, collect = 50, silent = TRUE)
gc() # only 650MB, much better than doing df2 <- df1; df2[is.na(df2)] <- 99

rm(df1, df2)
gc(reset = TRUE)

# This can't be done in R "easily" without hackery ways (fill 1 to 1000 by column)
df1 <- data.frame(matrix(nrow = 50000, ncol = 1000))
df2 <- DTfillNA(df1, value = 1:1000, low_mem = FALSE, collect = 50, silent = TRUE)
gc() # only 650MB

# You can do this on data.frame too...
# It will NOT coerce to data.table
# Just remember it doesn't update in real time in RStudio
df2 <- data.frame(matrix(nrow = 50000, ncol = 1000))
DTfillNA(df2, value = 1:1000, low_mem = TRUE, collect = 50, silent = TRUE)
head(df2)
is.data.table(df2) # FALSE, we did in-place replacement without parent.env hehe

Laurae2/Laurae documentation built on May 8, 2019, 7:59 p.m.