| splitH | R Documentation |
Reads or writes data files from/to disk in disjoint subsets. This is a two-stage function (see Examples).
splitH(readpath, writepath = NULL)
readpath |
character length 1. Full path to the source file |
writepath |
character length 1. Full path to the destination file |
Above arguments apply to Stage 1 only. The arguments for Stage 2 function, which is the output of Stage 1, are the following:
rows integer, length 1. Number of rows per subset. When rows = Inf, the data can be either
copied as is or moved to a new location
seq logical, default TRUE: read discrete subsets. Otherwise, progressively appended subsets
from first to current
dropcols character of length < ncol(data). Columns to drop. Works only when rows is
finite. Replaces argument select from data.table::fread
how symbol. Works only when rows = Inf and writepath location is given.
Options: how = scp, data file is copied as is to writepath location;
how = mv, data file is moved to writepath location
print logical, default TRUE, each subset written to disk is shown in console. Setting print to
FALSE could increase writing speed
orn logical, default FALSE. When TRUE, the original data row numbers
are shown in each subset
The main purpose of this utility is to bring manageable subsets from very large data into the working
environment for further processing when writepath = NULL. When orn = TRUE, each subset
receives a new column named "srn" showing source data row numbers. This column
is absent from subsets written to disk regardless of orn value. The source data file can be any
type of file readable by data.table::fread.
At the first stage:
the utility retrieves information about source data without loading them into memory and also provides the new function which, in the second stage:
reads source data in successive disjoint subsets (rows < Inf) and brings them into the work
environment (writepath = NULL), or
writes subsets to writepath location appending them automatically to the destination file.
During writing, if (print = TRUE) the displayed subsets are just printouts (class "NULL"). When
writepath = NULL, displayed subsets are objects.
There is a functional difference between rows = Inf and rows = nrow(data):
when rows = Inf, the size of source data is irrelevant. They can be either copied (how = scp)
or moved (how = mv) to writepath destination without being loaded into memory.
when rows has finite value, the size of source data is relevant and data columns can be dropped.
At stage 1, displayed information and a function (a closure). At stage 2, a "data.table" class subset of data or a printout of said subset when written on disk.
Part of internal code for Stage 1 was inspired by data.table Issue# 7169
Linux commands scp and mv
if (interactive()) {
# Make a 'csv' file
data(mtcars)
tmpf = tempfile(fileext = '.csv')
write.table(mtcars,tmpf , sep = ',', row.names = FALSE, quote = FALSE)
# 1. Read data file step by step
# 1.1 Get information on data
r = splitH(readpath = tmpf) # stage 1
class(r) # function
# 1.2 Read data iteratively # stage 2
a = r(rows = 11, dropcols = c('am', 'vs')) # iter1 no original row numbers
b = r(rows = 11, dropcols = c('am', 'vs'), orn = TRUE) # iter2 w. original row numbers
c = r(rows = 11, dropcols = c('am', 'vs')) # iter3 the last subset
## Not run:
d = r(rows = 11, dropcols = c('am', 'vs')) # iter4 stop! Return to stage 1
## End(Not run)
print(list(a, b, c))
# 2. Read data file completely
r = splitH(readpath = tmpf) # stage 1
n = ceiling(32/13) # read 13 rows at a time
a = replicate(n, r(rows = 13), simplify = FALSE) # read file
class(a) # list
print(a) # a list of tables
tmpf1 = tempfile(fileext = '.csv') # new location
# 3. Iteratively write to new location
r = splitH(readpath = tmpf, writepath = tmpf1) # stage 1
n = ceiling(32/11) # 11 rows each time
invisible(
replicate(n, r(rows = 11) , simplify = FALSE) # write to new location
)
a = data.table::fread(tmpf1) # check result
dim(a)
print(head(a))
unlink(tmpf1)
tmpf2 = tempfile(fileext = '.csv') # new location
# 4. Move file from tmpf to another location
r = splitH(readpath = tmpf, writepath = tmpf2) # stage 1
r(rows = Inf, how = mv, print = FALSE) # move to new location
a = data.table::fread(tmpf2) # check result
print(head(a))
unlink(tmpf)
unlink(tmpf2)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.