splitH: Read Or Write Subsets Of Data Files From Or To Disk

View source: R/seqRead.R

splitHR Documentation

Read Or Write Subsets Of Data Files From Or To Disk

Description

Reads or writes data files from/to disk in disjoint subsets. This is a two-stage function (see Examples).

Usage

splitH(readpath, writepath = NULL)

Arguments

readpath

character length 1. Full path to the source file

writepath

character length 1. Full path to the destination file

Details

Above arguments apply to Stage 1 only. The arguments for Stage 2 function, which is the output of Stage 1, are the following:

rows integer, length 1. Number of rows per subset. When rows = Inf, the data can be either copied as is or moved to a new location

seq logical, default TRUE: read discrete subsets. Otherwise, progressively appended subsets from first to current

dropcols character of length < ncol(data). Columns to drop. Works only when rows is finite. Replaces argument select from data.table::fread

how symbol. Works only when rows = Inf and writepath location is given. Options: how = scp, data file is copied as is to writepath location; how = mv, data file is moved to writepath location

print logical, default TRUE, each subset written to disk is shown in console. Setting print to FALSE could increase writing speed

orn logical, default FALSE. When TRUE, the original data row numbers are shown in each subset

The main purpose of this utility is to bring manageable subsets from very large data into the working environment for further processing when writepath = NULL. When orn = TRUE, each subset receives a new column named "srn" showing source data row numbers. This column is absent from subsets written to disk regardless of orn value. The source data file can be any type of file readable by data.table::fread.

At the first stage:

  • the utility retrieves information about source data without loading them into memory and also provides the new function which, in the second stage:

  • reads source data in successive disjoint subsets (rows < Inf) and brings them into the work environment (writepath = NULL), or

  • writes subsets to writepath location appending them automatically to the destination file. During writing, if (print = TRUE) the displayed subsets are just printouts (class "NULL"). When writepath = NULL, displayed subsets are objects.

There is a functional difference between rows = Inf and rows = nrow(data):

  • when rows = Inf, the size of source data is irrelevant. They can be either copied (how = scp) or moved (how = mv) to writepath destination without being loaded into memory.

  • when rows has finite value, the size of source data is relevant and data columns can be dropped.

Value

At stage 1, displayed information and a function (a closure). At stage 2, a "data.table" class subset of data or a printout of said subset when written on disk.

References

Part of internal code for Stage 1 was inspired by data.table Issue# 7169

See Also

Linux commands scp and mv

Examples


if (interactive()) {

# Make a 'csv' file

data(mtcars)
tmpf = tempfile(fileext = '.csv')
write.table(mtcars,tmpf , sep = ',', row.names = FALSE, quote = FALSE)

# 1. Read data file step by step

# 1.1 Get information on data
r = splitH(readpath = tmpf)                             # stage 1
class(r)                                                # function

# 1.2 Read data iteratively                             # stage 2
a = r(rows = 11, dropcols = c('am', 'vs'))              # iter1 no original row numbers
b = r(rows = 11, dropcols = c('am', 'vs'), orn = TRUE)  # iter2 w. original row numbers
c = r(rows = 11, dropcols = c('am', 'vs'))              # iter3 the last subset
## Not run: 
d = r(rows = 11, dropcols = c('am', 'vs'))              # iter4 stop! Return to stage 1

## End(Not run)
print(list(a, b, c))

# 2. Read data file completely

r = splitH(readpath = tmpf)                             # stage 1
n = ceiling(32/13)                                      # read 13 rows at a time
a = replicate(n, r(rows = 13), simplify = FALSE)        # read file
class(a)                                                # list
print(a)                                                # a list of tables

tmpf1 = tempfile(fileext = '.csv')                      # new location

# 3. Iteratively write to new location

r = splitH(readpath = tmpf, writepath = tmpf1)          # stage 1
n = ceiling(32/11)                                      # 11 rows each time
invisible(
 replicate(n, r(rows = 11) , simplify = FALSE)          # write to new location
 )
a = data.table::fread(tmpf1)                            # check result
dim(a)
print(head(a))

unlink(tmpf1)

tmpf2 = tempfile(fileext = '.csv')                      # new location

# 4. Move file from tmpf to another location

r = splitH(readpath = tmpf, writepath = tmpf2)          # stage 1
r(rows = Inf, how = mv, print = FALSE)                  # move to new location
a = data.table::fread(tmpf2)                            # check result
print(head(a))

unlink(tmpf)
unlink(tmpf2)

}


akin documentation built on May 19, 2026, 5:07 p.m.