comm.fread: comm.fread

View source: R/comm.fread.r

comm.freadR Documentation

comm.fread

Description

Given a directory, comm.fread() reads all csv files contained in it in parallel with available resources.

Usage

comm.fread(
  dir,
  pattern = "*.csv$",
  shcom = NULL,
  readers = comm.size(),
  verbose = 0,
  ...
)

Arguments

dir

A directory containing the files desired to be read. The directory should be accessible to all readers.

pattern

The pattern for files desired to be read.

shcom

Additional shell command passed to fread's com parameter as: fread(cmd = paste(shcom, file), where file is assigned based on rank. For example if grep <pattern> <file> is needed, set shcom = "grep <pattern>". (Lustre note: Our Lustre file system had an odd interaction on first reads of a 5 GB file with fread and worked 5x faster with shcom = "cat" than without it. Second reads were 10x faster. No such slowdown was observed on NFS file systems).

readers

The number of readers.

verbose

Determines the verbosity level. Acceptable values are 0, 1, 2, and 3 for least to most verbosity.

...

Additional arguments to be passed to data.table::fread().

Details

Each MPI rank reads different but entire files. Best load balance is achieved when the number of files is divisible by the number of ranks and the files are approximately the same size. All files are assumed to contain the same columns. See note for parameter shcom if you are working with a Lustre parallel file system.

Value

TODO

Examples

## Not run: 
### Save code in a file "demo.r" and run with 2 processors by
### SHELL> mpiexec -np 2 Rscript demo.r
library(pbdMPI)
library(pbdIO)

path <- "/tmp/read"
comm.print(dir(path))
## [1] "a.csv" "b.csv"

X <- comm.fread(path)

comm.print(X, all.rank=TRUE)
## COMM.RANK = 0
##    a b c
## 1: 1 2 3
## COMM.RANK = 1
##    a b c
## 1: 2 3 4

finalize()

## End(Not run)


RBigData/pbdIO documentation built on July 22, 2023, 2:25 p.m.