O Tools

View source: R/comm.fread.r

comm.fread

R Documentation

comm.fread

Description

Given a directory, comm.fread() reads all csv files contained in it in parallel with available resources.

Usage

comm.fread(
  dir,
  pattern = "*.csv$",
  shcom = NULL,
  readers = comm.size(),
  verbose = 0,
  ...
)

Arguments

`dir`	A directory containing the files desired to be read. The directory should be accessible to all readers.
`pattern`	The pattern for files desired to be read.
`shcom`	Additional shell command passed to `fread`'s `com` parameter as: `fread(cmd = paste(shcom, file)`, where file is assigned based on rank. For example if `grep <pattern> <file>` is needed, set `shcom = "grep <pattern>"`. (Lustre note: Our Lustre file system had an odd interaction on first reads of a 5 GB file with `fread` and worked 5x faster with `shcom = "cat"` than without it. Second reads were 10x faster. No such slowdown was observed on NFS file systems).
`readers`	The number of readers.
`verbose`	Determines the verbosity level. Acceptable values are 0, 1, 2, and 3 for least to most verbosity.
`...`	Additional arguments to be passed to `data.table::fread()`.

Details

Each MPI rank reads different but entire files. Best load balance is achieved when the number of files is divisible by the number of ranks and the files are approximately the same size. All files are assumed to contain the same columns. See note for parameter shcom if you are working with a Lustre parallel file system.

Value

TODO

Examples

## Not run: 
### Save code in a file "demo.r" and run with 2 processors by
### SHELL> mpiexec -np 2 Rscript demo.r
library(pbdMPI)
library(pbdIO)

path <- "/tmp/read"
comm.print(dir(path))
## [1] "a.csv" "b.csv"

X <- comm.fread(path)

comm.print(X, all.rank=TRUE)
## COMM.RANK = 0
##    a b c
## 1: 1 2 3
## COMM.RANK = 1
##    a b c
## 1: 2 3 4

finalize()

## End(Not run)

RBigData/pbdIO documentation built on July 22, 2023, 2:25 p.m.