global reading | R Documentation |
These functions are global reading from specified file.
comm.read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".",
na.strings = "NA", colClasses = NA, nrows = -1, skip = 0,
check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE,
blank.lines.skip = TRUE, comment.char = "#",
allowEscapes = FALSE,
flush = FALSE,
fileEncoding = "", encoding = "unknown",
read.method = .pbd_env$SPMD.IO$read.method[1],
balance.method = .pbd_env$SPMD.IO$balance.method[1],
comm = .pbd_env$SPMD.CT$comm)
comm.read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...,
read.method = .pbd_env$SPMD.IO$read.method[1],
balance.method = .pbd_env$SPMD.IO$balance.method[1],
comm = .pbd_env$SPMD.CT$comm)
comm.read.csv2(file, header = TRUE, sep = ";", quote = "\"",
dec = ",", fill = TRUE, comment.char = "", ...,
read.method = .pbd_env$SPMD.IO$read.method[1],
balance.method = .pbd_env$SPMD.IO$balance.method[1],
comm = .pbd_env$SPMD.CT$comm)
file |
as in |
header |
as in |
sep |
as in |
quote |
as in |
dec |
as in |
na.strings |
as in |
colClasses |
as in |
nrows |
as in |
skip |
as in |
check.names |
as in |
fill |
as in |
strip.white |
as in |
blank.lines.skip |
as in |
comment.char |
as in |
allowEscapes |
as in |
flush |
as in |
fileEncoding |
as in |
encoding |
as in |
... |
as in |
read.method |
either "gbd" or "common". |
balance.method |
balance method for |
comm |
a communicator number. |
These functions will apply read.table()
locally and sequentially
from rank 0, 1, 2, ...
By default, rank 0 reads the file only, then scatter to other ranks for
small datasets (.pbd_env$SPMD.IO$max.read.size
) in
read.method = "gbd"
.
(bcast to others in read.method = "common"
.)
As dataset size increases, the reading is performed from each ranks and read portion of rows in "gbd" format as described in pbdDEMO vignettes and used in pmclust.
comm.load.balance()
is called for "gbd" method as
as nrows = -1
and skip = 0
are set. Note that the default
method "block" is the better way for performance in general that distributes
equally and leaves residuals on higher ranks evenly.
"block0" is the other way around. "block.cyclic" is only useful for
converting to ddmatrix
as in pbdDMAT.
A distributed data.frame is returned.
All factors are disable and read as characters or as what data should be.
Wei-Chen Chen wccsnow@gmail.com, George Ostrouchov, Drew Schmidt, Pragneshkumar Patel, and Hao Yu.
Programming with Big Data in R Website: https://pbdr.org/
comm.load.balance()
and
comm.write.table()
## Not run:
### Save code in a file "demo.r" and run with 2 processors by
### SHELL> mpiexec -np 2 Rscript demo.r
spmd.code <- "
### Initialize
suppressMessages(library(pbdMPI, quietly = TRUE))
### Check.
if(comm.size() != 2){
comm.stop(\"2 processors are requried.\")
}
### Manually distributed iris.
da <- iris[get.jid(nrow(iris)),]
### Dump data.
comm.write.table(da, file = \"iris.txt\", quote = FALSE, sep = \"\\t\",
row.names = FALSE)
### Read back in.
da.gbd <- comm.read.table(\"iris.txt\", header = TRUE, sep = \"\\t\",
quote = \"\")
comm.print(c(nrow(da), nrow(da.gbd)), all.rank = TRUE)
### Read in common.
da.common <- comm.read.table(\"iris.txt\", header = TRUE, sep = \"\\t\",
quote = \"\", read.method = \"common\")
comm.print(c(nrow(da.common), sum(da.common != iris)))
### Finish.
finalize()
"
# execmpi(spmd.code, nranks = 2L)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.