ddfply: ddfply

Description Usage Arguments Details Value Examples

Description

performs chunk processing or split-apply-combine on the data in a distributed data frame(ddf)

Usage

1
2
3
ddfply(ddfdir, groupby, fun = identity, collect = "none",
  temploc = getwd(), nbins = 10, chunk = 50000, spill = 1e+06,
  cores = 1, buffer = 1e+09, ...)

Arguments

ddfdir

(string) path of ddf directory

groupby

(character vector) Columns names to used to split the data(if missing, fun is applied on each chunk)

fun

(object of class function) function to apply on each subset after the split

collect

(string) Collect the result as list or dataframe or none. none keeps the resulting ddo on disk.

temploc

(string) Path where intermediary files are kept

nbins

(positive integer) Number of directories into which the distributed dataframe (ddf) or distributed data object (ddo) is distributed

chunk

(positive integer) Number of rows of the file to be read at a time

spill

(positive integer) Maximum number of rows of any subset resulting from split

cores

(positive integer) Number of cores to be used in parallel

buffer

(positive integer) Size of batches of key-value pairs to be passed to the map OR Size of the batches of key-value pairs to flush to intermediate storage from the map output OR Size of the batches of key-value pairs to send to the reduce

...

Arguments to be passed to data.table function asis.

Details

see fileply

Value

list or a dataframe or a TRUE(when collect is 'none').

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
write.table(mtcars, "mtcars.csv", row.names = FALSE, sep = ",")
# create a ddf by keeping `keepddf = TRUE`
co <- capture.output(temp <- fileply("mtcars.csv"
                                     , groupby = c("carb", "gear")
                                     , fun     = identity
                                     , collect = "list"
                                     , sep     =  ","
                                     , header  = TRUE
                                     , keepddf = TRUE)
                     , file = NULL
                     , type = "message"
                     )
# use the ddf instead of reading the CSV again
temp2 <- ddfply(file.path(strsplit(co[6], ": ")[[1]][2], "data")
                , groupby = c("gear")
                , fun     = identity
                , collect = "list"
                , sep     =  ","
                , header  = TRUE
                )
temp2
unlink("mtcars.csv")
unlink(strsplit(co[6], ": ")[[1]][2], recursive = TRUE)

fileplyr documentation built on May 2, 2019, 4:03 p.m.