ffdfdply: Performs a split-apply-combine on an ffdf

Description Usage Arguments Value See Also Examples

Description

Performs a split-apply-combine on an ffdf. Splits the x ffdf according to split and applies FUN to the data, stores the result of the FUN in an ffdf.
Remark that this function does not actually split the data. In order to reduce the number of times data is put into RAM for situations with a lot of split levels, the function extracts groups of split elements which can be put into RAM according to BATCHBYTES. Please make sure your FUN covers the fact that several split elements can be in one chunk of data on which FUN is applied.
Mark also that NA's in the split are not considered as a split on which the FUN will be applied.

Usage

1
2
3
4
5
6
7
8
9
ffdfdply(
  x,
  split,
  FUN,
  BATCHBYTES = getOption("ffbatchbytes"),
  RECORDBYTES = sum(.rambytes[vmode(x)]),
  trace = TRUE,
  ...
)

Arguments

x

an ffdf

split

an ff vector which is part of the ffdf x

FUN

the function to apply to each split. This function needs to return a data.frame

BATCHBYTES

integer scalar limiting the number of bytes to be processed in one chunk

RECORDBYTES

optional integer scalar representing the bytes needed to process one row of x

trace

logical indicating to show on which split the function is computing

...

other parameters passed on to FUN

Value

an ffdf

See Also

grouprunningcumsum, table

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
data(iris)
ffiris <- as.ffdf(iris)

youraggregatorFUN <- function(x){
	dup <- duplicated(x[c("Species", "Petal.Width")])
  o <- order(x$Petal.Width)
  lowest_pw <- x[rev(o),][!dup,]
  highest_pw <- x[o,][!dup,]
  lowest_pw$group <- factor("lowest", levels=c("lowest", "highest"))
  highest_pw$group <- factor("highest", levels=c("lowest", "highest"))
	rbind(lowest_pw, highest_pw)
}
result <- ffdfdply( x = ffiris, split = ffiris$Species,
                   FUN = function(x) youraggregatorFUN(x),
                   BATCHBYTES = 5000, trace=TRUE)
dim(result)
dim(iris)
result[1:10,]

ffiris$integerkey <- with(ffiris, as.integer(Sepal.Length))
result <- ffdfdply( x = ffiris, split = as.character(ffiris$integerkey)
                  , FUN = function(x) youraggregatorFUN(x), BATCHBYTES = 5000
                  , trace=TRUE
                  )

ffiris$datekey <- ff( as.Date(ffiris$Sepal.Length[], origin = "1970-01-01"),
                      vmode = "integer")
result <- ffdfdply( x = ffiris, split = as.character(ffiris$datekey) 
                  , FUN = function(x) youraggregatorFUN(x)
                  , BATCHBYTES = 5000, trace=TRUE
                  )

Example output

Loading required package: ff
Loading required package: bit
Attaching package bit
package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
creators: bit bitwhich
coercion: as.logical as.integer as.bit as.bitwhich which
operator: ! & | xor != ==
querying: print length any all min max range sum summary
bit access: length<- [ [<- [[ [[<-
for more help type ?bit

Attaching package: 'bit'

The following object is masked from 'package:base':

    xor

Attaching package ff
- getOption("fftempdir")=="/work/tmp/tmp/RtmpmI9C0Z"

- getOption("ffextension")=="ff"

- getOption("ffdrop")==TRUE

- getOption("fffinonexit")==TRUE

- getOption("ffpagesize")==65536

- getOption("ffcaching")=="mmnoflush"  -- consider "ffeachflush" if your system stalls on large writes

- getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system

- getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system


Attaching package: 'ff'

The following objects are masked from 'package:bit':

    clone, clone.default, clone.list

The following objects are masked from 'package:utils':

    write.csv, write.csv2

The following objects are masked from 'package:base':

    is.factor, is.ordered


Attaching package: 'ffbase'

The following objects are masked from 'package:ff':

    [.ff, [.ffdf, [<-.ff, [<-.ffdf

The following objects are masked from 'package:base':

    %in%, table

2018-12-06 15:16:41, calculating split sizes
2018-12-06 15:16:41, building up split locations
2018-12-06 15:16:41, working on split 1/2, extracting data in RAM of 2 split elements, totalling, 0 GB, while max specified data specified using BATCHBYTES is 0 GB
2018-12-06 15:16:41, ... applying FUN to selected data
2018-12-06 15:16:41, ... appending result to the output ffdf
2018-12-06 15:16:41, working on split 2/2, extracting data in RAM of 1 split elements, totalling, 0 GB, while max specified data specified using BATCHBYTES is 0 GB
2018-12-06 15:16:41, ... applying FUN to selected data
2018-12-06 15:16:41, ... appending result to the output ffdf
[1] 54  6
[1] 150   5
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  group
1           5.9         3.2          4.8         1.8 versicolor lowest
2           6.7         3.1          4.7         1.5 versicolor lowest
3           5.4         3.0          4.5         1.5 versicolor lowest
4           6.2         2.2          4.5         1.5 versicolor lowest
5           6.2         2.9          4.3         1.3 versicolor lowest
6           5.0         2.3          3.3         1.0 versicolor lowest
7           5.0         3.5          1.6         0.6     setosa lowest
8           5.1         3.3          1.7         0.5     setosa lowest
9           5.4         3.4          1.5         0.4     setosa lowest
10          5.4         3.9          1.3         0.4     setosa lowest
2018-12-06 15:16:41, calculating split sizes
2018-12-06 15:16:41, building up split locations
2018-12-06 15:16:41, working on split 1/2, extracting data in RAM of 2 split elements, totalling, 0 GB, while max specified data specified using BATCHBYTES is 0 GB
2018-12-06 15:16:41, ... applying FUN to selected data
2018-12-06 15:16:41, ... appending result to the output ffdf
2018-12-06 15:16:41, working on split 2/2, extracting data in RAM of 2 split elements, totalling, 0 GB, while max specified data specified using BATCHBYTES is 0 GB
2018-12-06 15:16:41, ... applying FUN to selected data
2018-12-06 15:16:41, ... appending result to the output ffdf
2018-12-06 15:16:41, calculating split sizes
2018-12-06 15:16:41, building up split locations
2018-12-06 15:16:41, working on split 1/2, extracting data in RAM of 1 split elements, totalling, 0 GB, while max specified data specified using BATCHBYTES is 0 GB
2018-12-06 15:16:41, ... applying FUN to selected data
2018-12-06 15:16:41, ... appending result to the output ffdf
2018-12-06 15:16:41, working on split 2/2, extracting data in RAM of 3 split elements, totalling, 0 GB, while max specified data specified using BATCHBYTES is 0 GB
2018-12-06 15:16:41, ... applying FUN to selected data
2018-12-06 15:16:41, ... appending result to the output ffdf

ffbase documentation built on Feb. 27, 2021, 5:06 p.m.