divide: Divide a Distributed Data Object

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

Divide a ddo/ddf object into subsets based on different criteria

Usage

1
2
3
4
divide(data, by = NULL, spill = 1000000, filterFn = NULL, bsvFn = NULL,
  output = NULL, overwrite = FALSE, preTransFn = NULL,
  postTransFn = NULL, params = NULL, packages = NULL, control = NULL,
  update = FALSE, verbose = TRUE)

Arguments

data

an object of class "ddf" or "ddo" - in the latter case, need to specify preTransFn to coerce each subset into a data frame

by

specification of how to divide the data - conditional (factor-level or shingles), random replicate, or near-exact replicate (to come) – see details

spill

integer telling the division method how many lines of data should be collected until spilling over into a new key-value pair

filterFn

a function that is applied to each candidate output key-value pair to determine whether it should be (if returns TRUE) part of the resulting division

bsvFn

a function to be applied to each subset that returns a list of between subset variables (BSVs)

output

a "kvConnection" object indicating where the output data should reside (see localDiskConn, hdfsConn). If NULL (default), output will be an in-memory "ddo" object.

overwrite

logical; should existing output location be overwritten? (also can specify overwrite = "backup" to move the existing output to _bak)

preTransFn

a transformation function (if desired) to applied to each subset prior to division - note: this is deprecated - instead use addTransform prior to calling divide

postTransFn

a transformation function (if desired) to apply to each post-division subset

params

a named list of objects external to the input data that are needed in the distributed computing (most should be taken care of automatically such that this is rarely necessary to specify)

packages

a vector of R package names that contain functions used in fn (most should be taken care of automatically such that this is rarely necessary to specify)

control

parameters specifying how the backend should handle things (most-likely parameters to rhwatch in RHIPE) - see rhipeControl and localDiskControl

update

should a MapReduce job be run to obtain additional attributes for the result data prior to returning?

verbose

logical - print messages about what is being done

Details

The division methods this function will support include conditioning variable division for factors (implemented – see condDiv), conditioning variable division for numerical variables through shingles, random replicate (implemented – see rrDiv), and near-exact replicate. If by is a vector of variable names, the data will be divided by these variables. Alternatively, this can be specified by e.g. condDiv(c("var1", "var2")).

Value

an object of class "ddf" if the resulting subsets are data frames. Otherwise, an object of class "ddo".

Author(s)

Ryan Hafen

References

See Also

recombine, ddo, ddf, condDiv, rrDiv

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# divide iris data by Species by passing in a data frame
bySpecies <- divide(iris, by = "Species")
bySpecies

# divide iris data into random partitioning of ~30 rows per subset
irisRR <- divide(iris, by = rrDiv(30))
irisRR

# any ddf can be passed into divide:
irisRR2 <- divide(bySpecies, by = rrDiv(30))
irisRR2
bySpecies2 <- divide(irisRR2, by = "Species")
bySpecies2

# splitting on multiple columns
byEdSex <- divide(adult, by = c("education", "sex"))
byEdSex
byEdSex[[1]]

# splitting on a numeric variable
bySL <- ddf(iris) %>%
  addTransform(function(x) {
    x$slCut <- cut(x$Sepal.Length, 10)
    x
  }) %>%
  divide(by = "slCut")
bySL
bySL[[1]]

Example output

Distributed data frame backed by 'kvMemory' connection

 attribute      | value
----------------+----------------------------------------------------------------
 names          | Sepal.Length(num), Sepal.Width(num), Petal.Length(num), and 1 more
 nrow           | 150
 size (stored)  | 11.48 KB
 size (object)  | 11.48 KB
 # subsets      | 3

* Other attributes: getKeys()
* Missing attributes: splitSizeDistn, splitRowDistn, summary
* Conditioning variables: Species

* Input data is not 'ddf' - attempting to cast it as such
* Verifying parameters...
* Applying division...

Distributed data frame backed by 'kvMemory' connection

 attribute      | value
----------------+----------------------------------------------------------------
 names          | Sepal.Length(num), Sepal.Width(num), Petal.Length(num), and 2 more
 nrow           | 150
 size (stored)  | 13.02 KB
 size (object)  | 13.02 KB
 # subsets      | 5

* Other attributes: getKeys()
* Missing attributes: splitSizeDistn, splitRowDistn, summary
* Approx. number of rows in each division: 30

* Verifying parameters...
* Applying division...

Distributed data frame backed by 'kvMemory' connection

 attribute      | value
----------------+----------------------------------------------------------------
 names          | Sepal.Length(num), Sepal.Width(num), Petal.Length(num), and 2 more
 nrow           | 150
 size (stored)  | 13.02 KB
 size (object)  | 13.02 KB
 # subsets      | 5

* Other attributes: getKeys()
* Missing attributes: splitSizeDistn, splitRowDistn, summary
* Approx. number of rows in each division: 30

* Verifying parameters...
* Applying division...

Distributed data frame backed by 'kvMemory' connection

 attribute      | value
----------------+----------------------------------------------------------------
 names          | Sepal.Length(num), Sepal.Width(num), Petal.Length(num), and 1 more
 nrow           | 150
 size (stored)  | 10.71 KB
 size (object)  | 10.71 KB
 # subsets      | 3

* Other attributes: getKeys()
* Missing attributes: splitSizeDistn, splitRowDistn, summary
* Conditioning variables: Species


Distributed data frame backed by 'kvMemory' connection

 attribute      | value
----------------+----------------------------------------------------------------
 names          | age(int), workclass(cha), fnlwgt(int), and 11 more
 nrow           | 32561
 size (stored)  | 2.94 MB
 size (object)  | 2.94 MB
 # subsets      | 32

* Other attributes: getKeys()
* Missing attributes: splitSizeDistn, splitRowDistn, summary
* Conditioning variables: education, sex

$key
[1] "education=10th|sex=Female"

$value
  age workclass fnlwgt educationnum            marital    occupation
1  17         ? 304873            6      Never-married             ?
2  60         ?  24215            6           Divorced             ?
3  59 Local-gov 171328            6            Widowed Other-service
4  36   Private 348022            6 Married-civ-spouse Other-service
5  33   Private 228528            6      Never-married  Craft-repair
   relationship               race capgain caploss hoursperweek nativecountry
1     Own-child              White   34095       0           32 United-States
2 Not-in-family Amer-Indian-Eskimo       0       0           10 United-States
3     Unmarried              Black       0       0           30 United-States
4          Wife              White       0       0           24 United-States
5     Unmarried              White       0       0           35 United-States
  income incomebin
1  <=50K         0
2  <=50K         0
3  <=50K         0
4  <=50K         0
5  <=50K         0
...

*** finding global variables used in 'fn'... [none]
*** testing 'fn' on a subset... ok
* Verifying parameters...
* Applying division...

Distributed data frame backed by 'kvMemory' connection

 attribute      | value
----------------+----------------------------------------------------------------
 names          | Sepal.Length(num), Sepal.Width(num), Petal.Length(num), and 2 more
 nrow           | 150
 size (stored)  | 28.56 KB
 size (object)  | 28.56 KB
 # subsets      | 10

* Other attributes: getKeys()
* Missing attributes: splitSizeDistn, splitRowDistn, summary
* Conditioning variables: slCut

$key
[1] "slCut=(4.3,4.66]"

$value
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.6         3.1          1.5         0.2  setosa
2          4.6         3.4          1.4         0.3  setosa
3          4.4         2.9          1.4         0.2  setosa
4          4.3         3.0          1.1         0.1  setosa
5          4.6         3.6          1.0         0.2  setosa
...

datadr documentation built on May 1, 2019, 8:06 p.m.