sample: Do random sampling from an Xdf file

Description Usage Arguments Details Value See Also Examples

Description

Do random sampling from an Xdf file

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
## S3 method for class 'RxXdfData'
sample_n(tbl, size = 1, replace = FALSE,
  weight = NULL, .env = NULL)

## S3 method for class 'RxXdfData'
sample_frac(tbl, size = 1, replace = FALSE,
  weight = NULL, .env = NULL)

## S3 method for class 'grouped_tbl_xdf'
sample_n(tbl, size = 1, replace = FALSE,
  weight = NULL, .env = NULL)

## S3 method for class 'grouped_tbl_xdf'
sample_frac(tbl, size = 1, replace = FALSE,
  weight = NULL, .env = NULL)

Arguments

tbl

An Xdf file or a tbl wrapping the same.

size

For sample_n, the number of rows to select. For sample_frac, the fraction of rows to select. For a grouped dataset, size applies to each group.

replace, weight, .env

Not used.

Details

Sampling from Xdf files is slightly more limited than the data frame case. Only unweighted sampling without replacement is supported, and attempts to specify otherwise will result in a warning. Unlike the other single-table dplyr verbs, sample_n and sample_frac do not delete tbl inputs; this is because it's unlikely that a sample is intended to replace the input data entirely.

Currently sampling on HDFS data works in the local compute context (on the edge node) but not in the Hadoop or Spark compute contexts.

Value

An Xdf tbl.

See Also

sample_frac and sample_n in package dplyr, sample

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
mtx <- as_xdf(mtcars, overwrite=TRUE)

tbl <- sample_n(mtx, 10)
nrow(tbl)

tbl2 <- sample_frac(mtx, 0.5)
nrow(tbl2)

tbl3 <- group_by(mtx, vs) %>% sample_frac(0.5)
nrow(tbl3)

# to get an _approximate_ sample, use filter()
tbl4 <- filter(mtx, runif(.rxNumRows) < 0.4)  # keep 40% of rows in the data
nrow(tbl4)

RevolutionAnalytics/dplyrXdf documentation built on June 3, 2019, 9:08 p.m.