stratified: Take a Stratified Sample From a Dataset

Description Usage Arguments Value Note Author(s) See Also Examples

Description

The stratified function samples from a data.table in which one or more columns can be used as a "stratification" or "grouping" variable. The result is a new data.table with the specified number of samples from each group.

Usage

1
2
stratified(indt, group, size, select = NULL, replace = FALSE,
  keep.rownames = FALSE, bothSets = FALSE, ...)

Arguments

indt

The input data.table.

group

The column or columns that should be used to create the groups. Can be a character vector of column names (recommended) or a numeric vector of column positions. Generally, if you are using more than one variable to create your "strata", you should list them in the order of slowest varying to quickest varying. This can be a vector of names or column indexes.

size

The desired sample size.

  • If size is a value between 0 and 1 expressed as a decimal, size is set to be proportional to the number of observations per group.

  • If size is a single positive integer, it will be assumed that you want the same number of samples from each group.

  • If size is a named vector, the function will check to see whether the length of the vector matches the number of groups and that the names match the group names.

select

A named list containing levels from the "group" variables in which you are interested. The list names must be present as variable names for the input dataset.

replace

Logical. Should sampling be with replacement? Defaults to FALSE.

keep.rownames

Logical. If the input is a data.frame with rownames, as.data.table would normally drop the rownames. If TRUE, the rownames would be retained in a column named rn. Defaults to FALSE.

bothSets

Logical. Should both the sampled and non-sampled sets be returned as a list? Defaults to FALSE.

...

Optional arguments to base::sample().

Value

If bothSets = TRUE, a list of two data.tables; otherwise, a data.table.

Note

Slightly different sizes than requested: Because of how computers deal with floating-point arithmetic, and because R uses a "round to even" approach, the size per strata that results when specifying a proportionate sample may be one sample higher or lower per strata than you might have expected.

Author(s)

Ananda Mahto

See Also

sampling::strata() from the "strata" package; dplyr::sample_n() and dplyr::sample_frac() from "dplyr".

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Generate a sample data.frame to play with
set.seed(1)
DF <- data.frame(
  ID = 1:100,
  A = sample(c("AA", "BB", "CC", "DD", "EE"), 100, replace = TRUE),
  B = rnorm(100), C = abs(round(rnorm(100), digits=1)),
  D = sample(c("CA", "NY", "TX"), 100, replace = TRUE),
  E = sample(c("M", "F"), 100, replace = TRUE))

# Take a 10% sample from all -A- groups in DF
stratified(DF, "A", .1)

# Take a 10% sample from only "AA" and "BB" groups from -A- in DF
stratified(DF, "A", .1, select = list(A = c("AA", "BB")))

# Take 5 samples from all -D- groups in DF, specified by column number
stratified(DF, group = 5, size = 5)

# Use a two-column strata: -E- and -D-
stratified(DF, c("E", "D"), size = .15)

# Use a two-column strata (-E- and -D-) but only use cases where -E- == "M"
stratified(DF, c("E", "D"), .15, select = list(E = "M"))

## As above, but where -E- == "M" and -D- == "CA" or "TX"
stratified(DF, c("E", "D"), .15, select = list(E = "M", D = c("CA", "TX")))

# Use a three-column strata: -E-, -D-, and -A-
stratified(DF, c("E", "D", "A"), size = 2)

## Not run: 
# The following will produce errors
stratified(DF, "D", c(5, 3))
stratified(DF, "D", c(5, 3, 2))

## End(Not run)

# Sizes using a named vector
stratified(DF, "D", c(CA = 5, NY = 3, TX = 2))

# Works with multiple groups as well
stratified(DF, c("D", "E"), 
           c("NY F" = 2, "NY M" = 3, "TX F" = 1, "TX M" = 1,
             "CA F" = 5, "CA M" = 1))

mrdwab/splitstackshape documentation built on May 23, 2019, 7:16 a.m.