select: 'emil' and 'dplyr' integration
In emil: Evaluation of Modeling without Information Leakage

Description Usage Arguments Value Author(s) See Also Examples

Modeling results can be converted to tabular format and manipulated using dplyr and other Hadleyverse packages. This is accomplished by a class specific select_ function that differs somewhat in syntax from the default select_.

1 2	## S3 method for class 'list' select_(.data, ..., .dots)

.data

Modeling results, as returned by evaluate.

...

Not used, kept for consistency with dplyr.

.dots

Indices to select on each level of .data, i.e. the first index specifies which top level elements of .data to select, the second specifies second-level-elements etc. The last index must select elements that can be converted to a data frame. In case the desired bottom-level element is related to the observations of a modeling task, e.g. the predctions of a test set, you must supply the resampling scheme used to produce .data at the appropriate level (see the examples).

The names of the ... arguments specifies the names of the resulting data frame. Non-named arguments will be used to traverse the data but not returned.

In summary the ... indices can be on the following forms:

Simple indices: Anything that can be used to subset objects, e.g. integers, logicals, or characters.
Functions: A function that produces a data frame, vector or factor.
Resampling schemes: The same resampling scheme that was used to produce the modeling results.

A data.frame in long format.

Christofer Bäcklin

subtree

# Produce some results
x <- iris[-5]
y <- iris$Species
names(y) <- sprintf("orchid%03i", seq_along(y))
cv <- resample("crossvalidation", y, nfold=3, nrepeat=2)
procedures <- list(nsc = modeling_procedure("pamr"),
                   rf = modeling_procedure("randomForest"))
result <- evaluate(procedures, x, y, resample=cv)

# Get the foldwise error for the NSC method
result %>% select(fold = TRUE, "nsc", error = "error")

# Compare both methods 
require(tidyr)
result %>%
    select(fold = TRUE, method = TRUE, error = "error") %>%
    spread(method, error)
require(dplyr)
result %>%
    select(fold = TRUE, method = TRUE, error = "error") %>%
    group_by(method) %>% summarize(mean_error = mean(error))

# Investigate the variability in estimated class 2 probability across folds
result %>%
    select(fold = cv, "nsc", "prediction", probability = function(x) x$probability[,2]) %>%
    spread(fold, probability)