In martin-ueding/paramvalf: Parameter Value Analysis Framework

knitr::opts_chunk$set(echo = TRUE)

library(dplyr)
library(ggplot2)
library(paramvalf)

Motivation

During my course of studies I have often faced the following situation: I would analyse a data set and make some choices about the methedology along the way. My advisor would then ask me to chance some facet within the analysis code. The chance usually was easy to do, I re-run the whole analysis again and showed the data. Then I was asked to compare both variants, for instance in a combined plot.

The trouble is that I can do both variants individually, but not both at the same time. There are options which make this possible, but do not scale:

One could duplicate the whole analysis and just change the facet. One would have to rename all the variables but then one has all the data in one R sessions.
Alternatively one could run the analysis once, store everything to an analysis-1.Rdata file, change the facet, re-run and store everything to analysis-2.Rdata. Then load those files and perform the comparison.

These ways turn bad rather quickly as each bifurcation introduced increases the number of combinatoric combinations possible.

This analysis framework is an answer to this problem. Every step of the analysis is performed with the possibility of adding more parameters in mind.

Data structure

The data structure of the framework are “parameter value containers” which are a simple list with the fields param and value. Both are a data.frame and the number of rows must match. The value may be omitted if there are only parameters. The names of the colums should be unique throughout the whole analysis.

Example

The following will be a contrieved example to show how the framework works. The statistical content of this example should not be taken seriously.

Let's use the mtcars data set that is shipped with R. It contains the “Motor Trend Car Road Tests” and is a data frame with various data points of 32 cars. First we want to see whether there is a correlation between displacement (engine volume) and power in four cylinder cars. The number 4 is our first parameter.

To inject these parameters, we need to create a “parameter value container” which only contains parameters.

cylinders <- list(param = data.frame(cyl = c(4)))

Now we can do a correlation. In order to work with the container, we need to write a special function and then use the function pv_call.

do_correlation <- function(param, value) {
    # We only want to take the rows from the `mtcars` data set which correspond
    # to a specific number of cylinders as specified in the parameters.
    data <- mtcars %>%
        filter(cyl == param$cyl)

    list(corr_disp_hp = cor(data$disp, data$hp),
         corr_disp_wt = cor(data$disp, data$wt))
}

The functions in this framework always have the same structure:

Their parameters are always param and value, no matter how many variables go into this computation. This means that dependencies are not explicit here.
The return value is always a list. The names of the variables are those which will be available in value in later parts of the analysis. These need to be unique in the whole analysis.

In order to call this functions, we use the framework function.

correlation <- pv_call(do_correlation, cylinders, serial = TRUE)

This object now contains the correlation for each set of the parameters.

print(correlation)

Since we only have one parameter so far, there is only one row in it.

Now we perhaps wonder what the correlation is if we used the Spearman and not the Person (default) correlation definition. We can easily create another parameter for this.

correlation_method <- list(param = data.frame(correlation_method = c('pearson', 'spearman'),
                                              stringsAsFactors = FALSE))

do_correlation <- function(param, value) {
    # We only want to take the rows from the `mtcars` data set which correspond
    # to a specific number of cylinders as specified in the parameters.
    data <- mtcars %>%
        filter(cyl == param$cyl)

    list(corr_disp_hp = cor(data$disp, data$hp, method = param$correlation_method),
         corr_disp_wt = cor(data$disp, data$wt, method = param$correlation_method))
}

correlation <- pv_call(do_correlation, cylinders, correlation_method, serial = TRUE)

Now we have a more interesting result.

print(correlation)

This might be the point where one becomes interested in the other cylinder counts. So we redefine the cylinders container such that it contains them.

cylinders <- list(param = data.frame(cyl = c(4, 6, 8)))

We can just re-run the functions because they were built with this extension in mind already.

correlation <- pv_call(do_correlation, cylinders, correlation_method, serial = TRUE)
print(correlation)

The result has 6 rows already because we have the outer product of the number of cylinders and the correlation method. This is the long data format and it makes adding more parameters straightforward.

Our values so far are just atomics, but correlation$value is not a simple data frame any more. This is the problem of this approach, because it is rather a list of lists which happen to have the same names. Therefore one needs to be a but more careful. In order to access the elements, it might be needed to use value[, 'COL'] instead of just value$COL. This is unfortunate, but so far I have not learned a better way.

We want to build a summary now. This is a single large data frame which contains all the parameters and new summarizing columns. The number of rows can be arbitrary for each parameter set. Basically it will take the summary data frame for each parameter set, bind the parameters to each row and then bind all the resulting data frames to one long one.

Here we extract the two correlations that we have computed and put them into one column corr. A second column type keeps track of the variable that we the correlation is formed with. For each parameter set we will have a data frame with two rows.

do_corr_summary <- function(param, value) {
    summary <- data.frame(corr = c(value$corr_disp_hp,
                                   value$corr_disp_wt),
                          variable = rep(c('HP', 'Weight'), each = length(value$corr_disp_hp)))

    list(summary = summary)
}

corr_summary <- pv_call(do_corr_summary, correlation, serial = TRUE)

The resulting data frame has two rows per parameter set. This is flexible, so you can have arbitrary many rows for each parameter set.

print(corr_summary)

This format lends itself for plotting with ggplot2. We can place parameters and variables on various axes.

ggplot(corr_summary,
       aes(x = cyl,
           y = corr,
           color = correlation_method,
           shape = variable)) +
    geom_point(position = position_dodge(width = 0.4))

Summary

We have covered the pv_call function which takes a function and one or more “parameter value container” objects and calls the function on each combination of the parameters.

Outlook

At some point it might be needed to convert a former parameter to a value column. This will reduce the number of rows but makes each more complex. For this the pv_call_group function takes a character vector with the parameter names and does exactly this.

martin-ueding/paramvalf documentation built on Sept. 4, 2020, 1:27 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com