knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 5 )
library(WoodSimulatR) library(magrittr) library(ggplot2) pander::panderOptions('knitr.auto.asis', FALSE);
The WoodSimulatR
package provides functions for generating artificial datasets of
sawn wood properties obtained by destructive and non-destructive testing.
An existing dataset containing some of these properties can be enriched by adding simulated values for the missing properties.
This document aims to provide an overview of the capabilities
of the WoodSimulatR
package.
On the one hand, we will simulate a dataset with varying parameters,
highlighting both the capabilities of the WoodSimulatR
functions and the
direction where it should go.
On the other hand, we will illustrate the capabilities of WoodSimulatR
with
respect to simulating grade determining properties for a dataset with different
pre-existing variables.
As a quick summary of each dataset, we will show mean and CoV for all variables, split by country and subsample, and we will show the matrix of correlations.
For this, we define the following function:
summ_fun <- function(ds, grp = c('country', 'subsample', 'loadtype')) { grp <- intersect(grp, names(ds)); v <- setdiff(names(ds), grp); r <- cor(ds[v]); ds <- tibble::add_column(ds, n = 1); v <- c('n', v); ds <- tidyr::gather(ds, 'property', 'value', !!! rlang::syms(v)); ds <- dplyr::mutate( ds, property = factor( property, levels=v, labels=ifelse(v=='n', v, paste0(v, '_mean')), ordered = TRUE ) ); grp <- c(grp, 'property'); ds <- dplyr::group_by(ds, !!! rlang::syms(grp)); summ <- dplyr::summarise( ds, res = if (property[1] == 'n') sprintf('%.0f', sum(value)) else sprintf( if(property[1] %in% c('f_mean', 'ip_f_mean')) '%.1f (%.0f)' else '%.0f (%.0f)', mean(value), 100*sd(value)/mean(value)), .groups = 'drop_last' ); pander::pander( tidyr::spread(summ, property, res), split.tables = Inf ); pander::pander(r) invisible(summ); } compare_with_def <- function(ds, ssd, target = c('mean', 'cov')) { target <- match.arg(target); ds <- dplyr::group_by(ds, country); summ <- dplyr::summarise( ds, f_mean.ach = mean(f), f_cov.ach = sd(f) / f_mean.ach, E_mean.ach = mean(E), E_cov.ach = sd(E) / E_mean.ach, rho_mean.ach = mean(rho), rho_cov.ach = sd(rho) / rho_mean.ach, .groups = 'drop_last' ); stopifnot(!anyDuplicated(ssd$country)); summ <- dplyr::left_join( summ, dplyr::select( dplyr::mutate(ssd, f_cov = f_sd / f_mean, E_cov = E_sd / E_mean, rho_cov = rho_sd / rho_mean), country, f_mean, f_cov, E_mean, E_cov, rho_mean, rho_cov ), by = 'country' ); summ <- tidyr::pivot_longer( summ, -country, names_to = c('gdpname', '.value'), names_sep = '_' ); summ <- dplyr::mutate( summ, gdpname = factor(gdpname, levels = c('f', 'E', 'rho'), ordered = TRUE) ); if (target == 'mean') { ggplot(data = summ, aes(mean.ach, mean)) + geom_abline(slope = 1, intercept = 0) + geom_text(aes(label = country)) + geom_point(alpha = 0.5) + facet_wrap(vars(gdpname), scales = 'free') + theme(axis.text.x = element_text(angle = 90)); } else { ggplot(data = summ, aes(cov.ach, cov)) + geom_abline(slope = 1, intercept = 0) + geom_text(aes(label = country)) + geom_point(alpha = 0.5) + facet_wrap(vars(gdpname), scales = 'free') + theme(axis.text.x = element_text(angle = 90)); } }
The main function for dataset simulation is simulate_dataset()
.
It can be called without any further arguments to yield a "default" dataset.
For reproducibility, we will call it with the extra argument
random_seed = 12345
. This means that we will always generate the same
random numbers.
dataset_0 <- simulate_dataset(random_seed = 2345); summ_fun(dataset_0);
The meaning of the properties in dataset_0
is as follows:
simulate_dataset()
returns
generic country names "C1", "C2" etc., and sets the subsample names equal to
the country names.
Further options for countries and subsamples are explained below.All properties except E_dyn_u are to be taken as measured on the dry timber and corrected to a moisture content of 12%.
The default dataset created above relies on the following assumptions:
All of these assumptions can be modified more or less freely.
For convenience, the WoodSimulatR
package contains tables with means and
standard deviations for f, E and rho for different countries,
obtained in the research projects SiOSiP and
GradeWood [@Ranta_Maunus_et_al_2011_] or reported in scientific papers
[@Stapel_et_al_2014_; @Rohanova_2014_].
Data from both destructive tension and bending tests are available.
Currently, the data is restricted to European spruce (Picea abies, PCAB).
get_subsample_definitions(loadtype = 't') %>% dplyr::select(-species, -loadtype) %>% dplyr::arrange(country) %>% pander::pander(split.table = Inf);
get_subsample_definitions(loadtype = 'be') %>% dplyr::select(-species, -loadtype) %>% dplyr::arrange(country) %>% pander::pander(split.table = Inf);
ssd_c <- get_subsample_definitions( country = c('at', 'de', 'fi', 'pl', 'se', 'si', 'sk'), loadtype = 't' ); dataset_c <- simulate_dataset( random_seed = 12345, n = 5000, subsets = ssd_c ); summ_fun(dataset_c);
Compare achieved means with the defined values. It can be seen that the means of $f$ are met exactly, while the means of $E$ and $rho$ are only met approximately, which is the desideratum when we are dealing with simulation.
compare_with_def(dataset_c, ssd_c, 'm')
Compare achieved coefficients of variation with the defined values. Again, we have undesirable exact values for $f$.
compare_with_def(dataset_c, ssd_c, 'cov')
ssd_cn <- get_subsample_definitions( country = c(at = 1, de = 3, fi = 1.5, pl = 2, se = 3, si = 1, sk = 1), loadtype = 't' ); dataset_cn <- simulate_dataset( random_seed = 12345, n = 5000, subsets = ssd_cn ); summ_fun(dataset_cn);
In a similar manner to the predefined country specifications, we can also define
our own.
Since Version 0.6.0, we can also used different sample identifier columns
(instead of the standard "country" and "subsample")
-- for details, check the help on simulate_dataset()
.
As an example, we define different properties for boards with different cross sections (width and thickness, given in mm).
ssd_custom <- tibble::tribble( ~width, ~thickness, ~f_mean, ~f_sd, 80, 40, 27.5, 9.0, 140, 40, 29.4, 9.7, 160, 60, 31.6, 9.3, 200, 50, 30.2, 11.4, 240, 95, 25.5, 4.8, 250, 40, 25.3, 11.2 ); dataset_custom <- simulate_dataset( random_seed = 12345, n = 5000, subsets = ssd_custom ); summ_fun(dataset_custom, grp = c('width', 'thickness', 'loadtype'));
simbase_covar()
For adding simulated values to a dataset, we first need to establish the relationship between these values and some variables in the dataset.
In WoodSimulatR
, relationships are established in the following way:
simbase_covar()
;
the resulting simbase has class "simbase_covar".simbase_covar()
;
the resulting simbase has class "simbase_list".For both these options, it is possible to transform the involved variables.
To visualise the result of the simulation, we use scatterplots and define them in the following function:
plot_sim_gdp <- function(ds, simb, simulated_vars, ...) { extra_aes <- rlang::enexprs(...); ds <- dplyr::rename(ds, f_ref = f, E_ref = E, rho_ref = rho); if (!any(simulated_vars %in% names(ds))) ds <- simulate_conditionally(data = ds, simbase = simb); ds <- tidyr::pivot_longer( data = ds, cols = tidyselect::any_of(c('f_ref', 'E_ref', 'rho_ref', simulated_vars)), names_to = c('name', '.value'), names_sep = '_' ); ds <- dplyr::mutate( ds, name = factor(name, levels = c('f', 'E', 'rho'), ordered = TRUE) ); simname <- names(ds); simname <- simname[dplyr::cumany(simname == 'name')]; simname <- setdiff(simname, c('name', 'ref')); stopifnot(length(simname) == 1); ggplot(data = ds, mapping = aes(.data[[simname]], ref, !!!extra_aes)) + geom_point(alpha = .2, shape = 20) + geom_abline(slope = 1, intercept = 0, alpha = .5, linetype = 'twodash') + facet_wrap(vars(name), scales = 'free') + theme(axis.text.x = element_text(angle = 90)); } # undebug(plot_sim_gdp)
simbase_covar
without transformationThe main approach in WoodSimulatR
is to conditionally simulate
based on the means and the covariance matrix.
As a start, we create basis data
for the simulation without applying any transformation.
As we later want to add simulated GDP values to a dataset which already
contains GDP values, we rename the GDP values for the simbase_covar
to some
other names not yet present in the target dataset, by suffixing with _siml
(for SIMulation with Linear relationships)
sb_untransf <- dataset_0 %>% dplyr::rename(f_siml = f, E_siml = E, rho_siml = rho) %>% simbase_covar( variables = c('f_siml', 'E_siml', 'rho_siml', 'ip_f', 'E_dyn', 'ip_rho') ); sb_untransf;
Adding the simulated GDP values to a dataset is done by calling
simulate_conditionally()
.
dataset_c_sim <- simulate_conditionally(dataset_c, sb_untransf); names(dataset_c_sim) %>% pander::pander();
For a visual comparison:
plot_sim_gdp(dataset_c_sim, sb_untransf, c('f_siml', 'E_siml', 'rho_siml'));
This looks good for $E$ and $\rho$, but wrong in the $f$ simulation.
simbase_covar
with log-transformed $f$We might try using transforms
to improve the result. For this, we have to pass
a list with named entries corresponding to the GDP we want to transform.
The entry itself must be an object of class "trans"
(from the package scales
).
As we want to use a log-transform, the required entry is scales::log_trans()
.
sb_transf <- dataset_0 %>% dplyr::rename(f_simt = f, E_simt = E, rho_simt = rho) %>% simbase_covar( variables = c('f_simt', 'E_simt', 'rho_simt', 'ip_f', 'E_dyn', 'ip_rho'), transforms = list(f_simt = scales::log_trans()) ); dataset_c_sim <- simulate_conditionally(dataset_c_sim, sb_transf); plot_sim_gdp(dataset_c_sim, sb_transf, c('f_simt', 'E_simt', 'rho_simt'));
Now, this looks much better (which is no surprise, as dataset_c
itself has
been simulated with lognormal $f$).
simbase_covar
with log-transformed $f$ and derived on a grouped datasetIf we group the reference dataset (dataset_0
), e.g. by country, we get an
object of class "simbase_list" with separate simbases for each group
(technically, this is a tibble
with the grouping variables and an extra
column .simbase
which contains several objects of class "simbase_covar").
sb_group <- dataset_0 %>% dplyr::group_by(country) %>% dplyr::rename(f_simg = f, E_simg = E, rho_simg = rho) %>% simbase_covar( variables = c('f_simg', 'E_simg', 'rho_simg', 'ip_f', 'E_dyn', 'ip_rho'), transforms = list(f_simg = scales::log_trans()) ); sb_group
If we add variables to a dataset using such a "simbase_list", it is required that all grouping variables stored in the "simbase_list" object are also available in this dataset.
In our case: the dataset must contain the variable "country". Values of
"country" which do not also exist in our "simbase" object will result in
NA
values for the variables to be simulated.
Therefore, we add the variables in this case not to the dataset dataset_c
(which has different values for "country") but to the dataset_0
itself.
dataset_0_sim <- simulate_conditionally(dataset_0, sb_group); plot_sim_gdp(dataset_0_sim, sb_group, c('f_simg', 'E_simg', 'rho_simg'), colour=country);
simbase_list
objectSimbase objects of class "simbase_list" can also be used for simulating an
entire dataset, as long as the "simbase_list" only has the grouping variable(s)
"country" and/or "subsample", and as long as the value combinations in
"country"/"subsample" match those given in the "subsets" argument to the
function simulate_dataset
.
To demonstrate, we calculate a "simbase_list" based on the dataset_c
created
above. Here, we do not rename any of the variables.
sb_group_c <- dataset_c %>% dplyr::group_by(country) %>% simbase_covar( variables = c('f', 'E', 'rho', 'ip_f', 'E_dyn', 'ip_rho'), transforms = list(f = scales::log_trans()) ); sb_group_c
This "simbase_list" is now used as input to simulate_dataset
with the subset
definitions used previously (ssd_cn
).
dataset_cn2 <- simulate_dataset( random_seed = 12345, n = 5000, subsets = ssd_cn, simbase = sb_group_c ); summ_fun(dataset_cn2);
The package WoodSimulatR
has functions for simulating entire datasets of sawn
timber properties, both
based on internal definitions and on externally supplied base data.
WoodSimulatR
also has functions for adding
simulated grade determining properties (or other properties)
to a given dataset, based on a covariance matrix approach.
The functions for adding simulated variables are suitable for all kinds of
datasets, if one calculates an appropriate simbase_covar
object oneself, by a
call to simbase_covar
using reference data.
The simulation methods also support variable transformations to accommodate non-normally distributed variables.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.