knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"#,
  #fig.path = "man/figures/assess-normality-"#,
  # out.width = "100%"
)

options(knitr.kable.NA = '')
hcolor <- "#363B74"
tcolor <- "#EEEEEE"

knitr::opts_chunk$set(dpi = 300, fig.width = 7)

`%>%` <- dplyr::`%>%`
library(MetaPipe)

MetaPipe assesses the normality of variables (traits) by performing a Shapiro-Wilk test on the raw data (see Load Raw Data and Replace Missing Data). Based on whether or not the data approximates a normal distribution, an array of transformations will be computed, and the normality assessed one more time.

The diagram below shows the tree of transformations that can be performed, the user can specify the transformation values passing a vector with the argument transf_vals to the function assess_normality; by default, [2, e, 3, 4, 5, 6, 7, 8, 9, 10].

DiagrammeR::grViz(system.file("figures/assess-normality-graph.gv", package = "MetaPipe"), height = "200px", width = "100%")

The function call is as follows:

assess_normality(raw_data = raw_data, 
                 excluded_columns = c(2, 3, ..., M), 
                 # Optional
                 cpus = 1, 
                 out_prefix = "metapipe", 
                 plots_dir = tempdir(), 
                 transf_vals = c(2, exp(1), 3, 4, 5, 6, 7, 8, 9, 10),
                 alpha = 0.05,
                 pareto_scaling = FALSE,
                 show_stats = TRUE)

where raw_data is a data frame containing the raw data, as described in Load Raw Data and excluded_columns is a vector containing the indices of the properties, e.g. c(2, 3, ..., M). The other arguments are optional, cpus is the number of cores to use, in other words, the number of concurrent traits to process, out_prefix is the prefix for output files, plots_dir is the output directory where the plots will be stored, and transf_vals is a vector containing the transformation values to be used when transforming the original data.

Example

The following histogram shows a sample data obtained from a normal distribution with the command rnorm, but it was transformed using the power (base 2) function; thus, the data seems to be skewed:

set.seed(123)
test_data <- rnorm(500)
MetaPipe:::generate_hist(2^test_data, NULL, xlab = latex2exp::TeX("2^{data}"), save = FALSE, alpha = 1, fill = hcolor)

Using MetaPipe we can find an optimal transformation that "normalises" this data set:

example_data <- data.frame(ID = 1:500,
                           T1 = test_data,
                           T2 = 2^test_data)
normalised_data <- MetaPipe::assess_normality(example_data, c(1))

normalised_data_norm <- normalised_data$norm
normalised_data_skew <- normalised_data$skew

transformed_data <- read.csv("metapipe_raw_data_normalised_all.csv")

The output of this function is a long table of each trait, with the following format:

nrows <- 1
raw_df <- data.frame(A = rep("&nbsp;", nrows),
                     B = rep(NA, nrows),
                     C = rep(NA, nrows),
                     D = rep(NA, nrows),
                     E = rep(NA, nrows),
                     G = rep(NA, nrows))
colnames(raw_df) <- c("index", "trait", "values", "flag", "transf", "transf_val")

knitr::kable(raw_df, format = "html", escape = FALSE, table.attr = "style='width:100%;'", align = 'c') %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
  kableExtra::row_spec(0, extra_css = "vertical-align: middle;", background = hcolor, color = tcolor) %>%
  kableExtra::column_spec(1:6, border_left = TRUE, border_right = TRUE)

where index is a simple numeric value to uniquely identify each trait, trait is the trait/variable name, values is the actual entry, flag indicates whether the entry is parametric (Normal) or skewed (Non-normal), transf is the transformation function (empty for untransformed traits), and transf_val is the transformation value used.

From the previous example, the top 5 entries for the trait T1 are:

knitr::kable(head(transformed_data), format = "html", escape = FALSE, table.attr = "style='width:100%;'", align = 'c') %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
  kableExtra::row_spec(0, extra_css = "vertical-align: middle;", background = hcolor, color = tcolor) %>%
  kableExtra::column_spec(1:6, border_left = TRUE, border_right = TRUE)

And for trait T2:

tmp <- head(transformed_data[-c(1:500), ])
rownames(tmp) <- c(1:6)
knitr::kable(tmp, format = "html", escape = FALSE, table.attr = "style='width:100%;'", align = 'c') %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
  kableExtra::row_spec(0, extra_css = "vertical-align: middle;", background = hcolor, color = tcolor) %>%
  kableExtra::column_spec(1:6, border_left = TRUE, border_right = TRUE)
knitr::kable(head(transformed_data_norm), format = "html", escape = FALSE, table.attr = "style='width:100%;'", align = 'c') %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
  kableExtra::row_spec(0, extra_css = "vertical-align: middle;", background = hcolor, color = tcolor)

As expected both tables show the same entries; however, the latter indicates that T2 was transformed using $\log_2$. The function will generate histograms for all the traits, the naming convention used is:

For the previous data set HIST_1_NORM_T1.png and HIST_2_LOG_2_T2.png:

MetaPipe:::generate_hist(test_data, NULL, xlab = latex2exp::TeX("data"), save = FALSE, alpha = 1, fill = hcolor)

MetaPipe:::generate_hist(log(2^test_data), NULL, xlab = latex2exp::TeX("log_2(2^{data})"), save = FALSE, alpha = 1, fill = hcolor) 
filenames <- c(list.files(".", "metapipe_*"))
output <- lapply(filenames, file.remove)


villegar/MetaPipe documentation built on Nov. 22, 2022, 10:44 p.m.