In Aariq/holodeck: A Tidy Interface for Simulating Multivariate Data

downloads

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

holodeck: A Tidy Interface For Simulating Multivariate Data

holodeck allows quick and simple creation of simulated multivariate data with variables that co-vary or discriminate between levels of a categorical variable. The resulting simulated multivariate dataframes are useful for testing the performance of multivariate statistical techniques under different scenarios, power analysis, or just doing a sanity check when trying out a new multivariate method.

Installation

From CRAN:

install.packages("holodeck)

Development version from r-universe:

install.packages('holodeck', repos = c('https://aariq.r-universe.dev', 'https://cloud.r-project.org'))

Load packages

holodeck is built to work with dplyr functions, including group_by() and the pipe (%>%). purrr is helpful for iterating simulated data. For these examples I'll use ropls for PCA and PLS-DA.

library(holodeck)
library(dplyr)
library(tibble)
library(purrr)
library(ropls)

Example 1: Investigating PCA and PLS-DA

Let's say we want to learn more about how principal component analysis (PCA) works. Specifically, what matters more in terms of creating a principal component---variance or covariance of variables? To this end, you might create a dataframe with a few variables with high covariance and low variance and another set of variables with low covariance and high variance

Generate data

set.seed(925)
df1 <- 
  sim_covar(n_obs = 20, n_vars = 5, cov = 0.9, var = 1, name = "high_cov") %>%
  sim_covar(n_vars = 5, cov = 0.1, var = 2, name = "high_var")

Explore covariance structure visually. The diagonal is variance.

df1 %>% 
  cov() %>%
  heatmap(Rowv = NA, Colv = NA, symm = TRUE, margins = c(6,6), main = "Covariance")

Now let's make this dataset a little more complex. We can add a factor variable, some variables that discriminate between the levels of that factor, and add some missing values.

set.seed(501)
df2 <-
  df1 %>% 
  sim_cat(n_groups = 3, name = "factor") %>% 
  group_by(factor) %>% 
  sim_discr(n_vars = 5, var = 1, cov = 0, group_means = c(-1.3, 0, 1.3), name = "discr") %>% 
  sim_discr(n_vars = 5, var = 1, cov = 0, group_means = c(0, 0.5, 1), name = "discr2") %>% 
  sim_missing(prop = 0.1) %>% 
  ungroup()
df2

PCA

pca <- opls(select(df2, -factor), fig.pdfC = "none", info.txtC = "none")

plot(pca, parAsColFcVn = df2$factor, typeVc = "x-score")

getLoadingMN(pca) %>%
  as_tibble(rownames = "variable") %>% 
  arrange(desc(abs(p1)))

It looks like PCA mostly picks up on the variables with high covariance, not the variables that discriminate among levels of factor. This makes sense, as PCA is an unsupervised analysis.

PLS-DA

plsda <- opls(select(df2, -factor), df2$factor, predI = 2, permI = 10, fig.pdfC = "none", info.txtC = "none")

plot(plsda, typeVc = "x-score")

getVipVn(plsda) %>% 
  tibble::enframe(name = "variable", value = "VIP") %>% 
  arrange(desc(VIP))

PLS-DA, a supervised analysis, finds discrimination among groups and finds that the discriminating variables we generated are most responsible for those differences.

Aariq/holodeck documentation built on Aug. 28, 2023, 10:06 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com