knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 12, fig.height = 6 ) options(tibble.print_min = 6L, tibble.print.max = 6L, digits = 3)
In this vignette a PARAFAC model is created for upper jaw lingual data from the vanderPloeg2024
study. This is done by first processing the count data. Subsequently, the appropriate number of components are determined. Then the PARAFAC model is created and visualized.
library(parafac4microbiome) library(dplyr) library(ggplot2) library(ggpubr)
The data cube in vanderPloeg2024$upper_jaw_lingual$data
contains unprocessed counts. The function processDataCube()
performs the processing of these counts with the following steps:
vanderPloeg2024
we can take the RFgroups groups into account for feature selection. We do this by calculating the sparsity for each feature in each subject group and compare those against the sparsity threshold that we set. If a feature passes the threshold in either group, it is selected.compositions::clr()
function with a pseudo-count of one (on all features, prior to selection based on sparsity).The outcome of processing is a new version of the dataset called processedPloeg
. Please refer to the documentation of processDataCube()
for more information.
processedPloeg = processDataCube(vanderPloeg2024$upper_jaw_lingual, sparsityThreshold=0.50, considerGroups=TRUE, groupVariable="RFgroup", CLR=TRUE, centerMode=1, scaleMode=2)
A critical aspect of PARAFAC modelling is to determine the correct number of components. We have developed the functions assessModelQuality()
and assessModelStability()
for this purpose. First, we will assess the model quality and specify the minimum and maximum number of components to investigate and the number of randomly initialized models to try for each number of components.
Note: this vignette reflects a minimum working example for analyzing this dataset due to computational limitations in automatic vignette rendering. Hence, we only look at 1-3 components with 5 random initializations each. These settings are not ideal for real datasets. Please refer to the documentation of assessModelQuality()
for more information.
# Setup # For computational purposes we deviate from the default settings minNumComponents = 1 maxNumComponents = 3 numRepetitions = 3 # number of randomly initialized models numFolds = 5 # number of jack-knifed models ctol = 1e-5 maxit = 200 numCores = 1 colourCols = c("RFgroup", "Phylum", "") legendTitles = c("RF group", "Phylum", "") xLabels = c("Subject index", "Feature index", "Time index") legendColNums = c(3,5,0) arrangeModes = c(TRUE, TRUE, FALSE) continuousModes = c(FALSE,FALSE,TRUE) # Assess the metrics to determine the correct number of components qualityAssessment = assessModelQuality(processedPloeg$data, minNumComponents, maxNumComponents, numRepetitions, ctol=ctol, maxit=maxit, numCores=numCores)
The overview plot showcases the number of iterations, the sum-of-squared error, the CORCONDIA and the variance explained for 1-3 components.
qualityAssessment$plots$overview
This seems a clear-cut case for a two-component model, as the three-component models have a CORCONDIA near zero. The maximum amount of variation we can describe is ~20%. We skip the stability assessment since the appropriate number of components is clearly two.
We have decided that a two-component model is the most appropriate for the vanderPloeg2024
dataset. We can now select one of the random initializations from the assessModelQuality()
output as our final model. We're going to select the random initialization that corresponded the maximum amount of variation explained for two components.
numComponents = 2 modelChoice = which(qualityAssessment$metrics$varExp[,numComponents] == max(qualityAssessment$metrics$varExp[,numComponents])) finalModel = qualityAssessment$models[[numComponents]][[modelChoice]]
Finally, we visualize the model using plotPARAFACmodel()
.
plotPARAFACmodel(finalModel$Fac, processedPloeg, 2, colourCols, legendTitles, xLabels, legendColNums, arrangeModes, continuousModes, overallTitle = "vanderPloeg2024 PARAFAC model")
You will observe that the loadings for some modes in some components are all negative. This is due to sign flipping: two modes having negative loadings cancel out but describe the same thing as two positive loadings. The flipLoadings()
function automatically performs this procedure and also sorts the components by how much variation they describe.
finalModel = flipLoadings(finalModel, processedPloeg$data) plotPARAFACmodel(finalModel$Fac, processedPloeg, 2, colourCols, legendTitles, xLabels, legendColNums, arrangeModes, continuousModes = c(FALSE,FALSE,TRUE), overallTitle = "Ploeg PARAFAC model")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.