knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  error = FALSE,
  message = FALSE
)

Overview {-}

The poplin package aims to provide data processing utilities (imputation, normalization, and dimension reduction) for LC/MS metabolomics data and a S4 class container for storing and retrieving the resulting outputs, motivated by the r Biocpkg("SingleCellExperiment").

The poplin class

The poplin class, an extension of the SummarizedExperiment class, provides additional containers for data processing results and dimension-reduced presentations of metabolomics data. poplin objects can be created via the constructor of the same name. Posit that we have a feature intensity matrix generated from spectral data processing packages, such as r Biocpkg("xcms"). Note that rows represent features and columns represent samples like SummarizedExperiment objects.

library(poplin)
nsample <- 20
nfeature <- 100
features <- rlnorm(nsample * nfeature, 10, 1)
fmat <- matrix(features, ncol = nsample)
cn <- paste0("S", seq_len(nsample))
rn <- paste0("F", seq_len(nfeature))
colnames(fmat) <- cn
rownames(fmat) <- rn
pd <- poplin(
    assays = list(raw = fmat),
    colData = DataFrame(sample_id = cn),
    rowData = DataFrame(feature_id = rn)
    )
pd

Alternatively, a poplin object can be constructed by coercing an existing SummarizedExperiment object.

se <- SummarizedExperiment(
    assays = list(raw = fmat),
    colData = DataFrame(sample_id = cn),
    rowData = DataFrame(feature_id = rn)
    )
pd <- as(se, "poplin")

# any opration applied to "se" also works on "pd"
assays(pd)
dim(pd)
head(rowData(pd), 3)
head(colData(pd), 3)

To illustrate the methods of the class, we will use the faahko_poplin data included in the poplin package. This data set was generated from the faahko3 object in the r Biocpkg("faahKO") package, which consists of 12 samples (6 wild-type and 6 FAAH knockout mice) and 206 LC/MS peaks.

data(faahko_poplin)
faahko_poplin

The poplin class have three data containers: poplinRaw, poplinData, poplinReduced.

poplinRaw corresponds to assays in the SummarizedExperiment class and is intended to store raw intensity data. To retrieve the data in this container, one can use poplin_raw_list() accessor. This is an alias of assays() methods from the SummarizedExperiment class.

## Get a list of raw intensity data sets.
poplin_raw_list(faahko_poplin) # alias of assays()

## Get the names of data sets
poplin_raw_names(faahko_poplin) # alias of assayNames()

## Get indvidual entries
head(poplin_raw(faahko_poplin, 1), 3) # alias of assay()

poplinData is intended to store processed data that are typically returned by utility functions in the poplin package. To retrieve the data in this container, one can use poplin_data_list() accessor. Note that each entry must have the same dimension as returned by dim().

poplin_data_list(faahko_poplin)
poplin_data_names(faahko_poplin)
head(poplin_data(faahko_poplin, "knn"), 3)

poplinReduced is intended to store dimension-reduced data. To retrieve the data in this container, one can use poplin_reduced_list() accessor. Note that each entry must have the same number of rows as returned by ncol().

poplin_reduced_list(faahko_poplin)
poplin_reduced_names(faahko_poplin)
head(poplin_reduced(faahko_poplin, "pca"), 3)

In the poplin class, each of these accessors has setter methods so that users can assign data to individual containers.

## Operations also work on poplinRaw and poplinReduced containers.
knn <- poplin_data(faahko_poplin, "knn")
empty <- faahko_poplin
poplin_data_list(empty) <- list() # replace with empty data
poplin_data_list(empty) <- list(knn1 = knn, knn2 = knn) # add data list
poplin_data(empty, "knn3") <- knn # add data
poplin_data_names(empty)
poplin_data_names(empty) <- c("imp1", "imp2", "imp3") # change names
poplin_data_names(empty)

Missing value imputation

In the poplin package, commonly used missing value imputation algorithms are available via the poplin_impute() function, which included k-nearest neighbor (using Gower's distance), random forest, PCA-based methods (e.g., NIPALS PCA, Bayesian PCA, Probabilistic PCA), and univariate replacement (e.g., half-minimum, median, mean). poplin_impute() can be applied either to a poplin object or matrix. Please refer to the [Visualization] section to visualize the missingness of the data.

## missing % of raw intensity matrix
m <- poplin_raw(faahko_poplin, "raw")
100 * sum(is.na(m)) / prod(dim(faahko_poplin))

## apply half-mininum imputation to a poplin object
res <- poplin_impute(x = faahko_poplin, xin = "raw", xout = "halfmin",
                     method = "univariate", type = "halfmin")

## apply random forest imputation to a matrix
poplin_data(res, "rf") <- poplin_impute(x = m, method = "randomforest")

poplin_data_list(res)

Normalization

In metabolomics analysis, the data may need to be normalized to reduce unwanted sample-to-sample variability. The poplin_normalize() function provides several data-driven normalization approaches, such as probabilistic quotient normalization (PQN), cyclic LOESS normalization, variance stabilizing normalization (generalized log transformation), sum normalization, median normalization, feature-based scaling (e.g., auto scaling, pareto scaling, level scaling, and etc.).

## Apply sum normalization to a poplin object
res <- poplin_normalize(x = faahko_poplin, method = "sum", 
                        xin = "knn", xout = "knn_pqn")

## Apply VSN normalization to a matrix
m <- poplin_data(faahko_poplin, "knn")
poplin_data(res, "knn_vsn") <- poplin_normalize(x = m, method = "vsn")

poplin_data_list(res)

Dimension reduction

In metabolomics, dimension reduction methods are often used for modeling and visualization. Currently, the poplin package supports three dimension-reduction methods: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and partial least squares-discriminant analysis (PLS-DA). The poplin_reduce function perform dimension reduction of the data and store the result to the poplinReduced container.

empty <- faahko_poplin
poplin_reduced_list(empty) <- list()
poplin_reduced_names(empty)

## Apply PCA to a poplin object
res <- poplin_reduce(x = empty, xin = "knn_cyclic", xout = "pca", 
                     method = "pca", ncomp = 3)

## Apply t-SNE to a matrix
poplin_reduced(res, "tsne") <- poplin_reduce(m, method = "tsne", 
                                             ncomp = 3, perplexity = 3)

## Apply PLS-DA to a poplin object
y <- factor(colData(res)$sample_group, levels = c("WT", "KO"))
res <- poplin_reduce(x = res, xin = "knn_cyclic", xout = "plsda", 
                     method = "plsda", y = y, ncomp = 3)

The poplin_reduce() function returns the result containing custom attributes that are used for summary and visualization. Please refer to [Visualization] for details.

summary(poplin_reduced(res, "pca"))
summary(poplin_reduced(res, "tsne"))
summary(poplin_reduced(res, "plsda"))

Visualization

The poplin package provide common visualization for metabolomics data based on r CRANpkg("ggplot2"). The plot functions in poplin package can be applied either to a poplin object or a matrix.

poplin_naplot

The poplin_naplot() helps to visually inspect missingness of the data.

poplin_naplot(x = faahko_poplin, xin = "raw")

poplin_corplot

The poplin_corplot() visualizes correlations between samples (or features) to quickly check the grouping structure in the data.

poplin_corplot(x = faahko_poplin, xin = "knn_cyclic")

poplin_boxplot

The poplin_boxplot() produces a box-and-whisker plot of the feature intensity values across the samples.

group <- colData(faahko_poplin)$sample_group

## distribution of intensities before normalization
poplin_boxplot(faahko_poplin, xin = "knn", group = group, 
               pre_log2 = TRUE)

## distribution of intensities after cyclic LOESS normalization
poplin_boxplot(faahko_poplin, xin = "knn_cyclic", group = group)

poplin_scoreplot

The poplin_scoreplot() function visualizes the data onto a lower-dimensional space using the poplin_reduce() output.

group <- colData(faahko_poplin)$sample_group

## PCA output
poplin_scoreplot(faahko_poplin, xin = "pca", group = group, 
                 ellipse = TRUE)

## PLS-DA output 
poplin_scoreplot(faahko_poplin, xin = "plsda", ellipse = TRUE,
                 label = TRUE)

poplin_biplot

The poplin_biplot() function Visualize an overlay of a score plot and a loading plot using the poplin_reduce() output.

group <- colData(faahko_poplin)$sample_group

poplin_biplot(faahko_poplin, xin = "pca", group = group, 
              ellipse = TRUE, arrow_col = "orange", 
              arrow_label_subset = c("FT039", "FT042", "FT046", "FT098"),
              arrow_label_col = "gray20")

Session info {-}

sessionInfo()


jaehyunjoo/poplin documentation built on Jan. 8, 2022, 1:13 a.m.