README.md
In ahnjedid/MetaConIdentifier: Metadata Confounding Identifier

MetaConIdentifier

MetaConIdentifier is an R package for exploratory analysis to identify and visualize potential confounding factors in differentially expressed gene (DEG) studies. This is done by performing a statistical test called a correspondence analysis (CA) on RNASeq metadata and generate a matrix of factor scores to rerun differential expression analysis (e.g. DESeq2 Likelihood Ratio Test). Since it is impractical to control for only the condition in question, it is important to identify whether other extraneous variables such as age and sex may also be contributing to differences in gene expression.

There exists several different packages that each perform an individual component of the pipeline such as ours for transformations, missMDA for imputation, and ExPosition for the correspondence analysis. However, there does not exist a package that harmonizes the workflow nor is tailored towards RNASeq metadata, as it contains a combination of categorical, ordinal, and numeric variables along with inconsistencies and sporadic missingness.

The R package is geared towards scientists, researchers, and students performing differential expression analysis of their count RNASeq data with access to corresponding metadata. The package was developed using R version 4.0.3 and Windows platform (Windows 10).

To install the latest version of the package:

require("devtools")
devtools::install_github("ahnjedid/MetaConIdentifier", build_vignettes = TRUE)
library("MetaConIdentifier")

To run the shinyApp:

Not available yet!

ls("package:MetaConIdentifier")
data(package = "MetaConIdentifier") # optional

MetaConIdentifier contains 7 functions for running the pipeline, with 4 plotting functions in total to aid the exploratory analysis. The functions should be run sequentially in the following order for optimal analysis:

The investigate_metadata function will allow for exploratory learning of the metadata dataset. RNASeq metadata in particular can be very messy due to the sheer number of variables, lack of annotation, and widespread missingness. It will provide information on which variables should be dropped due to significant missingness and lack of variance while providing a visual plot of the missingness.
The standardize_metadata function will clean and standardize the raw RNASeq metadata by identifying variables as one of three types (categorical, ordinal, numeric) and convert all missing values into NA’s as a common format (e.g. A value of UNKNOWN or UNDETERMINED should be replaced as NA). It will return an object of class data.frame and metaStandard.
The run_ca function is the core function which runs the correspondence analysis (CA) to generate a matrix of factor scores for rerunning differential expression analysis. It will preprocess the metadata beforehand through transformations and imputation which are required to recode it into one common format and into a variable type compatible with CA.
The plot_components function will generate component plots for the variables and observations both. They will allow the user to identify whether there are any particular variable values that are grouped closely together, which may indicate potential confounding factors at play.
The identify_elbow function will computationally determine the optimal number of factors to extract from the matrix. A scree plot is also generated to visualize the elbow manually. If the numbers differ, following the scree plot is recommended.
The plot_factor_scores function will plot the full or truncated matrix of factor scores as a heatmap. The heatmap will allow the user to identify which groups of observations strongly influence a particular factor. If that group of observations share a common variable value, this may indicate potential confounding of the differential expression study.
The analyze_factor function will output the corresponding metadata for common groups of observations defined by a score threshold.
The package also contains raw RNASeq metadata from the Cancer Genome Atlas (TCGA) in tcga_meta_original and clean metadata as tcga_meta_clean. tcga_variable_subset and tcga_variable_type_vec are also available as example input to standardize_metadata function. Refer to package vignettes for more details.

browseVignettes("MetaConIdentifier")

An overview of the package is illustrated below.

The author of the package is Jedid Ahn.

The investigate_metadata function makes use of tidyverse packages, which include dplyr, ggplot2, and tidyr, while also using stats for na.omit function.

The standardize_metadata function consists of mostly manual validation so no external packages were required.

The run_ca function users the ours package for transformations, missMDA package for imputation, and ExPosition package for the correspondence analysis.

The plot_components function makes use of the ExPosition package by using the epGraphs function to plot the top two components for the variables and observations both.

The identify_elbow function makes use of the findElbowPoint function from PCAtools to computationally determine the optimal number of factors to extract.

The plot_factor_scores function makes use of three packages: heatmaply to make the heatmap interactive, as well as grDevices and RColorBrewer to create a red to blue heatmap.

The analyze_factor function also did not depend on any external packages.

Beaton, D., Chin Fatt, C. R., & Abdi, H. (2014). An ExPosition of multivariate analysis with the singular value decomposition in R. Computational statistics & data analysis, 72, 176–189.

Blighe, K., & Lun, A. (2020). PCAtools: Everything Principal Components Analysis. R package version 2.2.0.

Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., Szcześniak, M. W., et al. (2016). A survey of best practices for RNA-seq data analysis. Genome Biology, 17(1), 13.

Galili, T., O’Callaghan, A., Sidi, J., & Sievert, C. (2018). heatmaply: an R package for creating interactive cluster heatmaps for online publishing. Bioinformatics, 34(9), 1600–1602.

Josse, J., Husson, F. (2016). missMDA: A Package for Handling Missing Values in Multivariate Data Analysis. Journal of Statistical Software, 70(1), 1–31.

Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.

Neuwirth, E. (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2.

Sunderland, K. M., Beaton, D., Fraser, J., Kwan, D., McLaughlin, P. M., Montero-Odasso, M., Peltsch, A. J., et al. (2019). The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project. BMC Medical Research Methodology, 19(1), 102.

Tummers, J., Speelman, D., & Geeraerts, D. (2012). Multiple Correspondence Analysis as heuristic tool to unveil confounding variables in corpus linguistics.

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686.

This package was developed as part of an assessment for 2021 BCB410H: Applied Bioinformatics, University of Toronto, Toronto, CANADA.

ahnjedid/MetaConIdentifier documentation built on Dec. 18, 2021, 11:26 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com