README.md

MetaConIdentifier

Description

MetaConIdentifier is an R package for exploratory analysis to identify and visualize potential confounding factors in differentially expressed gene (DEG) studies. This is done by performing a statistical test called a correspondence analysis (CA) on RNASeq metadata and generate a matrix of factor scores to rerun differential expression analysis (e.g. DESeq2 Likelihood Ratio Test). Since it is impractical to control for only the condition in question, it is important to identify whether other extraneous variables such as age and sex may also be contributing to differences in gene expression.

There exists several different packages that each perform an individual component of the pipeline such as ours for transformations, missMDA for imputation, and ExPosition for the correspondence analysis. However, there does not exist a package that harmonizes the workflow nor is tailored towards RNASeq metadata, as it contains a combination of categorical, ordinal, and numeric variables along with inconsistencies and sporadic missingness.

The R package is geared towards scientists, researchers, and students performing differential expression analysis of their count RNASeq data with access to corresponding metadata. The package was developed using R version 4.0.3 and Windows platform (Windows 10).

Installation

To install the latest version of the package:

require("devtools")
devtools::install_github("ahnjedid/MetaConIdentifier", build_vignettes = TRUE)
library("MetaConIdentifier")

To run the shinyApp:

Not available yet!

Overview

ls("package:MetaConIdentifier")
data(package = "MetaConIdentifier") # optional

MetaConIdentifier contains 7 functions for running the pipeline, with 4 plotting functions in total to aid the exploratory analysis. The functions should be run sequentially in the following order for optimal analysis:

browseVignettes("MetaConIdentifier")

An overview of the package is illustrated below.

Contributions

The author of the package is Jedid Ahn.

The investigate_metadata function makes use of tidyverse packages, which include dplyr, ggplot2, and tidyr, while also using stats for na.omit function.

The standardize_metadata function consists of mostly manual validation so no external packages were required.

The run_ca function users the ours package for transformations, missMDA package for imputation, and ExPosition package for the correspondence analysis.

The plot_components function makes use of the ExPosition package by using the epGraphs function to plot the top two components for the variables and observations both.

The identify_elbow function makes use of the findElbowPoint function from PCAtools to computationally determine the optimal number of factors to extract.

The plot_factor_scores function makes use of three packages: heatmaply to make the heatmap interactive, as well as grDevices and RColorBrewer to create a red to blue heatmap.

The analyze_factor function also did not depend on any external packages.

References

Beaton, D., Chin Fatt, C. R., & Abdi, H. (2014). An ExPosition of multivariate analysis with the singular value decomposition in R. Computational statistics & data analysis, 72, 176–189.

Blighe, K., & Lun, A. (2020). PCAtools: Everything Principal Components Analysis. R package version 2.2.0.

Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., Szcześniak, M. W., et al. (2016). A survey of best practices for RNA-seq data analysis. Genome Biology, 17(1), 13.

Galili, T., O’Callaghan, A., Sidi, J., & Sievert, C. (2018). heatmaply: an R package for creating interactive cluster heatmaps for online publishing. Bioinformatics, 34(9), 1600–1602.

Josse, J., Husson, F. (2016). missMDA: A Package for Handling Missing Values in Multivariate Data Analysis. Journal of Statistical Software, 70(1), 1–31.

Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.

Neuwirth, E. (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2.

Sunderland, K. M., Beaton, D., Fraser, J., Kwan, D., McLaughlin, P. M., Montero-Odasso, M., Peltsch, A. J., et al. (2019). The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project. BMC Medical Research Methodology, 19(1), 102.

Tummers, J., Speelman, D., & Geeraerts, D. (2012). Multiple Correspondence Analysis as heuristic tool to unveil confounding variables in corpus linguistics.

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686.

Acknowledgements

This package was developed as part of an assessment for 2021 BCB410H: Applied Bioinformatics, University of Toronto, Toronto, CANADA.



ahnjedid/MetaConIdentifier documentation built on Dec. 18, 2021, 11:26 p.m.