MetaConIdentifier
is an R package for exploratory analysis to identify
and visualize potential confounding factors in differentially expressed
gene (DEG) studies. This is done by performing a statistical test called
a correspondence analysis (CA) on RNASeq metadata and generate a matrix
of factor scores to rerun differential expression analysis (e.g. DESeq2
Likelihood Ratio Test). Since it is impractical to control for only the
condition in question, it is important to identify whether other
extraneous variables such as age and sex may also be contributing to
differences in gene expression.
There exists several different packages that each perform an individual
component of the pipeline such as ours
for transformations, missMDA
for imputation, and ExPosition
for the correspondence analysis.
However, there does not exist a package that harmonizes the workflow nor
is tailored towards RNASeq metadata, as it contains a combination of
categorical, ordinal, and numeric variables along with inconsistencies
and sporadic missingness.
The R package is geared towards scientists, researchers, and students performing differential expression analysis of their count RNASeq data with access to corresponding metadata. The package was developed using R version 4.0.3 and Windows platform (Windows 10).
To install the latest version of the package:
require("devtools")
devtools::install_github("ahnjedid/MetaConIdentifier", build_vignettes = TRUE)
library("MetaConIdentifier")
To run the shinyApp:
Not available yet!
ls("package:MetaConIdentifier")
data(package = "MetaConIdentifier") # optional
MetaConIdentifier
contains 7 functions for running the pipeline, with
4 plotting functions in total to aid the exploratory analysis. The
functions should be run sequentially in the following order for optimal
analysis:
The investigate_metadata function will allow for exploratory learning of the metadata dataset. RNASeq metadata in particular can be very messy due to the sheer number of variables, lack of annotation, and widespread missingness. It will provide information on which variables should be dropped due to significant missingness and lack of variance while providing a visual plot of the missingness.
The standardize_metadata function will clean and standardize the raw RNASeq metadata by identifying variables as one of three types (categorical, ordinal, numeric) and convert all missing values into NA’s as a common format (e.g. A value of UNKNOWN or UNDETERMINED should be replaced as NA). It will return an object of class data.frame and metaStandard.
The run_ca function is the core function which runs the correspondence analysis (CA) to generate a matrix of factor scores for rerunning differential expression analysis. It will preprocess the metadata beforehand through transformations and imputation which are required to recode it into one common format and into a variable type compatible with CA.
The plot_components function will generate component plots for the variables and observations both. They will allow the user to identify whether there are any particular variable values that are grouped closely together, which may indicate potential confounding factors at play.
The identify_elbow function will computationally determine the optimal number of factors to extract from the matrix. A scree plot is also generated to visualize the elbow manually. If the numbers differ, following the scree plot is recommended.
The plot_factor_scores function will plot the full or truncated matrix of factor scores as a heatmap. The heatmap will allow the user to identify which groups of observations strongly influence a particular factor. If that group of observations share a common variable value, this may indicate potential confounding of the differential expression study.
The analyze_factor function will output the corresponding metadata for common groups of observations defined by a score threshold.
The package also contains raw RNASeq metadata from the Cancer Genome
Atlas (TCGA) in tcga_meta_original
and clean metadata as
tcga_meta_clean
. tcga_variable_subset
and
tcga_variable_type_vec
are also available as example input to
standardize_metadata function. Refer to package vignettes for
more details.
browseVignettes("MetaConIdentifier")
An overview of the package is illustrated below.
The author of the package is Jedid Ahn.
The investigate_metadata function makes use of tidyverse packages,
which include dplyr
, ggplot2
, and tidyr
, while also using stats
for na.omit function.
The standardize_metadata function consists of mostly manual validation so no external packages were required.
The run_ca function users the ours
package for transformations,
missMDA
package for imputation, and ExPosition
package for the
correspondence analysis.
The plot_components function makes use of the ExPosition
package
by using the epGraphs function to plot the top two components for the
variables and observations both.
The identify_elbow function makes use of the findElbowPoint function from PCAtools to computationally determine the optimal number of factors to extract.
The plot_factor_scores function makes use of three packages:
heatmaply
to make the heatmap interactive, as well as grDevices
and
RColorBrewer
to create a red to blue heatmap.
The analyze_factor function also did not depend on any external packages.
Beaton, D., Chin Fatt, C. R., & Abdi, H. (2014). An ExPosition of multivariate analysis with the singular value decomposition in R. Computational statistics & data analysis, 72, 176–189.
Blighe, K., & Lun, A. (2020). PCAtools: Everything Principal Components Analysis. R package version 2.2.0.
Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., Szcześniak, M. W., et al. (2016). A survey of best practices for RNA-seq data analysis. Genome Biology, 17(1), 13.
Galili, T., O’Callaghan, A., Sidi, J., & Sievert, C. (2018). heatmaply: an R package for creating interactive cluster heatmaps for online publishing. Bioinformatics, 34(9), 1600–1602.
Josse, J., Husson, F. (2016). missMDA: A Package for Handling Missing Values in Multivariate Data Analysis. Journal of Statistical Software, 70(1), 1–31.
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.
Neuwirth, E. (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2.
Sunderland, K. M., Beaton, D., Fraser, J., Kwan, D., McLaughlin, P. M., Montero-Odasso, M., Peltsch, A. J., et al. (2019). The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project. BMC Medical Research Methodology, 19(1), 102.
Tummers, J., Speelman, D., & Geeraerts, D. (2012). Multiple Correspondence Analysis as heuristic tool to unveil confounding variables in corpus linguistics.
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686.
This package was developed as part of an assessment for 2021 BCB410H: Applied Bioinformatics, University of Toronto, Toronto, CANADA.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.