In ahnjedid/MetaConIdentifier: Metadata Confounding Identifier

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Preprocessing stage

library(MetaConIdentifier)

Start by investigating the original TCGA metadata. Overall, we see a lot of clinically irrelevant variables, lack of annotation, and sporadic missingness, which is common in RNASeq metadata due to lack of quality control standards.

head(tcga_meta_original)

Let's run the investigation function provided in the package, which will provide a plot of the missingness.

meta_info <- investigate_metadata(tcga_meta_original, first_column_as_id = FALSE)

missingness_plot

With the metadata info generated, we will explore the full metadata. Column cgc_case_primary_therapy_outcome_success is of particular interest, as it represents the outcome of the treatment and condition after treatment. Despite a missingness percentage of over 80% and despite the previous algorithm recommending that it be omitted, it cannot be dropped as it is clinically very important.

meta_info$missing_percent_col[["cgc_case_primary_therapy_outcome_success"]]

We will look into the missingness of 6 other clinically relevant variables. Based on the results below, the other 6 variables look okay with less than 20% missingness.

for (var in tcga_variable_subset){
  print(paste(var, ": ", meta_info$missing_percent_col[[var]]))
}

Since we are keeping the primary variable, we will need to drop all subjects lacking a primary outcome value. In addition, due to the high number of subjects and vast number of different tissues analyzed in the TCGA library, it is most optimal to separate the metadata by their tissue type. cgc_case_primary_site represents the tissue type for each subject. Let's subset for prostate.

tcga_meta_sub <- tcga_meta_original[!is.na(tcga_meta_original$cgc_case_primary_therapy_outcome_success) , ]
tcga_meta_sub <- tcga_meta_sub[ toupper(tcga_meta_sub$cgc_case_primary_site) == "PROSTATE" , ]

Then, we can run the standardize algorithm. Looking at the clean metadata, all values in cgc_case_race are NA, so we should drop the variable.

meta_clean <- standardize_metadata(tcga_meta_sub, first_column_as_id = FALSE, variable_subset = tcga_variable_subset, variable_type_vec = tcga_variable_type_vec)

print(all(is.na(meta_clean$cgc_case_race)))

meta_clean <- meta_clean[ , -c(6)]

IMPORTANT: Every time the cleaned metadata is edited manually, it needs to be rerun through standardize_metadata() before running the correspondence analysis.

tcga_variable_subset <- tcga_variable_subset[-c(6)]
tcga_variable_type_vec <- tcga_variable_type_vec[-c(6)]

meta_clean <- standardize_metadata(meta_clean, first_column_as_id = FALSE, variable_subset = tcga_variable_subset, variable_type_vec = tcga_variable_type_vec)

Core function

With the metadata cleaned up and standardized, we can finally run the core function. It will perform transformations and imputation on the metadata as preprocessing steps beforehand as it is necessary to recode the metadata into one common format and into a variable type compatible with CA. Once this is complete, the CA will be run.

ca_info <- run_ca(tcga_meta_clean)

The return value will contain 3 objects:

ca_obj - The CA object returned from the ExPosition function. It is used to generate the component plots.
fi_var - A vector representing the percentage of explained variance. It is used to generate the scree plot.
fi_mat - A matrix of factor scores for the observations/rows. It is used to generate the intearctive heatmap of factor scores.

Exploratory analysis

The exploratory analysis starts by generating component plots for the variables and observations both. They will allow the user to identify whether there are any particular variable values that are grouped closely together, which may indicate potential confounding factors at play.

plot_components(ca_info$ca_obj)

component_plot

Observing the component plot, there are two defined groups of variable values.

Partial remission/response (treatment outcome) with three different source sites including University of Kansas, Melbourne Health, and University Medical Center Hamburg.
Stable disease (treatment outcome) with two different source sites including the ProCure BioBank, and Institute for Medical Research.

This indicates that the tissue source site may play a role in the outcome of prostate cancer treatment. This may be due to different reasons, whether it's environmental factors such as the climate of the location (assuming that subjects lived in a region near the site) or processing of the tissue itself. Overall, this component plot provides a first indication and sense of what kind of confounding factors may be at play.

Moving forward with the exploratory analysis, we will determine the optimal number of factors to extract from the matrix of factor scores. This is necessary as the objective is to capture as much variance as possible while selecting the fewest number of factors. A scree plot is generated to visualize the elbow manually along with a computationally derived value. If the numbers differ, following the scree plot is recommended.

num_factors <- identify_elbow(ca_info$fi_var)

print(num_factors)

scree_plot

Observing the scree plot, we see that the optimal number of factors is 6. The computationally derived value (30) differs from our value, so we will use the scree plot. It will serve as input to the plot_factor_scores function.

plot_factor_scores(ca_info$fi_mat, num_factors = 6)

factor_score_plot

Observing the factor score plot allows for a general idea of how many observations (positive and negative scores) are contributing to each factor. We see that factors 1 and 2 have many observations with negative factor scores, so let's analyze their metadata more carefully for any patterns. We will use a score threshold of 1 (so -1 for negative groups and 1 for positive groups).

factor_1 <- analyze_factor(tcga_meta_clean, ca_info$fi_mat, factor_num = 1, score_threshold = 1)

print(nrow(factor_1$positive_group))
print(nrow(factor_1$negative_group))

With no subjects in the positive group, we will only look at the negative group. Looking at the negative group below, we see that all participants have stable disease after treatment, the same histological diagnosis, and similar age group. The only variable value that differs is the source site.

print(unique(factor_1$negative_group$cgc_case_primary_therapy_outcome_success))
print(unique(factor_1$negative_group$cgc_case_histological_diagnosis))

print(factor_1$negative_group)

Looking at the source sites in more detail, it's clear that there's a correlation between Procure BioBank and stable disease after treatment. This lines up with the results of our component plot.

print(table(factor_1$negative_group$cgc_sample_tissue_source_site))

Analyzing factor 2, we see similar patterns to factor 1 in that there are only 2 observations in the positive group but 15 observations in the negative group.

factor_2 <- analyze_factor(tcga_meta_clean, ca_info$fi_mat, factor_num = 2, score_threshold = 1)

print(nrow(factor_2$positive_group))
print(nrow(factor_2$negative_group))

Looking at the negative group below, we see that all participants have partial remission/response to treatment, the same histological diagnosis, and similar age group. The only variable value that differs is the source site once again.

print(unique(factor_2$negative_group$cgc_case_primary_therapy_outcome_success))
print(unique(factor_2$negative_group$cgc_case_histological_diagnosis))

print(factor_2$negative_group)

Looking at the source sites in more detail, it's clear that there's a correlation between University Medical Center Hamburg and partial remission/response after treatment. This lines up with the results of our component plot as well.

print(table(factor_2$negative_group$cgc_sample_tissue_source_site))

Conclusion and next steps

Based on the results, we know that there are two possible correlations:

Procure BioBank and stable disease after treatment.
University Medical Center Hamburg and partial remission/response after treatment.

However, this may or may not indicate causation. This is where the scope of the package ends, as it is up to the user to determine whether there are latent variables at play or if it's merely a coincidence. If it's the former, then differential expression analysis should be rerun using the matrix of factor scores (e.g. DESeq 2 Likelihood Ratio Test) and reducing for source site. If it's the latter, then the current differential expression analysis results will suffice. Nonetheless, these results allow for validation and/or improve the quality and robustness of differential expression analyses.

Package References

[Ahn, J. (2021) MetaConIdentifier: An R package for identifying potential confounding factors in RNASeq metadata. Unpublished.] (https://github.com/ahnjedid/MetaConIdentifier)

Other References

R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/