Introduction to xCell 2.0 and Key Features

Cell type enrichment analysis and cellular deconvolution are computational techniques aimed to decipher the cellular heterogeneity in bulk transcriptomics samples.

xCell2 is an R package for cell type enrichment analysis that builds upon the original xCell methodology (Aran et al., 2017), offering new features and improved performance.

This vignette introduces the key features of xCell2 and provides step-by-step guidance for utilizing its functions and performing cell type enrichment analysis. Whether you are new or an experienced bioinformatician, this vignette will help you effectively use xCell2 to advance your research.

For more details on xCell2 and one of its applications in predicting immune checkpoint blockade responses, see the xCell2 preprint (Angel et al., 2024).

Custom Reference Training

The most significant advancements in xCell2 is its flexibility, allowing users to train custom reference objects tailored to their specific datasets.

Unlike its predecessor, xCell2 supports references derived from various data types, including single-cell RNA-Seq (scRNA-Seq), bulk RNA-Seq, and microarrays. This adaptability makes it possible to generate reference objects that align closely with the biological context and experimental conditions of your study.

Advantages of Custom Reference Training

Training a custom reference offers several advantages:

  1. Compatibility with Diverse Data Types: Whether your data originates from scRNA-Seq, bulk RNA-Seq, or microarrays, xCell2 can adapt to the unique characteristics of each platform.

  2. Biological Context Relevance: Leveraging data tailored to your organism, tissue type, or experimental condition ensures that the resulting xCell2 reference object accurately reflects the relevant cellular diversity and characteristics.

  3. Improved Accuracy: Users can further improve the precision of enrichment analysis by training an xCell2 reference object that include only the genes shared with their bulk mixture. This ensures that the cell type signatures are exclusively constructed from genes present in the bulk mixture, addressing potential limitations of pre-trained references where some signature genes may be absent.

  4. Expanding Repository of References: Custom xCell2 reference objects contributed by the developers and the community will be curated and be accessible within the package. Those references will makes it convenient to conduct cross-analyses for the same bulk mixtures by simply loading references that are customized and shared by researchers in your field.

When to Use Pre-trained References

Deciding whether to use a pre-trained reference or to train your own custom reference depends on the specifics of your study and the data you are analyzing. Below, we outline key considerations where pre-trained references may be more suitable.

Ontological Integration

Another of the key advancements in xCell2 is its ability to leverage cell ontology to account for hierarchical relationships between cell types.

What is Ontological Integration?

Ontological integration uses a structured vocabulary, such as the Cell Ontology (CL), to map relationships between cell types. These relationships may include:
- Ancestry: A broader cell type (e.g., "T cells") encompassing a more specific cell type (e.g., "CD4+ T cells").
- Functional Hierarchies: Specialized subtypes within a functional group (e.g., "Memory CD4+ T cells" as a subset of "CD4+ T cells").

By incorporating these relationships, xCell2 considers biological lineage when generating cell type signatures and ensures that spillover correction respects the hierarchical dependencies of related cell types.

How Does it Work?

When useOntology is set to TRUE (default), xCell2Train:
1. Identifies cell type dependencies in the reference using lineage relationships.
2. Uses this information to adjust gene signatures generation.
3. Incorporates the dependencies during spillover correction to prevent correction between lineage-related cell types.

Generating Lineage Information

Users can manually inspect and refine lineage relationships using the xCell2GetLineage function. xCell2GetLineage generates a file with descendants and ancestors for each cell type, which can be reviewed and provided to the xCell2Train function via the lineageFile parameter.

Disabling Ontological Integration

While ontological integration is recommended, it can be disabled by setting useOntology to FALSE.

This may be suitable for:

For more information on the implementation and benefits of ontological integration, refer to the xCell2 paper.

Spillover Correction {#how-does-xcell2-correct-spillover}

Spillover correction is a useful feature of the xCell and xCell2 algorithms, addressing a challenge in cell type enrichment analysis.

What is Spillover?

Spillover occurs when the same signature genes are associated with multiple cell types, especially those that are closely related. For instance, some signature genes of CD4+ T cells may also be expressed in CD8+ T cells. This overlap can artificially inflate the enrichment scores of CD8+ T cells, reducing the specificity of the analysis.

How Does xCell2 Correct Spillover?

  1. Simulating Scores: Synthetic mixtures are generated to model the spillover of the enrichment scores between all cell types in the reference.

  2. Creating a Spillover Matrix: This matrix captures the overall degree to which the enrichment score of one cell type influenced by the signatures of the others.

  3. Incorporating Ontology: When ontological integration is enabled, xCell2 uses lineage relationships between cell types to avoid spillover correction. For example, no correction is preformed between parent and child cell types (e.g., T cells and CD4+ T cells).

  4. Applying Spillover Compensation: A linear compensation method, similar to approaches used in flow cytometry, adjusts the enrichment scores. This reduces the contribution of shared genes while preserving the unique expression signals of each cell type.

  5. Maintaining Balance: Overcorrection is avoided by controlling the correction strength with the spilloverAlpha parameter.

Controlling Spillover Correction Strength

The strength of spillover correction in xCell2 can be controlled using the spilloverAlpha parameter:

Practical Considerations

For further details, refer to the first xCell paper

Installation

From Bioconductor (Coming Soon)

To install xCell2 from Bioconductor, use:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
 install.packages("BiocManager")
}
BiocManager::install("xCell2")

From GitHub (Development Version)

To install the development version from GitHub, use:

if (!requireNamespace("devtools", quietly = TRUE)) {
 install.packages("devtools")
}
devtools::install_github("AlmogAngel/xCell2")

After installation, load the package with:

library(xCell2)

Creating a Custom Reference with xCell2Train

The most significant advancement of xCell2 is the ability to create custom reference objects tailored to your specific research needs. This section will guide you through the process of generating a custom xCell2 reference object using the xCell2Train function.

Why Create a Custom Reference?

Training a custom reference object allows you to:

See also - Custom Reference Training

Preparing the Input Data

Before using xCell2Train, you need to prepare the input data:

1. Reference Gene Expression Matrix

2. Labels Data Frame

This data frame contain information about each sample or cell in your reference.

It should include the following columns:

Using SummarizedExperiment or SingleCellExperiment Objects

You can provide a SummarizedExperiment or SingleCellExperiment object as input.

Assays:
- Gene expression data should be stored in one of the following assay slots:
- "tpm" (Transcripts Per Million, recommended)
- "logcounts" (log-transformed counts)
- "normcounts" (normalized counts)
- "counts" (raw counts).

Metadata:
- Sample metadata (labels) should be stored in colData(object) as a data frame with 4 required columns:
- "ont" - Ontology ID (e.g., "CL:0000236"). Use NA if unavailable.
- "label" - Cell type name.
- "sample" - Sample or cell identifier matching column names in the assay.
- "dataset" - Dataset source for each sample (use a constant value if not applicable).

Notes:
- Labels with underscores ("_") in cell type names are automatically replaced with dashes ("-").
- Microarray references must use "counts" as the assay slot.

Example: preparing the input data {#example-preparing-the-input-data}

Let's walk through an example using a subset of the Database of Immune Cell Expression (DICE):

library(xCell2)

# Load the demo data
data(dice_demo_ref, package = "xCell2")

# Extract reference gene expression matrix
# (Note that values are in TPM but log transformed)
dice_ref <- SummarizedExperiment::assay(dice_demo_ref, "logcounts")
colnames(dice_ref) <- make.unique(colnames(dice_ref)) # Ensure unique sample names

# Extract reference metadata
dice_labels <- SummarizedExperiment::colData(dice_demo_ref)
dice_labels <- as.data.frame(dice_labels) # "label" column already exists

# Prepare labels data frame (the "label" column already exist)
dice_labels$ont <- NA # Here we assign the cell ontology (next section)
dice_labels$sample <- colnames(dice_ref)
dice_labels$dataset <- "DICE"

Assigning Cell Type Ontology (optional but recommended) {#assigning-ontologies}

Assigning cell type ontologies improves the quality of signatures generated by xCell2Train. This step is essential for accurate identification of cell type spesific genes, as it helps account for lineage dependencies of related cell types (e.g., "T cells, CD4+" and "T cells, CD4+, memory").

When to Skip Ontology Assignment

You may skip this step if:

Assigning Ontologies

You can assign cell type ontologies using a controlled vocabulary like the Cell Ontology (CL):

  1. Visit the EMBL-EBI Ontology Lookup Service.
  2. Search for your cell type (e.g., "T CD4+ memory").
  3. Identify the best match (e.g., "CD4-positive, alpha-beta memory T cell" with ID "CL:0000897").
  4. Assign the appropriate ontology ID to each cell type in the "labels" data frame under the "ont" column.

Example: assigning cell type ontology {#example-assigning-cell-type-ontology}

Run previous example first.

# Assign ontologies to cell types
dice_labels[dice_labels$label == "B cells", ]$ont <- "CL:0000236"
dice_labels[dice_labels$label == "Monocytes", ]$ont <- "CL:0000576"
dice_labels[dice_labels$label == "NK cells", ]$ont <- "CL:0000623"
dice_labels[dice_labels$label == "T cells, CD8+", ]$ont <- "CL:0000625"
dice_labels[dice_labels$label == "T cells, CD4+", ]$ont <- "CL:0000624"
dice_labels[dice_labels$label == "T cells, CD4+, memory", ]$ont <- "CL:0000897"

Example: checking lineage relationships {checking-lineage-relationships}

Use the xCell2GetLineage function to identify and inspect cell type dependencies based on the Cell Ontology tree. This function can either output a file in TSV format or return a list mapping cell types to their ancestors and descendants.

xCell2GetLineage(labels = dice_labels, outFile = "demo_dice_dep.tsv") 

After running the command:

  1. Open the generated file (demo_dice_dep.tsv) in a text editor or spreadsheet tool.
  2. Verify the lineage assignments. For example:
  3. "T cells, CD4+, memory" should be assigned as a descendant of "T cells, CD4+".
  4. Manually adjust the file if necessary, and save the changes.

The lineage file can then be used in the xCell2Train function via the lineageFile parameter to enhance the signatures generation process with user-curated dependencies.

lineage_list <- xCell2GetLineage(labels = dice_labels)

Interpreting the Output

If a TSV file is generated (outFile provided):

If a list is returned (no outFile specified):

Generating the xCell2 Reference Object

Once your inputs are prepared, you can create a custom xCell2 reference object using the xCell2Train function. This object will serve as the basis for cell type enrichment analysis in your downstream workflows and is a required input for the xCell2Analysis function.

Key Parameters of xCell2Train

Example: generating xCell2 reference object

The following code demonstrates how to generate a reference object. Before running this example make sure you have dice_ref and dice_lables loaded in R by running previous examples of "Preparing the Input Data" and "Assigning Cell Type Ontology".

library(BiocParallel)
parallel_param <- MulticoreParam(workers = 2) # Adjust workers as needed

set.seed(123) # (optional) For reproducibility

DICE_demo.xCell2Ref <- xCell2Train(
  ref = dice_ref,
  labels = dice_labels,
  refType = "rnaseq",
  BPPARAM = parallel_param
)

Why Use set.seed()?

The set.seed() function is used to ensure reproducibility. This is only relevant when working with scRNA-Seq data, as pseudo-bulk samples are generated by randomly sampling cells. Setting a seed ensures that the same random cells are selected each time the code is run, allowing you to replicate results and maintain consistency across analyses. If you are not working with scRNA-Seq data or if exact reproducibility is not a concern, setting the seed is optional.

Sharing Your Custom xCell2 Reference Object

Sharing your custom xCell2 reference object with the scientific community can greatly benefit researchers working in similar fields. Here’s how you can contribute:

  1. Save Your Reference Object
    Save your newly generated using the recommended saveRDS() function:

    r saveRDS(DICE_demo.xCell2Ref, file = "DICE_demo.xCell2Ref.rds")

  2. Upload to the xCell2 References Repository

    • Visit the xCell2 References Repository
    • Navigate to the "references" directory
    • Click "Add file" > "Upload files" and select your .rds xCell2 reference object file
  3. Update the README

    To help others understand and use your reference:

    • Return to the main page of the xCell2 References Repository
    • Open the README.md file
    • Click the pencil icon (Edit this file) in the top right corner
    • Add an entry to the references table
  4. Submit a Pull Request

    • Click on "Commit changes..."
    • Select "Create a new branch for this commit and start a pull request"
    • Click "Propose changes"
    • On the next page, click "Create pull request"

The repository maintainers will review your submission and merge it if everything is in order.

By sharing your custom reference, you contribute to the scientific community. Your reference object could help researchers in related fields achieve more accurate and relevant results in their studies.

Next Steps

After creating your custom reference, you can use it for cell type enrichment analysis with xCell2Analysis. We'll cover this in the next section.

Using Pre-trained xCell2 References

xCell2 provides a selection of pre-trained reference objects that can be directly used in your analysis. These references cover diverse cell types and are based on well-curated datasets. Pre-trained references are ideal for researchers who prefer not to train custom references or need a starting point for their analysis.

Available Pre-trained References

The following table provides an overview of the available pre-trained references as to January 2025.

| Dataset | Study | Species | Normalization | nSamples/Cells | nCellTypes | Platform | Tissues | |---------|---------|---------|---------|---------|---------|---------|---------| | BlueprintEncode | Martens JHA and Stunnenberg HG (2013), The ENCODE Project Consortium (2012), Aran D (2019) | Homo Sapiens | TPM | 259 | 43 | RNA-seq | Mixed | | ImmGenData | The Immunological Genome Project Consortium (2008), Aran D (2019) | Mus Musculus | RMA | 843 | 19 | Microarray | Immune/Blood | | Immune Compendium | Zaitsev A (2022) | Homo Sapiens | TPM | 3626 | 40 | RNA-seq | Immune/Blood | | LM22 | Chen B (2019) | Homo Sapiens | RMA | 113 | 22 | Microarray | Mixed | | MouseRNAseqData | Benayoun B (2019) | Mus Musculus | TPM | 358 | 18 | RNA-seq | Mixed | | Pan Cancer | Nofech-Mozes I (2023) | Homo Sapiens | Counts | 25084 | 29 | scRNA-seq | Tumor | | Tabula Muris Blood | The Tabula Muris Consortium (2018) | Mus Musculus | Counts | 11145 | 6 | scRNA-seq | Bone Marrow, Spleen, Thymus | | Tabula Sapiens Blood | The Tabula Sapiens Consortium (2022) | Homo Sapiens | Counts | 11921 | 18 | scRNA-seq | Blood, Lymph_Node, Spleen, Thymus, Bone Marrow | | TME Compendium | Zaitsev A (2022) | Homo Sapiens | TPM | 8146 | 25 | RNA-seq | Tumor |

Accessing Pre-trained References

From the Package Popular pre-trained references are included directly in the xCell2 package. You can load them using the data() function:

# Load the BlueprintEncode reference
data(BlueprintEncode.xCell2Ref, package = "xCell2")

From the xCell2 References Repository You can download additional pre-trained references from the xCell2 References Repository.

Use the following code to download a pre-trained reference within R:

# Set the URL of the pre-trained reference
ref_url <- "https://dviraran.github.io/xCell2refs/references/BlueprintEncode.xCell2Ref.rds"
# Set the local filename to save the reference
local_filename <- "BlueprintEncode.xCell2Ref.rds"
# Download the file
download.file(ref_url, local_filename, mode = "wb")
# Load the downloaded reference
BlueprintEncode.xCell2Ref <- readRDS(local_filename)

Choosing the Right Reference

Selecting an appropriate reference is crucial for accurate cell type enrichment analysis. Consider the following:

Next Steps

After loading a pre-trained reference, you can proceed to perform cell type enrichment analysis with the xCell2Analysis function. For details, see the Performing Cell Type Enrichment Analysis section.

Calculating Cell Type Enrichment with xCell2Analysis

After training or obtaining an xCell2 reference object, the next step is to use it for cell type enrichment analysis on your bulk transcriptomics data. This section will guide you through using the xCell2Analysis function and interpreting its results.

Preparing the Input Data {#preparing-the-input-data2}

Before running the analysis, ensure you have:

  1. An xCell2 reference object (see Creating a Custom Reference or Using Pre-trained References).
  2. A bulk gene expression matrix to analyze.

Ensuring Gene Compatibility Between Bulk Mixture and Reference

For accurate cell type enrichment analysis, it is essential to verify compatibility between the genes in your bulk mixture (mix) and the reference object (xcell2object). Compatibility means that the gene nomenclature (e.g., gene symbols, HUGO IDs, Ensembl IDs) match and there is sufficient overlap between the genes in the bulk mixture and the reference.

The getGenesUsed function allows you to access the genes used for the reference training:

library(xCell2)
data("DICE_demo.xCell2Ref", package = "xCell2")
genes_ref <- getGenesUsed(DICE_demo.xCell2Ref)
genes_ref[1:10]

You can calculate the percentage of overlapping genes between your bulk expression matrix and the reference. For example:

# Load demo bulk gene expression mixture
data("mix_demo", package = "xCell2")
genes_mix <- rownames(mix_demo)

# Calculate the percentage of overlapping genes
overlap_percentage <- round(length(intersect(genes_mix, genes_ref)) / length(genes_ref) * 100, 2)
print(paste0("Overlap between mixture and reference genes is: ", overlap_percentage, "%"))

Ensure the overlap is sufficient (default minimum: 90%). You can adjust this threshold in xCell2Analysis using the minSharedGenes parameter:

xcell2_results <- xCell2Analysis(
  mix = mix_demo,
  xcell2object = DICE_demo.xCell2Ref,
  minSharedGenes = 0.8 # Allow for a lower overlap threshold
)

If your expression matrix and the reference share a low percentage of common genes, consider creating a custom reference using the xCell2Train function. Custom references ensure better alignment between your bulk mixture and the reference, leading to more accurate enrichment results.

Key Parameters of xCell2Analysis

Calculating Cell Type Enrichment

To calculate the cell type enrichment, use the following command*:

xcell2_results <- xCell2Analysis(
 mix = mix_demo,
 xcell2object = DICE_demo.xCell2Ref
)

xcell2_results

The xCell2Analysis function return a matrix of cell type enrichment scores with the following structure:

Understanding Enrichment Scores

  1. Enrichment Scores Are Not Proportions:
  2. While the raw enrichment scores in xCell2 undergo a linear transformation that makes them resemble proportions, they are not actual proportions.
  3. The transformation ensures that scores are on a linear scale within the range of 0 to 1, but they might not sum to 1 across all cell types in a sample.

  4. Differences from Cellular Deconvolution Methods:

  5. Cellular deconvolution methods typically model bulk expression data as a linear combination of all reference cell type profiles, which forces the estimated proportions to sum to 1.
  6. However, this assumption is often unrealistic, as reference datasets rarely capture all cell type in a complex tissue sample. Missing cell types in the reference can lead to skewed or inaccurate proportions.

  7. Interpreting the Scale of Scores:

  8. Enrichment scores fall between 0 and 1, where higher scores indicate a stronger enrichment of a cell type in the corresponding sample relative to the others.
  9. Use the scores to compare cell type compositions across samples and identify differences rather than report their absolute value.

Advanced Analysis with xCell2 Results

The output of xCell2Analysis, a matrix of cell type enrichment scores, can be utilized for a variety of advanced analyses to uncover valuable biological insights. Below are examples for some common applications.

Normalizing Enrichment Scores by Tumor Purity

What is Tumor Purity? Tumor purity refers to the proportion of tumor cells within a sample, compared to the surrounding non-tumor cells (the tumor microenvironment), such as immune and stromal cells. Tumor purity typically varies across samples and can influence the interpretation of enrichment scores.

Why Normalize by Tumor Purity? In tumor samples, differences in tumor purity can skew the enrichment scores of non-tumor components like immune or stromal cells. For example, a sample with high tumor purity (e.g., 80%) contains a smaller proportion of tumor microenvironment (TME) cells compared to a sample with lower tumor purity (e.g., 40%). This makes direct comparisons of immune cell enrichment scores between these samples challenging. By normalizing the enrichment scores by tumor purity, we can isolate and compare the cellular composition of the TME alone, providing a clearer picture of immune or stromal cell dynamics.

How to Obtain Tumor Purity Values Tumor purity values can be estimated using external tools or methods. Some common approaches include: - Genomic Analysis Tools: Tools like ABSOLUTE or ESTIMATE can compute tumor purity values based on DNA sequencing or gene expression data. - Published Datasets: Tumor purity estimates are available for many publicly available datasets, such as The Cancer Genome Atlas (TCGA). - Future xCell2 Feature: In a future update, xCell2 will introduce a feature to compute the tumor microenvironment (TME) score. This score has been shown to have a negative correlation with tumor purity, as demonstrated in the study "Systematic pan-cancer analysis of tumour purity" (DOI: 10.1038/ncomms9971).

Normalization Formula The normalized enrichment score for a cell type is calculated as:

[ \text{Normalized Score} = \frac{\text{Enrichment Score}}{1 - \text{Tumor Purity}} ] where ( 1 - \text{Tumor Purity} ) represents the fraction of non-tumor cells in the sample.

Interpreting Normalized Scores After normalization, the enrichment scores reflect the relative abundance of the cell type within the TME, accounting for the sample’s tumor purity. This adjustment allows for more meaningful comparisons of immune or stromal cell enrichment between samples with varying tumor purity levels. For example, in the plot above, normalized scores for CD8+ T cells highlight differences that might otherwise be obscured in raw scores due to variations in tumor purity.

Comparing Cellular Enrichment Across Conditions

For one cell type

Compare enrichment scores between groups (e.g., healthy vs. diseased) using statistical tests like the Wilcoxon rank-sum test.

library(ggplot2, quietly = TRUE)

# Generate pseudo enrichment scores for demonstration
set.seed(123)
conditions <- c(rep("Healthy", 5), rep("Disease", 5))
enrichment_scores <- data.frame(
  Condition = conditions,
  T_cells_CD8 = c(runif(5, 0.4, 0.6), runif(5, 0.6, 0.8)),
  T_cells_CD4 = c(runif(5, 0.3, 0.5), runif(5, 0.5, 0.7))
)

# Wilcoxon rank-sum test for CD8+ T cells
wilcox_test <- wilcox.test(T_cells_CD8 ~ Condition, data = enrichment_scores)
print(wilcox_test)

# Boxplot with p-value
ggplot(enrichment_scores, aes(x = Condition, y = T_cells_CD8)) +
  geom_boxplot() +
  geom_jitter(width = 0.2) +
  labs(
    title = "CD8+ T Cells Enrichment Across Conditions",
    x = "Condition", y = "Enrichment Score"
  ) +
  annotate("text", x = 1.5, y = max(enrichment_scores$T_cells_CD8), 
           label = paste("p =", signif(wilcox_test$p.value, 3)))

For many cell types

Visualize differential cell type enrichment between conditions using a volcano plot with Wilcoxon rank-sum test.

library(EnhancedVolcano, quietly = TRUE)
library(dplyr, quietly = TRUE, verbose = FALSE)

# Simulate enrichment scores for 50 cell types
set.seed(123) 
cell_types <- paste0("CellType_", 1:50)

enrichment_scores_many <- data.frame(
  Sample = paste0("Sample", 1:10),
  Condition = factor(rep(c("Healthy", "Disease"), each = 5))
)

for (cell_type in cell_types) {
  # Healthy group: random around baseline
  healthy <- rnorm(5, mean = 0.5, sd = 0.1)
  # Disease group: add variability and directional shifts
  disease <- rnorm(5, mean = sample(c(0.4, 0.5, 0.6), 1), sd = 0.1)
  enrichment_scores_many[[cell_type]] <- c(healthy, disease)
}

special_cells <- sample(cell_types, 2)
for (cell_type in special_cells) {
  enrichment_scores_many[1:5, cell_type] <- rnorm(5, mean = 0.4, sd = 0.05) # Lower in healthy
}

# Calculate log fold change (LFC) and p-values for each cell type using Wilcoxon test
differential_results <- enrichment_scores_many %>%
  tidyr::pivot_longer(cols = starts_with("CellType"), names_to = "CellType", values_to = "Enrichment") %>%
  group_by(CellType) %>%
  summarise(
    LFC = log2(median(Enrichment[Condition == "Disease"]) / median(Enrichment[Condition == "Healthy"])),
    p_value = wilcox.test(Enrichment ~ Condition)$p.value
  ) %>%
  ungroup()

# Prepare data for EnhancedVolcano
volcano_data <- differential_results %>%
  mutate(
    neg_log10_p = -log10(p_value),
    Significance = ifelse(p_value < 0.05, "Significant", "Not Significant")
  ) %>%
  rename(
    log2FC = LFC,
    pval = p_value
  )

# Plot using EnhancedVolcano
EnhancedVolcano(
  volcano_data,
  lab = volcano_data$CellType,
  x = "log2FC",
  y = "pval",
  pCutoff = 0.05,
  FCcutoff = 0.5,
  title = "Volcano Plot of Cell Type Enrichment",
  subtitle = "Simulated Data for 50 Cell Types",
  xlab = "Log2 Fold Change (LFC)",
  ylab = "-Log10(p-value)",
  pointSize = 2.0,
  labSize = 3.0,
  xlim = c(-1, 1),
  ylim = c(0, 3)
)

Correlating Enrichment Scores with Clinical or Molecular Features

Correlate enrichment scores with clinical outcomes or molecular features, such as mutation burden.

# Simulated clinical feature (e.g., mutation burden)
set.seed(123)

mutation_burden <- runif(10, min = 0, max = 100)

# Calculate Pearson and Spearman correlation
pearson_corr <- cor.test(enrichment_scores$T_cells_CD8, mutation_burden, method = "pearson")
spearman_corr <- cor.test(enrichment_scores$T_cells_CD8, mutation_burden, method = "spearman")

# Scatterplot of the correlation with annotations for Pearson and Spearman coefficients
ggplot(enrichment_scores, aes(x = mutation_burden, y = T_cells_CD8)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Correlation Between Mutation Burden and CD8+ T Cells Enrichment",
    x = "Mutation Burden",
    y = "CD8+ T Cells Enrichment Score"
  ) +
  annotate("text", x = 15, y = 0.7, 
           label = paste0("Pearson: ", signif(pearson_corr$estimate, 3), 
                          "\nP-value: ", signif(pearson_corr$p.value, 3)),
           hjust = 0, color = "darkred") +
  annotate("text", x = 15, y = 0.65, 
           label = paste0("Spearman: ", signif(spearman_corr$estimate, 3), 
                          "\nP-value: ", signif(spearman_corr$p.value, 3)),
           hjust = 0, color = "darkgreen")

Clustering and Dimension Reduction

Use enrichment scores to group samples and visualize their cellular composition landscape.

# Simulated enrichment scores for demonstration
set.seed(123) 
enrichment_scores <- data.frame(
  Sample = paste0("Sample", 1:10),
  Condition = rep(c("Healthy", "Disease"), each = 5),
  CD8_T_cells = runif(10, min = 0.1, max = 0.8),
  B_cells = runif(10, min = 0.2, max = 0.6),
  NK_cells = runif(10, min = 0.05, max = 0.5)
)

# Perform PCA on the simulated enrichment scores
pca <- prcomp(enrichment_scores[, 3:5], scale. = TRUE)

# Create a data frame for PCA results
pca_df <- data.frame(
  PC1 = pca$x[, 1],
  PC2 = pca$x[, 2],
  Condition = enrichment_scores$Condition
)

# Visualize the first two principal components
ggplot(pca_df, aes(x = PC1, y = PC2, color = Condition)) +
  geom_point(size = 3) +
  labs(
    title = "PCA of Enrichment Scores",
    x = "PC1 (Principal Component 1)",
    y = "PC2 (Principal Component 2)"
  ) +
  theme_minimal()

Perform k-means clustering on the original enrichment scores

kmeans_result <- kmeans(enrichment_scores[, 3:5], centers = 2, nstart = 25)

# Add cluster assignment to the PCA results
pca_df <- data.frame(
  PC1 = pca$x[, 1],
  PC2 = pca$x[, 2],
  Condition = enrichment_scores$Condition,
  Cluster = as.factor(kmeans_result$cluster)
)

# Visualize the PCA results with k-means clusters
ggplot(pca_df, aes(x = PC1, y = PC2, color = Cluster, shape = Condition)) +
  geom_point(size = 3) +
  labs(
    title = "PCA of Enrichment Scores with K-Means Clusters",
    x = "PC1 (Principal Component 1)",
    y = "PC2 (Principal Component 2)"
  ) +
  scale_color_manual(values = c("1" = "blue", "2" = "red")) +
  theme_minimal()

Using Enrichment Scores as Features for Predictive Modeling

Use enrichment scores as predictive features in machine learning models to classify conditions.

library(randomForest, quietly = TRUE)

# Simulated enrichment scores for demonstration
set.seed(123)

enrichment_scores <- data.frame(
  Sample = paste0("Sample", 1:10),
  Condition = factor(rep(c("Healthy", "Disease"), each = 5)),
  CD8_T_cells = runif(10, min = 0.1, max = 0.8),
  B_cells = runif(10, min = 0.2, max = 0.6),
  NK_cells = runif(10, min = 0.05, max = 0.5)
)

# Train a random forest model
rf_model <- randomForest(Condition ~ CD8_T_cells + B_cells + NK_cells, 
                         data = enrichment_scores, 
                         ntree = 500, importance = TRUE)


# Assess variable importance
var_importance <- data.frame(
  Variable = rownames(importance(rf_model)),
  Importance = importance(rf_model)[, 1]
)

# Plot variable importance
ggplot(var_importance, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Feature Importance from Random Forest Model",
    x = "Features",
    y = "Importance Score"
  ) +
  theme_minimal()

Parallelization in xCell2

xCell2 leverages the BiocParallel package for parallel processing, enabling computationally intensive tasks to run faster by utilizing multiple processors.

By default, xCell2 performs computations sequentially using BiocParallel::SerialParam.

To enable parallel processing, users can specify a BiocParallelParam object via the BPPARAM argument in functions like xCell2Train and xCell2Analysis.

To use parallelization, simply create a BiocParallelParam instance:

library(BiocParallel)

# Example: Setting up parallel processing with 2 workers
BPPARAM <- MulticoreParam(workers = 2) # Adjust the number of workers as needed

If you are using an operating system where MulticoreParam is unavailable (e.g., certain Windows configurations), SnowParam can be used as an alternative:

BPPARAM <- SnowParam(workers = 2)

Example: Generating the xCell2 Reference Object with Parallelization

# Generate the xCell2 reference object with parallel processing
DICE_demo.xCell2Ref <- xCell2Train(
  ref = dice_ref,
  labels = dice_labels,
  refType = "rnaseq",
  BPPARAM = BPPARAM
)

Example: Generating the xCell2 Reference Object with Parallelization

# Perform cell type enrichment analysis with parallel processing
xcell2_results <- xCell2Analysis(
  mix = mix_demo,
  xcell2object = DICE_demo.xCell2Ref,
  BPPARAM = BPPARAM
)

For more information on BiocParallel and its options, visit the BiocParallel package documentation.

Citing xCell2

If you use xCell2 in your research, please cite: Angel A, Naom L, Nabel-Levy S, Aran D. xCell 2.0: Robust Algorithm for Cell Type Proportion Estimation Predicts Response to Immune Checkpoint Blockade. bioRxiv 2024.

Referece

R Session Info

sessionInfo()


AlmogAngel/xCell2 documentation built on April 13, 2025, 6:15 a.m.