Table of Contents

  1. Facilitate Gene Naming Conversions
  2. Compute Statistics Grouped by Clusters
  3. Assess Proportion of Positive Cells in Clusters
  4. Run Standard Seurat Pipeline

Facilitate Gene Naming Conversions {#facilitate-gene-naming-conversions}

Introduction

Gene naming conventions can vary significantly between organisms and databases, presenting a common challenge in scRNA-seq data analysis. SeuratExtend includes several functions to facilitate the conversion between human and mouse gene symbols and Ensembl IDs, as well as conversions between human and mouse homologous gene symbols. These functions leverage the biomaRt database for conversions but improve on reliability and performance by localizing the most commonly used databases, thus eliminating the need for internet connectivity and addressing the frequent instability issues with biomaRt.

Functions for Gene Naming Conversions

The functions provided for these conversions are:

These functions share a similar usage pattern, as detailed below using HumanToMouseGenesymbol as an example.

Getting Started with Examples

First, let's retrieve a few human gene symbols from a dataset as an example:

library(Seurat)
library(SeuratExtend)

human_genes <- VariableFeatures(pbmc)[1:6]
print(human_genes)

Default Usage

By default, HumanToMouseGenesymbol returns a data frame showing how human gene symbols (HGNC) match with mouse gene symbols (MGI):

HumanToMouseGenesymbol(human_genes)

This table indicates that not all human genes have direct mouse homologs, and some human genes may correspond to multiple mouse genes.

Simplified Output

If you prefer a simpler vector output without the matching details:

HumanToMouseGenesymbol(human_genes, match = FALSE)

One-to-One Correspondence

For cases where you require a one-to-one correspondence:

HumanToMouseGenesymbol(human_genes, keep.seq = TRUE)

Converting Gene Expression Matrices

These functions can also directly convert human gene expression matrices to their mouse counterparts:

# Create an example gene expression matrix
human_matr <- GetAssayData(pbmc)[human_genes, 1:4]
print(human_matr)

# Convert to a mouse gene expression matrix
HumanToMouseGenesymbol(human_matr)

Other Gene Naming Conversion Functions

The usage patterns for the other conversion functions in SeuratExtend, such as MouseToHumanGenesymbol, GenesymbolToEnsembl, and EnsemblToGenesymbol, are similar to those already discussed. These functions also leverage local databases to enhance performance and reliability but provide options to use online databases via biomaRt if necessary.

Here are some examples demonstrating the use of other gene naming conversion functions:

# Converting mouse gene symbols to human
mouse_genes <- c("Cd14", "Cd3d", "Cd79a")
MouseToHumanGenesymbol(mouse_genes, match = FALSE)

# Converting human gene symbols to Ensembl IDs
human_genes <- c("PPBP", "LYZ", "S100A9", "IGLL5", "GNLY", "FTL")
GenesymbolToEnsembl(human_genes, spe = "human", keep.seq = TRUE)

# Converting mouse gene symbols to Ensembl IDs
GenesymbolToEnsembl(mouse_genes, spe = "mouse", keep.seq = TRUE)

# Converting Ensembl IDs to human gene symbols
EnsemblToGenesymbol(c("ENSG00000163736", "ENSG00000090382"), spe = "human", keep.seq = TRUE)

# Converting Ensembl IDs to mouse gene symbols
EnsemblToGenesymbol(c("ENSMUSG00000051439", "ENSMUSG00000032094"), spe = "mouse", keep.seq = TRUE)

Using Online Resources

While SeuratExtend typically uses localized databases for conversions, you have the option to directly fetch results from biomaRt databases if required. This can be useful when working with less common genes or newer annotations not yet available in the local database:

# Fetching Ensembl IDs for human genes directly from biomaRt
GenesymbolToEnsembl(human_genes, spe = "human", local.mode = FALSE, keep.seq = TRUE)

Converting UniProt IDs to Gene Symbols

In addition to facilitating gene symbol and Ensembl ID conversions between human and mouse, SeuratExtend also includes functionality to convert UniProt IDs, which are widely used in proteomic databases, to gene symbols. This can be particularly useful when integrating proteomic and genomic data or when working with databases that use UniProt identifiers.

The function UniprotToGenesymbol in SeuratExtend provides a straightforward way to translate UniProt IDs into gene symbols. This function supports both human and mouse species, accommodating research that spans multiple types of biological data. Here's how you can convert UniProt IDs to gene symbols for both human and mouse:

# Converting UniProt IDs to human gene symbols
UniprotToGenesymbol(c("Q8NF67", "Q9NPB9"), spe = "human")

# Converting UniProt IDs to mouse gene symbols
UniprotToGenesymbol(c("Q9R1C8", "Q9QY84"), spe = "mouse")

Compute Statistics Grouped by Clusters {#compute-statistics-grouped-by-clusters}

The CalcStats function from the SeuratExtend package provides a comprehensive approach to compute various statistics, such as mean, median, z-scores, or LogFC, for genomic data. This function can handle data stored in Seurat objects or standard matrices, allowing for versatile analyses tailored to single-cell datasets.

Whether you're analyzing genes or pathways, CalcStats simplifies the task by computing statistics for selected features across different cell groups or clusters.

Using a Seurat Object

Begin by selecting a subset of features, such as genes. For this example, let's pick the first 20 variable features from a Seurat object:

library(Seurat)
library(SeuratExtend)

genes <- VariableFeatures(pbmc)[1:20]

Using CalcStats, compute your desired metric, like z-scores, for each feature across different cell clusters:

genes.zscore <- CalcStats(pbmc, features = genes, method = "zscore", group.by = "cluster")
head(genes.zscore)

Display the computed statistics using a heatmap:

Heatmap(genes.zscore, lab_fill = "zscore")

Select more genes and retain the top 4 genes of each cluster, sorted by p-value. This can be a convenient method to display the top marker genes of each cluster:

genes <- VariableFeatures(pbmc)
genes.zscore <- CalcStats(
  pbmc, features = genes, method = "zscore", group.by = "cluster", 
  order = "p", n = 4)
Heatmap(genes.zscore, lab_fill = "zscore")

Using Matrices as Input

For instance, you might perform Enrichment Analysis (GSEA) using the Hallmark 50 geneset and obtain the AUCell matrix (rows represent pathways, columns represent cells):

pbmc <- GeneSetAnalysis(pbmc, genesets = hall50$human)
matr <- pbmc@misc$AUCell$genesets

Using the matrix, compute the z-scores for the genesets across various cell clusters:

gsea.zscore <- CalcStats(matr, f = pbmc$cluster, method = "zscore")

Present the z-scores using a heatmap:

Heatmap(gsea.zscore, lab_fill = "zscore")

Assess Proportion of Positive Cells in Clusters {#assess-proportion-of-positive-cells-in-clusters}

This section describes how to utilize the feature_percent function in the SeuratExtend package to determine the proportion of positive cells within specified clusters or groups based on defined criteria. This function is particularly useful for identifying the expression levels of genes or other features within subpopulations of cells in scRNA-seq datasets.

Basic Usage

To calculate the proportion of positive cells for the top 5 variable features in a Seurat object:

library(SeuratExtend)

genes <- VariableFeatures(pbmc)[1:5]

# Default usage
proportions <- feature_percent(pbmc, feature = genes)
print(proportions)

This will return a matrix where rows are features and columns are clusters, showing the proportion of cells in each cluster where the feature's expression is above the default threshold (0).

Adjusting the Expression Threshold

To count a cell as positive only if its expression is above a value of 2:

proportions_above_2 <- feature_percent(pbmc, feature = genes, above = 2)
print(proportions_above_2)

Targeting Specific Clusters

To calculate proportions for only a subset of clusters:

proportions_subset <- feature_percent(pbmc, feature = genes, ident = c("B cell", "CD8 T cell"))
print(proportions_subset)

Grouping by Different Metadata

If you wish to group cells by a different variable other than the default cluster identities:

proportions_by_ident <- feature_percent(pbmc, feature = genes, group.by = "orig.ident")
print(proportions_by_ident)

Proportions Relative to Total Cell Numbers

To also check the proportion of expressed cells in total across selected clusters:

proportions_total <- feature_percent(pbmc, feature = genes, total = TRUE)
print(proportions_total)

Logical Output for Expression

For scenarios where you need a logical output indicating whether a significant proportion of cells are expressing the feature above a certain level (e.g., 20%):

expressed_logical <- feature_percent(pbmc, feature = genes, if.expressed = TRUE, min.pct = 0.2)
print(expressed_logical)

Run Standard Seurat Pipeline {#run-standard-seurat-pipeline}

The RunBasicSeurat function in the SeuratExtend package automates the execution of a standard Seurat pipeline for single-cell RNA sequencing data analysis. This comprehensive function includes steps such as normalization, PCA, clustering, and optionally integrates batch effects using Harmony. This automation is designed to streamline the analysis process, making it more efficient and reproducible.

Overview of the Pipeline

The Seurat pipeline typically includes the following steps, which are all encapsulated within the RunBasicSeurat function:

  1. Calculating Percent Mitochondrial Content: Identifying and filtering cells based on the proportion of mitochondrial genes, which is a common quality control metric.
  2. Normalization: Scaling data to account for cell-specific differences in library size.
  3. PCA: Performing principal component analysis to reduce dimensionality and highlight the major sources of variation.
  4. Clustering: Grouping cells based on their gene expression profiles to identify distinct cell types or states.
  5. UMAP Visualization: Projecting the high-dimensional data into two dimensions for visualization.
  6. Batch Integration (Optional): Using Harmony to correct for batch effects, ensuring that variations driven by experimental conditions are minimized.

For a comprehensive tutorial on the standard Seurat workflow, refer to the official Seurat PBMC tutorial.

Using RunBasicSeurat

Below are examples demonstrating how to use the RunBasicSeurat function to process scRNA-seq data:

library(SeuratExtend)

# Run the full pipeline with forced normalization and default parameters
pbmc <- RunBasicSeurat(pbmc, force.Normalize = TRUE)

# Visualize the clusters using DimPlot
DimPlot2(pbmc, group.by = "cluster")

Parameters and Customization

The function allows for extensive customization of each step through various parameters:

Conditional Execution

RunBasicSeurat intelligently decides whether to re-run certain steps based on parameter changes or previous executions:

Conclusion

The RunBasicSeurat function simplifies the execution of a comprehensive scRNA-seq data analysis pipeline, incorporating advanced features such as conditional execution and batch effect integration. This function ensures that users can efficiently process their data while maintaining flexibility to adapt the analysis to specific requirements.

sessionInfo()


huayc09/SeuratExtend documentation built on July 15, 2024, 6:22 p.m.