Gene naming conventions can vary significantly between organisms and databases, presenting a common challenge in scRNA-seq data analysis. SeuratExtend
includes several functions to facilitate the conversion between human and mouse gene symbols and Ensembl IDs, as well as conversions between human and mouse homologous gene symbols. These functions leverage the biomaRt
database for conversions but improve on reliability and performance by localizing the most commonly used databases, thus eliminating the need for internet connectivity and addressing the frequent instability issues with biomaRt
.
The functions provided for these conversions are:
HumanToMouseGenesymbol
MouseToHumanGenesymbol
EnsemblToGenesymbol
GenesymbolToEnsembl
These functions share a similar usage pattern, as detailed below using HumanToMouseGenesymbol
as an example.
First, let's retrieve a few human gene symbols from a dataset as an example:
library(Seurat) library(SeuratExtend) human_genes <- VariableFeatures(pbmc)[1:6] print(human_genes)
By default, HumanToMouseGenesymbol
returns a data frame showing how human gene symbols (HGNC) match with mouse gene symbols (MGI):
HumanToMouseGenesymbol(human_genes)
This table indicates that not all human genes have direct mouse homologs, and some human genes may correspond to multiple mouse genes.
If you prefer a simpler vector output without the matching details:
HumanToMouseGenesymbol(human_genes, match = FALSE)
For cases where you require a one-to-one correspondence:
HumanToMouseGenesymbol(human_genes, keep.seq = TRUE)
These functions can also directly convert human gene expression matrices to their mouse counterparts:
# Create an example gene expression matrix human_matr <- GetAssayData(pbmc)[human_genes, 1:4] print(human_matr) # Convert to a mouse gene expression matrix HumanToMouseGenesymbol(human_matr)
The usage patterns for the other conversion functions in SeuratExtend
, such as MouseToHumanGenesymbol
, GenesymbolToEnsembl
, and EnsemblToGenesymbol
, are similar to those already discussed. These functions also leverage local databases to enhance performance and reliability but provide options to use online databases via biomaRt
if necessary.
Here are some examples demonstrating the use of other gene naming conversion functions:
# Converting mouse gene symbols to human mouse_genes <- c("Cd14", "Cd3d", "Cd79a") MouseToHumanGenesymbol(mouse_genes, match = FALSE) # Converting human gene symbols to Ensembl IDs human_genes <- c("PPBP", "LYZ", "S100A9", "IGLL5", "GNLY", "FTL") GenesymbolToEnsembl(human_genes, spe = "human", keep.seq = TRUE) # Converting mouse gene symbols to Ensembl IDs GenesymbolToEnsembl(mouse_genes, spe = "mouse", keep.seq = TRUE) # Converting Ensembl IDs to human gene symbols EnsemblToGenesymbol(c("ENSG00000163736", "ENSG00000090382"), spe = "human", keep.seq = TRUE) # Converting Ensembl IDs to mouse gene symbols EnsemblToGenesymbol(c("ENSMUSG00000051439", "ENSMUSG00000032094"), spe = "mouse", keep.seq = TRUE)
While SeuratExtend
typically uses localized databases for conversions, you have the option to directly fetch results from biomaRt
databases if required. This can be useful when working with less common genes or newer annotations not yet available in the local database:
# Fetching Ensembl IDs for human genes directly from biomaRt GenesymbolToEnsembl(human_genes, spe = "human", local.mode = FALSE, keep.seq = TRUE)
In addition to facilitating gene symbol and Ensembl ID conversions between human and mouse, SeuratExtend
also includes functionality to convert UniProt IDs, which are widely used in proteomic databases, to gene symbols. This can be particularly useful when integrating proteomic and genomic data or when working with databases that use UniProt identifiers.
The function UniprotToGenesymbol
in SeuratExtend
provides a straightforward way to translate UniProt IDs into gene symbols. This function supports both human and mouse species, accommodating research that spans multiple types of biological data. Here's how you can convert UniProt IDs to gene symbols for both human and mouse:
# Converting UniProt IDs to human gene symbols UniprotToGenesymbol(c("Q8NF67", "Q9NPB9"), spe = "human") # Converting UniProt IDs to mouse gene symbols UniprotToGenesymbol(c("Q9R1C8", "Q9QY84"), spe = "mouse")
The CalcStats
function from the SeuratExtend
package provides a comprehensive approach to compute various statistics, such as mean, median, z-scores, or LogFC, for genomic data. This function can handle data stored in Seurat objects or standard matrices, allowing for versatile analyses tailored to single-cell datasets.
Whether you're analyzing genes or pathways, CalcStats
simplifies the task by computing statistics for selected features across different cell groups or clusters.
Begin by selecting a subset of features, such as genes. For this example, let's pick the first 20 variable features from a Seurat object:
library(Seurat) library(SeuratExtend) genes <- VariableFeatures(pbmc)[1:20]
Using CalcStats
, compute your desired metric, like z-scores, for each feature across different cell clusters:
genes.zscore <- CalcStats(pbmc, features = genes, method = "zscore", group.by = "cluster") head(genes.zscore)
Display the computed statistics using a heatmap:
Heatmap(genes.zscore, lab_fill = "zscore")
Select more genes and retain the top 4 genes of each cluster, sorted by p-value. This can be a convenient method to display the top marker genes of each cluster:
genes <- VariableFeatures(pbmc) genes.zscore <- CalcStats( pbmc, features = genes, method = "zscore", group.by = "cluster", order = "p", n = 4) Heatmap(genes.zscore, lab_fill = "zscore")
For instance, you might perform Enrichment Analysis (GSEA) using the Hallmark 50 geneset and obtain the AUCell matrix (rows represent pathways, columns represent cells):
pbmc <- GeneSetAnalysis(pbmc, genesets = hall50$human) matr <- pbmc@misc$AUCell$genesets
Using the matrix, compute the z-scores for the genesets across various cell clusters:
gsea.zscore <- CalcStats(matr, f = pbmc$cluster, method = "zscore")
Present the z-scores using a heatmap:
Heatmap(gsea.zscore, lab_fill = "zscore")
This section describes how to utilize the feature_percent
function in the SeuratExtend
package to determine the proportion of positive cells within specified clusters or groups based on defined criteria. This function is particularly useful for identifying the expression levels of genes or other features within subpopulations of cells in scRNA-seq datasets.
To calculate the proportion of positive cells for the top 5 variable features in a Seurat object:
library(SeuratExtend) genes <- VariableFeatures(pbmc)[1:5] # Default usage proportions <- feature_percent(pbmc, feature = genes) print(proportions)
This will return a matrix where rows are features and columns are clusters, showing the proportion of cells in each cluster where the feature's expression is above the default threshold (0).
To count a cell as positive only if its expression is above a value of 2:
proportions_above_2 <- feature_percent(pbmc, feature = genes, above = 2) print(proportions_above_2)
To calculate proportions for only a subset of clusters:
proportions_subset <- feature_percent(pbmc, feature = genes, ident = c("B cell", "CD8 T cell")) print(proportions_subset)
If you wish to group cells by a different variable other than the default cluster identities:
proportions_by_ident <- feature_percent(pbmc, feature = genes, group.by = "orig.ident") print(proportions_by_ident)
To also check the proportion of expressed cells in total across selected clusters:
proportions_total <- feature_percent(pbmc, feature = genes, total = TRUE) print(proportions_total)
For scenarios where you need a logical output indicating whether a significant proportion of cells are expressing the feature above a certain level (e.g., 20%):
expressed_logical <- feature_percent(pbmc, feature = genes, if.expressed = TRUE, min.pct = 0.2) print(expressed_logical)
The RunBasicSeurat
function in the SeuratExtend
package automates the execution of a standard Seurat pipeline for single-cell RNA sequencing data analysis. This comprehensive function includes steps such as normalization, PCA, clustering, and optionally integrates batch effects using Harmony. This automation is designed to streamline the analysis process, making it more efficient and reproducible.
The Seurat pipeline typically includes the following steps, which are all encapsulated within the RunBasicSeurat
function:
For a comprehensive tutorial on the standard Seurat workflow, refer to the official Seurat PBMC tutorial.
RunBasicSeurat
Below are examples demonstrating how to use the RunBasicSeurat
function to process scRNA-seq data:
library(SeuratExtend) # Run the full pipeline with forced normalization and default parameters pbmc <- RunBasicSeurat(pbmc, force.Normalize = TRUE) # Visualize the clusters using DimPlot DimPlot2(pbmc, group.by = "cluster")
The function allows for extensive customization of each step through various parameters:
spe
: Specifies the species (human or mouse) for mitochondrial calculations.nFeature_RNA.min
and nFeature_RNA.max
: Define the range of RNA features considered for each cell.percent.mt.max
: Sets the maximum allowed mitochondrial gene expression percentage.dims
: Determines the number of dimensions used in PCA and neighbor finding.resolution
: Adjusts the granularity of the clustering algorithm.reduction
: Chooses the dimensional reduction technique, with options for PCA or Harmony.harmony.by
: Specifies the metadata column for batch correction when using Harmony.RunBasicSeurat
intelligently decides whether to re-run certain steps based on parameter changes or previous executions:
force.*
Parameters: Each force
parameter (e.g., force.Normalize
, force.RunPCA
) overrides the function's internal checks, ensuring that specific steps are executed regardless of prior results. This feature is particularly useful when parameters are adjusted or when updates to the dataset require reanalysis.The RunBasicSeurat
function simplifies the execution of a comprehensive scRNA-seq data analysis pipeline, incorporating advanced features such as conditional execution and batch effect integration. This function ensures that users can efficiently process their data while maintaining flexibility to adapt the analysis to specific requirements.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.