knitr::opts_chunk$set(echo = T, fig.width = 7, fig.height = 5)
Term frequency–inverse document frequency (tf-idf) is an NLP technique to identify words or phrases that are enriched in one document relative to some other larger set of documents.
In our case, our words are within the non-standardized cell labels and our "documents" are the clusters. The goals is to find words that are enriched in each cluster relative to all the other clusters. This can be thought of as an NLP equivalent of finding gene markers for each cluster.
library(scNLP) data("pseudo_seurat")
If you don't already have a Seurat
object with reduced dimensions and cluster assignments, you can generate a new one with the following support function.
## Create some mock raw data counts <- Seurat::GetAssayData(pseudo_seurat) meta.data <- pseudo_seurat@meta.data processed_seurat <- seurat_pipeline(counts = counts, meta.data = meta.data)
seurat_tfidf
will run tf-idf on each cluster and put the results in the enriched_words and tf_idf cols of the meta.data
.
pseudo_seurat_tfidf <- run_tfidf(object = pseudo_seurat, reduction = "UMAP", cluster_var = "cluster", label_var = "celltype") head(pseudo_seurat_tfidf@meta.data)
You can also plot the results in reduced dimensional space (e.g. UMAP).
plot_tfidf()
will produce a list with three items.
- data
: The processed data used to create the plot.
- tfidf_df
: The full per-cluster TF-IDF enrichment results.
- plot
: The ggplot
.
Seurat
inputres <- plot_tfidf(object = pseudo_seurat, label_var = "celltype", cluster_var = "cluster", show_plot = T)
You can color the point by other metadata attributes instead.
res <- plot_tfidf(object = pseudo_seurat, label_var = "celltype", cluster_var = "cluster", color_var = "batch", show_plot = T)
SingleCellExperiment
inputplot_tfidf()
can also take in an object of class SingleCellExperiment
.
data("pseudo_sce") res <- plot_tfidf(object = pseudo_sce, label_var = "celltype", cluster_var = "cluster", show_plot = T)
list
inputLastly, if your data doesn't fit the above example data types, you can simply supply a named list
with metadata and embeddings.
data_list <- list(metadata = SingleCellExperiment::colData(pseudo_sce), embeddings = SingleCellExperiment::colData(pseudo_sce)[,c("UMAP.1","UMAP.2")]) res <- plot_tfidf(object = data_list, label_var = "celltype", cluster_var = "cluster", show_plot = T)
You can also create an interactive version of this plot.
res <- plot_tfidf(object = pseudo_seurat_tfidf, label_var = "celltype", cluster_var = "cluster", interact = T, show_plot = T, ### Add other metadata vars you want in the hover label like so: species="species", dataset="dataset", enriched_words="enriched_words", tf_idf="tf_idf")
You can also show the per-cluster tf-idk results as a wordcloud.
wordcloud_res <- wordcloud_tfidf(object=pseudo_seurat, label_var = "celltype", cluster_var = "cluster", terms_per_cluster=10) print(wordcloud_res$tfidf_df)
utils::sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.