knitr::opts_chunk$set(echo = T, fig.width = 7, fig.height = 5)

Intro

Term frequency–inverse document frequency (tf-idf) is an NLP technique to identify words or phrases that are enriched in one document relative to some other larger set of documents.

In our case, our words are within the non-standardized cell labels and our "documents" are the clusters. The goals is to find words that are enriched in each cluster relative to all the other clusters. This can be thought of as an NLP equivalent of finding gene markers for each cluster.

Examples

library(scNLP) 
data("pseudo_seurat")

Preprocessing

If you don't already have a Seurat object with reduced dimensions and cluster assignments, you can generate a new one with the following support function.

## Create some mock raw data 
counts <- Seurat::GetAssayData(pseudo_seurat)
meta.data <- pseudo_seurat@meta.data

processed_seurat <- seurat_pipeline(counts = counts, 
                                    meta.data = meta.data)

td-idf annotation

seurat_tfidf will run tf-idf on each cluster and put the results in the enriched_words and tf_idf cols of the meta.data.

pseudo_seurat_tfidf <- run_tfidf(object = pseudo_seurat,
                                 reduction = "UMAP",
                                 cluster_var = "cluster",
                                 label_var = "celltype") 
head(pseudo_seurat_tfidf@meta.data)

td-idf scatter plot

You can also plot the results in reduced dimensional space (e.g. UMAP). plot_tfidf() will produce a list with three items. - data: The processed data used to create the plot. - tfidf_df: The full per-cluster TF-IDF enrichment results. - plot: The ggplot.

Seurat input

res <- plot_tfidf(object = pseudo_seurat, 
                  label_var = "celltype", 
                  cluster_var = "cluster", 
                  show_plot = T)

You can color the point by other metadata attributes instead.

res <- plot_tfidf(object = pseudo_seurat, 
                  label_var = "celltype", 
                  cluster_var = "cluster", 
                  color_var = "batch",
                  show_plot = T)

SingleCellExperiment input

plot_tfidf() can also take in an object of class SingleCellExperiment.

data("pseudo_sce")

res <- plot_tfidf(object = pseudo_sce, 
                  label_var = "celltype", 
                  cluster_var = "cluster", 
                  show_plot = T)

list input

Lastly, if your data doesn't fit the above example data types, you can simply supply a named list with metadata and embeddings.

data_list <- list(metadata = SingleCellExperiment::colData(pseudo_sce),
                  embeddings = SingleCellExperiment::colData(pseudo_sce)[,c("UMAP.1","UMAP.2")])

res <- plot_tfidf(object = data_list,
                  label_var = "celltype", 
                  cluster_var = "cluster", 
                  show_plot = T)

Interactive mode

You can also create an interactive version of this plot.

res <- plot_tfidf(object = pseudo_seurat_tfidf,
                  label_var = "celltype", 
                  cluster_var = "cluster", 
                  interact = T,
                  show_plot = T,
                  ### Add other metadata vars you want in the hover label like so:
                  species="species",
                  dataset="dataset", 
                  enriched_words="enriched_words",
                  tf_idf="tf_idf") 

tf-idf wordcloud

You can also show the per-cluster tf-idk results as a wordcloud.

wordcloud_res <- wordcloud_tfidf(object=pseudo_seurat,
                                 label_var = "celltype", 
                                 cluster_var = "cluster", 
                                 terms_per_cluster=10)
print(wordcloud_res$tfidf_df)

Session Info

utils::sessionInfo()



neurogenomics/scNLP documentation built on Oct. 8, 2024, 5:30 p.m.