In neurogenomics/scNLP: NLP for Single-cell Omics.

knitr::opts_chunk$set(echo = T, fig.width = 7, fig.height = 5)

Intro

Term frequency–inverse document frequency (tf-idf) is an NLP technique to identify words or phrases that are enriched in one document relative to some other larger set of documents.

In our case, our words are within the non-standardized cell labels and our "documents" are the clusters. The goals is to find words that are enriched in each cluster relative to all the other clusters. This can be thought of as an NLP equivalent of finding gene markers for each cluster.

Examples

library(scNLP) 
data("pseudo_seurat")

Preprocessing

If you don't already have a Seurat object with reduced dimensions and cluster assignments, you can generate a new one with the following support function.

## Create some mock raw data 
counts <- Seurat::GetAssayData(pseudo_seurat)
meta.data <- pseudo_seurat@meta.data

processed_seurat <- seurat_pipeline(counts = counts, 
                                    meta.data = meta.data)

td-idf annotation

seurat_tfidf will run tf-idf on each cluster and put the results in the enriched_words and tf_idf cols of the meta.data.

pseudo_seurat_tfidf <- run_tfidf(object = pseudo_seurat,
                                 reduction = "UMAP",
                                 cluster_var = "cluster",
                                 label_var = "celltype") 
head(pseudo_seurat_tfidf@meta.data)

td-idf scatter plot

You can also plot the results in reduced dimensional space (e.g. UMAP). plot_tfidf() will produce a list with three items. - data: The processed data used to create the plot. - tfidf_df: The full per-cluster TF-IDF enrichment results. - plot: The ggplot.

`Seurat` input

res <- plot_tfidf(object = pseudo_seurat, 
                  label_var = "celltype", 
                  cluster_var = "cluster", 
                  show_plot = T)

You can color the point by other metadata attributes instead.

res <- plot_tfidf(object = pseudo_seurat, 
                  label_var = "celltype", 
                  cluster_var = "cluster", 
                  color_var = "batch",
                  show_plot = T)

`SingleCellExperiment` input

plot_tfidf() can also take in an object of class SingleCellExperiment.

data("pseudo_sce")

res <- plot_tfidf(object = pseudo_sce, 
                  label_var = "celltype", 
                  cluster_var = "cluster", 
                  show_plot = T)

`list` input

Lastly, if your data doesn't fit the above example data types, you can simply supply a named list with metadata and embeddings.

data_list <- list(metadata = SingleCellExperiment::colData(pseudo_sce),
                  embeddings = SingleCellExperiment::colData(pseudo_sce)[,c("UMAP.1","UMAP.2")])

res <- plot_tfidf(object = data_list,
                  label_var = "celltype", 
                  cluster_var = "cluster", 
                  show_plot = T)

Interactive mode

You can also create an interactive version of this plot.

res <- plot_tfidf(object = pseudo_seurat_tfidf,
                  label_var = "celltype", 
                  cluster_var = "cluster", 
                  interact = T,
                  show_plot = T,
                  ### Add other metadata vars you want in the hover label like so:
                  species="species",
                  dataset="dataset", 
                  enriched_words="enriched_words",
                  tf_idf="tf_idf")

tf-idf wordcloud

You can also show the per-cluster tf-idk results as a wordcloud.

wordcloud_res <- wordcloud_tfidf(object=pseudo_seurat,
                                 label_var = "celltype", 
                                 cluster_var = "cluster", 
                                 terms_per_cluster=10)
print(wordcloud_res$tfidf_df)

Session Info

utils::sessionInfo()

neurogenomics/scNLP documentation built on Oct. 8, 2024, 5:30 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

neurogenomics/scNLP
NLP for Single-cell Omics.

In neurogenomics/scNLP: NLP for Single-cell Omics.

Intro

Examples

Preprocessing

td-idf annotation

td-idf scatter plot

`Seurat` input

`SingleCellExperiment` input

`list` input

Interactive mode

tf-idf wordcloud

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

neurogenomics/scNLP NLP for Single-cell Omics.

In neurogenomics/scNLP: NLP for Single-cell Omics.

Intro

Examples

Preprocessing

td-idf annotation

td-idf scatter plot

Seurat input

SingleCellExperiment input

list input

Interactive mode

tf-idf wordcloud

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

neurogenomics/scNLP
NLP for Single-cell Omics.

`Seurat` input

`SingleCellExperiment` input

`list` input