In pachterlab/museumst: Museum of Spatial Transcriptomics

knitr::opts_chunk$set(echo = TRUE)

To do for all notebooks: Explain what the code does, just as I did in the BBB notebook.

library(tidyverse)
library(lubridate)
library(preText)
library(quanteda)
library(stm)
library(text2vec)
library(lexvarsdatr)
library(ggrepel)
library(sf)
library(furrr)
library(stminsights)
library(uwot)
library(tidytext)
library(broom)
library(gam)
library(igraph)
library(ggraph)
library(bluster)
library(tidygraph)
devtools::load_all()
plan(multisession)
theme_set(theme_bw())

data("lcm_abstracts")
data("lcm_city_gc")

pubs_on_map(lcm_abstracts, lcm_city_gc, plot = "hexbin", label_insts = FALSE, 
            label_cities = TRUE, n_label = 7)

Not sure if this is a good idea, since it's plotting number of publications per unit area. I think per capita is confusing, again, because researchers move a lot and I'm not sure who are the "locals". Is per unit area more reasonable? Maybe. It helps when there's overplotting in the scatter plot.

pubs_on_map(lcm_abstracts, city_gc = lcm_city_gc, label_cities = TRUE, n_label = 7)

pubs_per_cat(lcm_abstracts, country, n_top = 20)

pubs_per_cat(lcm_abstracts, city, n_top = 20)

I can try to get city population data, though a complication is that some city names have different spellings and sometimes there are multiple cities with the same name (e.g. trust me, there are Los Angeles's outside California). Well, I can geocode those cities with the Google API and extract Google's standards. But I don't think this is all that important, so I'm not doing it right now.

Unfortunately, I can't check the institutions here since I haven't figured out a way to automatically extract the institution names and standardize them from the address given that the addresses here do not have a standardized format. OK, given more time, I probably can figure that out, but I'd rather not do it for now. But the cities do tell a lot. Los Angeles means either UCLA or USC or some hospitals like Cedars Sinai. Ithaca means Cornell. New York means Columbia, NYU, Cornell medical school, or Rockefeller. Boston means Harvard Medical School or maybe Boston College or Boston University. Bethesda means NIH. Again, all the usual suspects.

pubs_per_capita(lcm_abstracts)

pubs_per_capita(df_1st, plot = "bar")

That's very different from other areas of spatial transcriptomics.

pubs_on_map(lcm_abstracts, lcm_city_gc, "europe", label_cities = TRUE)

pubs_per_capita(lcm_abstracts, "europe")

pubs_per_capita(lcm_abstracts, "europe", plot = "bar")

Denmark never showed up for the other areas in spatial transcriptomics.

pubs_on_map(lcm_abstracts, lcm_city_gc, "usa", label_cities = TRUE)

Orangeburg. Never heard about it. It's the Nathan Kline Institute, which again, I've never heard of. I think this shows how LCM is different from other fields of spatial transcriptomics; the other fields are more confined to the very elite institutions.

pubs_per_capita(lcm_abstracts, "usa")

pubs_per_capita(lcm_abstracts, "usa", plot = "bar")

I think that's because of NIH's labs in Bestheda and Baltimore. Apparently NIH has a lot interest in LCM, though not as much in other fields of spatial transcriptomics compared to some other most elite institutions. Anyway, I don't think this says too much about those states, since we know that academics are kind of like modern day foragers not anchored in a city early in their careers. Since the same people move around, I don't think this data says too much about the states themselves per se since the people doing the research are quite likely to be from somewhere else. I'm not convinced that Kansas and Virginia are really better than California in this field. But I still don't think I should plot the raw numbers on the choropleth, since it creates an illusion that physically larger states have more publications than they actually do just because they're physically larger.

How about over time?

pubs_per_year(lcm_abstracts, binwidth = 150)

This is not manually curated; it's just whatever PubMed gave me from a search. Curiously, there's a first peak, and a rising and taller second peak. I wonder why. When text mining, I can try to compare the two peaks.

I'll text mine the titles, keywords, or maybe abstracts as well. I can also try to extract country, state, city, institution, and department from the address field.

Text mining abstracts

I think the city (a convenient proxy of institution) and journal have a lot to do with the topic of an abstract. But meanwhile, if a lot of cities and journal only have 1 or 2 abstracts, then it makes the model unnecessarily complex given the relatively small corpus and won't make it more predictive. I know, I fit this model for explanation of importance of the covariates, but at least the model should capture the important features and not too much noise, in other words, not overfit, for me to trust the explanation.

length(unique(lcm_abstracts$city2))

length(unique(lcm_abstracts$journal))

lcm_abstracts %>% 
  count(city2) %>% 
  ggplot(aes(n)) +
  stat_ecdf() +
  labs(x = "Number of publications per city", y = "F") +
  scale_x_continuous(breaks = scales::breaks_width(2)) +
  scale_y_continuous(breaks = scales::breaks_width(0.2))

By lumping everything below 4 as "Other", there're only about 30% of cities left.

Another question: What percent of cities have what percent of publications? Rather than an ECDF of the number of publications per city, I can plot the cumulative number/proportion of all publications vs. rank of city.

df <- lcm_abstracts %>% 
  count(city2) %>% 
  arrange(desc(n)) %>% 
  mutate(rank = row_number(desc(n)),
         rel_rank = rank/max(rank),
         cum_n = cumsum(n),
         cum_prop = cum_n/sum(n))
ggplot(df, aes(rel_rank, cum_prop)) +
  geom_line() +
  scale_x_continuous(breaks = scales::breaks_width(0.2), labels = scales::percent) +
  scale_y_continuous(breaks = scales::breaks_width(0.2), labels = scales::percent) +
  labs(x = "City percentile", y = "Percent of publications")

When I do lump cities with fewer than 3 publications, the remaining 30% of cities have 70% of publications. But is it helpful at all that a bit over 30% of publications are in the "Other" category? Well, maybe better than not using cities at all, since given that in practice, topics have a lot to do with the interests of individual labs, it makes sense to include city as a covariate.

lcm_abstracts %>% 
  filter(journal != "bioRxiv") %>% 
  count(journal) %>% 
  ggplot(aes(n)) +
  stat_ecdf() +
  labs(x = "Number of publications per journal", y = "F") +
  scale_x_continuous(breaks = scales::breaks_width(2)) +
  scale_y_continuous(breaks = scales::breaks_width(0.2))

By lumping journals with fewer than 3 papers into "Other", there're about 25% of journals left.

df <- lcm_abstracts %>% 
  filter(journal != "bioRxiv") %>% 
  count(journal) %>% 
  arrange(desc(n)) %>% 
  mutate(rank = row_number(desc(n)),
         rel_rank = rank/max(rank),
         cum_n = cumsum(n),
         cum_prop = cum_n/sum(n))
ggplot(df, aes(rel_rank, cum_prop)) +
  geom_line() +
  scale_y_continuous(breaks = scales::breaks_width(0.2), labels = scales::percent) +
  scale_x_continuous(labels = scales::percent) +
  labs(x = "Journal percentile", y = "Percent of publications")

The remaining top 25% of journals have 65% of papers.

This curve rises more sharply than the city curve.

abstracts <- lcm_abstracts %>% 
  mutate(city_eff = fct_lump_min(city2, 5),
         journal_eff = fct_lump_min(journal, 5))

ac <- corpus(abstracts, docid_field = "title", text_field = "abstract")

n_words <- summary(ac, n = Inf)

hist(n_words$Tokens)

hist(n_words$Types)

hist(n_words$Sentences)

Did the number of words in abstracts change over time?

ggplot(n_words, aes(date_published, Tokens)) +
  geom_point() +
  geom_smooth(method = "lm")

Nope.

Tuning stm model

I'm debating whether I should stem the words. Apparently here this doesn't really matter. I don't know any keyword that has multiple meanings in the context of LCM. Anyway, there're more burning questions like whether I should remove punctuation. I don't think it matters in this context, but just to check. I'll randomly pick 200 abstracts and use the preText package to get a sense of the effects of those preprocessing choices.

ac_sub <- corpus_sample(ac, size = 200)

ac_fp <- factorial_preprocessing(ac_sub, parallel = FALSE, 
                                 intermediate_directory = ".", 
                                 return_results = FALSE,
                                 infrequent_term_threshold = 0.005)

head(ac_fp$choices)

preText_results <- preText(
  ac_fp,
  dataset_name = "LCM abstracts",
  distance_method = "cosine",
  num_comparisons = 50)

preText_score_plot(preText_results)

The options with lower scores have less unusual results.

regression_coefficient_plot(preText_results,
                            remove_intercept = TRUE)

Here this means removing stopwords and punctuation makes the results less likely to be unusual, while removing infrequent terms may lead to more unusual results, while the other options might not significantly affect the results. Here preText only looked at how these preprocessing choices influence pairwise distances between abstracts. I see how this is relevant to political science, for which that package was written. How relevant is pairwise distance to LCM abstracts?

I think in the first trial, I'll only use unigrams, do stem the words, do remove stopwords and punctuation, do remove numbers, and do turn everything into lower case, for the simplest model. Then I can try what if I don't remove stopwords, what if I don't remove punctuation, and what if I do remove infrequent words and see how that would affect the topic modeling results.

What are the collocations, i.e. common phrases?

ac_col <- tokens(ac, split_hyphens = TRUE, remove_symbols = TRUE) %>% 
  tokens_remove(c(stopwords(), "sup", "alt")) %>% 
  tokens_remove(c("^c_", "^m_", "^o_"), valuetype = "regex") %>% 
  tokens_wordstem() %>% 
  textstat_collocations(min_count = 20, size = 2:4) %>% 
  filter(count_nested < count) %>% 
  arrange(desc(count))

ac_col

OK, now I know which phrases are common. While many of the phrases here are relevant and are usually used as if they were words, some are not, so I'll use a list of phrases based on my background.

phrases <- str_split(ac_col$collocation, pattern = " ", simplify = FALSE)

ac_tokens <- tokens(ac, remove_punct = TRUE, 
                    remove_symbols = TRUE, remove_url = TRUE,
                    split_hyphens = TRUE) %>% 
  tokens_tolower(keep_acronyms = TRUE) %>% 
  tokens_remove(c(stopwords(), "laser", "capture", "microdissection", "use", 
                  "using", "used", "uses", "LCM", "sup", "LMD", "gene", "genes",
                  "width", "height", "src", "figdir", "alt")) %>% 
  tokens_remove(c("^c_", "^m_", "^o_"), valuetype = "regex") %>% 
  tokens_wordstem() %>% 
  tokens_compound(phrases)

lcm_tokens <- ac_tokens
usethis::use_data(lcm_tokens, overwrite = TRUE)

ac_dfm <- dfm(ac_tokens)

ac_dfm2 <- dfm_trim(ac_dfm, min_docfreq = 0.001, docfreq_type = "prop")

Here spectral means using NMF for initialization. I debated for a while whether I should use the covariates. I eventually decided that I should. Otherwise it's just an intercept. At worst the effects of the covariates are not significant. But I do think some of the covariates are relevant to topic prevalence, such as journal -- the journal Plant Cell obviously is about plants and not about human cancer. OK, it seems that adding the journal covariate only hurts the held out likelihood whether I lump low frequency journals into "Other" or not, so I'll not use it. I lumped cities with fewer than 5 publications into "Other", since using smaller numbers hurts the held-out likelihood, but I still do want to see the effects of city so I don't want to lump too much. Over 80% of cities and about half of publications are now in "Other", so this only shows the effects of the top 15% or so cities.

k_result2 <- my_searchK(ac_dfm2, K = seq(5, 60, by = 5), 
                       prevalence = ~ date_num + city_eff + journal_eff,
                       gamma.prior = "L1",
                       verbose = FALSE, seed = 1891)

plot_k_result(k_result2)

Looks like the number of topics should be 25 to 40 according to the residuals, but held-out likelihood and semantic coherence. suggest a smaller number. The "residuals" here means dispersion of the residuals, and ideally it should be 1.

plot_ec(k_result2, seq(30, 60, by = 10))

As expected, fewer topics results into higher semantic coherence and lower exclusivity. This is really a trade off. Sort of a compromise, here I choose the model with K = 50. Since I have held out part of the data when choosing K, I'll use this K to fit the model with the full data.

stm_res <- stm(ac_dfm2, K = 50, prevalence = ~ date_num + city_eff + journal_eff,
               gamma.prior = "L1", seed = 1919)

plot(stm_res, n = 10)

Upon manual inspection, 50 topics, with the spline for date, seem to give more reasonable results.

saveRDS(k_result2, "k_result_lcm.rds")
saveRDS(stm_res, "stm_res_lcm.rds")

Tidyverse version: Again, I'm copying and pasting code from Julia Silge's blog. I learnt a lot about text mining from her. However, I don't like her colorful barcharts, so I changed her code to make the lollipop plot here.

findThoughts(stm_res, lcm_abstracts$abstract, 50, n = 10, thresh = 0.9)

plot_topic_words(stm_res)

Stem cell and fetal development
GWAS, genetic screens, and genetics of complex phenotypes
Biomechanics, ECM, eye lens, muscles, and morphogenesis
Data analysis, especially of RNA-seq, but also of 3D genome structure and microarray
miRNAs in cancer
Quantitative analyses of cancer, clinical and bioinformatic
Hippocampus and Alzheimer's disease, sometimes related to Down syndrome
Prostate cancer and other stuff in molecular biology and biochemistry, probably because some prostate cancer papers have an emphasis on molecular biology
Plant embryos, plant development, and some stuff about evolution and ecology related to plants
Proteomics, especially in cancer
Cancer progression and diagnostics, especially lung cancer
Inflammation and immunology, especially in skin diseases
Breast cancer and liver cancer, with an emphasis in data analysis
Neural circuitry, neural plasticity, brain injury, and behavior
Plant gamitogenesis and reproduction
Spasmolytic polypeptide-expressing metaplasia (SPEM), oncogenes, KRAS
Endometrium and implantation. Somehow the top 2 entries are about hearing loss. Why? Epithelium?
Cell cycle, also hepatic zonation and circadian rhythm (the latter is also a cycle, so...)
Neurons, especially dopaminergic
Tumor stroma and microenvironment
Plant roots
Intestine, especially microbiome and immune response
Hypothalamus, obesity, and appetite
PDAC, and some stuff about glioma and prostate cancer
ALS, and other neurodegenerative diseases affecting motor neurons
Epigenetics
Tumor single cell profiling and cellular heterogeneity
Tissue isolation and preparation
Bone growth plate, especially recovery after radiotherapy, and some other stuff like oocytes, glaucoma, and epithelial injury
Pancreas and diabetes, especially T2D
Lymphocytes, lymphatic and blood vessels
Prefrontal cortex and schizophrenia
Synapses, dendritic spines, neuron potentiation, sometimes related to memory
Cancer genomics, mutations, and phylogeny
Bone formation, but also some other stuff about cancer and kidneys
Neurodegenerative diseases, Alzheimer's, Parkinson's, and multiple system atrophy
Spatial single cell techniques and imaging
Connective tissues and ECM, and some other stuff about circadian rhythms
Stem cells and development
Plant seed development and reproduction
RNA extraction and amplification, especially in microarray, but also in RNA-seq
Lots of different stuff about epithelium
Plant leaves, but also other stuff about gamitogenesis
Cancer pathway analyses and molecular and cellular mechanisms
Plant nitrogen fixation and soil microbiome
Lots of different stuff related to fibrosis and fibroblasts, such as in lung diseases and graft rejection
Neuron morphogenesis, axon guidance, somehow also angiogenesis, protein signaling
Inflammation, immune response, especially in atherosclerosis, though there's some other stuff about blood vessels
Model organisms and in vitro model systems
Intrahepatic cholangiocarcinoma (ICC)

lcm_dfm2 <- ac_dfm2
usethis::use_data(lcm_dfm2, overwrite = TRUE)
usethis::use_data(stm_res, overwrite = TRUE)

pachterlab/museumst documentation built on April 11, 2025, 11:06 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

pachterlab/museumst
Museum of Spatial Transcriptomics

In pachterlab/museumst: Museum of Spatial Transcriptomics

Text mining abstracts

Tuning stm model

R Package Documentation

Browse R Packages

We want your feedback!

pachterlab/museumst Museum of Spatial Transcriptomics

In pachterlab/museumst: Museum of Spatial Transcriptomics

Text mining abstracts

Tuning stm model

R Package Documentation

Browse R Packages

We want your feedback!

pachterlab/museumst
Museum of Spatial Transcriptomics