knitr::opts_chunk$set(echo = TRUE)

To do for all notebooks: Explain what the code does, just as I did in the BBB notebook.

library(tidyverse)
library(lubridate)
library(preText)
library(quanteda)
library(stm)
library(text2vec)
library(lexvarsdatr)
library(ggrepel)
library(sf)
library(furrr)
library(stminsights)
library(uwot)
library(tidytext)
library(broom)
library(gam)
library(igraph)
library(ggraph)
library(bluster)
library(tidygraph)
devtools::load_all()
plan(multisession)
theme_set(theme_bw())
data("lcm_abstracts")
data("lcm_city_gc")
pubs_on_map(lcm_abstracts, lcm_city_gc, plot = "hexbin", label_insts = FALSE, 
            label_cities = TRUE, n_label = 7)

Not sure if this is a good idea, since it's plotting number of publications per unit area. I think per capita is confusing, again, because researchers move a lot and I'm not sure who are the "locals". Is per unit area more reasonable? Maybe. It helps when there's overplotting in the scatter plot.

pubs_on_map(lcm_abstracts, city_gc = lcm_city_gc, label_cities = TRUE, n_label = 7)
pubs_per_cat(lcm_abstracts, country, n_top = 20)
pubs_per_cat(lcm_abstracts, city, n_top = 20)

I can try to get city population data, though a complication is that some city names have different spellings and sometimes there are multiple cities with the same name (e.g. trust me, there are Los Angeles's outside California). Well, I can geocode those cities with the Google API and extract Google's standards. But I don't think this is all that important, so I'm not doing it right now.

Unfortunately, I can't check the institutions here since I haven't figured out a way to automatically extract the institution names and standardize them from the address given that the addresses here do not have a standardized format. OK, given more time, I probably can figure that out, but I'd rather not do it for now. But the cities do tell a lot. Los Angeles means either UCLA or USC or some hospitals like Cedars Sinai. Ithaca means Cornell. New York means Columbia, NYU, Cornell medical school, or Rockefeller. Boston means Harvard Medical School or maybe Boston College or Boston University. Bethesda means NIH. Again, all the usual suspects.

pubs_per_capita(lcm_abstracts)
pubs_per_capita(df_1st, plot = "bar")

That's very different from other areas of spatial transcriptomics.

pubs_on_map(lcm_abstracts, lcm_city_gc, "europe", label_cities = TRUE)
pubs_per_capita(lcm_abstracts, "europe")
pubs_per_capita(lcm_abstracts, "europe", plot = "bar")

Denmark never showed up for the other areas in spatial transcriptomics.

pubs_on_map(lcm_abstracts, lcm_city_gc, "usa", label_cities = TRUE)

Orangeburg. Never heard about it. It's the Nathan Kline Institute, which again, I've never heard of. I think this shows how LCM is different from other fields of spatial transcriptomics; the other fields are more confined to the very elite institutions.

pubs_per_capita(lcm_abstracts, "usa")
pubs_per_capita(lcm_abstracts, "usa", plot = "bar")

I think that's because of NIH's labs in Bestheda and Baltimore. Apparently NIH has a lot interest in LCM, though not as much in other fields of spatial transcriptomics compared to some other most elite institutions. Anyway, I don't think this says too much about those states, since we know that academics are kind of like modern day foragers not anchored in a city early in their careers. Since the same people move around, I don't think this data says too much about the states themselves per se since the people doing the research are quite likely to be from somewhere else. I'm not convinced that Kansas and Virginia are really better than California in this field. But I still don't think I should plot the raw numbers on the choropleth, since it creates an illusion that physically larger states have more publications than they actually do just because they're physically larger.

How about over time?

pubs_per_year(lcm_abstracts, binwidth = 150)

This is not manually curated; it's just whatever PubMed gave me from a search. Curiously, there's a first peak, and a rising and taller second peak. I wonder why. When text mining, I can try to compare the two peaks.

I'll text mine the titles, keywords, or maybe abstracts as well. I can also try to extract country, state, city, institution, and department from the address field.

Text mining abstracts

I think the city (a convenient proxy of institution) and journal have a lot to do with the topic of an abstract. But meanwhile, if a lot of cities and journal only have 1 or 2 abstracts, then it makes the model unnecessarily complex given the relatively small corpus and won't make it more predictive. I know, I fit this model for explanation of importance of the covariates, but at least the model should capture the important features and not too much noise, in other words, not overfit, for me to trust the explanation.

length(unique(lcm_abstracts$city2))
length(unique(lcm_abstracts$journal))
lcm_abstracts %>% 
  count(city2) %>% 
  ggplot(aes(n)) +
  stat_ecdf() +
  labs(x = "Number of publications per city", y = "F") +
  scale_x_continuous(breaks = scales::breaks_width(2)) +
  scale_y_continuous(breaks = scales::breaks_width(0.2))

By lumping everything below 4 as "Other", there're only about 30% of cities left.

Another question: What percent of cities have what percent of publications? Rather than an ECDF of the number of publications per city, I can plot the cumulative number/proportion of all publications vs. rank of city.

df <- lcm_abstracts %>% 
  count(city2) %>% 
  arrange(desc(n)) %>% 
  mutate(rank = row_number(desc(n)),
         rel_rank = rank/max(rank),
         cum_n = cumsum(n),
         cum_prop = cum_n/sum(n))
ggplot(df, aes(rel_rank, cum_prop)) +
  geom_line() +
  scale_x_continuous(breaks = scales::breaks_width(0.2), labels = scales::percent) +
  scale_y_continuous(breaks = scales::breaks_width(0.2), labels = scales::percent) +
  labs(x = "City percentile", y = "Percent of publications")

When I do lump cities with fewer than 3 publications, the remaining 30% of cities have 70% of publications. But is it helpful at all that a bit over 30% of publications are in the "Other" category? Well, maybe better than not using cities at all, since given that in practice, topics have a lot to do with the interests of individual labs, it makes sense to include city as a covariate.

lcm_abstracts %>% 
  filter(journal != "bioRxiv") %>% 
  count(journal) %>% 
  ggplot(aes(n)) +
  stat_ecdf() +
  labs(x = "Number of publications per journal", y = "F") +
  scale_x_continuous(breaks = scales::breaks_width(2)) +
  scale_y_continuous(breaks = scales::breaks_width(0.2))

By lumping journals with fewer than 3 papers into "Other", there're about 25% of journals left.

df <- lcm_abstracts %>% 
  filter(journal != "bioRxiv") %>% 
  count(journal) %>% 
  arrange(desc(n)) %>% 
  mutate(rank = row_number(desc(n)),
         rel_rank = rank/max(rank),
         cum_n = cumsum(n),
         cum_prop = cum_n/sum(n))
ggplot(df, aes(rel_rank, cum_prop)) +
  geom_line() +
  scale_y_continuous(breaks = scales::breaks_width(0.2), labels = scales::percent) +
  scale_x_continuous(labels = scales::percent) +
  labs(x = "Journal percentile", y = "Percent of publications")

The remaining top 25% of journals have 65% of papers.

This curve rises more sharply than the city curve.

abstracts <- lcm_abstracts %>% 
  mutate(city_eff = fct_lump_min(city2, 5),
         journal_eff = fct_lump_min(journal, 5))
ac <- corpus(abstracts, docid_field = "title", text_field = "abstract")
n_words <- summary(ac, n = Inf)
hist(n_words$Tokens)
hist(n_words$Types)
hist(n_words$Sentences)

Did the number of words in abstracts change over time?

ggplot(n_words, aes(date_published, Tokens)) +
  geom_point() +
  geom_smooth(method = "lm")

Nope.

Tuning stm model

I'm debating whether I should stem the words. Apparently here this doesn't really matter. I don't know any keyword that has multiple meanings in the context of LCM. Anyway, there're more burning questions like whether I should remove punctuation. I don't think it matters in this context, but just to check. I'll randomly pick 200 abstracts and use the preText package to get a sense of the effects of those preprocessing choices.

ac_sub <- corpus_sample(ac, size = 200)
ac_fp <- factorial_preprocessing(ac_sub, parallel = FALSE, 
                                 intermediate_directory = ".", 
                                 return_results = FALSE,
                                 infrequent_term_threshold = 0.005)
head(ac_fp$choices)
preText_results <- preText(
  ac_fp,
  dataset_name = "LCM abstracts",
  distance_method = "cosine",
  num_comparisons = 50)
preText_score_plot(preText_results)

The options with lower scores have less unusual results.

regression_coefficient_plot(preText_results,
                            remove_intercept = TRUE)

Here this means removing stopwords and punctuation makes the results less likely to be unusual, while removing infrequent terms may lead to more unusual results, while the other options might not significantly affect the results. Here preText only looked at how these preprocessing choices influence pairwise distances between abstracts. I see how this is relevant to political science, for which that package was written. How relevant is pairwise distance to LCM abstracts?

I think in the first trial, I'll only use unigrams, do stem the words, do remove stopwords and punctuation, do remove numbers, and do turn everything into lower case, for the simplest model. Then I can try what if I don't remove stopwords, what if I don't remove punctuation, and what if I do remove infrequent words and see how that would affect the topic modeling results.

What are the collocations, i.e. common phrases?

ac_col <- tokens(ac, split_hyphens = TRUE, remove_symbols = TRUE) %>% 
  tokens_remove(c(stopwords(), "sup", "alt")) %>% 
  tokens_remove(c("^c_", "^m_", "^o_"), valuetype = "regex") %>% 
  tokens_wordstem() %>% 
  textstat_collocations(min_count = 20, size = 2:4) %>% 
  filter(count_nested < count) %>% 
  arrange(desc(count)) 
ac_col

OK, now I know which phrases are common. While many of the phrases here are relevant and are usually used as if they were words, some are not, so I'll use a list of phrases based on my background.

phrases <- str_split(ac_col$collocation, pattern = " ", simplify = FALSE)
ac_tokens <- tokens(ac, remove_punct = TRUE, 
                    remove_symbols = TRUE, remove_url = TRUE,
                    split_hyphens = TRUE) %>% 
  tokens_tolower(keep_acronyms = TRUE) %>% 
  tokens_remove(c(stopwords(), "laser", "capture", "microdissection", "use", 
                  "using", "used", "uses", "LCM", "sup", "LMD", "gene", "genes",
                  "width", "height", "src", "figdir", "alt")) %>% 
  tokens_remove(c("^c_", "^m_", "^o_"), valuetype = "regex") %>% 
  tokens_wordstem() %>% 
  tokens_compound(phrases)
lcm_tokens <- ac_tokens
usethis::use_data(lcm_tokens, overwrite = TRUE)
ac_dfm <- dfm(ac_tokens)
ac_dfm2 <- dfm_trim(ac_dfm, min_docfreq = 0.001, docfreq_type = "prop")

Here spectral means using NMF for initialization. I debated for a while whether I should use the covariates. I eventually decided that I should. Otherwise it's just an intercept. At worst the effects of the covariates are not significant. But I do think some of the covariates are relevant to topic prevalence, such as journal -- the journal Plant Cell obviously is about plants and not about human cancer. OK, it seems that adding the journal covariate only hurts the held out likelihood whether I lump low frequency journals into "Other" or not, so I'll not use it. I lumped cities with fewer than 5 publications into "Other", since using smaller numbers hurts the held-out likelihood, but I still do want to see the effects of city so I don't want to lump too much. Over 80% of cities and about half of publications are now in "Other", so this only shows the effects of the top 15% or so cities.

k_result2 <- my_searchK(ac_dfm2, K = seq(5, 60, by = 5), 
                       prevalence = ~ date_num + city_eff + journal_eff,
                       gamma.prior = "L1",
                       verbose = FALSE, seed = 1891)
plot_k_result(k_result2)

Looks like the number of topics should be 25 to 40 according to the residuals, but held-out likelihood and semantic coherence. suggest a smaller number. The "residuals" here means dispersion of the residuals, and ideally it should be 1.

plot_ec(k_result2, seq(30, 60, by = 10))

As expected, fewer topics results into higher semantic coherence and lower exclusivity. This is really a trade off. Sort of a compromise, here I choose the model with K = 50. Since I have held out part of the data when choosing K, I'll use this K to fit the model with the full data.

stm_res <- stm(ac_dfm2, K = 50, prevalence = ~ date_num + city_eff + journal_eff,
               gamma.prior = "L1", seed = 1919)
plot(stm_res, n = 10)

Upon manual inspection, 50 topics, with the spline for date, seem to give more reasonable results.

saveRDS(k_result2, "k_result_lcm.rds")
saveRDS(stm_res, "stm_res_lcm.rds")

Tidyverse version: Again, I'm copying and pasting code from Julia Silge's blog. I learnt a lot about text mining from her. However, I don't like her colorful barcharts, so I changed her code to make the lollipop plot here.

findThoughts(stm_res, lcm_abstracts$abstract, 50, n = 10, thresh = 0.9)
plot_topic_words(stm_res)
  1. Stem cell and fetal development
  2. GWAS, genetic screens, and genetics of complex phenotypes
  3. Biomechanics, ECM, eye lens, muscles, and morphogenesis
  4. Data analysis, especially of RNA-seq, but also of 3D genome structure and microarray
  5. miRNAs in cancer
  6. Quantitative analyses of cancer, clinical and bioinformatic
  7. Hippocampus and Alzheimer's disease, sometimes related to Down syndrome
  8. Prostate cancer and other stuff in molecular biology and biochemistry, probably because some prostate cancer papers have an emphasis on molecular biology
  9. Plant embryos, plant development, and some stuff about evolution and ecology related to plants
  10. Proteomics, especially in cancer
  11. Cancer progression and diagnostics, especially lung cancer
  12. Inflammation and immunology, especially in skin diseases
  13. Breast cancer and liver cancer, with an emphasis in data analysis
  14. Neural circuitry, neural plasticity, brain injury, and behavior
  15. Plant gamitogenesis and reproduction
  16. Spasmolytic polypeptide-expressing metaplasia (SPEM), oncogenes, KRAS
  17. Endometrium and implantation. Somehow the top 2 entries are about hearing loss. Why? Epithelium?
  18. Cell cycle, also hepatic zonation and circadian rhythm (the latter is also a cycle, so...)
  19. Neurons, especially dopaminergic
  20. Tumor stroma and microenvironment
  21. Plant roots
  22. Intestine, especially microbiome and immune response
  23. Hypothalamus, obesity, and appetite
  24. PDAC, and some stuff about glioma and prostate cancer
  25. ALS, and other neurodegenerative diseases affecting motor neurons
  26. Epigenetics
  27. Tumor single cell profiling and cellular heterogeneity
  28. Tissue isolation and preparation
  29. Bone growth plate, especially recovery after radiotherapy, and some other stuff like oocytes, glaucoma, and epithelial injury
  30. Pancreas and diabetes, especially T2D
  31. Lymphocytes, lymphatic and blood vessels
  32. Prefrontal cortex and schizophrenia
  33. Synapses, dendritic spines, neuron potentiation, sometimes related to memory
  34. Cancer genomics, mutations, and phylogeny
  35. Bone formation, but also some other stuff about cancer and kidneys
  36. Neurodegenerative diseases, Alzheimer's, Parkinson's, and multiple system atrophy
  37. Spatial single cell techniques and imaging
  38. Connective tissues and ECM, and some other stuff about circadian rhythms
  39. Stem cells and development
  40. Plant seed development and reproduction
  41. RNA extraction and amplification, especially in microarray, but also in RNA-seq
  42. Lots of different stuff about epithelium
  43. Plant leaves, but also other stuff about gamitogenesis
  44. Cancer pathway analyses and molecular and cellular mechanisms
  45. Plant nitrogen fixation and soil microbiome
  46. Lots of different stuff related to fibrosis and fibroblasts, such as in lung diseases and graft rejection
  47. Neuron morphogenesis, axon guidance, somehow also angiogenesis, protein signaling
  48. Inflammation, immune response, especially in atherosclerosis, though there's some other stuff about blood vessels
  49. Model organisms and in vitro model systems
  50. Intrahepatic cholangiocarcinoma (ICC)
lcm_dfm2 <- ac_dfm2
usethis::use_data(lcm_dfm2, overwrite = TRUE)
usethis::use_data(stm_res, overwrite = TRUE)


pachterlab/museumst documentation built on April 20, 2024, 11:26 p.m.