knitr::opts_chunk$set(echo = TRUE)
To do for all notebooks: Explain what the code does, just as I did in the BBB notebook.
library(tidyverse) library(lubridate) library(preText) library(quanteda) library(stm) library(text2vec) library(lexvarsdatr) library(ggrepel) library(sf) library(furrr) library(stminsights) library(uwot) library(tidytext) library(broom) library(gam) library(igraph) library(ggraph) library(bluster) library(tidygraph) devtools::load_all() plan(multisession) theme_set(theme_bw())
data("lcm_abstracts") data("lcm_city_gc")
pubs_on_map(lcm_abstracts, lcm_city_gc, plot = "hexbin", label_insts = FALSE, label_cities = TRUE, n_label = 7)
Not sure if this is a good idea, since it's plotting number of publications per unit area. I think per capita is confusing, again, because researchers move a lot and I'm not sure who are the "locals". Is per unit area more reasonable? Maybe. It helps when there's overplotting in the scatter plot.
pubs_on_map(lcm_abstracts, city_gc = lcm_city_gc, label_cities = TRUE, n_label = 7)
pubs_per_cat(lcm_abstracts, country, n_top = 20)
pubs_per_cat(lcm_abstracts, city, n_top = 20)
I can try to get city population data, though a complication is that some city names have different spellings and sometimes there are multiple cities with the same name (e.g. trust me, there are Los Angeles's outside California). Well, I can geocode those cities with the Google API and extract Google's standards. But I don't think this is all that important, so I'm not doing it right now.
Unfortunately, I can't check the institutions here since I haven't figured out a way to automatically extract the institution names and standardize them from the address given that the addresses here do not have a standardized format. OK, given more time, I probably can figure that out, but I'd rather not do it for now. But the cities do tell a lot. Los Angeles means either UCLA or USC or some hospitals like Cedars Sinai. Ithaca means Cornell. New York means Columbia, NYU, Cornell medical school, or Rockefeller. Boston means Harvard Medical School or maybe Boston College or Boston University. Bethesda means NIH. Again, all the usual suspects.
pubs_per_capita(lcm_abstracts)
pubs_per_capita(df_1st, plot = "bar")
That's very different from other areas of spatial transcriptomics.
pubs_on_map(lcm_abstracts, lcm_city_gc, "europe", label_cities = TRUE)
pubs_per_capita(lcm_abstracts, "europe")
pubs_per_capita(lcm_abstracts, "europe", plot = "bar")
Denmark never showed up for the other areas in spatial transcriptomics.
pubs_on_map(lcm_abstracts, lcm_city_gc, "usa", label_cities = TRUE)
Orangeburg. Never heard about it. It's the Nathan Kline Institute, which again, I've never heard of. I think this shows how LCM is different from other fields of spatial transcriptomics; the other fields are more confined to the very elite institutions.
pubs_per_capita(lcm_abstracts, "usa")
pubs_per_capita(lcm_abstracts, "usa", plot = "bar")
I think that's because of NIH's labs in Bestheda and Baltimore. Apparently NIH has a lot interest in LCM, though not as much in other fields of spatial transcriptomics compared to some other most elite institutions. Anyway, I don't think this says too much about those states, since we know that academics are kind of like modern day foragers not anchored in a city early in their careers. Since the same people move around, I don't think this data says too much about the states themselves per se since the people doing the research are quite likely to be from somewhere else. I'm not convinced that Kansas and Virginia are really better than California in this field. But I still don't think I should plot the raw numbers on the choropleth, since it creates an illusion that physically larger states have more publications than they actually do just because they're physically larger.
How about over time?
pubs_per_year(lcm_abstracts, binwidth = 150)
This is not manually curated; it's just whatever PubMed gave me from a search. Curiously, there's a first peak, and a rising and taller second peak. I wonder why. When text mining, I can try to compare the two peaks.
I'll text mine the titles, keywords, or maybe abstracts as well. I can also try to extract country, state, city, institution, and department from the address field.
I think the city (a convenient proxy of institution) and journal have a lot to do with the topic of an abstract. But meanwhile, if a lot of cities and journal only have 1 or 2 abstracts, then it makes the model unnecessarily complex given the relatively small corpus and won't make it more predictive. I know, I fit this model for explanation of importance of the covariates, but at least the model should capture the important features and not too much noise, in other words, not overfit, for me to trust the explanation.
length(unique(lcm_abstracts$city2))
length(unique(lcm_abstracts$journal))
lcm_abstracts %>% count(city2) %>% ggplot(aes(n)) + stat_ecdf() + labs(x = "Number of publications per city", y = "F") + scale_x_continuous(breaks = scales::breaks_width(2)) + scale_y_continuous(breaks = scales::breaks_width(0.2))
By lumping everything below 4 as "Other", there're only about 30% of cities left.
Another question: What percent of cities have what percent of publications? Rather than an ECDF of the number of publications per city, I can plot the cumulative number/proportion of all publications vs. rank of city.
df <- lcm_abstracts %>% count(city2) %>% arrange(desc(n)) %>% mutate(rank = row_number(desc(n)), rel_rank = rank/max(rank), cum_n = cumsum(n), cum_prop = cum_n/sum(n)) ggplot(df, aes(rel_rank, cum_prop)) + geom_line() + scale_x_continuous(breaks = scales::breaks_width(0.2), labels = scales::percent) + scale_y_continuous(breaks = scales::breaks_width(0.2), labels = scales::percent) + labs(x = "City percentile", y = "Percent of publications")
When I do lump cities with fewer than 3 publications, the remaining 30% of cities have 70% of publications. But is it helpful at all that a bit over 30% of publications are in the "Other" category? Well, maybe better than not using cities at all, since given that in practice, topics have a lot to do with the interests of individual labs, it makes sense to include city as a covariate.
lcm_abstracts %>% filter(journal != "bioRxiv") %>% count(journal) %>% ggplot(aes(n)) + stat_ecdf() + labs(x = "Number of publications per journal", y = "F") + scale_x_continuous(breaks = scales::breaks_width(2)) + scale_y_continuous(breaks = scales::breaks_width(0.2))
By lumping journals with fewer than 3 papers into "Other", there're about 25% of journals left.
df <- lcm_abstracts %>% filter(journal != "bioRxiv") %>% count(journal) %>% arrange(desc(n)) %>% mutate(rank = row_number(desc(n)), rel_rank = rank/max(rank), cum_n = cumsum(n), cum_prop = cum_n/sum(n)) ggplot(df, aes(rel_rank, cum_prop)) + geom_line() + scale_y_continuous(breaks = scales::breaks_width(0.2), labels = scales::percent) + scale_x_continuous(labels = scales::percent) + labs(x = "Journal percentile", y = "Percent of publications")
The remaining top 25% of journals have 65% of papers.
This curve rises more sharply than the city curve.
abstracts <- lcm_abstracts %>% mutate(city_eff = fct_lump_min(city2, 5), journal_eff = fct_lump_min(journal, 5))
ac <- corpus(abstracts, docid_field = "title", text_field = "abstract")
n_words <- summary(ac, n = Inf)
hist(n_words$Tokens)
hist(n_words$Types)
hist(n_words$Sentences)
Did the number of words in abstracts change over time?
ggplot(n_words, aes(date_published, Tokens)) + geom_point() + geom_smooth(method = "lm")
Nope.
I'm debating whether I should stem the words. Apparently here this doesn't really matter. I don't know any keyword that has multiple meanings in the context of LCM. Anyway, there're more burning questions like whether I should remove punctuation. I don't think it matters in this context, but just to check. I'll randomly pick 200 abstracts and use the preText
package to get a sense of the effects of those preprocessing choices.
ac_sub <- corpus_sample(ac, size = 200)
ac_fp <- factorial_preprocessing(ac_sub, parallel = FALSE, intermediate_directory = ".", return_results = FALSE, infrequent_term_threshold = 0.005)
head(ac_fp$choices)
preText_results <- preText( ac_fp, dataset_name = "LCM abstracts", distance_method = "cosine", num_comparisons = 50)
preText_score_plot(preText_results)
The options with lower scores have less unusual results.
regression_coefficient_plot(preText_results, remove_intercept = TRUE)
Here this means removing stopwords and punctuation makes the results less likely to be unusual, while removing infrequent terms may lead to more unusual results, while the other options might not significantly affect the results. Here preText only looked at how these preprocessing choices influence pairwise distances between abstracts. I see how this is relevant to political science, for which that package was written. How relevant is pairwise distance to LCM abstracts?
I think in the first trial, I'll only use unigrams, do stem the words, do remove stopwords and punctuation, do remove numbers, and do turn everything into lower case, for the simplest model. Then I can try what if I don't remove stopwords, what if I don't remove punctuation, and what if I do remove infrequent words and see how that would affect the topic modeling results.
What are the collocations, i.e. common phrases?
ac_col <- tokens(ac, split_hyphens = TRUE, remove_symbols = TRUE) %>% tokens_remove(c(stopwords(), "sup", "alt")) %>% tokens_remove(c("^c_", "^m_", "^o_"), valuetype = "regex") %>% tokens_wordstem() %>% textstat_collocations(min_count = 20, size = 2:4) %>% filter(count_nested < count) %>% arrange(desc(count))
ac_col
OK, now I know which phrases are common. While many of the phrases here are relevant and are usually used as if they were words, some are not, so I'll use a list of phrases based on my background.
phrases <- str_split(ac_col$collocation, pattern = " ", simplify = FALSE)
ac_tokens <- tokens(ac, remove_punct = TRUE, remove_symbols = TRUE, remove_url = TRUE, split_hyphens = TRUE) %>% tokens_tolower(keep_acronyms = TRUE) %>% tokens_remove(c(stopwords(), "laser", "capture", "microdissection", "use", "using", "used", "uses", "LCM", "sup", "LMD", "gene", "genes", "width", "height", "src", "figdir", "alt")) %>% tokens_remove(c("^c_", "^m_", "^o_"), valuetype = "regex") %>% tokens_wordstem() %>% tokens_compound(phrases)
lcm_tokens <- ac_tokens usethis::use_data(lcm_tokens, overwrite = TRUE)
ac_dfm <- dfm(ac_tokens)
ac_dfm2 <- dfm_trim(ac_dfm, min_docfreq = 0.001, docfreq_type = "prop")
Here spectral means using NMF for initialization. I debated for a while whether I should use the covariates. I eventually decided that I should. Otherwise it's just an intercept. At worst the effects of the covariates are not significant. But I do think some of the covariates are relevant to topic prevalence, such as journal -- the journal Plant Cell obviously is about plants and not about human cancer. OK, it seems that adding the journal covariate only hurts the held out likelihood whether I lump low frequency journals into "Other" or not, so I'll not use it. I lumped cities with fewer than 5 publications into "Other", since using smaller numbers hurts the held-out likelihood, but I still do want to see the effects of city so I don't want to lump too much. Over 80% of cities and about half of publications are now in "Other", so this only shows the effects of the top 15% or so cities.
k_result2 <- my_searchK(ac_dfm2, K = seq(5, 60, by = 5), prevalence = ~ date_num + city_eff + journal_eff, gamma.prior = "L1", verbose = FALSE, seed = 1891)
plot_k_result(k_result2)
Looks like the number of topics should be 25 to 40 according to the residuals, but held-out likelihood and semantic coherence. suggest a smaller number. The "residuals" here means dispersion of the residuals, and ideally it should be 1.
plot_ec(k_result2, seq(30, 60, by = 10))
As expected, fewer topics results into higher semantic coherence and lower exclusivity. This is really a trade off. Sort of a compromise, here I choose the model with K = 50. Since I have held out part of the data when choosing K, I'll use this K to fit the model with the full data.
stm_res <- stm(ac_dfm2, K = 50, prevalence = ~ date_num + city_eff + journal_eff, gamma.prior = "L1", seed = 1919)
plot(stm_res, n = 10)
Upon manual inspection, 50 topics, with the spline for date, seem to give more reasonable results.
saveRDS(k_result2, "k_result_lcm.rds") saveRDS(stm_res, "stm_res_lcm.rds")
Tidyverse version: Again, I'm copying and pasting code from Julia Silge's blog. I learnt a lot about text mining from her. However, I don't like her colorful barcharts, so I changed her code to make the lollipop plot here.
findThoughts(stm_res, lcm_abstracts$abstract, 50, n = 10, thresh = 0.9)
plot_topic_words(stm_res)
lcm_dfm2 <- ac_dfm2 usethis::use_data(lcm_dfm2, overwrite = TRUE) usethis::use_data(stm_res, overwrite = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.