knitr::opts_chunk$set( message = FALSE, warning = FALSE, comment = "#>", collapse = TRUE, cache = FALSE )
In this tutorial, we describe how to perform basic text mining using line-level data from NSSP-ESSENCE. This example uses line level data details from the Chief Complaint Query Validation (CCQV) data source for the CDC Coronavirus-DD definition, limiting to ED visits (Has been Emergency = "Yes").
The CCQV data source includes the chief complaint and discharge diagnosis fields, and does not include facility location, patient location, or other demographic information. This data source was created with the NSSP Community of Practice (CoP) so that users can test and create new syndrome categories using a large corpus of chief complaint and diagnosis code data that includes more variation than would be present in any one site. Some sites have opted out of contributing their data to the Query Validation data source, including Arizona, Idaho, Illinois, Marion County, Indiana, Massachusetts, North Dakota, and Ohio.
We start this tutorial by loading the Rnssp
package and all other necessary packages.
library(Rnssp) library(tidyr) library(widyr) library(tidytext) library(ggplot2) library(ggthemes) library(ggpubr) library(forcats) library(igraph) library(ggraph)
Next, we create an NSSP user profile.
myProfile <- readRDS("~/myProfile.rds")
# Creating an ESSENCE user profile myProfile <- create_profile() # save profile object to file for future use # save(myProfile, "myProfile.rda") # saveRDS(myProfile, "myProfile.rds") # Load profile object # load("myProfile.rda") # myProfile <- readRDS("myProfile.rds")
The above code needs to be executed only once. Upon execution, it prompts the user to provide his username and password.
Alternatively, the myProfile
object can be saved on file as an .RData
(.rda
) or .rds
file and loaded to automate Rmarkdown reports generation.
We now use the myProfile
object to authenticate to NSSP-ESSENCE and pull the data.
url <- "https://essence2.syndromicsurveillance.org/nssp_essence/api/dataDetails/csv?endDate=31Dec2020&ccddCategory=cdc%20coronavirus-dd%20v1&timeResolution=weekly&hasBeenE=1&percentParam=noPercent&userId=2362&datasource=va_erccdd&aqtTarget=DataDetails&detector=nodetectordetector&startDate=30Dec2020" # Data Pull from NSSP-ESSENCE api_data <- get_api_data(url, fromCSV = TRUE) # or myProfile$get_api_data(url, fromCSV = TRUE) # glimpse(api_data)
The Rnssp::clean_text()
function performs a standard series of text cleansing and pre-processing operations to remove meaningless patterns, common stop words, standardize free-text, and returns a clean version of the original dataframe that is ideal for summarizing a large corpus.
data <- api_data %>% clean_text() %>% mutate(linenumber = row_number())
After cleaning and preparing free-text fields, the next step in most natural language processing tasks is tokenization. Tokenization is the process of partitioning sentences, paragraphs, or documents into smaller units such as individual words (unigrams) or contiguous sequences of n words (n-grams). These smaller units are referred to as tokens and provide a feature-based representation of text. The unnest_tokens
function from tidytext
splits a column into tokens, resulting in a one-token-per-row dataframe. The arguments in unnest_tokens
include the following:
cc_unigrams <- data %>% unnest_tokens(output = word, input = chief_complaint_parsed, token = "words") %>% filter(!is.na(word)) %>% anti_join(stop_words) cc_bigrams <- data %>% unnest_tokens(output = bigram, input = chief_complaint_parsed, token = "ngrams", n = 2) %>% filter(!is.na(bigram)) %>% separate(bigram, into = c("word1", "word2"), sep = " ", remove = FALSE) %>% filter(!word1 %in% stop_words) %>% filter(!word2 %in% stop_words) cc_trigrams <- data %>% unnest_tokens(output = trigram, input = chief_complaint_parsed, token = "ngrams", n = 3) %>% separate(trigram, into = c("word1", "word2", "word3"), sep = " ", remove = FALSE) %>% filter(!word1 %in% stop_words) %>% filter(!word2 %in% stop_words) %>% filter(!word3 %in% stop_words)
To summarize the top 10 chief complaint unigram and bigram frequencies, we can use the count
and top_n
functions from dplyr
and visualize the results in horizontal bar charts. In order for the bars to be in descending order, we need to use fct_reorder
from forcats
to factor the n-grams with the levels being the frequencies themselves. By default, ggplot
will arrange the bars in alphabetical order unless the categorical variable is factored. After creating two separate ggplot
objects, we can use ggarrange
to arrange the plots into a horizontal grid.
pal <- economist_pal()(2) cc_unigram_freq <- cc_unigrams %>% count(word, sort = TRUE) %>% ungroup() %>% top_n(10) %>% mutate( word = str_to_upper(word), word = fct_reorder(word, n) ) %>% ggplot(aes(x = word, y = n)) + geom_col(show.legend = FALSE, color = "black", fill = pal[1]) + scale_y_continuous(labels = scales::comma) + theme_few() + labs(title = "Top 10 Chief Complaint Unigram Frequencies", x = "Unigram", y = "Frequency") + coord_flip() + theme(plot.margin = unit(c(1, 1, 1, 1), "cm"), title = element_text(size = 9)) cc_bigram_freq <- cc_bigrams %>% count(bigram, sort = TRUE) %>% ungroup() %>% top_n(10) %>% mutate( bigram = str_to_upper(bigram), bigram = fct_reorder(bigram, n) ) %>% ggplot(aes(x = bigram, y = n)) + geom_col(show.legend = FALSE, color = "black", fill = pal[1]) + scale_y_continuous(labels = scales::comma) + theme_few() + labs(title = "Top 10 Chief Complaint Bigram Frequencies", x = "Bigram", y = "Frequency") + coord_flip() + theme(plot.margin = unit(c(1, 1, 1, 1), "cm"), title = element_text(size = 9)) ggarrange(cc_unigram_freq, cc_bigram_freq, nrow = 1, ncol = 2)
cc_cooccurrence_net <- cc_unigrams %>% pairwise_count(word, linenumber, sort = TRUE, upper = FALSE) %>% top_n(200) %>% slice(1:200) %>% mutate(item1 = toupper(item1), item2 = toupper(item2)) %>% graph_from_data_frame() ggraph(cc_cooccurrence_net, layout = "fr") + geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = pal[1], show.legend = FALSE) + geom_node_point(size = 3) + geom_node_text(aes(label = name), repel = TRUE, point.padding = unit(0.2, "lines")) + theme_void()
While co-occurrence provides a metric of how frequently two terms co-occurred in the same chief complaint, pairwise_cor
from widyr
provides a measure of correlation between terms. Namely, pairwise_cor
computes the Phi Coefficient between two words, $w_1$ and $w_2$, and is defined as
[
\phi = \frac{n_{11}n_{00} - n_{10}n_{01}}{\sqrt{n_{1 \bullet}n_{0 \bullet}n_{\bullet 0}n_{\bullet 1}}}.
]
In this context,
\begin{table}[]
\begin{tabular}{llll}
& Contains $w_2$ & Does not contain $w_2$ & Total \
Contains $w_1$ & $n_{11}$ & $n_{10}$ & $n_{1 \bullet}$ \
Contains $w_2$ & $n_{01}$ & $n_{00}$ & $n_{0 \bullet}$ \
Total & $n_{\bullet 1}$ & $n_{\bullet 0}$ & $n$
\end{tabular}
\end{table}
cc_correlation_net <- cc_unigrams %>% group_by(word) %>% filter(n() >= 20) %>% pairwise_cor(word, linenumber, sort = TRUE) %>% filter(correlation > 0.5) %>% graph_from_data_frame() ggraph(cc_correlation_net, layout = "fr") + geom_edge_link(aes(edge_alpha = correlation), edge_colour = pal[1], show.legend = FALSE) + geom_node_point(size = 3) + geom_node_text(aes(label = name), repel = TRUE) + theme_void()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.