library(devtools) library(dplyr) library(readr) library(stringr)
# Read in latest data pull raw_data <- read_csv("latest_nhc_data.csv") print(raw_data)
# WIP Functions I'll probably include in the package # once I'm done para_preclean <- function(x) { # # Remove normal line returns # out <- gsub("\n", "", x) # the apostrophe in Xi'an and others that throws weird errors out <- gsub("\u0092", "'", x) return(out) } para_split <- function(x) { # Split on carriage returns splits <- strsplit(x, "\n") # Remove empty sets out <- lapply(splits, function(x) x[x != ""]) return(out) }
I'll try just splitting each article by carriage return.
Then, I take the oldest and latest example of each "article type" by paragraph count where n_content shows how many bulletins had the same number of paragraphs (n_para) So we can see if there was an iterative change in design over time or if they're interspersed.
Ideally, what we see is a linear (or at least consistent) usage of a particular type over time
para_summary_df <- raw_data %>% mutate( content = para_preclean(content), para_splits = para_split(content), n_para = vapply(para_splits, length, integer(1)) ) %>% group_by(n_para) %>% mutate(n_content = n()) %>% arrange(desc(date)) %>% filter(date %in% range(date)) %>% ungroup() print(para_summary_df)
Not exactly as expected, but there is a pattern, and the vast majority can be split into 5 paragraphs.
That style of bulletin apparently first began in 2/14/2020, so that's positive. We could set an initial value for each region and just write a parser for those bulletins and ignore the rest for the most part.
Let's take a random bulletin and get the data into tabular shape.
to_parse <- raw_data %>% mutate( content = para_preclean(content), para_splits = para_split(content), n_para = vapply(para_splits, length, integer(1)) ) %>% filter(n_para == 5) %>% arrange(desc(date)) %>% slice(1) cat(to_parse$content)
The easiest thing to do here is probably to latch onto "in" and then use the right side as the name, and the left side numbers as the value.
first_para <- to_parse$para_splits[[1]][1] str_match_all(first_para, "(([\\d,]*) in ([^\\d,]*) ([,\\)]*|and)*)")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.