knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE )
library("castarter") library("dplyr") library("ggplot2") # dataset <- tifkremlinen::kremlin_en
Textual datasets generated from online resources may have some issues, including missing documents (e.g. due to server issues at time of download), duplicate items, "empty" items (either because the given page actually does not have textual contents, but possibly due to errors), etc.
Some such errors (e.g. missing paragraphs) may be difficult to ascertain, while others may be easier to spot.
This vignette illustrates some common checks that may be run after creating a dataset to find some common issues, based on a dataset of items published on the Kremlin's website. Some of the summary statistics generated in the process may usefully be added to the dataset if distributed.
An excessive number of publications recorded as published on a given date may be a hint that further checks are needed.
dataset %>% group_by(date) %>% count(sort = TRUE)
Especially on institutional websites, it is not uncommon to have more items with exactly the same title. However, an excessive number of such occurrences may deserve an additional check.
dataset %>% group_by(title) %>% count(sort = TRUE)
n_days <- 90 dataset %>% dplyr::filter(is.na(date) == FALSE) %>% group_by(date) %>% count(name = "n") %>% ungroup() %>% mutate(n = slider::slide_period_dbl( .x = n, .i = date, .period = "day", .f = mean, .before = n_days / 2, .after = n_days / 2 )) %>% ggplot(mapping = aes(x = date, y = n)) + geom_line() + labs( title = "Number of publications per day", caption = paste("* Calculated on a rolling mean of", sum(n_days, 1), "days") )
Using an interactive timeline may make it easier to check if changes correspond to significant dates. As appears from this case, the number of publications grew significantly in early 2008. This may however not be particularly surprising, as it corresponds with the time when Dmitri Medvedev became president.
dataset %>% group_by(date) %>% count(name = "n") %>% ungroup() %>% mutate(n = slider::slide_period_dbl( .x = n, .i = date, .period = "day", .f = mean, .before = n_days / 2, .after = n_days / 2 )) %>% mutate(string = "Publications") %>% cas_show_ts_dygraph()
Is there any item with no contents at all?
dataset %>% dplyr::filter(is.na(text))
dataset %>% dplyr::filter(text == "")
Or is there any surprisingly short post?
dataset %>% mutate(nchar = nchar(text)) %>% arrange(nchar) %>% select(date, title, nchar)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.