In giocomai/castarter: Content Analysis Starter Toolkit

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

library("castarter")
library("dplyr")
library("ggplot2")

# dataset <- tifkremlinen::kremlin_en

Textual datasets generated from online resources may have some issues, including missing documents (e.g. due to server issues at time of download), duplicate items, "empty" items (either because the given page actually does not have textual contents, but possibly due to errors), etc.

Some such errors (e.g. missing paragraphs) may be difficult to ascertain, while others may be easier to spot.

This vignette illustrates some common checks that may be run after creating a dataset to find some common issues, based on a dataset of items published on the Kremlin's website. Some of the summary statistics generated in the process may usefully be added to the dataset if distributed.

Check number of publications by day

An excessive number of publications recorded as published on a given date may be a hint that further checks are needed.

dataset %>%
  group_by(date) %>%
  count(sort = TRUE)

Check if there are many publications with exactly the same title

Especially on institutional websites, it is not uncommon to have more items with exactly the same title. However, an excessive number of such occurrences may deserve an additional check.

dataset %>%
  group_by(title) %>%
  count(sort = TRUE)

Check if the distribution of publications does not have an unusual distribution

n_days <- 90

dataset %>%
  dplyr::filter(is.na(date) == FALSE) %>%
  group_by(date) %>%
  count(name = "n") %>%
  ungroup() %>%
  mutate(n = slider::slide_period_dbl(
    .x = n,
    .i = date,
    .period = "day",
    .f = mean,
    .before = n_days / 2,
    .after = n_days / 2
  )) %>%
  ggplot(mapping = aes(x = date, y = n)) +
  geom_line() +
  labs(
    title = "Number of publications per day",
    caption = paste("* Calculated on a rolling mean of", sum(n_days, 1), "days")
  )

Using an interactive timeline may make it easier to check if changes correspond to significant dates. As appears from this case, the number of publications grew significantly in early 2008. This may however not be particularly surprising, as it corresponds with the time when Dmitri Medvedev became president.

dataset %>%
  group_by(date) %>%
  count(name = "n") %>%
  ungroup() %>%
  mutate(n = slider::slide_period_dbl(
    .x = n,
    .i = date,
    .period = "day",
    .f = mean,
    .before = n_days / 2,
    .after = n_days / 2
  )) %>%
  mutate(string = "Publications") %>%
  cas_show_ts_dygraph()

Length of posts

Is there any item with no contents at all?

dataset %>%
  dplyr::filter(is.na(text))

dataset %>%
  dplyr::filter(text == "")

Or is there any surprisingly short post?

dataset %>%
  mutate(nchar = nchar(text)) %>%
  arrange(nchar) %>%
  select(date, title, nchar)

giocomai/castarter documentation built on June 12, 2025, 8:49 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

giocomai/castarter
Content Analysis Starter Toolkit

In giocomai/castarter: Content Analysis Starter Toolkit

Check number of publications by day

Check if there are many publications with exactly the same title

Check if the distribution of publications does not have an unusual distribution

Length of posts

R Package Documentation

Browse R Packages

We want your feedback!

giocomai/castarter Content Analysis Starter Toolkit

In giocomai/castarter: Content Analysis Starter Toolkit

Check number of publications by day

Check if there are many publications with exactly the same title

Check if the distribution of publications does not have an unusual distribution

Length of posts

R Package Documentation

Browse R Packages

We want your feedback!

giocomai/castarter
Content Analysis Starter Toolkit