knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  fig.align = "center"
)

The aim of the vignette is to give some basic idea about the workflow when accessing Estonian Health Statistics and Health Research (TAI) datasets via boulder package.

Download list of available datasets

Data tables available on TAI database are listed in the data frame generated by get_all_tables() function. By default, datasets table is loaded from a local copy supplied with the package (local = TRUE). Set argument local = FALSE to download fresh list of tables from TAI.

Table columns Database and Node describe table address in the database tree. Database names are available only in Estonian.

Updated and url contain date of the last update and data table url, respectively.

library(dplyr)
library(stringr)
library(tidyr)
library(ggplot2)
library(viridis)
library(boulder)
tabs <- get_all_tables(lang = "en")
tabs

Table info is stored in Title column. For better matching we convert Title column to lower case.

When looking for tables of interest it might be good idea to consult TAI database website, because, while database and node names, and table titles have nicely human readable descriptions, naming feels sometimes inconsistent and nonintuitive (at least to the author).

Cancer incidence

Cancer morbidity data is stored under Malignant neoplasms node.

tabs %>% 
  mutate(descr = str_to_lower(Title)) %>% 
  filter(str_detect(descr, "neoplasm")) %>% 
  select(Name, Title, Node, Updated) %>% 
  knitr::kable()

We are interested in table PK30: Age-specific incidence rate of malignant neoplasms per 100 000 inhabitants by site and sex.

pk30 <- pull_table("PK30", lang = "en")
pk30

Let's see if we can find data for colon cancer incidence. Cancer anatomical site is stored under Site variable. Incidence values are available for r n_distinct(select(pk30, Site)) anatomical sites.

pk30 %>% 
  select(Site) %>% 
  distinct()

After some digging, we can see that colon cancer is nested under Digestive organs (C15-C26): ..Colon (C18). Now we can pull out data for colon regular expression.

colo <- pk30 %>% 
  mutate_at("Site", str_to_lower) %>% 
  filter(str_detect(Site, "colon"))
colo

We have incidence rates for different age groups for men and women from years 2000 to 2015. Let's plot incidence (value) versus time (Year) to see if there are any changes in incidence rates during this time period. First we also need to convert Year from character to numeric and Age group to factor to keep its correct ordering.

colo %>% 
  mutate(Year = as.numeric(Year),
         `Age group` = factor(`Age group`, levels = unique(colo$`Age group`)),
         Site = gsub("^[.]+([a-z])", "\\U\\1", Site, perl = TRUE)) %>% 
  ggplot(aes(Year, value, group = `Age group`, color = `Age group`)) +
  geom_line() +
  facet_wrap(~ Sex) +
  scale_color_viridis(discrete = TRUE) +
  guides(color = guide_legend(ncol = 2)) +
  labs(title = "Incidence of colon (C18) cancer",
       y = "Age-specific incidence per 100 000")

We can see that, as expected, cancer incidence is higher in older age groups (coloring) and colon cancer is generally higher in men. However, it is interesting that colon cancer incidence has been increasing in men during observed period.

Cancer mortality

Mortality data is stored under node Deaths.

tabs %>% 
  mutate(Node = str_to_lower(Node)) %>% 
  filter(str_detect(Node, "deaths"))

We are interested in table SD22 titled Deaths per 100 000 inhabitants by cause of death, sex and age group:

sd22 <- pull_table("SD22", lang = "en")
sd22

It's a relatively large table with r nrow(sd22) rows. Let's try to filter out colon and rectal cancer data using ICD codes. From incidence table we saw that this disease is placed under code C18.

mort <- sd22 %>% 
  mutate(`Cause of death` = str_to_lower(`Cause of death`)) %>% 
  filter(str_detect(`Cause of death`, "c18"))
mort

Let's filter out data before year 2000 and keep only age groups data. Mortality is accounted separately for under 1 year old babies. Anyway, for comparison, we are mostly interested in >40 year old age groups, as cancer incidence is generally very low in younger people. Filter by Year and Age group:

colo_mort <- mort %>% 
  filter(Year >= 2000, 
         `Age group` %in% colo$`Age group`,
         !is.na(value))
colo_mort

Now we can create similar plot as in case of incidence:

colo_mort %>% 
  mutate(Year = as.numeric(Year),
         `Age group` = factor(`Age group`, levels = unique(colo_mort$`Age group`)),
         `Cause of death` = gsub("^[.]+([a-z])", "\\U\\1", `Cause of death`, perl = TRUE)) %>% 
  ggplot(aes(Year, value, group = `Age group`, color = `Age group`)) +
  geom_line() +
  facet_wrap(~ Sex) +
  scale_color_viridis(discrete = TRUE) +
  guides(color = guide_legend(ncol = 2)) +
  labs(title = "Colon cancer (C18) mortality",
       y = "Deaths per 100 000")

We can see that colon cancer deaths have shown considerable increase in 80-84 age group in men, whereas mortality among women has remained pretty much the same.

Incidence versus mortality

Cancer mortality can be considered as a proxy to treatment efficacy. Therefore, let's put these two datasets side by side to get better visual at the gap between morbidity and mortality, and how deadly this disease really is in Estonia.

First, we need to merge incidence and mortality datasets:

c18 <- inner_join(
  colo %>% select(-Site) %>% rename(incidence = value), 
  colo_mort %>% select(-`Cause of death`) %>% rename(mortality = value)
  ) %>% 
  filter(incidence != 0)
c18

Now we can plot colon cancer incidence versus mortality, but first we want to gather incidence and mortality values into common column for easier plot setup:

c18 %>% 
  mutate(Year = as.numeric(Year)) %>% 
  gather(key, value, -c("Year", "Sex", "Age group")) %>%
  ggplot(aes(Year, value, color = key)) +
  geom_line() +
  facet_grid(Sex ~ `Age group`) +
  scale_color_viridis(discrete = TRUE, direction = -1) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, size = 6),
        legend.title = element_blank(),
        legend.position = "bottom") +
  labs(y = "Incidence vs. mortality per 100000")

Summary

Ok, this is how working with TAI data basically looks like -- find your table and download it, whereas it's advisable to compare downloaded table to the one available on TAI webpage. Downloaded tables may need lot of cleaning (e.g. to get rid of summary data rows), pay attention to variable names which are nicely human readable but contain whitespace, may contain apostrophes.



tpall/boulder documentation built on May 6, 2019, 11:47 a.m.