knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

DDSQLtools is a package that aims to make downloading data from the UNPD SQL API into R easy. In this vignette we will discuss the different ways in which you can explore and download this data. Make sure you have the latest version of DDSQLtools by installing it from Github:

devtools::install_github("timriffe/DDSQLtools")

and load the package:

library(DDSQLtools)

Quick recipes

Before we begin, some of the output might show numbers with scientific notation depending on your R setup (2.56×10 instead of 0.00000256). You can suppress scientific notation by typing this into your R console: options(scipen = 9999).

If you're already familiar with concepts from the UNPD SQL API, then you'll only have to interact with the get_recorddata function. For example, here's how to get data on population by age and sex for Egypt with different options:

egypt_str <- get_recorddata(dataProcessTypeIds = "Census",
                            indicatorTypeIds = "Population by age and sex",
                            locIds = "Egypt",
                            locAreaTypeIds = "Whole area",
                            subGroupIds = "Total or All groups",
                            isComplete = "Abridged")

get_recorddata accepts either string names or codes. For example, this will return the same thing as above:

egypt_num <- get_recorddata(dataProcessTypeIds = 2,
                            indicatorTypeIds = 8,
                            locIds = 818,
                            locAreaTypeIds = 2,
                            subGroupIds = 2,
                            isComplete = 0)

Note that using numeric codes is faster than using string names because we don't have to make code translations behind the scenes (if you're unsure what these codes mean, please refer to the section Exploring the UNPD SQL API from R below). If you want to figure out which names to use, each of these arguments has an equivalent lookup function. For example, for figuring out the possible values in locIds, use get_locations(). You can see all arguments and their equivalent lookup functions in the documentation of ?get_recorddata. Do note that these lookup functions have both the 'Name' and a 'ShortName' of each value. The translation will only accept string names from the 'Name' column.

Here are a few other examples:

Population by age and sex with census data

china <- get_recorddata(dataProcessTypeIds = "Census",
                        indicatorTypeIds = "Population by age and sex",
                        locIds = "China",
                        locAreaTypeIds = "Whole area",
                        subGroupIds = "Total or All groups",
                        isComplete = "Abridged")

head(china)

Births by age of mother

get_recorddata(dataProcessTypeIds = "Census",
               indicatorTypeIds = "Births by age of mother",
               locIds = "Togo",
               locAreaTypeIds = "Whole area",
               subGroupIds = "Total or All groups",
               isComplete = "Abridged")

See the error? This means that using "Census" for this set of options is unavailable. We need to check which other data processes have data for these options (for example, using get_dataprocesstype()). Since this is a trial and error process, we've nailed it down to "Survey: cross-sectional survey (generic/national)":

togo <- get_recorddata(dataProcessTypeIds = "Survey: cross-sectional survey (generic/national)",
                       indicatorTypeIds = "Births by age of mother",
                       locIds = "Togo",
                       locAreaTypeIds = "Whole area",
                       subGroupIds = "Total or All groups",
                       isComplete = "Abridged")

head(togo)

Deaths by age and sex using register data with uncertainty

You can also request uncertainty estimates with includeUncertainty = TRUE

oman <- get_recorddata(dataProcessTypeIds = "Register",
                       indicatorTypeIds = "Deaths by age and sex",
                       locIds = "Oman",
                       locAreaTypeIds = "Whole area",
                       subGroupIds = "Total or All groups",
                       isComplete = "Abridged",
                       includeUncertainty = TRUE)

head(oman)

By default, the results from get_recorddata exclude any uncertainty columns such as StandardErrorValue, ConfidenceInterval, etc. By setting includeUncertainty = TRUE, the resulting data frame adds new fields which have uncertainty for the requested values. These fields are all next to each other so by limiting the data to the fields next to StandardErrorValue, the user can quickly see the uncertainty columns.

Additional queries with dataTypeGroupIds and dataTypeGroupId2s

get_recorddataadditional allows to use additional arguments such as dataTypeGroupIds and dataTypeGroupId2s.For a list of all arguments you can use, see here. For example:

## For dataTypeGroupIds
dt <-
  get_recorddataadditional(
    dataTypeGroupIds = "Direct", # or 3
    indicatorTypeIds = 8,
    isComplete = 0,
    locIds = 818,
    locAreaTypeIds = 2,
    subGroupIds = 2
  )

## For dataTypeGroupId2s
dt <-
  get_recorddataadditional(
    dataTypeGroupId2s = "Population (sample tabulation)", # or 11
    indicatorTypeIds = 8,
    isComplete = 0,
    locIds = 818,
    locAreaTypeIds = 2,
    subGroupIds = 2
  )

get_datatypes can show you which dataTypeIds are within the umbrella of aggregated categories such as dataTypeGroupIds and dataTypeGroupId2s.

Change the IP of the UNPD SQL API

Currently the IP of the server is automatically set (to https://popdiv.dfs.un.org/DemoData/api/) but can be changed simply by adding options(unpd_server = "your server") to the beginning of your script:

# Change this to the desired server. This is the server
# which is set by default.
options(unpd_server = "https://popdiv.dfs.un.org/DemoData/api/")

get_recorddata(dataProcessTypeIds = "Register",
               indicatorTypeIds = "Deaths by age and sex",
               locIds = "Oman",
               locAreaTypeIds = "Whole area",
               subGroupIds = "Total or All groups",
               isComplete = "Abridged")

Downloading many countries

Due to the nature of the UNPD API, it's not possible to download over 30 countries in the same API request. For example, if we wanted to download data on these countries:

myLocations <- c(28, 492, 570, 674, 308, 96, 196, 8,
                 376, 662, 670, 642, 84, 188, 442, 100,
                 192, 170, 414, 616, 320, 480, 218,
                 818, 222, 300, 558, 40, 52, 348, 470,
                 620, 702, 858, 56, 780, 388, 246, 352,
                 591, 144, 862)

get_recorddata(dataProcessTypeIds = "Register",
               indicatorTypeIds = "Deaths by age and sex",
               locIds = myLocations,
               locAreaTypeIds = "Whole area",
               subGroupIds = "Total or All groups",
               isComplete = "Abridged",
               startYear = 1920,
               endYear = 2020)

You'll get an error saying that DDSQLtools could not connect to the API. When that happens you can loop over chunks of countries to get it faster. The first thing you want to do is to copy the optimized query printed by get_recorddata that already translated the string codes to numbers.

After that, you need to loop through groups of countries and make requests for each group separately. Below is the same example using base R and the tidyverse. On average, downloading these countries takes between 25 to 30 minutes.

Using base R
myLocations <- c(28, 492, 570, 674, 308, 96, 196, 8,
                 376, 662, 670, 642, 84, 188, 442, 100,
                 192, 170, 414, 616, 320, 480, 218,
                 818, 222, 300, 558, 40, 52, 348, 470,
                 620, 702, 858, 56, 780, 388, 246, 352,
                 591, 144, 862)

# Here replace with number of desired chunk of countries
n_chunks <- 3
chunk_groups <- rep(1:n_chunks, length.out = length(myLocations))
cnty_groups <- split(myLocations, chunk_groups)

# Loop through each chunk of countries with `lapply`
deaths <- lapply(cnty_groups, function(x) {

  res <- get_recorddata(dataProcessTypeIds = 9,
                        startYear = 1920,
                        endYear = 2020,
                        indicatorTypeIds = 20,
                        isComplete = 0,
                        locIds = x,
                        locAreaTypeIds = 2,
                        subGroupIds = 2)

  # Informative message to keep track of which chunk
  # of countries have been completed.
  cat("Countries", paste0(x, collapse = ", "), "done")
  res
})

# Row-bind all the data frames into one data frame
deaths <- do.call(rbind, deaths)
head(deaths)
Using the tidyverse
library(tidyverse)

myLocations <- c(28, 492, 570, 674, 308, 96, 196, 8,
                 376, 662, 670, 642, 84, 188, 442, 100,
                 192, 170, 414, 616, 320, 480, 218,
                 818, 222, 300, 558, 40, 52, 348, 470,
                 620, 702, 858, 56, 780, 388, 246, 352,
                 591, 144, 862)

n_chunks <- 3
chunk_groups <- rep(1:n_chunks, length.out = length(myLocations))

cnty_groups <- split(myLocations, chunk_groups)

# Loop through each location with `purrr`
deaths <- map(cnty_groups, ~ {

  res <- get_recorddata(dataProcessTypeIds = 9,
                        startYear = 1920,
                        endYear = 2020,
                        indicatorTypeIds = 20,
                        isComplete = 0,
                        locIds = .x,
                        locAreaTypeIds = 2,
                        subGroupIds = 2)

  cat("Countries", paste0(x, collapse = ", "), " done")
  res
})

deaths <-
  deaths %>%
  bind_rows() %>%
  as_tibble()

deaths

Identifying ID and Names of values

Whenever you download some data, you'll see that there are several columns that end with Name. This columns contain, for example, the name of the location (LocName) or the name of the subgroup (SubGroupName). To be able to see the equivalent ID of these names, there are two different strategies using the haven package.

First, let's get some data:

res <- get_recorddata(
    dataProcessTypeIds = 2, # Census
    indicatorTypeIds = 8, # Population by age and sex - abridged
    locIds = 818, # Egypt
    locAreaTypeIds = 2, # Whole area
    subGroupIds = 2, # Total or All groups
    isComplete = 0,
    collapse_id_name = TRUE
  )

Base R

You can use print_labels from haven on the column of interest and it will show you the ID:

library(haven)
print_labels(res$LocName)

If you want to flip the labels for the entire data frame such that the ID is current value and not the name, you can do that with as_factor:

res_edit <- as_factor(res)

# No long showing the name but the ID
head(res_edit["LocName"])

Tidyverse

Using the tidyverse, the solution from base R applies as well. However, if you convert your data frame to a tibble, it will automatically show you the labels in the data frame.

library(tidyverse)

res_tb <- res %>% as_tibble()
# See the code next to the name? No need to print it
# separately, `tibble` print boths things together.
res_tb %>%
  select(ends_with("Name"))

If you're interested in keeping either the codes or the names, you can do that with as_factor:

res_tb %>%
  # You can extract just the ID
  as_factor(levels = "labels") %>%
  select(ends_with("Name"))
res_tb %>%
  # Or just the values
  as_factor(levels = "values") %>%
  select(ends_with("Name"))

Since get_recorddataadditional is a cousing of get_recorddatal with additional options to specify, all of the above applies to get_recorddataadditional.

Exploring the UNPD SQL API from R

The previous examples assume you're familiar with the names or codes of the parameters of the API: locations, subgroups, etc... Whenever you don't know these values, you need to resort to the list of available values. Each parameter defined above has an equivalent get_* function to extract all possible values. The current list is:

For example, to see all possible indicator types:

get_indicatortypes()

Here we can see both the code and the name of the indicator. Both values are accepted by get_recorddata. Another example is getting the list of location types. get_locationtypes() will give you those:

get_locationtypes()

In order to search for a specific combination of values ("Egypt", "Deaths by age and sex", etc...), the process is only possible through trial and error. For example, if we wanted to explore Deaths by age and sex for Oman, we'd have to explore whether there is data for Semi-rural (locAreaTypeIds) and Christian Orthodox (subGroupIds).

For now, let's try this example for "Semi-rural":

get_recorddata(dataProcessTypeIds = "Register",
               indicatorTypeIds = "Deaths by age and sex",
               locIds = "Oman",
               locAreaTypeIds = "Semi-rural",
               subGroupIds = "Total or All groups",
               isComplete = "Abridged")

This error is just saying that the combination of data that you've specified is not available (in this case, Semi-rural is not available for register data on deaths by age and sex in Oman for all groups abridged).

Figuring out which specific set of values returns what the user want is a tedious process. If the user doesn't know the specific combination of values by memory, the most optimal strategy that we advise is to explore the available values in the front end of the API and then use these in the request.

Let's do one example. Follow the hyperlink and click on "Get Started - Select Indicator & Country". Then select "Deaths by age and sex" -> "Oman". All possible data sets for Oman and "Deaths by age and sex" will appear automatically. We can filter down the options in the left-most menu and use these values to populate our get_recorddata call.

Looking at the possible values of "Type of location", only "Whole area" is available. This means that we can't specify "Semi-rural" (as we saw in the example above). Looking at "Type of subgroup", the only available group is Total or All groups, meaning that we can't specify "Christian Orthodox" either.

Using this strategy is much easier and less time consuming that trying out different values.

Exploring indicators

To explore all possible indicatorIds, indicatorTypeIds and components, get_iitypes can be of use. Although get_indicators and get_indicatortypes are very helpful for looking at the entire list of indicators, a user often wants to combine the two. That's where get_iitypes comes in. For example:

all_iitypes <- get_iitypes()
head(all_iitypes)

The output is all IndicatorTypeID, together with the IndicatorIds and ComponentIDs. The advantage of get_iitypes is that you can select any of then with the function:

get_iitypes(indicatorIds = 239)
get_iitypes(indicatorTypeIds = 2)

or ComponentIds:

get_iitypes(componentIds = 4)

Although this is handy for exploration, what's even more handy is the combination of parameters:

get_iitypes(componentIds = 4, indicatorTypeIds = 38)

This approach to exploring indicators is much more rich that selecting indicators manually with the get_* functions.



timriffe/DDSQLtools documentation built on Oct. 6, 2021, 5:34 p.m.