knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

hildar

R-CMD-check

HILDA survey data is a large panel survey 20 waves (2001 - 2020) and counting! Some waves have more than 5000 variables, which means reading them into R is a little challenging (personally I think it is wayyyyyyyy too slow).

The goal of this package is to provide a quick and easy way to query HILDA data into R. This is possible by converting each wave of HILDA from its STATA file (.dta), one of the three formats HILDA provides, to fst format. fst is a binary format and can be read much much quicker than .dta in R.

| Function name | Description | |---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | hil_setup() | Setup HILDA fst files for hil_fetch() to use. | | hil_fetch() | Fetches HILDA records based on query options. | | hil_dict() | Shows HILDA data glossary and waves each variable is available in. This provides a convenient way to select multiple variables based on their description by passing it to hil_fetch(). | | hil_vars() | Returns all variables where their variable names match a regular expression. | | hil_labs() | Returns all variables where their labels match a regular expression. | | hil_browse()| Opens up the HILDA data dictionary page on your default web browser. | | hil_crosswave_info()| Takes a variable name and search for its cross wave information. | | hil_var_details()| Similar to hil_crosswave_info() but it searches for a variable's details. |

Installation

The development version from GitHub with:

# install.packages("remotes")
remotes::install_github("asiripanich/hildar")

Setup

1) Store HILDA as .fst files

Use hil_setup() to read HILDA STATA (.dta) files and save them as .fst files. .fst is a binary data format that can be read very quickly, a lot faster than .dta. An additional benefit of hil_setup() is that it creates a HILDA dictionary file that you can later call using hil_dict(). Having a functional hil_dict() allows the user to use hil_vars() and hil_labs() for searching variable names using a regular expression.
This will allow you to fast query HILDA data from all waves using hil_fetch().

hil_setup(
  read_dir = "/path/to/your/hilda-stata-files", 
  save_dir = "/path/to/save/hilda-fst-files"
)

To speed up the setup process, you can select a future parallel backend and call it before running your hil_setup().

library(future)
plan(multisession, workers = 2)

# `hil_setup()` can take several minutes to finish.
# To monitor its progress, you can wrap the function in
# `progressr::with_progress({...}}` like below.
progressr::with_progress({
   hil_setup(read_dir = "...", save_dir = "...")
})

2) Tell hildar where the HILDA .fst files are stored at.

hil_fetch() requires the user to specify where the HILDA fst files generated in the previous step are stored. You can either set this HILDA_FST as a global option or an R environment variable. Setting this as a persistent option for all your R sessions will make hil_fetch() more convinient to use. Alternatively, you can manually set it in each call using hilda_fst_dir argument in hil_fetch().

Example

Once the setup is completed, you can now start fetching HILDA data with hildar!

library(hildar)

# fetch removes the HILDA year prefix from all the selected variable
# (e.g. axxx = 2001, bxxx = 2002).
hil_fetch(years = 2001:2003, add_geography = T) %>%
  summary()

There is a quick option to add basic demographic variables to the data, which is set to TRUE by default.

hil_fetch(years = 2001, add_basic_vars = T) %>%
  names()

How about doing a quick search to find variables that you want? Use hil_dict which is a data.table that you can search or view HILDA variables without going to their documentation webpage.

hilda_dictionary <- hil_dict()
head(hilda_dictionary)

Let say we want to select all variables that are related to 'employment'. Here is how we can easily use the selected employment variables in hil_fetch().

hilda_data <- hil_fetch(years = 2001:2003, vars = hil_labs("employment"))
dim(hilda_data)

Or if you know the prefix of a subject area that you like to query, you can use hil_vars(pattern) to query all variable names that match the pattern. For example, hil_vars("^ff") will get all variables in subject area 'Health' and nested area 'Heath - diet'.

hilda_data <- hil_fetch(years = 2001:2003, vars = hil_vars("^ff"))
dim(hilda_data)

To set default variables to be loaded every time you call hil_fetch(), see hil_user_default_vars().

Here is a summary of the dimensions of our HILDA data files.

# the number of variables and rows in each wave
nrows_by_wave <-
  hil_fetch(years = 2001:2020, add_basic_vars = F) %>%
  .[, .(number_of_rows = .N), by = wave]

hilda_dictionary[, unlist(wave), by = .(var, label)] %>%
  data.table::setnames("V1", "wave") %>%
  .[!is.na(wave), .(number_of_variables = .N), by = wave] %>%
  merge(nrows_by_wave, by = "wave")


asiripanich/hildar documentation built on July 24, 2022, 8:21 p.m.