knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
HILDA survey data is a large panel survey 20 waves (2001 - 2020) and counting! Some waves have more than 5000 variables, which means reading them into R is a little challenging (personally I think it is wayyyyyyyy too slow).
The goal of this package is to provide a quick and easy way to query HILDA data into R.
This is possible by converting each wave of HILDA from its STATA file (.dta
), one of the three formats HILDA provides, to fst
format. fst is a binary format and can be read much much quicker than .dta
in R.
| Function name | Description |
|---------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| hil_setup()
| Setup HILDA fst files for hil_fetch() to use
. |
| hil_fetch()
| Fetches HILDA records based on query options. |
| hil_dict()
| Shows HILDA data glossary and waves each variable is available in. This provides a convenient way to select multiple variables based on their description by passing it to hil_fetch()
. |
| hil_vars()
| Returns all variables where their variable names match a regular expression. |
| hil_labs()
| Returns all variables where their labels match a regular expression. |
| hil_browse()
| Opens up the HILDA data dictionary page on your default web browser. |
| hil_crosswave_info()
| Takes a variable name and search for its cross wave information. |
| hil_var_details()
| Similar to hil_crosswave_info()
but it searches for a variable's details. |
The development version from GitHub with:
# install.packages("remotes") remotes::install_github("asiripanich/hildar")
.fst
filesUse hil_setup()
to read HILDA STATA (.dta) files and save them as .fst
files.
.fst
is a binary data format that can be read very quickly, a lot faster than .dta
.
An additional benefit of hil_setup()
is that it creates a HILDA dictionary file that you can later call using hil_dict()
.
Having a functional hil_dict()
allows the user to use hil_vars()
and hil_labs()
for searching variable names using a regular expression.
This will allow you to fast query HILDA data from all waves using hil_fetch()
.
hil_setup( read_dir = "/path/to/your/hilda-stata-files", save_dir = "/path/to/save/hilda-fst-files" )
To speed up the setup process, you can select a future
parallel backend and call it before running your hil_setup()
.
library(future) plan(multisession, workers = 2) # `hil_setup()` can take several minutes to finish. # To monitor its progress, you can wrap the function in # `progressr::with_progress({...}}` like below. progressr::with_progress({ hil_setup(read_dir = "...", save_dir = "...") })
hildar
where the HILDA .fst
files are stored at.hil_fetch()
requires the user to specify where the HILDA fst files generated in the previous step are stored.
You can either set this HILDA_FST
as a global option or an R environment variable.
Setting this as a persistent option for all your R sessions will make hil_fetch()
more convinient to use.
Alternatively, you can manually set it in each call using hilda_fst_dir
argument in hil_fetch()
.
Once the setup is completed, you can now start fetching HILDA data with hildar
!
library(hildar) # fetch removes the HILDA year prefix from all the selected variable # (e.g. axxx = 2001, bxxx = 2002). hil_fetch(years = 2001:2003, add_geography = T) %>% summary()
There is a quick option to add basic demographic variables to the data, which is set to TRUE by default.
hil_fetch(years = 2001, add_basic_vars = T) %>% names()
How about doing a quick search to find variables that you want? Use hil_dict
which is a data.table that you can search or view HILDA variables without going to their documentation webpage.
hilda_dictionary <- hil_dict() head(hilda_dictionary)
Let say we want to select all variables that are related to 'employment'. Here is how we can easily use the selected employment variables in hil_fetch()
.
hilda_data <- hil_fetch(years = 2001:2003, vars = hil_labs("employment")) dim(hilda_data)
Or if you know the prefix of a subject area that you like to query, you can use hil_vars(pattern)
to query all variable names that match the pattern.
For example, hil_vars("^ff")
will get all variables in subject area 'Health' and nested area 'Heath - diet'.
hilda_data <- hil_fetch(years = 2001:2003, vars = hil_vars("^ff")) dim(hilda_data)
To set default variables to be loaded every time you call hil_fetch()
, see hil_user_default_vars()
.
Here is a summary of the dimensions of our HILDA data files.
# the number of variables and rows in each wave nrows_by_wave <- hil_fetch(years = 2001:2020, add_basic_vars = F) %>% .[, .(number_of_rows = .N), by = wave] hilda_dictionary[, unlist(wave), by = .(var, label)] %>% data.table::setnames("V1", "wave") %>% .[!is.na(wave), .(number_of_variables = .N), by = wave] %>% merge(nrows_by_wave, by = "wave")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.