knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
Dummy variables (binary, 0/1 variables) are a frequent part of analyses in the social sciences. But making them can be an arduous process in base R, which tends rely on numerous ifelse()
statements. This can be especially trying and headache-inducing if you have a variable with several values (e.g., race, education level, employment status, political party identification, etc.).
dumdum
cuts down on the work, gives you the ability to easily rename the new, dummied variables, and also makes selecting reference categories simple. It also allows you to make dummies across many variables at once.
There are currently a few other packages that helps users make dummy variables such as ml
, caret
, tidymodels
, and fastdummies
. What sets dumdum
apart is that it was designed with social-scientists and non-machine learning folk in mind. Its functions are flexible and intuitive. It also requires no package dependencies; all that's necessary to have installed are the base R packages!
You can install the latest version from GitHub with:
# install.packages("devtools") devtools::install_github("prlitics/dumdum")
or
# install.packages("remotes") remotes::install_github("prlitics/dumdum")
There are two functions in dumdum
dummify()
which allows you to make dummy variables from a specified dataframe and variable. (This is the main function in dumdum
)dummify_across()
a wrapper for dummify()
that lets users dummify multiple variables at once.dummify()
dummify()
has 4 arguments; 2 mandatory and 2 options. The options are set to NULL
as default.
dummify(data, var, reference = NULL, dumNames = NULL)
data
& var
requirementsdummify()
accepts data frame objects in data
. var
accepts either the name of the column or the the column index.dummify
currently accepts integer, factor, and character vectors. reference
and dumNames
options.reference
allows you to decide if you want to leave out a reference category from your values.TRUE
, it will leave out the first value it encounters.r c("Adelie", "Gentoo")
), it will leave out all of the specified values.dumNames
has dummify()
rename the variables for you, so you don't have to do it afterwards.variableName_DUM_Value
. So for a column for Adelie penguins it will be species_DUM_Adelie
. All of the Adelie penguins will have a 1 for this variable and the other penguins will have a 0.
dumNames
can accept a character list that will rename all the columns in the order of the list (r dumNames = c("Ad","Ge","Ch")
).
dumNames
can also accept a named list so that you don't have to worry about making sure the names are in the right order. (r dumNames = c("Ad"="Adelie","Ge"="Gentoo","Ch"="Chinstrap")
)These examples are going to use the Palmer Penguins dataset because there are a number of "dummy-able" variables in it. (And, also, like, penguins!!)
library(dumdum) penguins<-palmerpenguins::penguins
knitr::kable(head(penguins))
dummify
{#dummify}Let's say that you want to make a dummy variable for the penguins' sex because you plan to run a regression where you check to see if sex is predictive of body mass. To make dummy variables out of species, you could do this with dummify()
:
penguins<- palmerpenguins::penguins penguins_dummied<-dummify(data = penguins, var = "sex")
knitr::kable(head(penguins_dummied))
If you wanted to set "male" as the reference category, you could do:
penguins_dummied<-dummify(data = penguins, var = "sex", reference = "male")
knitr::kable(head(penguins_dummied))
The default naming convention is to make sure that the user knows what the 1
is in reference to in that column. You can also rename the columns.
penguins_dummied<-dummify(data = penguins, var = "sex", reference = "male", dumNames = c("f","unknown"))
knitr::kable(head(penguins_dummied))
If you didn't want to worry about putting the list of column names in the right order:
penguins_dummied<-dummify(data = penguins, var = "sex", reference = "male", dumNames = c("f"="female","unknown"=NA))
knitr::kable(head(penguins_dummied))
dummify_across
Let's say that there are multiple variables that you want to dummy across. In the case of the penguins, you might want to dummy species, as well as the island and sex. You can do so with dummify_across()
.
dummify_across()
is a wrapper for dummify()
that allows you to pass multiple variables at once. Like dummify()
, you specify a data frame object and you specify a set of variables (vars
) that you want to be dummified. These can either be names or column indices.
penguins_dummied<-dummify_across(data = penguins, vars = c("sex","species","island"))
knitr::kable(head(penguins_dummied))
You can also pass along whether or not you want dummify_across()
to leave out a reference column for the variables you selected:
penguins_dummied<-dummify_across(data = penguins, vars = c("sex","species","island"), reference = TRUE)
knitr::kable(head(penguins_dummied))
Currently, dummify_across()
will only leave out the first encountered variable as a reference. Future updates to the package will allow you to specify which variables you want to have reference categories for--as well as the values for those references.
I personally am a huge fan of the tidyverse
; it's what allowed me to get my feet wet with R before I could truly dive into it. I know a lot of potential dumdum
users would also use the tidyverse
, so it was important to me that dumdum
functions were pipe-able.
library(magrittr) pen_df <- penguins %>% dummify("sex")
knitr::kable(head(pen_df))
pen_df <- penguins %>% dummify_across(c("sex","island","species"))
knitr::kable(head(pen_df))
If you have any bugs or suggestions, let me know! Always happy for constructive feedback.
Huge thanks to Sabrina Marasa, who tested the package on the Mac version of R.
This function is distributed under a MIT license.
citation("dumdum")
Data & packages used in this readme.
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.
Stefan Milton Bache and Hadley Wickham (2014). magrittr: A Forward-Pipe Operator for R. R package version 1.5. https://CRAN.R-project.org/package=magrittr
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.