An overview on *dformula*

knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, 
                      fig.width = 6, fig.height = 5, 
                      message = FALSE, warning = FALSE)

Introduction

dformula allows to easily modify, transform, add and extrapolate using the basic R formula. The operations on the data are the following:

| Operation | Function| ------------|------------| | Add new variables | add() | | Transform existing variables| transform()| | Rename existing variables | rename() | | Selection rows and columns | select()| | Removing row and column | remove| ------------------------------------------

The formula is composed of two part:

$$ column_names \sim new_variables $$

the right-hand side shows the names of the columns of the data and the left-hand side the transformation or the new variables to insert in the data.

The I() is used in the right-hand side to indicate the type of transformation of the existing variable. In this function, we can insert logical statement, function implemented in R or user build function.

For example:

$$ var_name_1 + var_name_2 \sim I(log(var_name_1)) + I(var_name_2 == "something") $$

the two variable $var_name_1$ and $var_name_2$ are transformed in $log(var_name_1)$ or selected to be equal to $"something"$.

In the same fashion of SQL, we have the from argument, the input data, and the as argument, the new name of the variables, after transformation, selection or addition.



The CRAN version can be loaded

library('dformula')

or the development version from GitHub:

remotes::install_github('dataallaround/dformula')

The data are available in the package will be used in this overview

data("population_data")
pop_data <- population_data

which describes the Population and Area of world countries.

str(pop_data)

Adding variables

The add() function inserts new variables starting from the existing columns in the data.

Suppose we want to calculate population density and attach this to the original dataset

new_pop <- add(from = pop_data, formula = ~ I(Population / Area))
head(new_pop)

and give a name to this new variable

new_pop <- add(from = pop_data, formula = ~ I(Population / Area), as = "pop_density")
head(new_pop)

Multiple variable can be added with a single formula

new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)))
head(new_pop)

and with new names

new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)), 
               as = c("pop_density", "log_area"))
head(new_pop)

If we have one transformation applied to a group of variables, we do not spicify the function multipe times

new_pop <- add(from = pop_data, formula = Population + Area ~ log())
head(new_pop)

and with names

new_pop <- add(from = pop_data, formula = Population + Area ~ log(),
               as = c("log_pop", "log_area"))
head(new_pop)

Suppose we want tp add a numerical id for the countries at the beginning of the dataset, using the position argument

new_pop <- add(from = pop_data, 
               formula = ~ I(1:nrow(new_pop)), 
               position = "left", as = "id")
head(new_pop)

We can also add a constant variable, for example the year of the observation

new_pop <- add(from = pop_data, formula = ~ C("2020"), position = "left")
head(new_pop)

or both

new_pop <- add(from = pop_data, 
               formula = ~ I(1:nrow(new_pop)) + C("2020"), 
               position = "left", as = c("ids", "year"))
head(new_pop)

The C() construct, add a constant for all the rows

We can be interested in having a dummy variable, i.e. a variable equal to $1$ if some event happen or $0$ otherwise. For example, we suppose to build a dummy variables with the most populated countries, for example countries with more than $100$ million of people.

new_pop <- add(from = pop_data, formula =  ~ I(Population > 100000000))
head(new_pop)

or two variables one with the most populated countries and the other with the biggest extended countries

new_pop <- add(from = pop_data, 
               formula =  ~ I(Population > 100000000) + I(Area > 8000000))
head(new_pop)

or a variable indicating the most populated and the biggest countries togheter

new_pop <- add(from = pop_data, 
               formula =  ~ I(Population > 100000000 & Area > 8000000))
head(new_pop)

If we want obtain a boolean vector, as an interrogation, setting to TRUE the argument logic_convert the unction will return a boolean vector

new_pop <- add(from = pop_data, 
               formula =  ~ I(Population > 100000000), 
               logic_convert = FALSE, as = "most_populated")
head(new_pop)

Transform variables

The transform() function modifies existing variables in the dataset.

Suppose we want to change the scale on the Population

new_pop <- transform(from = pop_data, 
                     formula =  Population ~ I(Population/10000))
head(new_pop)

or we want a logarithmic transformation, renaming the variable

new_pop <- transform(from = pop_data, 
                     formula =  Population ~ I(log(Population)), 
                     as = "log_pop")
head(new_pop)

With a single formula, multiple variables can be transformed, as showed before.

new_pop <- transform(from = pop_data, 
                     formula =  Population  + Area~ I(log()))
head(new_pop)

We can also transformed multiple varaible with multiple transformation

new_pop <- transform(from = pop_data, 
                     formula =  Population + Area ~ I(Population > 100000000) + I(log(Area)))
head(new_pop)

Rename varaibles

The rename() function may be used to change names of existing variables, for example

new_pop <- rename(from = pop_data, formula =  Population  ~ pop )
head(new_pop)

or multiple variables

new_pop <- rename(from = pop_data, formula =  Population  + Area ~ pop + area)
head(new_pop)

Select variables and rows

In the same fashion of SQL, the select() function first select the rows, given a statement, and then shows the select variables.

The first part of the formula are the columns to select, as the previous functions, and the left-hand side of the formula, the condition part, will select the rows.

Suppose to want to select only the most populated countries

new_pop <- select(from = pop_data, 
                  formula =  . ~ I(Population > 100000000))
head(new_pop)

you can also add . to returns all variables instead of nothing.

We want only the name of the most populated countries

new_pop <- select(from = pop_data, 
                  formula =  Country ~ I(Population > 100000000))
head(new_pop)

We might be interest in only the most populated and biggest countries

new_pop <- select(from = pop_data, 
               formula = . ~ I(Population > 100000000 & Area > 8000000)) 
head(new_pop)

or both

new_pop <- select(from = pop_data, 
               formula = ~ I(Population > 100000000 | Area > 8000000)) 
head(new_pop)

by selecting only the names

new_pop <- select(from = pop_data, 
               formula = Country ~ I(Population > 100000000 | Area > 8000000)) 
head(new_pop)

Remove varaibles

The remove() function has the same syntax of select() function, but now the rows and columns will be removed.

new_pop <- remove(from = pop_data, 
                  formula =  Area ~ I(Population > 100000000))
head(new_pop)

Missing values

In all the functions, except for rename, the argument na.remove, remove all the rows with missing values, after adding, transforming or selecting the rows.

The remove function, can be employed to remove all the rows with at least a missing observation,

data("airquality")
dt <- airquality

dt_new <- remove(from = dt,formula = .~., na.remove = TRUE)
head(dt_new)

If we are interested to focus on the observation with missing values, the na.return = TRUE arguments of select function, will return only the incomplete rows after the selection

dt_new <- select(from = dt,formula = ~ I(Temp > 50), na.return = TRUE)
head(dt_new)


Try the dformula package in your browser

Any scripts or data that you put into this service are public.

dformula documentation built on July 2, 2020, 3:37 a.m.