An overview on dformula
In dformula: Data Manipulation using Formula

knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, 
                      fig.width = 6, fig.height = 5, 
                      message = FALSE, warning = FALSE)

Introduction

dformula allows to easily modify, transform, add and extrapolate using the basic R formula. The operations on the data are the following:

| Operation | Function| ------------|------------| | Add new variables | add() | | Transform existing variables| transform()| | Rename existing variables | rename() | | Selection rows and columns | select()| | Removing row and column | remove| ------------------------------------------

The formula is composed of two part:

$$ column_names \sim new_variables $$

the right-hand side shows the names of the columns of the data and the left-hand side the transformation or the new variables to insert in the data.

The I() is used in the right-hand side to indicate the type of transformation of the existing variable. In this function, we can insert logical statement, function implemented in R or user build function.

For example:

$$ var_name_1 + var_name_2 \sim I(log(var_name_1)) + I(var_name_2 == "something") $$

the two variable $var_name_1$ and $var_name_2$ are transformed in $log(var_name_1)$ or selected to be equal to $"something"$.

In the same fashion of SQL, we have the from argument, the input data, and the as argument, the new name of the variables, after transformation, selection or addition.

The CRAN version can be loaded

library('dformula')

or the development version from GitHub:

remotes::install_github('serafinialessio/dformula')

The data are available in the package will be used in this overview

data("population_data")
pop_data <- population_data

which describes the Population and Area of world countries

str(pop_data)

`Adding` variables

The add() function inserts new variables starting from the existing columns in the data.

Suppose we want to calculate population density and attach this to the original dataset

new_pop <- add(from = pop_data, formula = ~ I(Population / Area))
head(new_pop)

and give a name to this new variable

new_pop <- add(from = pop_data, formula = ~ I(Population / Area), as = "pop_density")
head(new_pop)

Multiple variable can be added with a single formula

new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)))
head(new_pop)

and with new names

new_pop <- add(from = pop_data, formula = ~ I(Population / Area) + I(log(Area)), 
               as = c("pop_density", "log_area"))
head(new_pop)

If we have one transformation applied to a group of variables, we do not specify the function multiple times

new_pop <- add(from = pop_data, formula = Population + Area ~ log())
head(new_pop)

and with new column names

new_pop <- add(from = pop_data, formula = Population + Area ~ log(),
               as = c("log_pop", "log_area"))
head(new_pop)

Suppose we want to add a numerical id for the countries at the beginning of the dataset, using the position argument

new_pop <- add(from = pop_data, 
               formula = ~ I(1:nrow(new_pop)), 
               position = "left", as = "id")
head(new_pop)

We can also add a constant variable. For example the year of the observation

new_pop <- add(from = pop_data, formula = ~ C("2020"), position = "left")
head(new_pop)

or both

new_pop <- add(from = pop_data, 
               formula = ~ I(1:nrow(new_pop)) + C("2020"), 
               position = "left", as = c("ids", "year"))
head(new_pop)

The C() construct add a constant for all the rows

We can be interested in having a dummy variable, i.e. a variable equal to $1$ if some event happen or $0$ otherwise. For example, we suppose to build a dummy variables with the most populated countries. In this we suppose countries with more than $100$ million of people.

new_pop <- add(from = pop_data, formula =  ~ I(Population > 100000000))
head(new_pop)

or two variables one with the most populated countries and the other with the biggest extended countries

new_pop <- add(from = pop_data, 
               formula =  ~ I(Population > 100000000) + I(Area > 8000000))
head(new_pop)

or a variable indicating the most populated and the biggest countries togheter

new_pop <- add(from = pop_data, 
               formula =  ~ I(Population > 100000000 & Area > 8000000))
head(new_pop)

If we want obtain a boolean vector, as an interrogation, setting to TRUE the argument logic_convert the function will return a boolean vector

new_pop <- add(from = pop_data, 
               formula =  ~ I(Population > 100000000), 
               logic_convert = FALSE, as = "most_populated")
head(new_pop)

`Transform` variables

The transform() function modifies existing variables in the dataset.

Suppose we want to change the scale on the Population

new_pop <- transform(from = pop_data, 
                     formula =  Population ~ I(Population/10000))
head(new_pop)

or we want a logarithmic transformation, renaming the variable

new_pop <- transform(from = pop_data, 
                     formula =  Population ~ I(log(Population)), 
                     as = "log_pop")
head(new_pop)

With a single formula multiple variables can be transformed, as showed before.

new_pop <- transform(from = pop_data, 
                     formula =  Population  + Area~ I(log()))
head(new_pop)

We can also transformed multiple variables with multiple transformations

new_pop <- transform(from = pop_data, 
                     formula =  Population + Area ~ I(Population > 100000000) + I(log(Area)))
head(new_pop)

`Rename` variables

The rename() function may be used to change names of existing variables, for example

new_pop <- rename(from = pop_data, formula =  Population  ~ pop )
head(new_pop)

or multiple variables

new_pop <- rename(from = pop_data, formula =  Population  + Area ~ pop + area)
head(new_pop)

`Select` variables and rows

In the same fashion of SQL, the select() function first select the rows, given a statement, and then shows the select variables.

The first part of the formula are the columns to select, as the previous functions, and the right-hand side of the formula, the condition part, will select the rows.

Suppose to want to select only the most populated countries

new_pop <- select(from = pop_data, 
                  formula =  . ~ I(Population > 100000000))
head(new_pop)

you can also add . to returns all variables instead of nothing.

We want only the name of the most populated countries

new_pop <- select(from = pop_data, 
                  formula =  Country ~ I(Population > 100000000))
head(new_pop)

We might be interest in only the most populated and biggest countries

new_pop <- select(from = pop_data, 
               formula = . ~ I(Population > 100000000 & Area > 8000000)) 
head(new_pop)

or both

new_pop <- select(from = pop_data, 
               formula = ~ I(Population > 100000000 | Area > 8000000)) 
head(new_pop)

by selecting only the names

new_pop <- select(from = pop_data, 
               formula = Country ~ I(Population > 100000000 | Area > 8000000)) 
head(new_pop)

`Remove` variables

The remove() function has the same syntax of select() function, but now the rows and columns will be removed.

new_pop <- remove(from = pop_data, 
                  formula =  Area ~ I(Population > 100000000))
head(new_pop)

Handling Missing Values

In all the functions, except for rename, the argument na.remove will remove all the rows with missing values, after adding, transforming or selecting the rows.

The remove function, can be employed to remove all the rows with at least a missing observation,

data("airquality")
dt <- airquality

dt_new <- remove(from = dt,formula = .~., na.remove = TRUE)
head(dt_new)

If we are interested to focus on the observation with missing values, the na.return = TRUE arguments of select function will return only the incomplete rows after the selection

dt_new <- select(from = dt,formula = ~ I(Temp > 50), na.return = TRUE)
head(dt_new)

Any scripts or data that you put into this service are public.

dformula documentation built on May 29, 2024, 1:37 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dformula
Data Manipulation using Formula

An overview on dformula
In dformula: Data Manipulation using Formula

Introduction

`Adding` variables

`Transform` variables

`Rename` variables

`Select` variables and rows

`Remove` variables

Handling Missing Values

Try the dformula package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

dformula Data Manipulation using Formula

An overview on *dformula* In dformula: Data Manipulation using Formula

Introduction

Adding variables

Transform variables

Rename variables

Select variables and rows

Remove variables

Handling Missing Values

Try the dformula package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

dformula
Data Manipulation using Formula

An overview on dformula
In dformula: Data Manipulation using Formula

`Adding` variables

`Transform` variables

`Rename` variables

`Select` variables and rows

`Remove` variables