Common tasks when using dplyr

library(knitr)
opts_chunk$set(out.extra = 'style="display:block; margin: auto"'
    #, fig.align = "center"
    , fig.width = 4.3, fig.height = 3.2, dev.args = list(pointsize = 10)
    )
knit_hooks$set(spar = function(before, options, envir) {
    if (before) {
        par( las = 1 )                   #also y axis labels horizontal
        par(mar = c(2.0,3.3,0,0) + 0.3 ) #margins
        par(tck = 0.02 )                 #axe-tick length inside plots             
        par(mgp = c(1.1,0.2,0) )  #positioning of axis title, axis labels, axis
     }
})

Introduction

The tidyverse with the dplyr package in combination with purrr mapping and tidyr data.frames is agreat way of managing data in R.

Using the package, however, require repeated coding for some common issues. This package provides routines for some of those issues. We hope that most of the issues are delat with in dplyr directly, and this package to become obsolete then.

require(dplyr, quietly = TRUE)
require(dplyrUtil, quietly = TRUE)

split-map-combine

One common operation is the split-map-combine pattern. A tbl, i.e. tabular data such as a data.frame or a tibble, is subset, then for each subset a function is called that returns a tbl, and finally the resulting tbls are row-bind again.

While the tidyverse favours the group_by-nest-map-unnest idiom of this pattern, the resulting code is hard to understand for people not familiar with this idiom. Moreover, the nested data.frames do not contain the grouping columns which are sometimes required by the function working on the subset.

dplyUtil provides function mapGroups that provides a dplyr like call interface for the base split function.

The following example applies a function on each kind of food subset that uses the number of records in the subset and the grouping variable, kind.

ds <- tibble(
  food = c("banana","strawberry","pear","bread","corn flakes")
  , kind = factor(c(rep("fruit",3), rep("cereal",2)))
  )
ans <- ds %>% group_by(kind) %>% mapGroups(function(dss){
  dss %>% mutate(
    countInKind = paste(first(kind),1:n())
  )
}, .omitWarning = TRUE)
# or shorter with formula-functions
ans <- ds %>% group_by(kind) %>% mapGroups(
  ~mutate(., countInKind = paste(first(kind),1:n())), .omitWarning = TRUE)
ans

If the data.frames to combine hold inconsistent factor levels, by default a warning is shown and the column is converted to character. If instead, one wants to combine all factor levels, there is argument .isFactorReleveled

ans <- ds %>% group_by(kind) %>% mapGroups(function(dss){
  dss %>% mutate(
    countInKind = factor(paste(first(kind),1:n()))
  )
}
, .isFactorReleveled = TRUE
, .noWarningCols = c("countInKind")
, .omitWarning = TRUE)
ans

Better use split - map_dfr

However, try learning the split-map idiom to replace calls to mapGroups and may spare package dependencies.

library(purrr)
ds %>% group_by(kind) %>% split(group_indices(.)) %>% map_dfr(
  ~mutate(., countInKind = paste(first(kind),1:n())))

In order to users to better use this pattern, gapGroups issues a warning. The warning can be avoided by argument .omitWarning = TRUE.

Recent versions of map_dfr and bind_rows also work with inconsistent factor levels:

ansList <- ds %>% group_by(kind) %>% split(group_indices(.)) %>% map(
  ~mutate(., countInKind = factor(paste(first(kind),1:n())))) 
ansList[[1]]$countInKind # note that factor levels do not include fruits
# countInKind is converted to character in older dplyr versions
ansList %>% bind_rows()  

mutate acting only on the rows satisfying the condition

Default mutate acts on all rows. For each modified variable one needs to use ifelse. Function mutate_cond helps in those cases.

The following example modifies the Petal.Length column, but only for a given Species.

ans <- iris %>% as_tibble() %>% 
  mutate_cond(
    Species == "setosa"
    , Petal.Length = 1.0
  )
rbind(
ans %>% filter(Species == "setosa") %>% 
  select(Species, Petal.Length) %>% head(3) # 1.0
, ans %>% filter(Species != "setosa") %>% 
  select(Species, Petal.Length) %>% head(3) # orig
) 

Joining with Replacement

A join is often used to add new columns to a dataset. If the same information is joined several times, by default the column is creted with a modified name. If one instead wants to replace the joined column, function left_joinReplace helps with that.

The following example joines the info column again, because it was modified after the first join. It demonstrates the different results between left_join and left_joinReplace.

kindInfo <- tibble(
  kind = factor(c("fruit","cereal"))
  ,info = c("rich in vitamins","rich in carbohydrates")
)
ans1 <- ds %>% left_join(kindInfo, by = "kind")
kindInfo$info[1] <- "juicy"
ans1 %>% left_join(kindInfo, by = "kind") # duplicated info column
ans1 %>% left_joinReplace(kindInfo, by = "kind") # modified info column

Dealing with unequal factor levels

Superseded: This functionality has been implemented in recent dpylr versions.

df1 <- tibble(f = factor(c("D","D","C")))
df2 <- tibble(f = factor(c("C","C","A"))
                  , desc = c("forC1","forC2","forA1"))
bind_rows(df1,df2) # f is converted to character in older dplyr versions
try({
  left_join(df1,df2, by = "f") # fails in older dplyr version
})

With older version of dplyr

In case of unequal factor levels in row-binding tbls or joins, factors are converted to character. Often one wants to combine the factor levels instead. Function expandAllInconsistentFactorLevels helps with that.

In the following example, the levels of factor f do not match. Therefore, bind_rows warns and coerces to character. With expanding the factors before, the combined data.frame factor f is still a factor but with modified levels.

df1 <- tibble(f = factor(c("D","D","C")))
df2 <- tibble(f = factor(c("C","C","A"))
                  , desc = c("forC1","forC2","forA1"))
bind_rows(df1,df2) # f is converted to character in older dplyr versions
expandAllInconsistentFactorLevels(df1,df2, .omitWarning = TRUE) %>% bind_rows()

Function expandFactorLevels helps to relevel a specific column across several tbls.

Both functions proviede the result as a list of tbls. This type of result can be handled by bind_rows, but other functions require single data.frames, i.e. the components of the result ([[1]], [[2]]).

For convenience function left_joinFactors helps joining tbls with different factor levels.

left_joinFactors(df1,df2, by = "f", .noWarningCols = c("f"), .omitWarning = TRUE)


bgctw/dplyrUtil documentation built on Nov. 11, 2020, 12:25 a.m.