library(knitr) opts_chunk$set(out.extra = 'style="display:block; margin: auto"' #, fig.align = "center" , fig.width = 4.3, fig.height = 3.2, dev.args = list(pointsize = 10) ) knit_hooks$set(spar = function(before, options, envir) { if (before) { par( las = 1 ) #also y axis labels horizontal par(mar = c(2.0,3.3,0,0) + 0.3 ) #margins par(tck = 0.02 ) #axe-tick length inside plots par(mgp = c(1.1,0.2,0) ) #positioning of axis title, axis labels, axis } })
The tidyverse with the dplyr package in combination with purrr mapping and tidyr data.frames is agreat way of managing data in R.
Using the package, however, require repeated coding for some common issues. This package provides routines for some of those issues. We hope that most of the issues are delat with in dplyr directly, and this package to become obsolete then.
require(dplyr, quietly = TRUE) require(dplyrUtil, quietly = TRUE)
One common operation is the split-map-combine pattern. A tbl, i.e. tabular data such as a data.frame or a tibble, is subset, then for each subset a function is called that returns a tbl, and finally the resulting tbls are row-bind again.
While the tidyverse favours the group_by-nest-map-unnest idiom of this pattern, the resulting code is hard to understand for people not familiar with this idiom. Moreover, the nested data.frames do not contain the grouping columns which are sometimes required by the function working on the subset.
dplyUtil
provides function mapGroups
that provides a dplyr like
call interface for the base split function.
The following example applies a function on each kind of food subset
that uses the number of records in the subset and the grouping variable, kind
.
ds <- tibble( food = c("banana","strawberry","pear","bread","corn flakes") , kind = factor(c(rep("fruit",3), rep("cereal",2))) ) ans <- ds %>% group_by(kind) %>% mapGroups(function(dss){ dss %>% mutate( countInKind = paste(first(kind),1:n()) ) }, .omitWarning = TRUE) # or shorter with formula-functions ans <- ds %>% group_by(kind) %>% mapGroups( ~mutate(., countInKind = paste(first(kind),1:n())), .omitWarning = TRUE) ans
If the data.frames to combine hold inconsistent factor levels, by default
a warning is shown and the column is converted to character.
If instead, one wants to combine all factor levels, there is argument
.isFactorReleveled
ans <- ds %>% group_by(kind) %>% mapGroups(function(dss){ dss %>% mutate( countInKind = factor(paste(first(kind),1:n())) ) } , .isFactorReleveled = TRUE , .noWarningCols = c("countInKind") , .omitWarning = TRUE) ans
However, try learning the split-map idiom to replace calls to mapGroups and may spare package dependencies.
library(purrr) ds %>% group_by(kind) %>% split(group_indices(.)) %>% map_dfr( ~mutate(., countInKind = paste(first(kind),1:n())))
In order to users to better use this pattern, gapGroups
issues a
warning. The warning can be avoided by argument .omitWarning = TRUE
.
Recent versions of map_dfr and bind_rows also work with inconsistent factor levels:
ansList <- ds %>% group_by(kind) %>% split(group_indices(.)) %>% map( ~mutate(., countInKind = factor(paste(first(kind),1:n())))) ansList[[1]]$countInKind # note that factor levels do not include fruits # countInKind is converted to character in older dplyr versions ansList %>% bind_rows()
Default mutate
acts on all rows. For each modified variable one needs to
use ifelse
. Function mutate_cond
helps in those cases.
The following example modifies the Petal.Length column, but only for a given Species.
ans <- iris %>% as_tibble() %>% mutate_cond( Species == "setosa" , Petal.Length = 1.0 ) rbind( ans %>% filter(Species == "setosa") %>% select(Species, Petal.Length) %>% head(3) # 1.0 , ans %>% filter(Species != "setosa") %>% select(Species, Petal.Length) %>% head(3) # orig )
A join is often used to add new columns to a dataset.
If the same information is joined several times, by default the
column is creted with a modified name.
If one instead wants to replace the joined column, function
left_joinReplace
helps with that.
The following example joines the info column again,
because it was modified after the first join.
It demonstrates the different results between left_join
and
left_joinReplace
.
kindInfo <- tibble( kind = factor(c("fruit","cereal")) ,info = c("rich in vitamins","rich in carbohydrates") ) ans1 <- ds %>% left_join(kindInfo, by = "kind") kindInfo$info[1] <- "juicy" ans1 %>% left_join(kindInfo, by = "kind") # duplicated info column ans1 %>% left_joinReplace(kindInfo, by = "kind") # modified info column
Superseded: This functionality has been implemented in recent dpylr versions.
df1 <- tibble(f = factor(c("D","D","C"))) df2 <- tibble(f = factor(c("C","C","A")) , desc = c("forC1","forC2","forA1")) bind_rows(df1,df2) # f is converted to character in older dplyr versions
try({ left_join(df1,df2, by = "f") # fails in older dplyr version })
In case of unequal factor levels in row-binding tbls or joins,
factors are converted to character.
Often one wants to combine the factor levels instead.
Function expandAllInconsistentFactorLevels
helps with that.
In the following example, the levels of factor f
do not match.
Therefore, bind_rows
warns and coerces to character.
With expanding the factors before, the combined data.frame factor f
is still a factor but with modified levels.
df1 <- tibble(f = factor(c("D","D","C"))) df2 <- tibble(f = factor(c("C","C","A")) , desc = c("forC1","forC2","forA1")) bind_rows(df1,df2) # f is converted to character in older dplyr versions expandAllInconsistentFactorLevels(df1,df2, .omitWarning = TRUE) %>% bind_rows()
Function expandFactorLevels
helps to relevel a specific column
across several tbls.
Both functions proviede the result as a list of
tbls. This type of result can be handled by bind_rows
,
but other functions require
single data.frames, i.e. the components of the result ([[1]]
, [[2]]
).
For convenience function left_joinFactors
helps joining
tbls with different factor levels.
left_joinFactors(df1,df2, by = "f", .noWarningCols = c("f"), .omitWarning = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.