In krlmlr/pdlyr: Compatibility package to access both plyr and dplyr

library(magrittr)
knit_print.data.frame = function(x, ...) {
  res = paste(c('', '', knitr::kable(x)), collapse = '\n')
  knitr::asis_output(res)
}
`knit_print.{` = function(x, ...) {
  res <- paste(c("```r", paste0(c("", rep("  ", length(x)-1)), as.character(x)), "}", "```", ""), collapse = '\n')
  knitr::asis_output(res)
}

Problem statement

When loading both plyr and dplyr, the last package loaded overwrites symbols exported by the package loaded first:

library(plyr)
library(dplyr)

Currently, the following symbols are affected:

plyr_exports <- ls("package:plyr")
dplyr_exports <- ls("package:dplyr")
(both_exports <- intersect(plyr_exports, dplyr_exports))

This means that existing projects that use plyr cannot simply load dplyr using library(dplyr) without potentially breaking existing code. There are workarounds, but all of them seem to have specific disadvantages:

Load only one of both packages, access functionality from the other package via explicit qualification (e.g., dplyr::summarise).
- Cumbersome
Don't load, always access via explicit qualification
- Requires extensive code rewriting for existing projects
Load dplyr after plyr, modify usage of conflicting symbols in the code
- Doesn't safeguard new code that attempts to use dplyr primitives with plyr

This document explores alternative solutions.

Analysis

Let's take a closer look at the interface of the functions exported from both packages:

formals_df <-
  both_exports %>%
  setNames(nm = .) %>%
  lapply(
    . %>% {
      lapply(
        c("plyr", "dplyr") %>% setNames(nm = .),
        function (x) {
          get(., sprintf("package:%s", x)) %>%
            formals %>% names %>% paste(collapse = ", ")
        }
      )
    }
  ) %>%
  plyr::ldply(data.frame, stringsAsFactors = FALSE, .id = "name") %>%
  plyr::mutate(data_first = grepl("^[.]data,", dplyr) | grepl("^df,", plyr),
               identical = plyr == dplyr) %>%
  plyr::arrange(!data_first, !identical, name)

formals_df

We can split the overlaps in two groups:

The first argument is a data frame or table object
No data frame or table object, but identical formals

Data is first argument

formals_df %>%
  subset(data_first) %>%
  plyr::mutate(data_first = NULL)

Here, mutate, summari[sz]e and arrange mean the same in both plyr and dplyr, although plyr::summari[sz]e on a grouped tbl_df seems to destroy the grouping. The count function is similar, too, only that plyr::count is more similar to dplyr::count_, and that for some reason the first argument of dplyr::count is called x and not .data. Also, plyr::rename seems to be more similar to dplyr::rename_.

Identical formals

formals_df %>%
  subset(!data_first & identical) %>%
  plyr::mutate(data_first = NULL, identical = NULL)

Here, we take a look at the bodies. We compare them for textual identity.

body_l <-
  formals_df %>%
  subset(!data_first & identical) %>%
  extract2("name") %>%
  as.character %>%
  setNames(nm = .) %>%
  lapply(
    . %>% {
      lapply(
        c("plyr", "dplyr") %>% setNames(nm = .),
        function (x) {
          get(., sprintf("package:%s", x)) %>%
            body
        }
      )
    }
  )

body_l %>%
  llply(. %>% { identical(as.character(.$plyr), as.character(.$dplyr)) } ) %>%
  ldply(. %>% data.frame(identical_body = .), .id = "name")

Only failwith is different, but the effect seems to be the same:

body_l$failwith$plyr
body_l$failwith$dplyr

Concept

For fully seamless transition between plyr and dplyr, a compatibility package looks like a possible option. It would provide the union of the exported functions of both packages, and compatibility wrappers for the two functions count and rename that need special attention.

Deprecating the functions count and rename in the plyr package seems simpler. For an even more radical solution, all overlapping functions could be deprecated, referring to identical functionality in dplyr:

mutate -> dplyr::mutate
- row names are lost
summari[sz]e -> dplyr::summari[sz]e
arrange(df, ...) -> dplyr::arrange(.data, ...)
count(df, vars, wt_var) -> dplyr::count_(x, vars, wt, sort = FALSE)
- modified call
- rename n column to freq
rename(x, replace, warn_missing) -> dplyr::rename_(.data, replace)
- replace needs to be massaged
- warn_missing is always TRUE
desc(), failwith(), id() -> dplyr::...

Summary

For practical use, a thin compatibility layer seems to work reasonably well for a project that was created using plyr and is now transitioning towards dplyr:

attach(pdlyr::plyr_compat)

Tests in the pdlyr package will assure that this package can be safely loaded with warn.conflicts = FALSE. The original plyr implementations can be accessed via shortcuts (prefix p).

A few usage examples:

mtcars %>% mutate(lphkm = 100 * 3.785411784 / 1.609344 / mpg) %>% head
mtcars %>% plyr::mutate(lphkm = 100 * 3.785411784 / 1.609344 / mpg) %>% head
mtcars %>% ddply("cyl", summarize, mean_mpg = mean(mpg))
mtcars %>% ddply("cyl", plyr::summarise, mean_mpg = mean(mpg))
mtcars %>% arrange(-wt) %>% head
mtcars %>% plyr::arrange(-wt) %>% head
mtcars %>% count("gear")
mtcars %>% plyr::count("gear")
mtcars %>% rename(list(mpg = "miles_per_gallon")) %>% extract(1:2) %>% head
mtcars %>% plyr::rename(list(mpg = "miles_per_gallon")) %>% extract(1:2) %>% head