In HughParsonage/hutils: Miscellaneous R Functions and Aliases

`hutils` package

My name is Hugh. I've written some miscellaneous functions that don't seem to belong in a particular package. I've usually put these in R/utils.R when I write a package. Thus, hutils.

This vignette just goes through each exported function.

library(knitr)
suggested_packages <- c("geosphere", "nycflights13", "dplyr", "ggplot2", "microbenchmark")
opts_chunk$set(eval = all(vapply(suggested_packages, requireNamespace, quietly = TRUE, FUN.VALUE = FALSE)))

tryCatch({
  library(geosphere)
  library(nycflights13)
  library(dplyr, warn.conflicts = FALSE)
  library(ggplot2)
  library(microbenchmark)
  library(data.table, warn.conflicts = FALSE)
  library(magrittr)
  library(hutils, warn.conflicts = FALSE)
}, 
# requireNamespace does not detect errors like
# package ‘dplyr’ was installed by an R version with different internals; it needs to be reinstalled for use with this R version
error = function(e) {
  opts_chunk$set(eval = FALSE)
})
options(digits = 4)

Aliases

These are simple additions to magrittr's aliases, including: capitalized forms of and and or that invoke && and || (the 'long-form' logical operators) and nor / neither functions.

The main motivation is to make the source code easier to indent. I occasionally find such source code easier to use.

OR(OR(TRUE,
      stop("Never happens")),  ## short-circuits
   AND(FALSE,
       stop("Never happens")))

nor (or neither which is identical) returns TRUE if and only if both arguments are FALSE.

`coalesce` and `if_else`

These are near drop-in replacements for the equivalent functions from dplyr. They are included here because they are very useful outside of the tidyverse, but may be required in circumstances where importing dplyr (with all of its dependencies) would be inappropriate.

They attempt to be drop-in replacements but:

hutils::if_else only works with logical, integer, double, and character type vectors. Lists and factors won't work.
hutils::coalesce short-circuits on its first argument; if there are no NAs in x then x is returned, even if the other vectors are the wrong length or type.

In addition, hutils::if_else is generally faster than dplyr::if_else:

my_check <- function(values) {
  all(vapply(values[-1], function(x) identical(values[[1]], x), logical(1)))
}

set.seed(2)
cnd <- sample(c(TRUE, FALSE, NA), size = 100e3, replace = TRUE)
yes <- sample(letters, size = 100e3, replace = TRUE)
no <- sample(letters, size = 100e3, replace = TRUE)
na <- sample(letters, size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::if_else(cnd, yes, no, na),
               hutils = hutils::if_else(cnd, yes, no, na),
               check = my_check) %>%
  print

cnd <- sample(c(TRUE, FALSE, NA), size = 100e3, replace = TRUE)
yes <- sample(letters, size = 1, replace = TRUE)
no <- sample(letters, size = 100e3, replace = TRUE)
na <- sample(letters, size = 1, replace = TRUE)

microbenchmark(dplyr =  dplyr::if_else(cnd, yes, no, na),
               hutils = hutils::if_else(cnd, yes, no, na),
               check = my_check) %>%
  print

This speed advantage also appears to be true of coalesce:

x <- sample(c(letters, NA), size = 100e3, replace = TRUE)
A <- sample(c(letters, NA), size = 100e3, replace = TRUE)
B <- sample(c(letters, NA), size = 100e3, replace = TRUE)
C <- sample(c(letters, NA), size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::coalesce(x, A, B, C),
               hutils = hutils::coalesce(x, A, B, C),
               check = my_check) %>%
  print

especially during short-circuits:

x <- sample(c(letters), size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::coalesce(x, A, B, C),
               hutils = hutils::coalesce(x, A, B, C),
               check = my_check) %>%
  print

x <- sample(c(letters, NA), size = 100e3, replace = TRUE)
A <- sample(c(letters), size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::coalesce(x, A, B, C),
               hutils = hutils::coalesce(x, A, B, C),
               check = my_check) %>%
  print

Drop columns

To drop a column from a data.table, you set it to NULL

DT <- data.table(A = 1:5, B = 1:5, C = 1:5)
DT[, A := NULL]

There's nothing wrong with this, but I've found the following a useful alias, especially in a magrittr pipe.

DT <- data.table(A = 1:5, B = 1:5, C = 1:5)
DT %>%
  drop_col("A") %>%
  drop_col("B")

# or
DT <- data.table(A = 1:5, B = 1:5, C = 1:5)
DT %>%
  drop_cols(c("A", "B"))

These functions simple invoke the canonical form, so won't be any faster.

Additionally, one can drop columns by a regular expression using drop_colr:

flights <- as.data.table(flights)

flights %>%
  drop_colr("time") %>%
  drop_colr("arr(?!_delay)", perl = TRUE)

`drop_constant_cols`

When a table is filtered, the filtrate is often redundant.

flights %>%
  .[origin == "JFK"] %>%
  drop_constant_cols

`drop_empty_cols`

This function drops columns in which all the values are NA.

planes %>% 
  as.data.table %>% 
  .[!complete.cases(.)]

planes %>% 
  as.data.table %>% 
  .[!complete.cases(.)] %>% 
  # drops speed
  drop_empty_cols

`duplicated_rows`

There are many useful functions for detecting duplicates in R. However, in interactive use, I often want to not merely see which values are duplicated, but also compare them to the original. This is especially true when I am comparing duplicates across a subset of columns in a a data.table.

flights %>%
  # only the 'second' of the duplicates is returned
  .[duplicated(., by = c("origin", "dest"))]  

flights %>%
  # Both rows are returned and (by default)
  # duplicates are presented adjacently
  duplicated_rows(by = c("origin", "dest"))

Haversine distance

To emphasize the miscellany of this package, I now present haversine_distance which simply returns the distance between two points on the Earth, given their latitude and longitude.

I prefer this to other packages' implementations. Although the geosphere package can do a lot more than calculate distances between points, I find the interface for distHaversine unfortunate as it cannot be easily used inside a data.frame. In addition, I've found the arguments clearer in hutils::haversine_distance rather than trying to remember whether to use byrow inside the matrix function while passing to distHaversine.

DT1 <- data.table(lat_orig = runif(1e5, -80, 80),
                  lon_orig = runif(1e5, -179, 179),
                  lat_dest = runif(1e5, -80, 80),
                  lon_dest = runif(1e5, -179, 179))

DT2 <- copy(DT1)

microbenchmark(DT1[, distance := haversine_distance(lat_orig, lon_orig,
                                                    lat_dest, lon_dest)],

               DT2[, distance := distHaversine(cbind(lon_orig, lat_orig),
                                               cbind(lon_orig, lat_orig))])
rm(DT1, DT2)

`mutate_other`

There may be occasions where a categorical variable in a data.table may need to modified to reduce the number of distinct categories. For example, you may want to plot a chart with a set number of facets, or ensure the smooth operation of randomForest, which accepts no more than 32 levels in a feature.

mutate_other keeps the n most common categories and changes the other categories to Other.

set.seed(1)
DT <- data.table(Fruit = sample(c("apple", "pear", "orange", "tomato", "eggplant"),
                                size = 20,
                                prob = c(0.45, 0.25, 0.15, 0.1, 0.05),
                                replace = TRUE),
                 Price = rpois(20, 10))

kable(mutate_other(DT, "Fruit", n = 3)[])

`ngrep`

This is a 'dumb' negation of grep. In recent versions of R, the option invert = FALSE exists. A slight advantage of ngrep is that it's shorter to type. But if you don't have arthritis, best use invert = FALSE or !grepl.

`notin` `ein` `enotin` `pin`

These functions provide complementary functionality to %in%:

`%notin%`

%notin% is the negation of %in%, but also uses the package fastmatch to increase the speed of the operation

`%ein%` and `%enotin%`

The functions %ein% and %enotin% are motivated by a different sort of problem. Consider the following statement:

iris <- as.data.table(iris)
iris[Species %in% c("setosa", "versicolour")] %$%
  mean(Sepal.Length / Sepal.Width)

On the face of it, this appears to give the average ratio of Iris setosa and Iris versicolour irises. However, it only gives the average ratio of setosa irises, as the correct spelling is Iris versicolor not -our. This particular error is easy to make, (in fact when I wrote this vignette, the first hit of Google for iris dataset made the same spelling error), but it's easy to imagine similar mistakes, such as mistaking the capitalization of a value. The functions %ein% and %enotin% strive to reduce the occurrence of this mistake. The functions operate exactly the same as %in% and %enotin% but error if any of the table of values to be matched against is not present in any of the values:

iris <- as.data.table(iris)
iris[Species %ein% c("setosa", "versicolour")] %$%
  mean(Sepal.Length / Sepal.Width)

The e stands for 'exists'; i.e. they should be read as "exists and in" and "exists and not in".

`%pin%`

This performs a partial match (i.e grepl) but with a possibly more readable or intuitive syntax

identical(iris[grep("v", Species)],
          iris[Species %pin% "v"])

If the RHS has more than one element, the matching is done on alternation (i.e. OR):

iris[Species %pin% c("ver", "vir")] %>%
  head

There is an important qualification: if the RHS is NULL, then the result will be TRUE along the length of x, contrary to the behaviour of %in%. This is not entirely unexpected as NULL could legitimately be interpreted as (\varepsilon), the empty regular expression, which occurs in every string.

`provide.dir`

This is the same as dir.create but checks whether the target exists or not and does nothing if it does. Motivated by \providecommand in (\rm\LaTeX{}), which creates a macro only if it does not exist already.

`select_which`

This provides a similar role to dplyr::select_if but was originally part of package:grattan so has a different name. It simply returns the columns whose values return TRUE when Which is applied. Additional columns (which may or not may satisfy Which) may be included by using .and.dots. (To remove columns, you can use drop_col).

DT <- data.table(x = 1:5,
                 y = letters[1:5],
                 AB = c(NA, TRUE, FALSE, TRUE, FALSE))
select_which(DT, anyNA, .and.dots = "y")

`set_cols_first`

Up to and including data.table 1.10.4, one could only reorder the columns by supplying all the columns. You can use set_cols_first and set_cols_last to put columns first or last without supplying all the columns.

Unique keys

In some circumstances, you need to know that the key of a data.table is unique. For example, you may expect a join to be performed later, without specifying mult='first' or permitting Cartesian joins. data.table does not require a key to be unique and does not supply tools to check the uniqueness of keys. hutils supplies two simple functions: has_unique_key which when applied to a data.table returns TRUE if and only if the table has a key and it is unique.

set_unique_key does the same as setkey but will error if the resultant key is not unique.

`hutils` v1.1.0

`auc`

The area under the (ROC) curve gives a single value to measure the tradeoff between true positives and false positives.

dt <- data.table(y = !sample(0:1, size = 100, replace = TRUE), 
                 x = runif(100))
dt[, pred := predict(lm(y ~ x, data = .SD), newdata = .SD)]

dt[, auc(y, pred)]

`select_grep`

To select columns matching a regular expression:

flights %>%
  select_grep("arr")

You can use the additional arguments .and and .but.not to override the patterns.

flights %>%
  select_grep("arr", .and = "year", .but.not = "arr_time")

`hutils` v1.2.0

`RQ`

This is simply a shorthand to test whether a package needs installing. The package name need not be quoted, for convenience.

RQ(dplyr, "dplyr must be installed")
RQ("dplyr", "dplyr needs installing", "dplyr installed.")

`ahull`

This locates the biggest rectangle beneath a curve:

if (!identical(Sys.info()[["sysname"]], "Darwin"))
  ggplot(data.table(x = c(0, 1, 2, 3, 4), y = c(0, 1, 2, 0.1, 0))) +
  geom_area(aes(x, y)) +
  geom_rect(data = ahull(, c(0, 1, 2, 3, 4), c(0, 1, 2, 0.1, 0)),
            aes(xmin = xmin,
                xmax = xmax,
                ymin = ymin,
                ymax = ymax),
            color = "red")

set.seed(101)
ahull_dt <-
  data.table(x = c(0:100) / 100,
             y = cumsum(rnorm(101, 0.05)))
if (!identical(Sys.info()[["sysname"]], "Darwin"))
ggplot(ahull_dt) +
  geom_area(aes(x, y)) + 
  geom_rect(data = ahull(ahull_dt), 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            color = "red") + 
  geom_rect(data = ahull(ahull_dt,
                         incl_negative = TRUE), 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            color = "blue") + 
  geom_rect(data = ahull(ahull_dt,
                         incl_negative = TRUE,
                         minH = 4), 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            color = "green") + 
  geom_rect(data = ahull(ahull_dt,
                         incl_negative = TRUE,
                         minW = 0.25), 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            color = "white",
            fill = NA)

`hutils` v1.3.0

`weighted_quantile`

Simply a version of quantile supporting weighted values:

x <- 1:10
w <- c(rep(1, 5), rep(2, 5))
quantile(x, prob = c(0.25, 0.75), names = FALSE)

weighted_quantile(x, w, p = c(0.25, 0.75))

`mutate_ntile`

To add a column of ntiles (say, for later summarizing):

flights %>%
  as.data.table %>%
  .[, .(year, month, day, origin, dest, distance)] %>%
  mutate_ntile(distance, n = 5L)

You can use non-standard evaluation (as above) or you can quote the col argument. Use character.only = TRUE to ensure column is only interpreted as character.

flights %>%
  as.data.table %>%
  .[, .(year, month, day, origin, dest, distance)] %>%
  mutate_ntile(distance, n = 5L)

flights %>%
  as.data.table %>%
  mutate_ntile("distance",
               n = 5L,
               character.only = TRUE) %>%
  .[, dep_delay := coalesce(dep_delay, 0)] %>%
  .[, .(avgDelay = mean(dep_delay)), keyby = "distanceQuintile"]

`longest_affix`

Trim common affixes can be useful during data cleaning:

trim_common_affixes(c("CurrentHousingCosts(weekly)",
                      "CurrentFuelCosts(weekly)"))

`hutils 1.4.0`

`%<->%`

Referred to as swap in the documentation. Used to swap values between object names

a <- 1
b <- 2
a %<->% b
identical(c(a, b), c(2, 1))

`average_bearing`

Determine the average bearing of vectors. Slightly more difficult than simply the average modulo 360 since its the most acute sector is desired.

average_bearing(0, 270)  # NW
mean(c(0, 270))          # SE (i.e. wrong)

`dir2`

This is a faster version of list.files for Windows only, utilizing the dir command on the command prompt.

`Mode`

The statistical mode; the most common element.

Mode(c(1, 1, 1, 2, 3))

`replace_pattern_in`

A cousin of find_pattern_in, but instead of collecting the results, it replaces the contents sought with the replacement provided.

`samp`

A safer version of sample. I use it because I found the following behaviour of sample surprising.

DT <- data.table(x = c(5, 2, 3),
                 y = c(5, 3, 4))
DT[, .(Base = sample(.BY[["x"]]:.BY[["y"]])), keyby = .(x, y)]
DT[, .(Base = samp(.BY[["x"]]:.BY[["y"]])), keyby = .(x, y)]

HughParsonage/hutils documentation built on Feb. 12, 2023, 8:26 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

HughParsonage/hutils
Miscellaneous R Functions and Aliases

In HughParsonage/hutils: Miscellaneous R Functions and Aliases

`hutils` package

Aliases

`coalesce` and `if_else`

Drop columns

`drop_constant_cols`

`drop_empty_cols`

`duplicated_rows`

Haversine distance

`mutate_other`

`ngrep`

`notin` `ein` `enotin` `pin`

`%notin%`

`%ein%` and `%enotin%`

`%pin%`

`provide.dir`

`select_which`

`set_cols_first`

Unique keys

`hutils` v1.1.0

`auc`

`select_grep`

`hutils` v1.2.0

`RQ`

`ahull`

`hutils` v1.3.0

`weighted_quantile`

`mutate_ntile`

`longest_affix`

`hutils 1.4.0`

`%<->%`

`average_bearing`

`dir2`

`Mode`

`replace_pattern_in`

`samp`

R Package Documentation

Browse R Packages

We want your feedback!

HughParsonage/hutils Miscellaneous R Functions and Aliases

In HughParsonage/hutils: Miscellaneous R Functions and Aliases

hutils package

Aliases

coalesce and if_else

Drop columns

drop_constant_cols

drop_empty_cols

duplicated_rows

Haversine distance

mutate_other

ngrep

notin ein enotin pin

%notin%

%ein% and %enotin%

%pin%

provide.dir

select_which

set_cols_first

Unique keys

hutils v1.1.0

auc

select_grep

hutils v1.2.0

RQ

ahull

hutils v1.3.0

weighted_quantile

mutate_ntile

longest_affix

hutils 1.4.0

%<->%

average_bearing

dir2

Mode

replace_pattern_in

samp

R Package Documentation

Browse R Packages

We want your feedback!

HughParsonage/hutils
Miscellaneous R Functions and Aliases

`hutils` package

`coalesce` and `if_else`

`drop_constant_cols`

`drop_empty_cols`

`duplicated_rows`

`mutate_other`

`ngrep`

`notin` `ein` `enotin` `pin`

`%notin%`

`%ein%` and `%enotin%`

`%pin%`

`provide.dir`

`select_which`

`set_cols_first`

`hutils` v1.1.0

`auc`

`select_grep`

`hutils` v1.2.0

`RQ`

`ahull`

`hutils` v1.3.0

`weighted_quantile`

`mutate_ntile`

`longest_affix`

`hutils 1.4.0`

`%<->%`

`average_bearing`

`dir2`

`Mode`

`replace_pattern_in`

`samp`