knitr::opts_chunk$set(tidy = FALSE, comment = "#>")

Overview

rsalad, like any other salad, is a mixture of different healthy vegetables that you should be having frequently and that can make your life much better. Except that instead of vegetables, rsalad provides you with R functions.

This package was born as a result of me constantly breaking the DRY principle by copy-and-pasting functions from old projects into new ones. Hence, the functions in rsalad do not have a single common topic, but they are all either related to manipulating data.frames or general productivity utilities.

Analysis

This vignette will introduce all the families of functions available in rsalad, but will not dive too deeply into any one specific function. To demonstrate all the functionality, we will use the nycflights13::flights dataset (information about ~335k flights departing from NYC) to visualize the 50 most common destinations of flights out of NYC. While the analysis is not particularly exciting, it will show how to use rsalad proficiently.

Load packages

Before beginning any analysis using rsalad, the first step is to load the package. We'll also load dplyr, a package that every analysis workflow should use.

library(rsalad)
library(dplyr)

Load data

First step is to load the flights dataset and have a peak at how it looks

fDat <- nycflights13::flights
head(fDat)
knitr::kable(head(fDat))

%nin% operator and notIn()

Let's say that for some reason we aren't interested in flights operated by United Airlines (UA), Delta Airlines (DL) and American Airlines (AA). To choose only carrier that are not part of that group, we can use the %nin% operator, which is also aliased to notIn().

fDat2 <- fDat %>% filter(carrier %nin% c("UA", "DL", "AA"))
allCarriers <- fDat %>% select(carrier) %>% first %>% unique
myCarriers <- fDat2 %>% select(carrier) %>% first %>% unique

paste0("All carriers: ", paste(allCarriers, collapse = ", "))
paste0("My carriers: ", paste(myCarriers, collapse = ", "))

The %nin% operator is simply the negation of %in%, but can be a handy shortcut. lhs %nin% rhs is equivalent to notIn(lhs, rhs). The following code would have the same result as above:

fDat2_2 <- fDat %>% filter(notIn(carrier, c("UA", "DL", "AA")))
identical(fDat2, fDat2_2)

For more information, see ?rsalad::notIn.

move functions: move columns to front/back

The move family of functions can be used to rearrange the column order of a data.frame by moving specific columns to be the first (moveFront() and moveFront_()) or last (moveBack() and moveBack_()) columns.
The order in which the columns are passed in as arguments determines the order in which the columns will be in the resulting data.frame, regardless of whether the columns are moved to the front or back.

These functions support non-standard evaulation (see function documentation for more details).

For brevity, we will only keep a few columns in the data.

fDat3 <- fDat2 %>% select(carrier, flight, origin, dest)
head(fDat3)
knitr::kable(head(fDat3))

Now let's rearrange the columns to be in this order: dest, origin, carrier, flight.

fDat4 <- fDat3 %>% moveFront(dest, origin)
head(fDat4)
knitr::kable(head(fDat4))

The same result can be achieved in different ways using other move functions.

fDat4_2 <- fDat3 %>% moveFront_(c("dest", "origin"))
fDat4_3 <- fDat3 %>% moveBack(carrier, flight) %>% moveFront(dest)

all(identical(fDat4, fDat4_2), identical(fDat4, fDat4_3))

For more information, see ?rsalad::move.

dfFactorize(): convert data.frame columns to factors

Sometimes you want to convert all the character columns of a data.frame into factors. In our current data, we have three character variables (dest, origin, carrier), but they all make more sense as factors. Rather than converting each column manually, we can use the dfFactorize() function.

str(fDat4)
fDat5 <- fDat4 %>% dfFactorize()
str(fDat5)

As you can see, calling dfFactorize() with no additional arguments converted all potential factor columns into factors. Note that the integer column was unaffected.

By default, all character columns are coerced to factors, but we can also specify which columns to convert or which columns to leave unaffected.

str(fDat4 %>% dfFactorize(only = "origin"))
str(fDat4 %>% dfFactorize(ignore = c("origin", "dest")))

For more information, see ?rsalad::dfFactorize.

dfCount(): count number of rows per group

Our goal is to see which destinations were the most common, so the next step is to count how many observations we have for each destination. This can be achieved using the base R function table():

head(table(fDat5$dest))

However, this is such a common task for me that I was not happy with the result table() gives.
Specifically:

The dfCount() function provides an alternative way to count the data in a data.frame column in an efficient way, sorts the results, and returns a data.frame.
Let's use dfCount to count the number of flights for each destination.

countDat <- fDat5 %>% dfCount("dest")
head(countDat)
knitr::kable(head(countDat))

Now our count data is in a nice data.frame format that can play nicely with other data.frames, and can be easily merged/joined into the original dataset if we wanted to.

Since we only want to see the 50 most common destinations, and the count data is sorted in descending order, we can now easily retain only the 50 destinations that appeared the most.

countDat2 <- slice(countDat, 1:50)

For a performance analysis of dfCount vs base::table, see the dfCount performance vignette.

For more information, see ?rsalad::dfCount.

Visual analysis

The ggExtra package has several functions that can be used to plot the resulting data more efficiently. These functions used to be part of this package, but are now in their own dedicated package.

Other functions

tolowerfirst(): convert first character to lower case

rsalad provides another function that can sometimes become handy. tolowerfirst() can be used to convert the first letter of a string (or a vector of strings) into lower case. This can be useful, for example, when columns of a data.frame do not follow a consistent capitalization and you would like to lower-case all first letters.

df <- data.frame(StudentName = character(0), ExamGrade = numeric(0))
(colnames(df) <- tolowerfirst(colnames(df)))

For more information, see ?rsalad::tolowerfirst.

setdiffsym(): symmetric set difference

When wanting to know the difference between two sets, the base R function setdiff() unfortunately does not do exactly what you want because it is asymmetric. This means that the results depend on the order of the two vectors passed in, which is often not the desired behaviour. setdiffsym implements symmetric set difference, whiich is a more intuitive set difference.

setdiff(1:5, 2:4)
setdiff(2:4, 1:5)
setdiffsym(1:5, 2:4)
setdiffsym(2:4, 1:5)

For more information, see ?rsalad::setdiffsym.

%btwn% operator and between()

Determine if a numeric value is between the specified range. By default, the range is inclusive of the endpoints.

5 %btwn% c(1, 10)
c(5, 20) %btwn% c(5, 10)
rsalad::between(5, c(5, 10))
rsalad::between(5, c(5, 10), inclusive = FALSE)

For more information, see ?rsalad::between.



daattali/rsalad documentation built on Oct. 28, 2019, 12:16 p.m.