knitr::opts_chunk$set(tidy = FALSE, comment = "#>")
gumbo
, like any other soup, is a mixture of different ingredients.
This one in particular is made with a dark roux, vegetables, chicken,
sausage, shrimp and served over rice. Except that instead of tasty food
and fresh ingredients, rgumbo
provides you with R functions.
This package is a result of me constantly breaking the DRY principle by
copy-and-pasting functions from old projects into new ones.
Hence, the functions in rgumbo
do not have a single common topic,
but they are all either related to manipulating genomic data or general
productivity utilities.
This vignette will introduce all the families of functions available in
rsalad
, but will not dive too deeply into any one specific function. To
demonstrate all the functionality, we will use the nycflights13::flights
dataset (information about ~335k flights departing from NYC) to visualize the
50 most common destinations of flights out of NYC. While the analysis is not
particularly exciting, it will show how to use rsalad
proficiently.
Before beginning any analysis using rsalad
, the first step is to load the
package. We'll also load dplyr
, a package that every analysis workflow should use.
library(rsalad) library(dplyr)
First step is to load the flights
dataset and have a peak at how it looks
fDat <- nycflights13::flights head(fDat)
knitr::kable(head(fDat))
%nin%
operator and notIn()
Let's say that for some reason we aren't interested in flights operated by
United Airlines (UA), Delta Airlines (DL) and American Airlines (AA). To choose
only carrier that are not part of that group, we can use the %nin%
operator, which is also aliased to notIn()
.
fDat2 <- fDat %>% filter(carrier %nin% c("UA", "DL", "AA")) allCarriers <- fDat %>% select(carrier) %>% first %>% unique myCarriers <- fDat2 %>% select(carrier) %>% first %>% unique paste0("All carriers: ", paste(allCarriers, collapse = ", ")) paste0("My carriers: ", paste(myCarriers, collapse = ", "))
The %nin%
operator is simply the negation of %in%
, but can be a handy
shortcut. lhs %nin% rhs
is equivalent to notIn(lhs, rhs)
. The following
code would have the same result as above:
fDat2_2 <- fDat %>% filter(notIn(carrier, c("UA", "DL", "AA"))) identical(fDat2, fDat2_2)
For more information, see ?rsalad::notIn
.
move
functions: move columns to front/backThe move
family of functions can be used to rearrange the column order of a
data.frame by moving specific columns to be the first (moveFront()
and
moveFront_()
) or last (moveBack()
and moveBack_()
) columns.
The order in which the columns are passed in as arguments determines the order
in which the columns will be in the resulting data.frame, regardless of whether
the columns are moved to the front or back.
These functions support non-standard evaulation (see function documentation for more details).
For brevity, we will only keep a few columns in the data.
fDat3 <- fDat2 %>% select(carrier, flight, origin, dest) head(fDat3)
knitr::kable(head(fDat3))
Now let's rearrange the columns to be in this order: dest, origin, carrier, flight.
fDat4 <- fDat3 %>% moveFront(dest, origin) head(fDat4)
knitr::kable(head(fDat4))
The same result can be achieved in different ways using other move
functions.
fDat4_2 <- fDat3 %>% moveFront_(c("dest", "origin")) fDat4_3 <- fDat3 %>% moveBack(carrier, flight) %>% moveFront(dest) all(identical(fDat4, fDat4_2), identical(fDat4, fDat4_3))
For more information, see ?rsalad::move
.
dfFactorize()
: convert data.frame columns to factorsSometimes you want to convert all the character columns of a data.frame
into factors. In our current data, we have three character variables
(dest, origin, carrier), but they all make more sense as factors. Rather
than converting each column manually, we can use the dfFactorize()
function.
str(fDat4) fDat5 <- fDat4 %>% dfFactorize() str(fDat5)
As you can see, calling dfFactorize()
with no additional arguments converted
all potential factor columns into factors. Note that the integer column was
unaffected.
By default, all character columns are coerced to factors, but we can also specify which columns to convert or which columns to leave unaffected.
str(fDat4 %>% dfFactorize(only = "origin")) str(fDat4 %>% dfFactorize(ignore = c("origin", "dest")))
For more information, see ?rsalad::dfFactorize
.
dfCount()
: count number of rows per groupOur goal is to see which destinations were the most common, so the next step
is to count how many observations we have for each destination. This can be
achieved using the base R function table()
:
head(table(fDat5$dest))
However, this is such a common task for me that I was not happy with the result
table()
gives.
Specifically:
table()
returns a table
object rather than the much more uesful
data.frame
. table()
does not sort the resulting counts. table()
performs very slowly on large datasets, especially if the data is
numeric (see Performance section below). The dfCount()
function provides an alternative way to count the data in a
data.frame column in an efficient way, sorts the results, and returns a
data.frame.
Let's use dfCount to count the number of flights for each destination.
countDat <- fDat5 %>% dfCount("dest") head(countDat)
knitr::kable(head(countDat))
Now our count data is in a nice data.frame format that can play nicely with other data.frames, and can be easily merged/joined into the original dataset if we wanted to.
Since we only want to see the 50 most common destinations, and the count data is sorted in descending order, we can now easily retain only the 50 destinations that appeared the most.
countDat2 <- slice(countDat, 1:50)
For a performance analysis of dfCount
vs base::table
, see the
dfCount performance vignette.
For more information, see ?rsalad::dfCount
.
The ggExtra
package has several
functions that can be used to plot the resulting data more efficiently. These
functions used to be part of this package, but are now in their own dedicated
package.
tolowerfirst()
: convert first character to lower casersalad
provides another function that can sometimes become handy.
tolowerfirst()
can be used to convert the first letter of a string (or a
vector of strings) into lower case. This can be useful, for example, when
columns of a data.frame do not follow a consistent capitalization and you would
like to lower-case all first letters.
df <- data.frame(StudentName = character(0), ExamGrade = numeric(0)) (colnames(df) <- tolowerfirst(colnames(df)))
For more information, see ?rsalad::tolowerfirst
.
setdiffsym()
: symmetric set differenceWhen wanting to know the difference between two sets, the base R function
setdiff()
unfortunately does not do exactly what you want because it is
asymmetric. This means that the results depend on the order of the two
vectors passed in, which is often not the desired behaviour. setdiffsym
implements symmetric set difference, whiich is a more intuitive set difference.
setdiff(1:5, 2:4) setdiff(2:4, 1:5) setdiffsym(1:5, 2:4) setdiffsym(2:4, 1:5)
For more information, see ?rsalad::setdiffsym
.
%btwn%
operator and between()
Determine if a numeric value is between the specified range. By default, the range is inclusive of the endpoints.
5 %btwn% c(1, 10) c(5, 20) %btwn% c(5, 10) rsalad::between(5, c(5, 10)) rsalad::between(5, c(5, 10), inclusive = FALSE)
For more information, see ?rsalad::between
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.