Home

/

GitHub

/

In mdelcorvo/rgumbo: A soup of useful R functions for genomics data analysis

knitr::opts_chunk$set(tidy = FALSE, comment = "#>")

Overview

gumbo, like any other soup, is a mixture of different ingredients. This one in particular is made with a dark roux, vegetables, chicken, sausage, shrimp and served over rice. Except that instead of tasty food and fresh ingredients, rgumbo provides you with R functions.

This package is a result of me constantly breaking the DRY principle by copy-and-pasting functions from old projects into new ones. Hence, the functions in rgumbo do not have a single common topic, but they are all either related to manipulating genomic data or general productivity utilities.

Analysis

This vignette will introduce all the families of functions available in rsalad, but will not dive too deeply into any one specific function. To demonstrate all the functionality, we will use the nycflights13::flights dataset (information about ~335k flights departing from NYC) to visualize the 50 most common destinations of flights out of NYC. While the analysis is not particularly exciting, it will show how to use rsalad proficiently.

Load packages

Before beginning any analysis using rsalad, the first step is to load the package. We'll also load dplyr, a package that every analysis workflow should use.

library(rsalad)
library(dplyr)

Load data

First step is to load the flights dataset and have a peak at how it looks

fDat <- nycflights13::flights
head(fDat)

knitr::kable(head(fDat))

`%nin%` operator and `notIn()`

Let's say that for some reason we aren't interested in flights operated by United Airlines (UA), Delta Airlines (DL) and American Airlines (AA). To choose only carrier that are not part of that group, we can use the %nin% operator, which is also aliased to notIn().

fDat2 <- fDat %>% filter(carrier %nin% c("UA", "DL", "AA"))
allCarriers <- fDat %>% select(carrier) %>% first %>% unique
myCarriers <- fDat2 %>% select(carrier) %>% first %>% unique

paste0("All carriers: ", paste(allCarriers, collapse = ", "))
paste0("My carriers: ", paste(myCarriers, collapse = ", "))

The %nin% operator is simply the negation of %in%, but can be a handy shortcut. lhs %nin% rhs is equivalent to notIn(lhs, rhs). The following code would have the same result as above:

fDat2_2 <- fDat %>% filter(notIn(carrier, c("UA", "DL", "AA")))
identical(fDat2, fDat2_2)

For more information, see ?rsalad::notIn.

`move` functions: move columns to front/back

The move family of functions can be used to rearrange the column order of a data.frame by moving specific columns to be the first (moveFront() and moveFront_()) or last (moveBack() and moveBack_()) columns.
The order in which the columns are passed in as arguments determines the order in which the columns will be in the resulting data.frame, regardless of whether the columns are moved to the front or back.

These functions support non-standard evaulation (see function documentation for more details).

For brevity, we will only keep a few columns in the data.

fDat3 <- fDat2 %>% select(carrier, flight, origin, dest)
head(fDat3)

knitr::kable(head(fDat3))

Now let's rearrange the columns to be in this order: dest, origin, carrier, flight.

fDat4 <- fDat3 %>% moveFront(dest, origin)
head(fDat4)

knitr::kable(head(fDat4))

The same result can be achieved in different ways using other move functions.

fDat4_2 <- fDat3 %>% moveFront_(c("dest", "origin"))
fDat4_3 <- fDat3 %>% moveBack(carrier, flight) %>% moveFront(dest)

all(identical(fDat4, fDat4_2), identical(fDat4, fDat4_3))

For more information, see ?rsalad::move.

`dfFactorize()`: convert data.frame columns to factors

Sometimes you want to convert all the character columns of a data.frame into factors. In our current data, we have three character variables (dest, origin, carrier), but they all make more sense as factors. Rather than converting each column manually, we can use the dfFactorize() function.

str(fDat4)
fDat5 <- fDat4 %>% dfFactorize()
str(fDat5)

As you can see, calling dfFactorize() with no additional arguments converted all potential factor columns into factors. Note that the integer column was unaffected.

By default, all character columns are coerced to factors, but we can also specify which columns to convert or which columns to leave unaffected.

str(fDat4 %>% dfFactorize(only = "origin"))
str(fDat4 %>% dfFactorize(ignore = c("origin", "dest")))

For more information, see ?rsalad::dfFactorize.

`dfCount()`: count number of rows per group

Our goal is to see which destinations were the most common, so the next step is to count how many observations we have for each destination. This can be achieved using the base R function table():

head(table(fDat5$dest))

However, this is such a common task for me that I was not happy with the result table() gives.
Specifically:

table() returns a table object rather than the much more uesful data.frame.
table() does not sort the resulting counts.
table() performs very slowly on large datasets, especially if the data is numeric (see Performance section below).

The dfCount() function provides an alternative way to count the data in a data.frame column in an efficient way, sorts the results, and returns a data.frame.
Let's use dfCount to count the number of flights for each destination.

countDat <- fDat5 %>% dfCount("dest")
head(countDat)

knitr::kable(head(countDat))

Now our count data is in a nice data.frame format that can play nicely with other data.frames, and can be easily merged/joined into the original dataset if we wanted to.

Since we only want to see the 50 most common destinations, and the count data is sorted in descending order, we can now easily retain only the 50 destinations that appeared the most.

countDat2 <- slice(countDat, 1:50)

For a performance analysis of dfCount vs base::table, see the dfCount performance vignette.

For more information, see ?rsalad::dfCount.

Visual analysis

The ggExtra package has several functions that can be used to plot the resulting data more efficiently. These functions used to be part of this package, but are now in their own dedicated package.

Other functions

`tolowerfirst()`: convert first character to lower case

rsalad provides another function that can sometimes become handy. tolowerfirst() can be used to convert the first letter of a string (or a vector of strings) into lower case. This can be useful, for example, when columns of a data.frame do not follow a consistent capitalization and you would like to lower-case all first letters.

df <- data.frame(StudentName = character(0), ExamGrade = numeric(0))
(colnames(df) <- tolowerfirst(colnames(df)))

For more information, see ?rsalad::tolowerfirst.

`setdiffsym()`: symmetric set difference

When wanting to know the difference between two sets, the base R function setdiff() unfortunately does not do exactly what you want because it is asymmetric. This means that the results depend on the order of the two vectors passed in, which is often not the desired behaviour. setdiffsym implements symmetric set difference, whiich is a more intuitive set difference.

setdiff(1:5, 2:4)
setdiff(2:4, 1:5)
setdiffsym(1:5, 2:4)
setdiffsym(2:4, 1:5)

For more information, see ?rsalad::setdiffsym.

`%btwn%` operator and `between()`

Determine if a numeric value is between the specified range. By default, the range is inclusive of the endpoints.

5 %btwn% c(1, 10)
c(5, 20) %btwn% c(5, 10)
rsalad::between(5, c(5, 10))
rsalad::between(5, c(5, 10), inclusive = FALSE)

For more information, see ?rsalad::between.

mdelcorvo/rgumbo documentation built on Jan. 3, 2025, 2:12 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mdelcorvo/rgumbo
A soup of useful R functions for genomics data analysis

In mdelcorvo/rgumbo: A soup of useful R functions for genomics data analysis

Overview

Analysis

Load packages

Load data

`%nin%` operator and `notIn()`

`move` functions: move columns to front/back

`dfFactorize()`: convert data.frame columns to factors

`dfCount()`: count number of rows per group

Visual analysis

Other functions

`tolowerfirst()`: convert first character to lower case

`setdiffsym()`: symmetric set difference

`%btwn%` operator and `between()`

R Package Documentation

Browse R Packages

We want your feedback!

mdelcorvo/rgumbo A soup of useful R functions for genomics data analysis

In mdelcorvo/rgumbo: A soup of useful R functions for genomics data analysis

Overview

Analysis

Load packages

Load data

%nin% operator and notIn()

move functions: move columns to front/back

dfFactorize(): convert data.frame columns to factors

dfCount(): count number of rows per group

Visual analysis

Other functions

tolowerfirst(): convert first character to lower case

setdiffsym(): symmetric set difference

%btwn% operator and between()

R Package Documentation

Browse R Packages

We want your feedback!

mdelcorvo/rgumbo
A soup of useful R functions for genomics data analysis

`%nin%` operator and `notIn()`

`move` functions: move columns to front/back

`dfFactorize()`: convert data.frame columns to factors

`dfCount()`: count number of rows per group

`tolowerfirst()`: convert first character to lower case

`setdiffsym()`: symmetric set difference

`%btwn%` operator and `between()`