vignettes/overview.md

Overview

rsalad, like any other salad, is a mixture of different healthy vegetables that you should be having frequently and that can make your life much better. Except that instead of vegetables, rsalad provides you with R functions.

This package was born as a result of me constantly breaking the DRY principle by copy-and-pasting functions from old projects into new ones. Hence, the functions in rsalad do not have a single common topic, but they are all either related to manipulating data.frames or general productivity utilities.

Analysis

This vignette will introduce all the families of functions available in rsalad, but will not dive too deeply into any one specific function. To demonstrate all the functionality, we will use the nycflights13::flights dataset (information about ~335k flights departing from NYC) to visualize the 50 most common destinations of flights out of NYC. While the analysis is not particularly exciting, it will show how to use rsalad proficiently.

Load packages

Before beginning any analysis using rsalad, the first step is to load the package. We'll also load dplyr and ggplot2 to make the analysis more complete.

library(rsalad)
library(dplyr)
library(ggplot2)

Load data

First step is to load the flights dataset and have a peak at how it looks

fDat <- nycflights13::flights
head(fDat)
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33 2013 1 1 542 2 923 33 AA N619AA 1141 JFK MIA 160 1089 5 42 2013 1 1 544 -1 1004 -18 B6 N804JB 725 JFK BQN 183 1576 5 44 2013 1 1 554 -6 812 -25 DL N668DN 461 LGA ATL 116 762 5 54 2013 1 1 554 -4 740 12 UA N39463 1696 EWR ORD 150 719 5 54

%nin% operator and notIn()

Let's say that for some reason we aren't interested in flights operated by United Airlines (UA), Delta Airlines (DL) and American Airlines (AA). To choose only carrier that are not part of that group, we can use the %nin% operator, which is also aliased to notIn().

fDat2 <- fDat %>% filter(carrier %nin% c("UA", "DL", "AA"))
allCarriers <- fDat %>% select(carrier) %>% first %>% unique
myCarriers <- fDat2 %>% select(carrier) %>% first %>% unique

paste0("All carriers: ", paste(allCarriers, collapse = ", "))
paste0("My carriers: ", paste(myCarriers, collapse = ", "))

#> [1] "All carriers: UA, AA, B6, DL, EV, MQ, US, WN, VX, FL, AS, 9E, F9, HA, YV, OO"
#> [1] "My carriers: B6, EV, MQ, US, WN, VX, FL, AS, 9E, F9, HA, YV, OO"

The %nin% operator is simply the negation of %in%, but can be a handy shortcut. lhs %nin% rhs is equivalent to notIn(lhs, rhs). The following code would have the same result as above:

fDat2_2 <- fDat %>% filter(notIn(carrier, c("UA", "DL", "AA")))
identical(fDat2, fDat2_2)

#> [1] TRUE

For more information, see ?rsalad::notIn.

move functions: move columns to front/back

The move family of functions can be used to rearrange the column order of a data.frame by moving specific columns to be the first (moveFront() and moveFront_()) or last (moveBack() and moveBack_()) columns. The order in which the columns are passed in as arguments determines the order in which the columns will be in the resulting data.frame, regardless of whether the columns are moved to the front or back.

These functions support non-standard evaulation (see function documentation for more details).

For brevity, we will only keep a few columns in the data.

fDat3 <- fDat2 %>% select(carrier, flight, origin, dest)
head(fDat3)
carrier flight origin dest B6 725 JFK BQN B6 507 EWR FLL EV 5708 LGA IAD B6 79 JFK MCO B6 49 JFK PBI B6 71 JFK TPA

Now let's rearrange the columns to be in this order: dest, origin, carrier, flight.

fDat4 <- fDat3 %>% moveFront(dest, origin)
head(fDat4)
dest origin carrier flight BQN JFK B6 725 FLL EWR B6 507 IAD LGA EV 5708 MCO JFK B6 79 PBI JFK B6 49 TPA JFK B6 71

The same result can be achieved in different ways using other move functions.

fDat4_2 <- fDat3 %>% moveFront_(c("dest", "origin"))
fDat4_3 <- fDat3 %>% moveBack(carrier, flight) %>% moveFront(dest)

all(identical(fDat4, fDat4_2), identical(fDat4, fDat4_3))

#> [1] TRUE

For more information, see ?rsalad::move.

dfFactorize(): convert data.frame columns to factors

Sometimes you want to convert all the character columns of a data.frame into factors. In our current data, we have three character variables (dest, origin, carrier), but they all make more sense as factors. Rather than converting each column manually, we can use the dfFactorize() function.

str(fDat4)

#> Classes 'tbl_df', 'tbl' and 'data.frame':    197272 obs. of  4 variables:
#>  $ dest   : chr  "BQN" "FLL" "IAD" "MCO" ...
#>  $ origin : chr  "JFK" "EWR" "LGA" "JFK" ...
#>  $ carrier: chr  "B6" "B6" "EV" "B6" ...
#>  $ flight : int  725 507 5708 79 49 71 1806 371 4650 343 ...

fDat5 <- fDat4 %>% dfFactorize()
str(fDat5)

#> Classes 'tbl_df', 'tbl' and 'data.frame':    197272 obs. of  4 variables:
#>  $ dest   : Factor w/ 94 levels "ABQ","ACK","ALB",..: 12 32 38 48 63 90 11 32 4 63 ...
#>  $ origin : Factor w/ 3 levels "EWR","JFK","LGA": 2 1 3 2 2 2 2 3 3 1 ...
#>  $ carrier: Factor w/ 13 levels "9E","AS","B6",..: 3 3 4 3 3 3 3 3 8 3 ...
#>  $ flight : int  725 507 5708 79 49 71 1806 371 4650 343 ...

As you can see, calling dfFactorize() with no additional arguments converted all potential factor columns into factors. Note that the integer column was unaffected.

By default, all character columns are coerced to factors, but we can also specify which columns to convert or which columns to leave unaffected.

str(fDat4 %>% dfFactorize(only = "origin"))

#> Classes 'tbl_df', 'tbl' and 'data.frame':    197272 obs. of  4 variables:
#>  $ dest   : chr  "BQN" "FLL" "IAD" "MCO" ...
#>  $ origin : Factor w/ 3 levels "EWR","JFK","LGA": 2 1 3 2 2 2 2 3 3 1 ...
#>  $ carrier: chr  "B6" "B6" "EV" "B6" ...
#>  $ flight : int  725 507 5708 79 49 71 1806 371 4650 343 ...

str(fDat4 %>% dfFactorize(ignore = c("origin", "dest")))

#> Classes 'tbl_df', 'tbl' and 'data.frame':    197272 obs. of  4 variables:
#>  $ dest   : chr  "BQN" "FLL" "IAD" "MCO" ...
#>  $ origin : chr  "JFK" "EWR" "LGA" "JFK" ...
#>  $ carrier: Factor w/ 13 levels "9E","AS","B6",..: 3 3 4 3 3 3 3 3 8 3 ...
#>  $ flight : int  725 507 5708 79 49 71 1806 371 4650 343 ...

For more information, see ?rsalad::dfFactorize.

dfCount(): count number of rows per group

Our goal is to see which destinations were the most common, so the next step is to count how many observations we have for each destination. This can be achieved using the base R function table():

head(table(fDat5$dest))

#> 
#>  ABQ  ACK  ALB  ATL  AUS  AVL 
#>  254  265  439 6541 1047  275

However, this is such a common task for me that I was not happy with the result table() gives. Specifically:

The dfCount() function provides an alternative way to count the data in a data.frame column in an efficient way, sorts the results, and returns a data.frame. Let's use dfCount to count the number of flights for each destination.

countDat <- fDat5 %>% dfCount("dest")
head(countDat)
dest total CLT 14062 BOS 9739 DCA 9701 RDU 8162 FLL 6563 ATL 6541

Now our count data is in a nice data.frame format that can play nicely with other data.frames, and can be easily merged/joined into the original dataset if we wanted to.

Since we only want to see the 50 most common destinations, and the count data is sorted in descending order, we can now easily retain only the 50 destinations that appeared the most.

countDat2 <- slice(countDat, 1:50)

For a performance analysis of dfCount vs base::table, see the dfCount performance vignette.

For more information, see ?rsalad::dfCount.

Visual analysis

The ggExtra package has several functions that can be used to plot the resulting data more efficiently. These functions used to be part of this package, but are now in their own dedicated package.

Other functions

spinMyR(): create markdown/HTML reports from R scripts with no hassle

See the spinMyR vignette for information about this function.

tolowerfirst(): convert first character to lower case

rsalad provides another function that can sometimes become handy. tolowerfirst() can be used to convert the first letter of a string (or a vector of strings) into lower case. This can be useful, for example, when columns of a data.frame do not follow a consistent capitalization and you would like to lower-case all first letters.

df <- data.frame(StudentName = character(0), ExamGrade = numeric(0))
(colnames(df) <- tolowerfirst(colnames(df)))

#> [1] "studentName" "examGrade"

For more information, see ?rsalad::tolowerfirst.

setdiffsym(): symmetric set difference

When wanting to know the difference between two sets, the base R function setdiff() unfortunately does not do exactly what you want because it is asymmetric. This means that the results depend on the order of the two vectors passed in, which is often not the desired behaviour. setdiffsym implements symmetric set difference, whiich is a more intuitive set difference.

setdiff(1:5, 2:4)

#> [1] 1 5

setdiff(2:4, 1:5)

#> integer(0)

setdiffsym(1:5, 2:4)

#> [1] 1 5

setdiffsym(2:4, 1:5)

#> [1] 1 5

For more information, see ?rsalad::setdiffsym.

%btwn% operator and between()

Determine if a numeric value is between the specified range. By default, the range is inclusive of the endpoints.

5 %btwn% c(1, 10)

#> [1] TRUE

c(5, 20) %btwn% c(5, 10)

#> [1]  TRUE FALSE

rsalad::between(5, c(5, 10))

#> [1] TRUE

rsalad::between(5, c(5, 10), inclusive = FALSE)

#> [1] FALSE

For more information, see ?rsalad::between.



daattali/rsalad documentation built on Oct. 28, 2019, 12:16 p.m.