knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

Use tidyfst just like dplyr

This part of vignette has referred to dplyr's vignette in https://dplyr.tidyverse.org/articles/dplyr.html. We'll try to reproduce all the results. First load the needed packages.

library(tidyfst)
library(nycflights13)
library(data.table)

data.table(flights)

Filter rows with filter_dt()

filter_dt(flights, month == 1 & day == 1)

Note that comma could not be used in the expressions. Which means filter_dt(flights, month == 1,day == 1) would return error.

Arrange rows with arrange_dt()

arrange_dt(flights, year, month, day)

Use - (minus symbol) to order a column in descending order:

arrange_dt(flights, -arr_delay)

Select columns with select_dt()

select_dt(flights, year, month, day)

select_dt(flights, year:day) and select_dt(flights, -(year:day)) are not supported. But I have added a feature to help select with regular expression, which means you can:

select_dt(flights, "^dep")

The rename process is almost the same as that in dplyr:

select_dt(flights, tail_num = tailnum)
rename_dt(flights, tail_num = tailnum)

Add new columns with mutate_dt()

mutate_dt(flights,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)

However, if you just create the column, please split them. The following codes would not work:

mutate_dt(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

Instead, use:

mutate_dt(flights,gain = arr_delay - dep_delay) %>%
  mutate_dt(gain_per_hour = gain / (air_time / 60))

If you only want to keep the new variables, use transmute_dt():

transmute_dt(flights,
  gain = arr_delay - dep_delay
)

Summarise values with summarise_dt()

summarise_dt(flights,
  delay = mean(dep_delay, na.rm = TRUE)
)

Randomly sample rows with sample_n_dt() and sample_frac_dt()

sample_n_dt(flights, 10)
sample_frac_dt(flights, 0.01)

Grouped operations

For the below dplyr codes:

by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

We could get it via:

flights %>% 
  summarise_dt( count = .N,
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE),by = tailnum)

summarise_dt (or summarize_dt) has a parameter "by", you can specify the group. We could find the number of planes and the number of flights that go to each possible destination:

# the dplyr syntax:
# destinations <- group_by(flights, dest)
# summarise(destinations,
#   planes = n_distinct(tailnum),
#   flights = n()
# )

summarise_dt(flights,planes = uniqueN(tailnum),flights = .N,by = dest) %>% 
  arrange_dt(dest)

If you need to group by many variables, use:

# the dplyr syntax:
# daily <- group_by(flights, year, month, day)
# (per_day   <- summarise(daily, flights = n()))

flights %>% 
  summarise_dt(by = .(year,month,day),flights = .N)

# (per_month <- summarise(per_day, flights = sum(flights)))
flights %>% 
  summarise_dt(by = .(year,month,day),flights = .N) %>% 
  summarise_dt(by = .(year,month),flights = sum(flights))

# (per_year  <- summarise(per_month, flights = sum(flights)))
flights %>% 
  summarise_dt(by = .(year,month,day),flights = .N) %>% 
  summarise_dt(by = .(year,month),flights = sum(flights)) %>% 
  summarise_dt(by = .(year),flights = sum(flights))

Comparison with data.table syntax

tidyfst provides a tidy syntax for data.table. For such design, tidyfst never runs faster than the analogous data.table codes. Nevertheless, it facilitate the dplyr-users to gain the computation performance in no time and guide them to learn more about data.table for speed. Below, we'll compare the syntax of tidyfst and data.table (referring to Introduction to data.table). This could let you know how they are different, and let users to choose their preference. Ideally, tidyfst will lead even more users to learn more about data.table and its wonderful features, so as to design more extentions for tidyfst in the future.

Data

Because we want a more stable data source, here we'll use the flight data from the above nycflights13 package.

library(tidyfst)
library(data.table)
library(nycflights13)

flights = data.table(flights) %>% na.omit()

Subset rows

# data.table
head(flights[origin == "JFK" & month == 6L])
flights[1:2]
flights[order(origin, -dest)] 

# tidyfst
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  head()
flights %>% slice_dt(1:2)
flights %>% arrange_dt(origin,-dest)

Select column(s)

# data.table
flights[, list(arr_delay)]
flights[, .(arr_delay, dep_delay)]
flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]

# tidyfst
flights %>% select_dt(arr_delay)
flights %>% select_dt(arr_delay, dep_delay)
flights %>% transmute_dt(delay_arr = arr_delay, delay_dep = dep_delay)

Mixed computation

# data.table
flights[, sum( (arr_delay + dep_delay) < 0)]
flights[origin == "JFK" & month == 6L,
               .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
flights[origin == "JFK" & month == 6L, length(dest)]
flights[origin == "JFK" & month == 6L, .N]

# tidyfst
flights %>% summarise_dt(sum( (arr_delay + dep_delay) < 0))
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  summarise_dt(m_arr = mean(arr_delay), m_dep = mean(dep_delay))
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  nrow()
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  count_dt()
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  summarise_dt(.N)

In the above examples, we could learn that in tidyfst, you could still use the methods in data.table, such as .N.

Refer to columns by names

# data.table
flights[, c("arr_delay", "dep_delay")]

select_cols = c("arr_delay", "dep_delay")
flights[ , ..select_cols]
flights[ , select_cols, with = FALSE]

flights[, !c("arr_delay", "dep_delay")]
flights[, -c("arr_delay", "dep_delay")]

# returns year,month and day
flights[, year:day]
# returns day, month and year
flights[, day:year]
# returns all columns except year, month and day
flights[, -(year:day)]
flights[, !(year:day)]

# tidyfst
flights %>% select_dt(c("arr_delay", "dep_delay"))

select_cols = c("arr_delay", "dep_delay")
flights %>% select_dt(cols = select_cols)

flights %>% select_dt(-arr_delay,-dep_delay)

flights %>% select_dt(year:day)
flights %>% select_dt(day:year)
flights %>% select_dt(-(year:day))
flights %>% select_dt(!(year:day))

Aggregations

# data.table
flights[, .N, by = .(origin)]
flights[carrier == "AA", .N, by = origin]
flights[carrier == "AA", .N, by = .(origin, dest)]
flights[carrier == "AA",
        .(mean(arr_delay), mean(dep_delay)),
        by = .(origin, dest, month)]

# tidyfst
flights %>% count_dt(origin) # sort by default
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin)
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin,dest)
flights %>% filter_dt(carrier == "AA") %>% 
  summarise_dt(mean(arr_delay), mean(dep_delay),
               by = .(origin, dest, month))

Note that currently keyby is not used in tidyfst. This featuer might be included in the future for better performance in order-independent tasks. Moreover, count_dt is sorted automatically by the counted number, this could be controlled by the parameter "sort".

# data.table
flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)]
flights[, .N, .(dep_delay>0, arr_delay>0)]

# tidyfst
flights %>% 
  filter_dt(carrier == "AA") %>% 
  count_dt(origin,dest,sort = FALSE) %>% 
  arrange_dt(origin,-dest)
flights %>% 
  summarise_dt(.N,by = .(dep_delay>0, arr_delay>0))

Now let's try a more complex example:

# data.table
flights[carrier == "AA", 
        lapply(.SD, mean), 
        by = .(origin, dest, month), 
        .SDcols = c("arr_delay", "dep_delay")] 

# tidyfst
flights %>% 
  filter_dt(carrier == "AA") %>% 
  group_dt(
    by = .(origin, dest, month),
    at_dt("_delay",summarise_dt,mean)
           )

Let me explain what happens here, especially in group_dt. First filter by condition carrier == "AA", then group by three variables, which are origin, dest, month. Last, summarise by columns with "_delay" in the column names and get the mean value of all such variables(with "_delay" in their column names). This is a very creative design, utilizing .SD in data.table and upgrade the group_by function in dplyr (because you never need to ungroup now, just put the group operations in the group_dt). And you can pipe in the group_dt function. Let's play with it a little bit further:

flights %>% 
  filter_dt(carrier == "AA") %>% 
  group_dt(
    by = .(origin, dest, month),
    at_dt("_delay",summarise_dt,mean) %>% 
      mutate_dt(sum = dep_delay + arr_delay)
           )

However, I don't recommend using it if you don't acutually need it for group computation (just start another pipe follows group_dt). Now let's end with some easy examples:

# data.table
flights[, head(.SD, 2), by = month]

# tidyfst
flights %>% 
  group_dt(by = month,head(2))

Deep inside, tidyfst is born from dplyr and data.table, and use stringr to make flexible APIs, so as to bring their superiority into full play.



hope-data-science/tidyfst documentation built on Sept. 23, 2024, 8:05 p.m.