Example 1: Basic usage"
In tidyfst: Tidy Verbs for Fast Data Manipulation

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

Use tidyfst just like dplyr

This part of vignette has referred to dplyr's vignette in https://dplyr.tidyverse.org/articles/dplyr.html. We'll try to reproduce all the results. First load the needed packages.

library(tidyfst)
library(nycflights13)
library(data.table)

data.table(flights)

Filter rows with `filter_dt()`

filter_dt(flights, month == 1 & day == 1)

Note that comma could not be used in the expressions. Which means filter_dt(flights, month == 1,day == 1) would return error.

Arrange rows with `arrange_dt()`

arrange_dt(flights, year, month, day)

Use - (minus symbol) to order a column in descending order:

arrange_dt(flights, -arr_delay)

Select columns with `select_dt()`

select_dt(flights, year, month, day)

select_dt(flights, year:day) and select_dt(flights, -(year:day)) are not supported. But I have added a feature to help select with regular expression, which means you can:

select_dt(flights, "^dep")

The rename process is almost the same as that in dplyr:

select_dt(flights, tail_num = tailnum)
rename_dt(flights, tail_num = tailnum)

Add new columns with `mutate_dt()`

mutate_dt(flights,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)

However, if you just create the column, please split them. The following codes would not work:

mutate_dt(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)

Instead, use:

mutate_dt(flights,gain = arr_delay - dep_delay) %>%
  mutate_dt(gain_per_hour = gain / (air_time / 60))

If you only want to keep the new variables, use transmute_dt():

transmute_dt(flights,
  gain = arr_delay - dep_delay
)

Summarise values with `summarise_dt()`

summarise_dt(flights,
  delay = mean(dep_delay, na.rm = TRUE)
)

Randomly sample rows with `sample_n_dt()` and `sample_frac_dt()`

sample_n_dt(flights, 10)
sample_frac_dt(flights, 0.01)

Grouped operations

For the below dplyr codes:

by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

We could get it via:

flights %>% 
  summarise_dt( count = .N,
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE),by = tailnum)

summarise_dt (or summarize_dt) has a parameter "by", you can specify the group. We could find the number of planes and the number of flights that go to each possible destination:

# the dplyr syntax:
# destinations <- group_by(flights, dest)
# summarise(destinations,
#   planes = n_distinct(tailnum),
#   flights = n()
# )

summarise_dt(flights,planes = uniqueN(tailnum),flights = .N,by = dest) %>% 
  arrange_dt(dest)

If you need to group by many variables, use:

# the dplyr syntax:
# daily <- group_by(flights, year, month, day)
# (per_day   <- summarise(daily, flights = n()))

flights %>% 
  summarise_dt(by = .(year,month,day),flights = .N)

# (per_month <- summarise(per_day, flights = sum(flights)))
flights %>% 
  summarise_dt(by = .(year,month,day),flights = .N) %>% 
  summarise_dt(by = .(year,month),flights = sum(flights))

# (per_year  <- summarise(per_month, flights = sum(flights)))
flights %>% 
  summarise_dt(by = .(year,month,day),flights = .N) %>% 
  summarise_dt(by = .(year,month),flights = sum(flights)) %>% 
  summarise_dt(by = .(year),flights = sum(flights))

Comparison with data.table syntax

tidyfst provides a tidy syntax for data.table. For such design, tidyfst never runs faster than the analogous data.table codes. Nevertheless, it facilitate the dplyr-users to gain the computation performance in no time and guide them to learn more about data.table for speed. Below, we'll compare the syntax of tidyfst and data.table (referring to Introduction to data.table). This could let you know how they are different, and let users to choose their preference. Ideally, tidyfst will lead even more users to learn more about data.table and its wonderful features, so as to design more extentions for tidyfst in the future.

Data

Because we want a more stable data source, here we'll use the flight data from the above nycflights13 package.

library(tidyfst)
library(data.table)
library(nycflights13)

flights = data.table(flights) %>% na.omit()

Subset rows

# data.table
head(flights[origin == "JFK" & month == 6L])
flights[1:2]
flights[order(origin, -dest)] 

# tidyfst
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  head()
flights %>% slice_dt(1:2)
flights %>% arrange_dt(origin,-dest)

Select column(s)

# data.table
flights[, list(arr_delay)]
flights[, .(arr_delay, dep_delay)]
flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]

# tidyfst
flights %>% select_dt(arr_delay)
flights %>% select_dt(arr_delay, dep_delay)
flights %>% transmute_dt(delay_arr = arr_delay, delay_dep = dep_delay)

Mixed computation

# data.table
flights[, sum( (arr_delay + dep_delay) < 0)]
flights[origin == "JFK" & month == 6L,
               .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
flights[origin == "JFK" & month == 6L, length(dest)]
flights[origin == "JFK" & month == 6L, .N]

# tidyfst
flights %>% summarise_dt(sum( (arr_delay + dep_delay) < 0))
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  summarise_dt(m_arr = mean(arr_delay), m_dep = mean(dep_delay))
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  nrow()
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  count_dt()
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  summarise_dt(.N)

In the above examples, we could learn that in tidyfst, you could still use the methods in data.table, such as .N.

Refer to columns by names

# data.table
flights[, c("arr_delay", "dep_delay")]

select_cols = c("arr_delay", "dep_delay")
flights[ , ..select_cols]
flights[ , select_cols, with = FALSE]

flights[, !c("arr_delay", "dep_delay")]
flights[, -c("arr_delay", "dep_delay")]

# returns year,month and day
flights[, year:day]
# returns day, month and year
flights[, day:year]
# returns all columns except year, month and day
flights[, -(year:day)]
flights[, !(year:day)]

# tidyfst
flights %>% select_dt(c("arr_delay", "dep_delay"))

select_cols = c("arr_delay", "dep_delay")
flights %>% select_dt(cols = select_cols)

flights %>% select_dt(-arr_delay,-dep_delay)

flights %>% select_dt(year:day)
flights %>% select_dt(day:year)
flights %>% select_dt(-(year:day))
flights %>% select_dt(!(year:day))

Aggregations

# data.table
flights[, .N, by = .(origin)]
flights[carrier == "AA", .N, by = origin]
flights[carrier == "AA", .N, by = .(origin, dest)]
flights[carrier == "AA",
        .(mean(arr_delay), mean(dep_delay)),
        by = .(origin, dest, month)]

# tidyfst
flights %>% count_dt(origin) # sort by default
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin)
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin,dest)
flights %>% filter_dt(carrier == "AA") %>% 
  summarise_dt(mean(arr_delay), mean(dep_delay),
               by = .(origin, dest, month))

Note that currently keyby is not used in tidyfst. This featuer might be included in the future for better performance in order-independent tasks. Moreover, count_dt is sorted automatically by the counted number, this could be controlled by the parameter "sort".

# data.table
flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)]
flights[, .N, .(dep_delay>0, arr_delay>0)]

# tidyfst
flights %>% 
  filter_dt(carrier == "AA") %>% 
  count_dt(origin,dest,sort = FALSE) %>% 
  arrange_dt(origin,-dest)
flights %>% 
  summarise_dt(.N,by = .(dep_delay>0, arr_delay>0))

Now let's try a more complex example:

# data.table
flights[carrier == "AA", 
        lapply(.SD, mean), 
        by = .(origin, dest, month), 
        .SDcols = c("arr_delay", "dep_delay")] 

# tidyfst
flights %>% 
  filter_dt(carrier == "AA") %>% 
  group_dt(
    by = .(origin, dest, month),
    at_dt("_delay",summarise_dt,mean)
           )

Let me explain what happens here, especially in group_dt. First filter by condition carrier == "AA", then group by three variables, which are origin, dest, month. Last, summarise by columns with "_delay" in the column names and get the mean value of all such variables(with "_delay" in their column names). This is a very creative design, utilizing .SD in data.table and upgrade the group_by function in dplyr (because you never need to ungroup now, just put the group operations in the group_dt). And you can pipe in the group_dt function. Let's play with it a little bit further:

flights %>% 
  filter_dt(carrier == "AA") %>% 
  group_dt(
    by = .(origin, dest, month),
    at_dt("_delay",summarise_dt,mean) %>% 
      mutate_dt(sum = dep_delay + arr_delay)
           )

However, I don't recommend using it if you don't acutually need it for group computation (just start another pipe follows group_dt). Now let's end with some easy examples:

# data.table
flights[, head(.SD, 2), by = month]

# tidyfst
flights %>% 
  group_dt(by = month,head(2))

Deep inside, tidyfst is born from dplyr and data.table, and use stringr to make flexible APIs, so as to bring their superiority into full play.

Any scripts or data that you put into this service are public.

tidyfst documentation built on Sept. 16, 2024, 9:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tidyfst
Tidy Verbs for Fast Data Manipulation

Example 1: Basic usage"
In tidyfst: Tidy Verbs for Fast Data Manipulation

Use tidyfst just like dplyr

Filter rows with `filter_dt()`

Arrange rows with `arrange_dt()`

Select columns with `select_dt()`

Add new columns with `mutate_dt()`

Summarise values with `summarise_dt()`

Randomly sample rows with `sample_n_dt()` and `sample_frac_dt()`

Grouped operations

Comparison with data.table syntax

Data

Subset rows

Select column(s)

Mixed computation

Refer to columns by names

Aggregations

Try the tidyfst package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

tidyfst Tidy Verbs for Fast Data Manipulation

Example 1: Basic usage" In tidyfst: Tidy Verbs for Fast Data Manipulation

Use tidyfst just like dplyr

Filter rows with filter_dt()

Arrange rows with arrange_dt()

Select columns with select_dt()

Add new columns with mutate_dt()

Summarise values with summarise_dt()

Randomly sample rows with sample_n_dt() and sample_frac_dt()

Grouped operations

Comparison with data.table syntax

Data

Subset rows

Select column(s)

Mixed computation

Refer to columns by names

Aggregations

Try the tidyfst package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

tidyfst
Tidy Verbs for Fast Data Manipulation

Example 1: Basic usage"
In tidyfst: Tidy Verbs for Fast Data Manipulation

Filter rows with `filter_dt()`

Arrange rows with `arrange_dt()`

Select columns with `select_dt()`

Add new columns with `mutate_dt()`

Summarise values with `summarise_dt()`

Randomly sample rows with `sample_n_dt()` and `sample_frac_dt()`