knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE )
This part of vignette has referred to dplyr
's vignette in https://dplyr.tidyverse.org/articles/dplyr.html. We'll try to reproduce all the results. First load the needed packages.
library(tidyfst) library(nycflights13) library(data.table) data.table(flights)
filter_dt()
filter_dt(flights, month == 1 & day == 1)
Note that comma could not be used in the expressions. Which means filter_dt(flights, month == 1,day == 1)
would return error.
arrange_dt()
arrange_dt(flights, year, month, day)
Use -
(minus symbol) to order a column in descending order:
arrange_dt(flights, -arr_delay)
select_dt()
select_dt(flights, year, month, day)
select_dt(flights, year:day)
and select_dt(flights, -(year:day))
are not supported. But I have added a feature to help select with regular expression, which means you can:
select_dt(flights, "^dep")
The rename process is almost the same as that in dplyr
:
select_dt(flights, tail_num = tailnum) rename_dt(flights, tail_num = tailnum)
mutate_dt()
mutate_dt(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60 )
However, if you just create the column, please split them. The following codes would not work:
mutate_dt(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) )
Instead, use:
mutate_dt(flights,gain = arr_delay - dep_delay) %>% mutate_dt(gain_per_hour = gain / (air_time / 60))
If you only want to keep the new variables, use transmute_dt()
:
transmute_dt(flights, gain = arr_delay - dep_delay )
summarise_dt()
summarise_dt(flights, delay = mean(dep_delay, na.rm = TRUE) )
sample_n_dt()
and sample_frac_dt()
sample_n_dt(flights, 10) sample_frac_dt(flights, 0.01)
For the below dplyr
codes:
by_tailnum <- group_by(flights, tailnum) delay <- summarise(by_tailnum, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay <- filter(delay, count > 20, dist < 2000)
We could get it via:
flights %>% summarise_dt( count = .N, dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE),by = tailnum)
summarise_dt
(or summarize_dt
) has a parameter "by", you can specify the group.
We could find the number of planes and the number of flights that go to each possible destination:
# the dplyr syntax: # destinations <- group_by(flights, dest) # summarise(destinations, # planes = n_distinct(tailnum), # flights = n() # ) summarise_dt(flights,planes = uniqueN(tailnum),flights = .N,by = dest) %>% arrange_dt(dest)
If you need to group by many variables, use:
# the dplyr syntax: # daily <- group_by(flights, year, month, day) # (per_day <- summarise(daily, flights = n())) flights %>% summarise_dt(by = .(year,month,day),flights = .N) # (per_month <- summarise(per_day, flights = sum(flights))) flights %>% summarise_dt(by = .(year,month,day),flights = .N) %>% summarise_dt(by = .(year,month),flights = sum(flights)) # (per_year <- summarise(per_month, flights = sum(flights))) flights %>% summarise_dt(by = .(year,month,day),flights = .N) %>% summarise_dt(by = .(year,month),flights = sum(flights)) %>% summarise_dt(by = .(year),flights = sum(flights))
tidyfst provides a tidy syntax for data.table. For such design, tidyfst never runs faster than the analogous data.table codes. Nevertheless, it facilitate the dplyr-users to gain the computation performance in no time and guide them to learn more about data.table for speed.
Below, we'll compare the syntax of tidyfst
and data.table
(referring to Introduction to data.table). This could let you know how they are different, and let users to choose their preference. Ideally, tidyfst will lead even more users to learn more about data.table and its wonderful features, so as to design more extentions for tidyfst in the future.
Because we want a more stable data source, here we'll use the flight data from the above nycflights13
package.
library(tidyfst) library(data.table) library(nycflights13) flights = data.table(flights) %>% na.omit()
# data.table head(flights[origin == "JFK" & month == 6L]) flights[1:2] flights[order(origin, -dest)] # tidyfst flights %>% filter_dt(origin == "JFK" & month == 6L) %>% head() flights %>% slice_dt(1:2) flights %>% arrange_dt(origin,-dest)
# data.table flights[, list(arr_delay)] flights[, .(arr_delay, dep_delay)] flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)] # tidyfst flights %>% select_dt(arr_delay) flights %>% select_dt(arr_delay, dep_delay) flights %>% transmute_dt(delay_arr = arr_delay, delay_dep = dep_delay)
# data.table flights[, sum( (arr_delay + dep_delay) < 0)] flights[origin == "JFK" & month == 6L, .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))] flights[origin == "JFK" & month == 6L, length(dest)] flights[origin == "JFK" & month == 6L, .N] # tidyfst flights %>% summarise_dt(sum( (arr_delay + dep_delay) < 0)) flights %>% filter_dt(origin == "JFK" & month == 6L) %>% summarise_dt(m_arr = mean(arr_delay), m_dep = mean(dep_delay)) flights %>% filter_dt(origin == "JFK" & month == 6L) %>% nrow() flights %>% filter_dt(origin == "JFK" & month == 6L) %>% count_dt() flights %>% filter_dt(origin == "JFK" & month == 6L) %>% summarise_dt(.N)
In the above examples, we could learn that in tidyfst, you could still use the methods in data.table, such as .N
.
# data.table flights[, c("arr_delay", "dep_delay")] select_cols = c("arr_delay", "dep_delay") flights[ , ..select_cols] flights[ , select_cols, with = FALSE] flights[, !c("arr_delay", "dep_delay")] flights[, -c("arr_delay", "dep_delay")] # returns year,month and day flights[, year:day] # returns day, month and year flights[, day:year] # returns all columns except year, month and day flights[, -(year:day)] flights[, !(year:day)] # tidyfst flights %>% select_dt(c("arr_delay", "dep_delay")) select_cols = c("arr_delay", "dep_delay") flights %>% select_dt(cols = select_cols) flights %>% select_dt(-arr_delay,-dep_delay) flights %>% select_dt(year:day) flights %>% select_dt(day:year) flights %>% select_dt(-(year:day)) flights %>% select_dt(!(year:day))
# data.table flights[, .N, by = .(origin)] flights[carrier == "AA", .N, by = origin] flights[carrier == "AA", .N, by = .(origin, dest)] flights[carrier == "AA", .(mean(arr_delay), mean(dep_delay)), by = .(origin, dest, month)] # tidyfst flights %>% count_dt(origin) # sort by default flights %>% filter_dt(carrier == "AA") %>% count_dt(origin) flights %>% filter_dt(carrier == "AA") %>% count_dt(origin,dest) flights %>% filter_dt(carrier == "AA") %>% summarise_dt(mean(arr_delay), mean(dep_delay), by = .(origin, dest, month))
Note that currently keyby
is not used in tidyfst. This featuer might be included in the future for better performance in order-independent tasks. Moreover, count_dt
is sorted automatically by the counted number, this could be controlled by the parameter "sort".
# data.table flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)] flights[, .N, .(dep_delay>0, arr_delay>0)] # tidyfst flights %>% filter_dt(carrier == "AA") %>% count_dt(origin,dest,sort = FALSE) %>% arrange_dt(origin,-dest) flights %>% summarise_dt(.N,by = .(dep_delay>0, arr_delay>0))
Now let's try a more complex example:
# data.table flights[carrier == "AA", lapply(.SD, mean), by = .(origin, dest, month), .SDcols = c("arr_delay", "dep_delay")] # tidyfst flights %>% filter_dt(carrier == "AA") %>% group_dt( by = .(origin, dest, month), at_dt("_delay",summarise_dt,mean) )
Let me explain what happens here, especially in group_dt
. First filter by condition carrier == "AA"
, then group by three variables, which are origin, dest, month
. Last, summarise by columns with "_delay" in the column names and get the mean value of all such variables(with "_delay" in their column names). This is a very creative design, utilizing .SD
in data.table and upgrade the group_by
function in dplyr (because you never need to ungroup
now, just put the group operations in the group_dt
). And you can pipe in the group_dt function. Let's play with it a little bit further:
flights %>% filter_dt(carrier == "AA") %>% group_dt( by = .(origin, dest, month), at_dt("_delay",summarise_dt,mean) %>% mutate_dt(sum = dep_delay + arr_delay) )
However, I don't recommend using it if you don't acutually need it for group computation (just start another pipe follows group_dt
).
Now let's end with some easy examples:
# data.table flights[, head(.SD, 2), by = month] # tidyfst flights %>% group_dt(by = month,head(2))
Deep inside, tidyfst is born from dplyr and data.table, and use stringr to make flexible APIs, so as to bring their superiority into full play.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.