knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This post has referred to a vignette from dplyr, you can find it in https://dplyr.tidyverse.org/articles/two-table.html. We'll try to display how to join data tables in this vignette. First, load the packages we need and get some data.

library(tidyfst)
library(nycflights13)

flights2 <- flights %>% 
  select_dt(year,month,day, hour, origin, dest, tailnum, carrier)

Do a left join with a simple:

flights2 %>% 
  left_join_dt(airlines)

Controlling how the tables are matched

Join works the same as dplyr:

flights2 %>% left_join_dt(weather)
flights2 %>% left_join_dt(planes, by = "tailnum")
flights2 %>% left_join_dt(airports, c("dest" = "faa"))
flights2 %>% left_join_dt(airports, c("origin" = "faa"))

Types of join

df1 <- data.table(x = c(1, 2), y = 2:1)
df2 <- data.table(x = c(1, 3), a = 10, b = "a")

df1 %>% inner_join_dt(df2) 
df1 %>% left_join_dt(df2)
df1 %>% right_join_dt(df2)
df1 %>% full_join_dt(df2)

If all you have is a data.frame or tibble, you have no need to change the format. Feed the data directly:

df1 <- data.frame(x = c(1, 1, 2), y = 1:3)
df2 <- data.frame(x = c(1, 1, 2), z = c("a", "b", "a"))

df1 %>% left_join_dt(df2)

The "_dt" suffix should remind you that this is backed up by data.table and will always return a data.table in the end.

Filtering joins

Filtering joins have also been supported in tidyfst.

flights %>% 
  anti_join_dt(planes, by = "tailnum") %>% 
  count_dt(tailnum, sort = TRUE)

Other examples (semi_join_dt() and anti_join_dt() never duplicate; they only ever remove observations.):

df1 <- data.frame(x = c(1, 1, 3, 4), y = 1:4)
df2 <- data.frame(x = c(1, 1, 2), z = c("a", "b", "a"))

# Four rows to start with:
df1 %>% nrow()

# And we get four rows after the join
df1 %>% inner_join_dt(df2, by = "x") %>% nrow()

# But only two rows actually match
df1 %>% semi_join_dt(df2, by = "x") %>% nrow()

Set operations

For set operations, wrap data.table's function directly, but the functions will automatically turn any data.frame into data.table. Examples are listed as below:

x = iris[c(2,3,3,4),]
x2 = iris[2:4,]
y = iris[c(3:5),]

intersect_dt(x, y)            # intersect
intersect_dt(x, y, all=TRUE)  # intersect all
setdiff_dt(x, y)              # except
setdiff_dt(x, y, all=TRUE)    # except all
union_dt(x, y)                # union
union_dt(x, y, all=TRUE)      # union all
setequal_dt(x, x2, all=FALSE) # setequal
setequal_dt(x, x2)     

For more details, just find the help from data.table using ?setops.



hope-data-science/tidyfst documentation built on Sept. 23, 2024, 8:05 p.m.