fast_left_join: Fast left join for large tables

View source: R/pipeline-helpers.R

fast_left_joinR Documentation

Fast left join for large tables

Description

The dplyr join functions are a little on the slow side for very large tables. This version converts its inputs data.table structures, and uses that package's faster indexing capabilities to do a faster join.

Usage

fast_left_join(left, right, by)

Arguments

left

The left-side table to join. Any class inheriting from data.frame is acceptable.

right

The right-side table to join. Any class inheriting from data.frame is acceptable.

by

Character vector of column names to join by.

Details

Because there is some overhead associated with setting up and indexing the data.table structures, this function is only useful when the right-side table is big enough that the savings in the join to make up for the overhead. Therefore, this function should only be used for joins that are demonstrably causing bottlenecks due to the size of the tables involved. This version should never be the first choice in development. As a rule of thumb, any join that is taking more than 500ms using the dplyr join functions is a candidate for this function.

When using this function, be aware that data.table has some slightly different conventions for handling duplicated columns that are not being joined on. Suppose we have tables A and B, both of which have a column value that is not being joined on. Then, AB <- dplyr::left_join(A, B) will have columns AB$value.x with the values from table A and AB$value.y with the values from table B. In AB <- gcamdata::fast_left_join(A, B), the corresponding columns will be AB$i.value for the values from table A, and AB$value (sic) for the values from table B. This function makes no attempt to correct the column names in the result to conform to the dplyr convention, and is therefore not exactly a drop-in replacement for left_join. However, it is usually easy enough to make corrections on the returned value.

Since this function is intended only for specialized use, we don't provide any of the other join variants like first-only or error-no-match. The cases where that extra functionality is needed and the tables involved are too large for the slower version of join are uncommon enough that they can be handled on a case by case basis. (That's documentation-speak for "You're on your own.")

Value

The left join of left and right. It will be returned as a tbl_df, irrespective of the type of the inputs.


JGCRI/gcamdata documentation built on March 21, 2023, 2:19 a.m.