cross_join: Cross Join

Description Usage Arguments Details Examples

View source: R/cross_join.R

Description

The CROSS JOIN returns all combinations of x and y, i.e. the dataset which is the number of rows in the first dataset multiplied by the number of rows in the second dataset. This kind of result is called the Cartesian Product.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
cross_join(
  x,
  y,
  copy = FALSE,
  suffix = c("_x", "_y"),
  ...,
  na_matches = c("never", "na")
)

## S3 method for class 'tbl_lazy'
cross_join(
  x,
  y,
  copy = FALSE,
  suffix = c("_x", "_y"),
  ...,
  na_matches = c("never", "na")
)

## S3 method for class 'data.frame'
cross_join(
  x,
  y,
  copy = FALSE,
  suffix = c("_x", "_y"),
  ...,
  na_matches = c("na", "never")
)

Arguments

x, y

A pair of tbl_sparks or data.frames.

copy

If x and y are not from the same data source, and copy is TRUE, then y will be copied into a temporary table in same database as x. *_join() will automatically run ANALYZE on the created table in the hope that this will make you queries as efficient as possible by giving more data to the query planner.

This allows you to join tables across srcs, but it's potentially expensive operation so you must opt into it.

suffix

If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.

...

Other parameters passed onto methods.

na_matches

Should NA (NULL) values match one another? The default, "never", is how databases usually work. "na" makes the joins behave like the dplyr join functions, merge(), match(), and %in%.

Details

From Spark 2.1 the prerequisite for using a cross join is that, spark.sql.crossJoin.enabled must be set to true, otherwise an exception will be thrown. Cartesian products are very slow. More importantly, they could consume a lot of memory and trigger an OOM. If the join type is not Inner, Spark SQL could use a Broadcast Nested Loop Join even if both sides of tables are not small enough. Thus, it also could cause lots of unwanted network traffic.

Examples

1
2
3
4
5
6
x <- data.frame(
  id = c("id1", "id2", "id3", "id4", "id5"),
  val = c(2, 7, 11, 13, 17),
  stringsAsFactors = FALSE
)
cross_join(x, x)

nathaneastwood/sparkplugs documentation built on Feb. 28, 2021, 4:57 p.m.