inexact_join: Join two data frames inexactly

Description Usage Arguments Details Examples

Description

These functions are modifications of the standard dplyr join functions, except that it allows a variable of an ordered type (like date or numeric) in x to be matched in inexact ways to variables in y.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
inexact_inner_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

inexact_left_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

inexact_right_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

inexact_full_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

inexact_semi_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

inexact_nest_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  keep = FALSE,
  name = NULL,
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

inexact_anti_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  ...,
  var = NULL,
  jvar = NULL,
  method,
  exact = TRUE
)

Arguments

x, y, by, copy, suffix, keep, name, ...

Arguments to be passed to the relevant join function.

var

Quoted or unquoted variable from the x data frame which is to be indirectly matched.

jvar

Quoted or unquoted variable(s) from the y data frame which are to be indirectly matched. These cannot be variable names also in x or var.

method

The approach to be taken in performing the indirect matching.

exact

A logical, where TRUE indicates that exact matches are acceptable. For example, if method = 'last', x contains var = 2, and y contains jvar = 1 and jvar = 2, then exact = TRUE will match with the jvar = 2 observation, and exact = FALSE will match with the jvar = 1 observation. If jvar contains two variables and you want them treated differently, set to c(TRUE,FALSE) or c(FALSE,TRUE).

Details

This allows matching, for example, if one data set contains data from multiple days in the week, while the other data set is weekly. Another example might be matching an observation in one data set to the *most recent* previous observation in the other.

The available methods for matching are:

Note that if, given the method, var finds no proper match, it will be merged with any is.na(jvar[1]) values.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
data(Scorecard)
# We also have this data on the December unemployment rate for US college grads nationally
# but only every other year
unemp_data <- data.frame(
  unemp_year = c(2006, 2008, 2010, 2012, 2014, 2016, 2018),
  unemp = c(.017, .036, .048, .040, .028, .025, .020)
)
# I want to match the most recent unemployment data I have to each college
Scorecard <- Scorecard %>%
  inexact_left_join(unemp_data,
    method = "last",
    var = year,
    jvar = unemp_year
  )

# Or perhaps I want to find the most recent lagged value (i.e. no exact matches, only recent ones)
data(Scorecard)
Scorecard <- Scorecard %>%
  inexact_left_join(unemp_data,
    method = "last",
    var = year,
    jvar = unemp_year,
    exact = FALSE
  )

# Another way to do the same thing would be to specify the range of unemp_years I want exactly
data(Scorecard)
unemp_data$unemp_year2 <- unemp_data$unemp_year + 2
Scorecard <- Scorecard %>%
  inexact_left_join(unemp_data,
    method = "between",
    var = year,
    jvar = c(unemp_year, unemp_year2)
  )

pmdplyr documentation built on July 2, 2020, 4:08 a.m.