View source: R/derive_merged.R
derive_vars_merged | R Documentation |
Add new variable(s) to the input dataset based on variables from another
dataset. The observations to merge can be selected by a condition
(filter_add
argument) and/or selecting the first or last observation for
each by group (order
and mode
argument).
derive_vars_merged(
dataset,
dataset_add,
by_vars,
order = NULL,
new_vars = NULL,
filter_add = NULL,
mode = NULL,
exist_flag = NULL,
true_value = "Y",
false_value = NA_character_,
missing_values = NULL,
check_type = "warning",
duplicate_msg = NULL,
relationship = NULL
)
dataset |
Input dataset The variables specified by the
|
dataset_add |
Additional dataset The variables specified by the
|
by_vars |
Grouping variables The input dataset and the selected observations from the additional dataset are merged by the specified variables. Variables can be renamed by naming the element, i.e.
|
order |
Sort order If the argument is set to a non-null value, for each by group the first or last observation from the additional dataset is selected with respect to the specified order. Variables defined by the For handling of
|
new_vars |
Variables to add The specified variables from the additional dataset are added to the output
dataset. Variables can be renamed by naming the element, i.e., For example And Values of the added variables can be modified by specifying an expression.
For example, If the argument is not specified or set to
|
filter_add |
Filter for additional dataset ( Only observations fulfilling the specified condition are taken into account for merging. If the argument is not specified, all observations are considered. Variables defined by the
|
mode |
Selection mode Determines if the first or last observation is selected. If the If the
|
exist_flag |
Exist flag If the argument is specified (e.g.,
|
true_value |
True value The value for the specified variable
|
false_value |
False value The value for the specified variable
|
missing_values |
Values for non-matching observations For observations of the input dataset (
|
check_type |
Check uniqueness? If If the
|
duplicate_msg |
Message of unique check If the uniqueness check fails, the specified message is displayed.
|
relationship |
Expected merge-relationship between the This argument is passed to the
|
The new variables (new_vars
) are added to the additional dataset
(dataset_add
).
The records from the additional dataset (dataset_add
) are restricted
to those matching the filter_add
condition.
If order
is specified, for each by group the first or last observation
(depending on mode
) is selected.
The variables specified for new_vars
are merged to the input dataset
using left_join()
. I.e., the output dataset contains all observations
from the input dataset. For observations without a matching observation in
the additional dataset the new variables are set as specified by
missing_values
(or to NA
for variables not in missing_values
).
Observations in the additional dataset which have no matching observation
in the input dataset are ignored.
The output dataset contains all observations and variables of the
input dataset and additionally the variables specified for new_vars
from
the additional dataset (dataset_add
).
derive_vars_joined()
The question between using derive_vars_merged()
or the more powerful
derive_vars_joined()
comes down to how you need to select the observations
to be merged.
If the observations from dataset_add
to merge can be selected
by a condition (filter_add
) using only variables from dataset_add
, then
always use derive_vars_merged()
as it requires less resources (time and
memory). A common example of this would be a randomization date in ADSL
,
where you are simply merging on a date from DS
according to a certain
DSDECOD
condition such as DSDECOD == "RANDOMIZATION"
.
However, if the selection of the observations from dataset_add
can depend
on variables from both datasets, then use derive_vars_joined()
. An
example of this would be assigning period variables from ADSL
to an ADAE
,
where you now need to check each adverse event start date against the period
start and end dates to decide which period value to join.
Merge all demographic variables onto a vital signs dataset.
The variable DOMAIN
exists in both datasets so note the use of
select(dm, -DOMAIN)
in the dataset_add
argument. Without this an error
would be issued to notify the user.
library(tibble) library(dplyr, warn.conflicts = FALSE) vs <- tribble( ~DOMAIN, ~USUBJID, ~VSTESTCD, ~VISIT, ~VSSTRESN, ~VSDTC, "VS", "01", "HEIGHT", "SCREENING", 178.0, "2013-08-20", "VS", "01", "WEIGHT", "SCREENING", 81.9, "2013-08-20", "VS", "01", "WEIGHT", "BASELINE", 82.1, "2013-08-29", "VS", "01", "WEIGHT", "WEEK 2", 81.9, "2013-09-15", "VS", "01", "WEIGHT", "WEEK 4", 82.6, "2013-09-24", "VS", "02", "WEIGHT", "BASELINE", 58.6, "2014-01-11" ) %>% mutate(STUDYID = "AB42") dm <- tribble( ~DOMAIN, ~USUBJID, ~AGE, ~AGEU, "DM", "01", 61, "YEARS", "DM", "02", 64, "YEARS", "DM", "03", 85, "YEARS" ) %>% mutate(STUDYID = "AB42") derive_vars_merged( vs, dataset_add = select(dm, -DOMAIN), by_vars = exprs(STUDYID, USUBJID) ) %>% select(USUBJID, VSTESTCD, VISIT, VSSTRESN, AGE, AGEU) #> # A tibble: 6 × 6 #> USUBJID VSTESTCD VISIT VSSTRESN AGE AGEU #> <chr> <chr> <chr> <dbl> <dbl> <chr> #> 1 01 HEIGHT SCREENING 178 61 YEARS #> 2 01 WEIGHT SCREENING 81.9 61 YEARS #> 3 01 WEIGHT BASELINE 82.1 61 YEARS #> 4 01 WEIGHT WEEK 2 81.9 61 YEARS #> 5 01 WEIGHT WEEK 4 82.6 61 YEARS #> 6 02 WEIGHT BASELINE 58.6 64 YEARS
order
and mode
)Merge the last occurring weight for each subject to the demographics dataset.
To enable sorting by visit date convert_dtc_to_dtm()
is used to convert
to a datetime, within the order
argument.
Then the mode
argument is set to "last"
to ensure the last sorted value
is taken. Be cautious if NA
values are possible in the order
variables -
see Sort Order.
The filter_add
argument is used to restrict the vital signs records only
to weight assessments.
derive_vars_merged( dm, dataset_add = vs, by_vars = exprs(STUDYID, USUBJID), order = exprs(convert_dtc_to_dtm(VSDTC)), mode = "last", new_vars = exprs(LSTWT = VSSTRESN), filter_add = VSTESTCD == "WEIGHT" ) %>% select(USUBJID, AGE, AGEU, LSTWT) #> # A tibble: 3 × 4 #> USUBJID AGE AGEU LSTWT #> <chr> <dbl> <chr> <dbl> #> 1 01 61 YEARS 82.6 #> 2 02 64 YEARS 58.6 #> 3 03 85 YEARS NA
check_type
)The source records are checked regarding duplicates with respect to the by variables and the order specified. By default, a warning is issued if any duplicates are found. Note the results here with a new vital signs dataset containing a duplicate last weight assessment date.
vs_dup <- tribble( ~DOMAIN, ~USUBJID, ~VSTESTCD, ~VISIT, ~VSSTRESN, ~VSDTC, "VS", "01", "WEIGHT", "WEEK 2", 81.1, "2013-09-24", "VS", "01", "WEIGHT", "WEEK 4", 82.6, "2013-09-24" ) %>% mutate(STUDYID = "AB42") derive_vars_merged( dm, dataset_add = vs_dup, by_vars = exprs(STUDYID, USUBJID), order = exprs(convert_dtc_to_dtm(VSDTC)), mode = "last", new_vars = exprs(LSTWT = VSSTRESN), filter_add = VSTESTCD == "WEIGHT" ) %>% select(USUBJID, AGE, AGEU, LSTWT) #> # A tibble: 3 × 4 #> USUBJID AGE AGEU LSTWT #> <chr> <dbl> <chr> <dbl> #> 1 01 61 YEARS 82.6 #> 2 02 64 YEARS NA #> 3 03 85 YEARS NA #> Warning: Dataset contains duplicate records with respect to `STUDYID`, `USUBJID`, and #> `convert_dtc_to_dtm(VSDTC)` #> i Run `admiral::get_duplicates_dataset()` to access the duplicate records
For investigating the issue, the dataset of the duplicate source records
can be obtained by calling get_duplicates_dataset()
:
get_duplicates_dataset() #> Duplicate records with respect to `STUDYID`, `USUBJID`, and #> `convert_dtc_to_dtm(VSDTC)`. #> # A tibble: 2 × 9 #> STUDYID USUBJID convert_dtc_to_dtm(VSDT…¹ DOMAIN VSTESTCD VISIT VSSTRESN VSDTC #> * <chr> <chr> <dttm> <chr> <chr> <chr> <dbl> <chr> #> 1 AB42 01 2013-09-24 00:00:00 VS WEIGHT WEEK… 81.1 2013… #> 2 AB42 01 2013-09-24 00:00:00 VS WEIGHT WEEK… 82.6 2013… #> # i abbreviated name: ¹`convert_dtc_to_dtm(VSDTC)` #> # i 1 more variable: LSTWT <dbl>
Common options to solve the issue:
Specifying additional variables for order
- this is the most common approach,
adding something like a sequence variable.
Restricting the source records by specifying/updating the filter_add
argument.
Setting check_type = "none"
to ignore any duplicates, but then in this case
the last occurring record would be chosen according to the sort order of the
input dataset_add
. This is not often advisable, unless the order has no impact
on the result, as the temporary sort order can be prone to variation across
an ADaM script.
new_vars
and missing_values
)For the last occurring weight for each subject, add a categorization of which visit it occurred at to the demographics dataset.
In the new_vars
argument, other functions can be utilized to modify the
merged values. For example, in the below case we want to categorize the
visit as "BASELINE"
or "POST-BASELINE"
using if_else()
.
The missing_values
argument assigns a specific value for subjects with
no matching observations - see subject "03"
in the below example.
derive_vars_merged( dm, dataset_add = vs, by_vars = exprs(STUDYID, USUBJID), order = exprs(convert_dtc_to_dtm(VSDTC)), mode = "last", new_vars = exprs( LSTWTCAT = if_else(VISIT == "BASELINE", "BASELINE", "POST-BASELINE") ), filter_add = VSTESTCD == "WEIGHT", missing_values = exprs(LSTWTCAT = "MISSING") ) %>% select(USUBJID, AGE, AGEU, LSTWTCAT) #> # A tibble: 3 × 4 #> USUBJID AGE AGEU LSTWTCAT #> <chr> <dbl> <chr> <chr> #> 1 01 61 YEARS POST-BASELINE #> 2 02 64 YEARS BASELINE #> 3 03 85 YEARS MISSING
exist_flag
, true_value
and false_value
)Similar to the above example, now we prefer to have a separate flag variable to show whether a selected record was merged.
The name of the new variable is set with the exist_flag
argument.
The values of this new variable are assigned via the true_value
and
false_value
arguments.
derive_vars_merged( dm, dataset_add = vs, by_vars = exprs(STUDYID, USUBJID), order = exprs(convert_dtc_to_dtm(VSDTC)), mode = "last", new_vars = exprs( LSTWTCAT = if_else(VISIT == "BASELINE", "BASELINE", "POST-BASELINE") ), filter_add = VSTESTCD == "WEIGHT", exist_flag = WTCHECK, true_value = "Y", false_value = "MISSING" ) %>% select(USUBJID, AGE, AGEU, LSTWTCAT, WTCHECK) #> # A tibble: 3 × 5 #> USUBJID AGE AGEU LSTWTCAT WTCHECK #> <chr> <dbl> <chr> <chr> <chr> #> 1 01 61 YEARS POST-BASELINE Y #> 2 02 64 YEARS BASELINE Y #> 3 03 85 YEARS <NA> MISSING
new_vars
)Derive treatment start datetime and associated imputation flags.
In this example we first impute exposure datetime and associated flag
variables as a separate first step to be used in the order
argument.
In the new_vars
arguments, you can see how both datetime and the date and
time imputation flags are all merged in one call.
ex <- tribble( ~DOMAIN, ~USUBJID, ~EXSTDTC, "EX", "01", "2013-08-29", "EX", "01", "2013-09-16", "EX", "02", "2014-01-11", "EX", "02", "2014-01-25" ) %>% mutate(STUDYID = "AB42") ex_ext <- derive_vars_dtm( ex, dtc = EXSTDTC, new_vars_prefix = "EXST", highest_imputation = "M" ) #> The default value of `ignore_seconds_flag` will change to "TRUE" in admiral #> 1.4.0. derive_vars_merged( dm, dataset_add = ex_ext, by_vars = exprs(STUDYID, USUBJID), new_vars = exprs(TRTSDTM = EXSTDTM, TRTSDTF = EXSTDTF, TRTSTMF = EXSTTMF), order = exprs(EXSTDTM), mode = "first" ) %>% select(USUBJID, TRTSDTM, TRTSDTF, TRTSTMF) #> # A tibble: 3 × 4 #> USUBJID TRTSDTM TRTSDTF TRTSTMF #> <chr> <dttm> <chr> <chr> #> 1 01 2013-08-29 00:00:00 <NA> H #> 2 02 2014-01-11 00:00:00 <NA> H #> 3 03 NA <NA> <NA>
Further example usages of this function can be found in the Generic Derivations vignette.
General Derivation Functions for all ADaMs that returns variable appended to dataset:
derive_var_extreme_flag()
,
derive_var_joined_exist_flag()
,
derive_var_merged_ef_msrc()
,
derive_var_merged_exist_flag()
,
derive_var_merged_summary()
,
derive_var_obs_number()
,
derive_var_relative_flag()
,
derive_vars_cat()
,
derive_vars_computed()
,
derive_vars_joined()
,
derive_vars_joined_summary()
,
derive_vars_merged_lookup()
,
derive_vars_transposed()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.