mergeCheck: Merge, order, and check resulting rows and columns.
In NMdata: Preparation, Checking and Post-Processing Data for PK/PD Modeling

mergeCheck

R Documentation

Merge, order, and check resulting rows and columns.

Description

Stop checking that the number of rows is unchanged after a merge - 'mergeCheck' checks what you really want - i.e. x is extended with columns from y while all rows in x are retained, and no new rows are created (plus some more checks). 'mergeCheck' is not a merge implementation - it is a useful merge wrapper. The advantage over using much more flexible merge or join function lies in the fully automated checking that the results are consistent with the simple merge described above.

Usage

mergeCheck(
  x,
  y,
  by,
  by.x,
  by.y,
  common.cols = base::warning,
  ncols.expect,
  track.msg = FALSE,
  quiet,
  df1,
  df2,
  subset.x,
  fun.na.by = base::stop,
  as.fun,
  fun.commoncols,
  ...
)

Arguments

`x`	A data.frame with the number of rows must should be obtained from the merge. The resulting data.frame will be ordered like x.
`y`	A data.frame that will be merged onto x.
`by`	The column(s) to merge by. Character string (vector). by or by.x and by.y must be supplied.
`by.x`	If the columns to merge by in x and y are named differently. by or by.x and by.y must be supplied.
`by.y`	If the columns to merge by in x and y are named differently. by or by.x and by.y must be supplied.
`common.cols`	If common columns are found in x and y, and they are not used in 'by', this will by default create columns named like col.x and col.y in result (see ?merge). Often, this is a mistake, and the default is to throw a warning if this happens. If using 'mergeCheck' in programming, you may want to make sure this is not happening and use common.cols=stop. If you want nothing to happen, you can do common.cols=NULL. You can also use 'common.cols="drop.x"' to drop "non-by" columns in 'x' with identical column names in 'y'. Use "drop.y" to drop them in 'y' and avoid the conflicts. The last option is to use 'common.cols="merge.by"' which means 'by' will automatically be extended to include all common column names.
`ncols.expect`	If you want to include a check of the number of columns being added to the dimensions of 'x'. So if ncols.expect=1, the resulting data must have exactly one column more than 'x' - if not, an error will be returned.
`track.msg`	If using 'mergeCheck' inside other functions, it can be useful to use track.msg=TRUE. This will add information to messages/warnings/errors that they came from 'mergeCheck()'.
`quiet`	If FALSE, the names of the added columns are reported. Default value controlled by NMdataConf.
`df1`	Deprecated. Use x.
`df2`	Deprecated. Use y.
`subset.x`	Not implemented.
`fun.na.by`	If NA's are found in (matched) by columns in both x and why, what should we do? This could be OK, but in many cases, it's because something unexpected is happening. Use fun.na.by=NULL if you don't want to be notified and want to go ahead regardless.
`as.fun`	The default is to return a data.table if x is a data.table and return a data.frame in all other cases. Pass a function in as.fun to convert to something else.
`fun.commoncols`	Deprecated. Please use 'common.cols'.
`...`	additional arguments passed to data.table::merge. If all is among them, an error will be returned.

Details

Besides merging and checking rows, 'mergeCheck' makes sure the order in x is retained in the resulting data (both rows and column order). Also, a warning is given if column names are overlapping, making merge create new column names like col.x and col.y. Merges and other operations are done using data.table. If x is a data.frame (and not a data.table), it will internally be converted to a data.table, and the resulting data.table will be converted back to a data.frame before returning.

'mergeCheck' is for the kind of merges where we think of x as the data to be enriched with columns from y - rows unchanged. This is even further limited than a left join where you can match rows multiple times. A common example of the use of 'mergeCheck' is for adding covariates to a pk/pd data set. We do not want that to remove or duplicate doses, observations, or simulation records. In those cases, 'mergeCheck' does all needed checks, and you can run full speed without checking dimensions (which is anyway not exactly the right thing to do in the general case) or worry that something might go wrong.

Checks performed:

x has >0 rows
by columns are present in x an y
Merge is not performed on NA values. If by=ID and both x$ID and y$ID contain NA's, an error is thrown (see argument fun.na.by).
Merge is done by all common column names in x and y. A warning is thrown if there are column names that are not being used to merge by. This will result in two columns named like BW.x and BW.y and is often unintended.
Before merging a row counter is added to x. After the merge, the result is assured to have exactly one occurrence of each of the values of the row counter in x.

Moreover, row and column order from x is retained in the result.

Value

a data.frame resulting from merging x and y. Class as defined by as.fun.

Examples

 df1 <- data.frame(x = 1:10,
                   y=letters[1:10],
                   stringsAsFactors=FALSE)
 df2 <- data.frame(y=letters[1:11],
                   x2 = 1:11,
                   stringsAsFactors=FALSE)

 mc1 <- mergeCheck(x=df1,y=df2,by="y")

## Notice as opposed to most merge/join algorithms, `mergeCheck` by
#default retains both row and column order from x
library(data.table)
merge(as.data.table(df1),as.data.table(df2))
## Here we get a duplicate of a df1 row in the result. If we only
## check dimensions, we make a mistake. `mergeCheck` captures the
## error - and tell us where to find the problem (ID 31 and 180):
## Not run: 
pk <- readRDS(file=system.file("examples/data/xgxr2.rds",package="NMdata"))
dt.cov <- pk[,.(ID=unique(ID))]
dt.cov[,COV:=sample(1:5,size=.N,replace=TRUE)]
dt.cov <- dt.cov[c(1,1:(.N-1))]
res.merge <- merge(pk,dt.cov,by="ID")
dims(pk,dt.cov,res.merge)
mergeCheck(pk,dt.cov,by="ID")

## End(Not run)

NMdata documentation built on April 4, 2025, 2:11 a.m.