mergeCheck: Merge, order, and check resulting rows and columns.

View source: R/mergeCheck.R

mergeCheckR Documentation

Merge, order, and check resulting rows and columns.

Description

Stop checking that the number of rows is unchanged after a merge - mergeCheck checks what you really want - i.e. x is extended with columns from y while all rows in x are retained, and no new rows are created (plus some more checks). mergeCheck is not a merge implementation - it is a useful merge wrapper. The advantage over using much more flexible merge or join function lies in the fully automated checking that the results are consistent with the simple merge described above.

Usage

mergeCheck(
  x,
  y,
  by,
  by.x,
  by.y,
  fun.commoncols = base::warning,
  ncols.expect,
  track.msg = FALSE,
  quiet,
  df1,
  df2,
  fun.na.by = base::stop,
  as.fun,
  ...
)

Arguments

x

A data.frame with the number of rows must should be obtained from the merge. The resulting data.frame will be ordered like x.

y

A data.frame that will be merged onto x.

by

The column(s) to merge by. Character string (vector). by or by.x and by.y must be supplied.

by.x

If the columns to merge by in x and y are named differently. by or by.x and by.y must be supplied.

by.y

If the columns to merge by in x and y are named differently. by or by.x and by.y must be supplied.

fun.commoncols

If common columns are found in x and y, and they are not used in by, this will create columns named like col.x and col.y in result (see ?merge). Often, this is a mistake, and the default is to throw a warning if this happens. If using mergeCheck in a function, you may want to make sure this is not happening and use fun.commoncols=stop. If you want nothing to happen, you can do fun.commoncols=NULL.

ncols.expect

If you want to include a check of the number of columns being added to the dimensions of x. So if ncols.expect=1, the resulting data must have exactly one column more than x - if not, an error will be returned.

track.msg

If using mergeCheck inside other functions, it can be useful to use track.msg=TRUE. This will add information to messages/warnings/errors that they came from mergCheck.

quiet

If FALSE, the names of the added columns are reported. Default value controlled by NMdataConf.

df1

Deprecated. Use x.

df2

Deprecated. Use y.

fun.na.by

If NA's are found in (matched) by columns in both x and why, what should we do? This could be OK, but in many cases, it's because something unexpected is happening. Use fun.na.by=NULL if you don't want to be notified and want to go ahead regardless.

as.fun

The default is to return a data.table if x is a data.table and return a data.frame in all other cases. Pass a function in as.fun to convert to something else.

...

additional arguments passed to data.table::merge. If all is among them, an error will be returned.

Details

Besides merging and checking rows, mergeCheck makes sure the order in x is retained in the resulting data (both rows and column order). Also, a warning is given if column names are overlapping, making merge create new column names like col.x and col.y. Merges and other operations are done using data.table. If x is a data.frame (and not a data.table), it will internally be converted to a data.table, and the resulting data.table will be converted back to a data.frame before returning.

mergeCheck is for the kind of merges where we think of x as the data to be enriched with columns from y - rows unchanged. This is even further limited than a left join where you can match rows multiple times. A common example of the use of mergeCheck is for adding covariates to a pk/pd data set. We do not want that to remove or duplicate doses, observations, or simulation records. In those cases, mergeCheck does all needed checks, and you can run full speed without checking dimensions (which is anyway not exactly the right thing to do in the general case) or worry that something might go wrong.

Checks performed:

  • x has >0 rows

  • by columns are present in x an y

  • Merge is not performed on NA values. If by=ID and both x$ID and y$ID contain NA's, an error is thrown (see argument fun.na.by).

  • Merge is done by all common column names in x and y. A warning is thrown if there are column names that are not being used to merge by. This will result in two columns named like BW.x and BW.y and is often unintended.

  • Before merging a row counter is added to x. After the merge, the result is assured to have exactly one occurrence of each of the values of the row counter in x.

Moreover, row and column order from x is retained in the result.

Value

a data.frame resulting from merging x and y. Class as defined by as.fun.

See Also

Other DataCreate: NMorderColumns(), NMstamp(), NMwriteData(), addTAPD(), findCovs(), findVars(), flagsAssign(), flagsCount(), tmpcol()

Examples

 df1 <- data.frame(x = 1:10,
                   y=letters[1:10],
                   stringsAsFactors=FALSE)
 df2 <- data.frame(y=letters[1:11],
                   x2 = 1:11,
                   stringsAsFactors=FALSE)

 mc1 <- mergeCheck(x=df1,y=df2,by="y")

## Notice as opposed to most merge/join algorithms, mergeCheck by
#default retains both row and column order from x
library(data.table)
merge(as.data.table(df1),as.data.table(df2))
## Here we get a duplicate of a df1 row in the result. If we only
## check dimensions, we make a mistake. mergeCheck captures the
## error - and tell us where to find the problem (ID 31 and 180):
## Not run: 
pk <- readRDS(file=system.file("examples/data/xgxr2.rds",package="NMdata"))
dt.cov <- pk[,.(ID=unique(ID))]
dt.cov[,COV:=sample(1:5,size=.N,replace=TRUE)]
dt.cov <- dt.cov[c(1,1:(.N-1))]
dim(pk)
res.merge <- merge(pk,dt.cov,by="ID")
dim(res.merge)
mergeCheck(pk,dt.cov,by="ID")

## End(Not run)

NMdata documentation built on Nov. 11, 2023, 5:07 p.m.