MergeDataSetsByCase: Merge Data Sets by Case

View source: R/mergedatasetsbycase.R

MergeDataSetsByCaseR Documentation

Merge Data Sets by Case

Description

Merges multiple data sets by case where the data sets contain similar variables but different cases, e.g., data sets from different time periods.

Usage

MergeDataSetsByCase(
  data.set.names,
  merged.data.set.name = NULL,
  auto.select.what.to.match.by = TRUE,
  match.by.variable.names = TRUE,
  match.by.variable.labels = TRUE,
  match.by.value.labels = TRUE,
  ignore.case = TRUE,
  ignore.non.alphanumeric = TRUE,
  min.match.percentage = 90,
  variables.to.combine = NULL,
  variables.to.not.combine = NULL,
  variables.to.keep = NULL,
  variables.to.omit = NULL,
  include.merged.data.set.in.output = FALSE,
  when.multiple.labels.for.one.value = "Create new values for the labels",
  use.names.and.labels.from = "First data set",
  data.sets.whose.variables.are.kept = seq_along(data.set.names),
  min.value.label.match.percentage = 90
)

Arguments

data.set.names

A character vector of names of data sets from the Displayr cloud drive to merge (if run from Displayr) or file paths of local data sets.

merged.data.set.name

A character scalar of the name of the merged data set in the Displayr cloud drive (if run from Displayr) or the local file path of the merged data set.

auto.select.what.to.match.by

If TRUE, the metadata to match by is chosen automatically, whereas if FALSE, the metadata to match by is specified by setting the flags match.by.variable.names, match.by.variable.labels and match.by.value.labels.

match.by.variable.names

Logical scalar indicating whether to match using variable names.

match.by.variable.labels

Logical scalar indicating whether to match using variable labels.

match.by.value.labels

Logical scalar indicating whether to match using value labels of categorical variables.

ignore.case

Logical scalar indicating whether to ignore case when matching text (variable names and labels and value labels).

ignore.non.alphanumeric

Logical scalar indicating whether to ignore non-alphanumeric characters when matching text (variable names and labels and value labels) except when numeric characters appear both before and after non-alphanumeric characters e.g., "24 - 29", in which case the characters are still ignored but the separation between the numbers is noted.

min.match.percentage

A numeric scalar of a percentage (number from 0 to 100) which determines how close matches need to be in order for matches to be accepted. Applies to variable names and labels and value labels.

variables.to.combine

A character vector of comma-separated variable names indicating which variables are to appear together. Ranges of variables can be specified by separating variable names by '-'. Variables can be specified from specific data sets by appending '(x)' to the variable name where x is the data set index.

variables.to.not.combine

A character vector of comma-separated variable names specifying variables that should never be combined together. To specify variables from a specific data set, suffix variable names with the data set index in parentheses, e.g., 'Q2(3)'.

variables.to.keep

Character vector of variable names to keep in the merged data set. To specify variables from a specific data set, suffix the name with the data set index in parentheses, e.g., 'Q2(3)'. Ranges of variables can be specified by separating variable names by '-'. Wildcard matching of names is supported using the asterisk character '*'. This parameter is only useful when data.sets.whose.variables.are.kept is used (i.e., when variables are left out).

variables.to.omit

Character vector of variable names to omit from the merged data set. To specify variables from a specific data set, suffix the name with the data set index in parentheses, e.g., 'Q2(3)'. Ranges of variables can be specified by separating variable names by '-'. Wildcard matching of names is supported using the asterisk character '*'.

include.merged.data.set.in.output

A logical scalar which controls whether to include the merged data set in the output object, which can be used for diagnostic purposes in R.

when.multiple.labels.for.one.value

Character scalar that is either "Use label from preferred data set" or "Create new values for the labels". When the former is the case, the label from the earliest/latest data set will be chosen if use.names.and.labels.from is "First data set"/"Last data set". If the latter is the case, new values are generated for the extra labels.

use.names.and.labels.from

Character scalar that is either "First data set" or "Last data set". This sets the preference for either the first or last data set when choosing which names and labels to use in the merged data set.

data.sets.whose.variables.are.kept

An integer vector of indices of data sets where merged variables are only included if they contain input variables from these data sets.

min.value.label.match.percentage

Numeric scalar of the minimum percentage match for value labels to be considered the same when combining value attributes from different variables.

Value

A list with the following elements:

  • merged.data.set If include.merged.data.set.in.output, is TRUE, this is a data frame of the merged data set.

  • input.data.sets.metadata A list containing metadata on the the input data sets such as variable names, labels etc. See the function metadataFromDataSets for more information.

  • merged.data.set.metadata A list containing metadata on the the merged data set such as variable names, labels etc. See the function metadataFromDataSet for more information.

  • matched.names A character matrix whose rows correspond to the variables in the merged data set. The elements in each row correspond to the input data sets and contain the names of the variables from the input data sets that have been combined together to create a merged variable. This matrix also has the attributes "is.fuzzy.match" and "matched.by". is.fuzzy.match is a logical matrix of the same size as matched.names indicating if an input variable was matched using fuzzy matching. matched.by is a character matrix of the same size as matched.names containing the strings "Variable name", "Variable label", "Value label" and "Manual" indicating what data was used to match an input variable or if the variable was matched manually.

  • merged.names A character vector containing the names of the variables in the merged data set.

  • omitted.variable.names.list A list whose elements correspond to the input data sets. Each element is a character vector that contains the names of variables from an input data set that have been omitted from the merged data set.

  • input.value.attributes.list A list whose elements correspond to the variables in the merged data set. Each element is another list whose elements correspond to the input data sets, which each of these elements containing a named numeric vector representing the values and value labels of a categorical input variable. This is NULL if the input variable is not categorical.

  • is.saved.to.cloud Logical scalar that indicates whether the merged data set was saved to the Displayr cloud drive.

Examples

data.set.names <- c(system.file("examples", "cola1.sav", package = "flipData"),
                    system.file("examples", "cola2.sav", package = "flipData"),
                    system.file("examples", "cola5.sav", package = "flipData"),
                    system.file("examples", "cola8.sav", package = "flipData"))

print(MergeDataSetsByCase(data.set.names = data.set.names,
                          data.sets.whose.variables.are.kept = 1,
                          variables.to.combine = "Q4_A_3,Q4_A_3_new"))

NumbersInternational/flipData documentation built on March 2, 2024, 10:52 a.m.