consolidate: Consolidate datacube into a single dataset

View source: R/consolidate.R

consolidateR Documentation

Consolidate datacube into a single dataset

Description

This function consolidates a set of datasets in a 'many*' package datacube into a single dataset with some combination of the rows, columns, and observations of the datasets in the datacube.

Usage

consolidate(
  datacube,
  join = c("full", "inner", "left"),
  resolve = "coalesce",
  key = NULL
)

Arguments

datacube

A datacube from one of the many packages

join

Which join procedure to use. By default "full" so that all observations are retained, but other options include "left" for basing the consolidated dataset on observations present in the first dataset (reorder the datasets to favour another dataset), and "inner" for a consolidated dataset that includes only observations that are present in all datasets.

resolve

Choice how (potentially conflicting) values from shared variables should be resolved. Options include:

  • "coalesce" (default): uses first non-NA value (if available) for each observation, essentially favouring the order the datasets are in in the datacube.

  • "unite": combines the unique values for each observation across datasets as a set (separated by commas and surrounded by braces), which can be useful for retaining information.

  • "random": selects values at random from among the observations from each dataset that observed that variable, of particular use for exploring the implications of dataset-related variation.

  • "precise": selects the value that has the highest precision from among the observations from each dataset (see resolving_precision()), which favours more precise data.

  • "min", "max": these options return the minimum or maximum values respectively, which can be useful for conservative temporal fixing.

To resolve variables by different functions, pass the argument a vector (e.g. resolve = c(var1 = "min", var2 = "max")). Unnamed variables will be resolved according to the default ("coalesce").

key

An ID column to collapse by. By default "manyID". Users can also specify multiple key variables in a list. For multiple key variables, the key variables must be present in all the datasets in the datacube (e.g. key = c("key1", "key2")). For equivalent key columns with different names across datasets, matching is possible if keys are declared (e.g. key = c("key1" = "key2")). Missing observations in the key variable are removed.

Details

The function includes separate arguments for the rows and columns, as well as for how to resolve conflicts for observations across datasets. This provides users with considerable flexibility in how they combine data. For example, users may wish to stick to units that appear in every dataset but include variables coded in any dataset, or units that appear in any dataset but only those variables that appear in every dataset. Even then there may be conflicts, as the actual unit-variable observations may differ from dataset to dataset. We offer a number of resolve methods that enable users to choose how conflicts between observations are resolved.

Text variables are dropped for more efficient consolidation.

Value

A single tibble/data frame.

Examples


consolidate(emperors, join = "full", resolve = "coalesce", key = "ID")
consolidate(emperors, join = "inner", resolve = "min", key = "ID")
consolidate(emperors, join = "left", resolve = "max", key = "ID")


manydata documentation built on April 4, 2025, 5:25 a.m.