compare_dataset_versions: Compare two versions of a dataset

View source: R/quality_assurance.R

compare_dataset_versionsR Documentation

Compare two versions of a dataset

Description

This function compares two versions of a dataset returning the dataset with the added, removed or changed rows identified, using the daff package. The compared dataset can then be exported into an Excel spreadsheet to quickly identify where values have been changed using conditional formatting, on text containing #.

Usage

compare_dataset_versions(old_version, new_version)

Arguments

old_version

The earlier version of the dataset as a data frame.

new_version

The later version of the dataset as a data frame.

Details

An initial check should be performed prior to comparing versions to check that the column names are identical and that there has not been any addition or removal of columns between dataset versions, so that the dataset schema can be made the same between versions if necessary. This check can be done using the compare function in the waldo package.

Value

The data frame with an additional difference column indicating new, removed or updated rows highlighted with #.

Examples

suppressPackageStartupMessages({
  suppressWarnings({
    library(palmerpenguins)
    library(dplyr)
  })
})

# select top 5 heaviest penguins from each species on each island
heaviest_penguins <- penguins %>%
  select(species, island, body_mass_g) %>%
  group_by(species, island) %>%
  arrange(desc(body_mass_g)) %>%
  slice_head(n = 5) %>%
  ungroup()
heaviest_penguins

suppressPackageStartupMessages({
  suppressWarnings({
    library(dplyr)
  })
})

## each version will require an unique identifier
heaviest_penguins <- heaviest_penguins %>%
  mutate(id = row_number()) %>%
  relocate(id)

## old_version: exclude Chinstrap penguins
heaviest_penguins_old <- heaviest_penguins %>%
  filter(species != "Chinstrap")

## new_version: exclude Gentoo penguins and convert body mass to kilograms
heaviest_penguins_new <- heaviest_penguins %>%
  filter(species != "Gentoo") %>%
  mutate(body_mass_g = body_mass_g / 1000) %>%
  rename(body_mass_kg = body_mass_g)

# check columns and column names are identical between versions
waldo::compare(heaviest_penguins_old, heaviest_penguins_new)

# make columns same between versions
heaviest_penguins_old <- heaviest_penguins_old %>%
  rename(body_mass = body_mass_g)

heaviest_penguins_new <- heaviest_penguins_new %>%
  rename(body_mass = body_mass_kg)

# compare versions of dataset
suppressWarnings(compare_dataset_versions(heaviest_penguins_old, heaviest_penguins_new))

gcfrench/store documentation built on May 17, 2024, 5:52 p.m.