copies: Check the number of copies/duplicated rows in a data frame.

View source: R/copies.R

copiesR Documentation

Check the number of copies/duplicated rows in a data frame.

Description

Checks a data frame for copied/duplicated rows based on specified variables to use for checking (via ...) or all columns (if unspecified). Also allows filtering of the output to retain all records with copy # info, a subset of distinct records, or a subset of duplicated records. This flexibility makes copies similar to both get_dupes) & distinct), while at the same time providing greater flexibility through a larger array of output options and competitive performance by using data.table as a backend. dupes is also available as a convenience shortcut for copies(filter = "dupes", sort_by_copies = TRUE).

Usage

copies(
  data,
  ...,
  filter = c("all", "dupes", "first", "last", "unique"),
  keep_all_cols = TRUE,
  sort_by_copies = FALSE,
  order = c("d", "a", "i"),
  na_last = FALSE,
  output = c("same", "tibble", "dt", "data.frame")
)

Arguments

data

a data frame, tibble, or data.table.

...

This special argument accepts any number of unquoted column names (also present in the data source) to use when searching for duplicates, e.g. x, y, z. Also accepts a character vector of column names or index numbers, e.g. c("x", "y", "z") or c(1, 2, 3), but not a mixture of formats in the same call. If no column names are specified, all columns will be used.

filter

Shortcuts for filtering (retaining a subset of) the rows of the output based on the number of copies detected. Options include: "all" = all rows that were present in the input (default), "dupes" = only rows that were found to be duplicated (mimics the behaviour of get_dupes), "unique" = only rows that appear as a single copy (not duplicated at all), "first" = keeps the 1st copy in cases where duplicates are detected (mimics the behaviour of distinct & unique), and "last" = keeps the last copy in cases where duplicates are detected. Note: if "dupes" is selected a message will be printed to the console indicating whether or not duplicates were detected.

keep_all_cols

If column names are specified using ..., this allows you to drop unspecified columns, similarly to the .keep_all argument for 'dplyr::distinct()“

sort_by_copies

Only applicable to the "all" & "dupes" filtering options. If TRUE, sorts the results by the number of copies, in order specified by the order argument. Default is FALSE to maximize performance.

order

Only applicable to the "all" & "dupes" filtering options. If sort_by_copies is set to TRUE, this controls whether the results should be sorted in order of descending/decreasing = "d" (the default) or ascending/increasing = "a" or "i" copy numbers.

na_last

should rows of the specified columns with missing values be listed below non-missing values (TRUE/FALSE)? Default is FALSE.

output

"tibble" for tibble, "dt" for data.table, or "data.frame" for a data frame. "same", the default option, returns the same format as the input data.

Value

If filter argument is set to "all", returns a modified version of the input data frame with two additional columns added to the end/right side:

- `copy_number` = the row copy number which is included to allow
subsequent filtering based on the 1st or last copy detected.

- `n_copies` = the total number of copies detected

If filter is set to dupes, then only the n_copies column is appended and only duplicated rows are returned. If any other of the other filter argument options are chosen, only the chosen subset of the rows & columns will be returned.

Author(s)

Craig P. Hutton, craig.hutton@gov.bc.ca

See Also

duplicated, get_dupes, distinct, unique, dupes

Examples


# check based on one variable & return all rows with copy indicators
copies(pdata, g, filter = "all") #the default

# check based on one variable & return duplicated rows only
copies(pdata, g, filter = "dupes")

# check based on one variable & return distinct/unique rows only
copies(pdata, g, filter = "unique")

# check based on one variable & return the 1st detected copy for cases where
# more than one copy is detected (like `dplyr::distinct()` or `unique()`)
copies(pdata, g, filter = "first")

# check based on one variable & return the last detected copy for cases where
# more than one copy is detected (like `unique()` with fromLast = TRUE`)
copies(pdata, g, filter = "last")

## Not run: 
copies(pdata, high_low, g) #check based on 2 variables

copies(pdata) #check based on all columns

## End(Not run)


bcgov/elucidate documentation built on Sept. 3, 2022, 7:16 p.m.