dupes: Check the number of duplicated rows in a data frame.

View source: R/copies.R

dupesR Documentation

Check the number of duplicated rows in a data frame.

Description

Checks a data frame for duplicated rows based on specified variables to use for checking (via ...) or all columns (if unspecified).dupes is a convenience shortcut for copies with the "filter" argument set to "dupes" and the "sort_by_copies" argument set to TRUE by default. For greater flexibility in checking row copy numbers or filtering for distinct rows, use copies instead. dupes behaves similarly to get_dupes) but is substantially faster due to the use of data.table as a backend.

Usage

dupes(
  data,
  ...,
  keep_all_cols = TRUE,
  sort_by_copies = TRUE,
  order = c("d", "a", "i"),
  na_last = FALSE,
  output = c("same", "tibble", "dt", "data.frame")
)

Arguments

data

a data frame, tibble, or data.table.

...

This special argument accepts any number of unquoted column names (also present in the data source) to use when searching for duplicates, e.g. x, y, z. Also accepts a character vector of column names or index numbers, e.g. c("x", "y", "z") or c(1, 2, 3), but not a mixture of formats in the same call. If no column names are specified, all columns will be used.

keep_all_cols

If column names are specified using ..., this allows you to drop unspecified columns, similarly to the .keep_all argument for 'dplyr::distinct()“

sort_by_copies

If TRUE (the default), sorts the results by the number of copies, in order specified by the order argument.

order

If sort_by_copies is set to TRUE, this controls whether the results should be sorted in order of descending/decreasing = "d" (the default) or ascending/increasing = "a" or "i" copy numbers.

na_last

should rows of the specified columns with missing values be listed below non-missing values (TRUE/FALSE)? Default is FALSE.

output

"tibble" for tibble, "dt" for data.table, or "data.frame" for a data frame. "same", the default option, returns the same format as the input data.

Value

A subset of the input data frame consisting of duplicated rows that were detected based on specified variables used to condition the search. A message will also be printed to the console indicating whether or not duplicates were detected. An n_copies column is appended specifying the total number of copies of each row that were detected.

Author(s)

Craig P. Hutton, craig.hutton@gov.bc.ca

See Also

copies, get_dupes

Examples


# check for duplicates based on one variable, "g" in this case
dupes(pdata, g)

## Not run: 
dupes(pdata, high_low, g) #check based on 2 variables

# check based on all variables, i.e. fully duplicated rows
dupes(pdata)

## End(Not run)


bcgov/elucidate documentation built on Sept. 3, 2022, 7:16 p.m.