copies | R Documentation |
Checks a data frame for copied/duplicated rows based on
specified variables to use for checking (via ...
) or all columns (if
unspecified). Also allows filtering of the output to retain all records
with copy # info, a subset of distinct records, or a subset of duplicated
records. This flexibility makes copies
similar to both
get_dupes
) & distinct
), while
at the same time providing greater flexibility through a larger array of
output options and competitive performance by using data.table
as a
backend. dupes
is also available as a convenience shortcut
for copies(filter = "dupes", sort_by_copies = TRUE)
.
copies( data, ..., filter = c("all", "dupes", "first", "last", "unique"), keep_all_cols = TRUE, sort_by_copies = FALSE, order = c("d", "a", "i"), na_last = FALSE, output = c("same", "tibble", "dt", "data.frame") )
data |
a data frame, tibble, or data.table. |
... |
This special argument accepts any number of unquoted column names
(also present in the data source) to use when searching for duplicates,
e.g. |
filter |
Shortcuts for filtering (retaining a subset of) the rows of the
output based on the number of copies detected. Options include: |
keep_all_cols |
If column names are specified using |
sort_by_copies |
Only applicable to the "all" & "dupes" filtering
options. If TRUE, sorts the results by the number of copies, in order
specified by the |
order |
Only applicable to the "all" & "dupes" filtering options. If sort_by_copies is set to TRUE, this controls whether the results should be sorted in order of descending/decreasing = "d" (the default) or ascending/increasing = "a" or "i" copy numbers. |
na_last |
should rows of the specified columns with missing values be listed below non-missing values (TRUE/FALSE)? Default is FALSE. |
output |
"tibble" for tibble, "dt" for data.table, or "data.frame" for a data frame. "same", the default option, returns the same format as the input data. |
If filter
argument is set to "all", returns a modified version of the
input data frame with two additional columns added to the end/right side:
- `copy_number` = the row copy number which is included to allow subsequent filtering based on the 1st or last copy detected. - `n_copies` = the total number of copies detected
If filter
is set to dupes
, then only the n_copies
column is appended
and only duplicated rows are returned. If any other of the other filter
argument options are chosen, only the chosen subset of the rows & columns
will be returned.
Craig P. Hutton, craig.hutton@gov.bc.ca
duplicated
,
get_dupes
, distinct
,
unique
, dupes
# check based on one variable & return all rows with copy indicators copies(pdata, g, filter = "all") #the default # check based on one variable & return duplicated rows only copies(pdata, g, filter = "dupes") # check based on one variable & return distinct/unique rows only copies(pdata, g, filter = "unique") # check based on one variable & return the 1st detected copy for cases where # more than one copy is detected (like `dplyr::distinct()` or `unique()`) copies(pdata, g, filter = "first") # check based on one variable & return the last detected copy for cases where # more than one copy is detected (like `unique()` with fromLast = TRUE`) copies(pdata, g, filter = "last") ## Not run: copies(pdata, high_low, g) #check based on 2 variables copies(pdata) #check based on all columns ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.