elucidate: Convenience Functions to Help Researchers Elucidate Patterns in Their Data

copies

R Documentation

Check the number of copies/duplicated rows in a data frame.

Description

Checks a data frame for copied/duplicated rows based on specified variables to use for checking (via ...) or all columns (if unspecified). Also allows filtering of the output to retain all records with copy # info, a subset of distinct records, or a subset of duplicated records. This flexibility makes copies similar to both get_dupes) & distinct), while at the same time providing greater flexibility through a larger array of output options and competitive performance by using data.table as a backend. dupes is also available as a convenience shortcut for copies(filter = "dupes", sort_by_copies = TRUE).

Usage

copies(
  data,
  ...,
  filter = c("all", "dupes", "first", "last", "unique"),
  keep_all_cols = TRUE,
  sort_by_copies = FALSE,
  order = c("d", "a", "i"),
  na_last = FALSE,
  output = c("same", "tibble", "dt", "data.frame")
)

Arguments

`data`	a data frame, tibble, or data.table.
`...`	This special argument accepts any number of unquoted column names (also present in the data source) to use when searching for duplicates, e.g. `x, y, z`. Also accepts a character vector of column names or index numbers, e.g. c("x", "y", "z") or c(1, 2, 3), but not a mixture of formats in the same call. If no column names are specified, all columns will be used.
`filter`	Shortcuts for filtering (retaining a subset of) the rows of the output based on the number of copies detected. Options include: `"all"` = all rows that were present in the input (default), `"dupes"` = only rows that were found to be duplicated (mimics the behaviour of `get_dupes`), `"unique"` = only rows that appear as a single copy (not duplicated at all), `"first"` = keeps the 1st copy in cases where duplicates are detected (mimics the behaviour of `distinct` & `unique`), and `"last"` = keeps the last copy in cases where duplicates are detected. Note: if `"dupes"` is selected a message will be printed to the console indicating whether or not duplicates were detected.
`keep_all_cols`	If column names are specified using `...`, this allows you to drop unspecified columns, similarly to the `.keep_all` argument for 'dplyr::distinct()“
`sort_by_copies`	Only applicable to the "all" & "dupes" filtering options. If TRUE, sorts the results by the number of copies, in order specified by the `order` argument. Default is FALSE to maximize performance.
`order`	Only applicable to the "all" & "dupes" filtering options. If sort_by_copies is set to TRUE, this controls whether the results should be sorted in order of descending/decreasing = "d" (the default) or ascending/increasing = "a" or "i" copy numbers.
`na_last`	should rows of the specified columns with missing values be listed below non-missing values (TRUE/FALSE)? Default is FALSE.
`output`	"tibble" for tibble, "dt" for data.table, or "data.frame" for a data frame. "same", the default option, returns the same format as the input data.

Value

If filter argument is set to "all", returns a modified version of the input data frame with two additional columns added to the end/right side:

- `copy_number` = the row copy number which is included to allow
subsequent filtering based on the 1st or last copy detected.

- `n_copies` = the total number of copies detected

If filter is set to dupes, then only the n_copies column is appended and only duplicated rows are returned. If any other of the other filter argument options are chosen, only the chosen subset of the rows & columns will be returned.

Author(s)

Craig P. Hutton, craig.hutton@gov.bc.ca

Examples


# check based on one variable & return all rows with copy indicators
copies(pdata, g, filter = "all") #the default

# check based on one variable & return duplicated rows only
copies(pdata, g, filter = "dupes")

# check based on one variable & return distinct/unique rows only
copies(pdata, g, filter = "unique")

# check based on one variable & return the 1st detected copy for cases where
# more than one copy is detected (like `dplyr::distinct()` or `unique()`)
copies(pdata, g, filter = "first")

# check based on one variable & return the last detected copy for cases where
# more than one copy is detected (like `unique()` with fromLast = TRUE`)
copies(pdata, g, filter = "last")

## Not run: 
copies(pdata, high_low, g) #check based on 2 variables

copies(pdata) #check based on all columns

## End(Not run)

bcgov/elucidate documentation built on Sept. 3, 2022, 7:16 p.m.

bcgov/elucidate index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

bcgov/elucidate
Convenience Functions to Help Researchers Elucidate Patterns in Their Data

copies: Check the number of copies/duplicated rows in a data frame.
In bcgov/elucidate: Convenience Functions to Help Researchers Elucidate Patterns in Their Data

Check the number of copies/duplicated rows in a data frame.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Related to copies in bcgov/elucidate...

R Package Documentation

Browse R Packages

We want your feedback!

bcgov/elucidate Convenience Functions to Help Researchers Elucidate Patterns in Their Data

copies: Check the number of copies/duplicated rows in a data frame. In bcgov/elucidate: Convenience Functions to Help Researchers Elucidate Patterns in Their Data

Check the number of copies/duplicated rows in a data frame.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Related to copies in bcgov/elucidate...

R Package Documentation

Browse R Packages

We want your feedback!

bcgov/elucidate
Convenience Functions to Help Researchers Elucidate Patterns in Their Data

copies: Check the number of copies/duplicated rows in a data frame.
In bcgov/elucidate: Convenience Functions to Help Researchers Elucidate Patterns in Their Data